I. Business flow
The following is a simple business flow I assumed. After overnight batch processing, many data scientists start to work :)II. Architecture with the use of Parquet
They also request the usage of DataFrames created by overnight batch processing and the performance of queries. In this case, I think architecture with the use of Parquet is suitable.
DataFrame 1 is converted to Parquet 1. When data scientists want to use DataFrame 1, they load it from Parquet 1. DataFrame 4 is also the same. But not all of columns are loaded. Parquet is a columnar storage format so that it allows loading only selective records.
Therefore, the performance of queries against a Parquet table is higher than against a Text table.III. Usage of Parquet
How to use Parquet is so easy. We can convert a DataFrame to a Parquet file like the following.・convert a DataFrame to a Parquet file
dataFrame1.write.format("parquet").save("parquetFilePath")
We can also load a DataFrame from a Parquet file like the following.
・load a DataFrame from a Parquet file
val dataFrame1' = sqlContext.read.parquet("parquetFilePath")
// Not all of columns are loaded. Only selective columns are loaded.
val dataFrame5 = dataFrame1'.join(dataFrame3, dataFrame1'("id") === dataFrame3("id"))
.groupBy(dataFrame3("category1"), dataFrame3("category2"))
.agg(avg(dataFrame1'(value)))
// Not all of columns are loaded. Only selective columns are loaded.
val dataFrame5 = dataFrame1'.join(dataFrame3, dataFrame1'("id") === dataFrame3("id"))
.groupBy(dataFrame3("category1"), dataFrame3("category2"))
.agg(avg(dataFrame1'(value)))