Sep 15, 2016

Architecture for Overnight Batch Processing and Data Scientists' Work with Apache Spark

I'm on vacation and consider architecture for overnight batch processing and data scientists' work with Apache Spark :)



I. Business flow

The following is a simple business flow I assumed. After overnight batch processing, many data scientists start to work :)


II. Architecture with the use of Parquet

They also request the usage of DataFrames created by overnight batch processing and the performance of queries. In this case, I think architecture with the use of Parquet is suitable.
An example of this architecture is shown below.


DataFrame 1 is converted to Parquet 1. When data scientists want to use DataFrame 1, they load it from Parquet 1. DataFrame 4 is also the same. But not all of columns are loaded. Parquet is a columnar storage format so that it allows loading only selective records.
Therefore, the performance of queries against a Parquet table is higher than against a Text table.

III. Usage of Parquet

How to use Parquet is so easy. We can convert a DataFrame to a Parquet file like the following.

・convert a DataFrame to a Parquet file
dataFrame1.write.format("parquet").save("parquetFilePath")

We can also load a DataFrame from a Parquet file like the following.

・load a DataFrame from a Parquet file
val dataFrame1' = sqlContext.read.parquet("parquetFilePath")
                                              // Not all of columns are loaded. Only selective columns are loaded.

val dataFrame5 = dataFrame1'.join(dataFrame3, dataFrame1'("id") === dataFrame3("id"))
.groupBy(dataFrame3("category1"), dataFrame3("category2"))
.agg(avg(dataFrame1'(value)))