Oct 30, 2016

Design of Filling Up Lacking Data with Apache Spark

I considered the design of filling up lacking data with Apache Spark :)



1. Introduction

The category of rows which id is 1, 4, or 5 is A, B and C. Otherwise, the row which id is 2 and category is B doesn't exist, and the rows which Id is 3 and category is A or C don't exist. I needed to fill up these lacking data like following.

The purpose of this article is to puropose a effective design such that the following figure is achieved.



2. The Design

The design which proposed in this article is constructed by two parts.

2.1. Part1

It is inefficient to filter a RDD by id, check that the RDD contains a particular row and fill up lacking data. In order to do efficient check and filling up, I created a RDD of key-value pairs and grouped the elements by a key.


   ・Stage1: create Pair RDD and group by key
      map() -> groupBykey()


2.2. Part2

By groupByKey(), Spark's shuffle work and partitions are created again like the following left figure. If groupByKey() don't work, a computer must search many partitions. After groupByKey(), however, the computer only search one partition whose id is equal to 2 because multiple lows which have same id are divided into the same partition.

In addition, above checking and filling up are pipeline processing, thus the execution time are reduced.


   ・Stage2: check and fill up lacking data
      map() -> flatMap()



3. Conclusion

If you don't want to do full-scan that is to search all partitions (long time), the design proposed in section 2 efficiently works and reduces the execution time greatly.