1. Introduction
The category of rows which id is 1, 4, or 5 is A, B and C. Otherwise, the row which id is 2 and category is B doesn't exist, and the rows which Id is 3 and category is A or C don't exist. I needed to fill up these lacking data like following.The purpose of this article is to puropose a effective design such that the following figure is achieved.
2. The Design
The design which proposed in this article is constructed by two parts.2.1. Part1
It is inefficient to filter a RDD by id, check that the RDD contains a particular row and fill up lacking data. In order to do efficient check and filling up, I created a RDD of key-value pairs and grouped the elements by a key.・Stage1: create Pair RDD and group by key
map() -> groupBykey()
2.2. Part2
By groupByKey(), Spark's shuffle work and partitions are created again like the following left figure. If groupByKey() don't work, a computer must search many partitions. After groupByKey(), however, the computer only search one partition whose id is equal to 2 because multiple lows which have same id are divided into the same partition.In addition, above checking and filling up are pipeline processing, thus the execution time are reduced.
・Stage2: check and fill up lacking data
map() -> flatMap()