Mar 13, 2017

Fun Quiz on SparkR: How to Create Data Frame without Spark-CSV Package :)





1. Introduction

We are able to write Spark applications that create directly a data frame from a csv file by using the spark-csv package. And so, the package is often used.

Mahler's Symphony No. 5 will be performed by Eliahu Inbal and Konzerthausorchester Berlin in Tokyo tonight. I arrived at the hall an hour before that the classical concert started. So, I considered a few interesting quiz about creating data frames from csv files without above package :)

In this article, Spark's version is 1.6.3.

2. Fun Quiz

In this chapter I note the source code using the spark-csv package and the few funny quiz :)
Before starting the quiz game, we need to prepare the following csv file.

・ test_data.csv
20170228,scheme1,BermudanCallableSwap,JPY,21261339422
20170228,scheme2,BermudanCallableSwap,JPY,22759109989
20170228,scheme3,BermudanCallableSwap,JPY,21405741891
...


2.1. Use Spark-CSV Package

First, I note the program using the spark-csv package. There are two important points. One is setting 'SPARKR_SUBMIT_ARGS' to use spark-csv package, and the other is to load a csv file as data frame directly. 

How to set 'SPARKR_SUBMIT_ARGS' and load a csv file as data frame are as follows.

・ set 'SPARKR_SUBMIT_ARGS' (com.databricks:spark-csv)
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.11:1.5.0" "sparkr-shell"')

・ load a csv file as data frame
df <- read.df(sqlContext, inputFilePath, source = "com.databricks.spark.csv", schema = schema)

The full source code is as follows.

・ source code
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.11:1.5.0" "sparkr-shell"')

# init SparkR
sc <- sparkR.init(appName = "visualization", master = "local[*]")
sqlContext <- sparkRSQL.init(sc)

# parameter
inputFilePath <- "C:/dev/R/test_data.csv"

# schema of data frame
schema <- structType(structField(
                     structField(x = "date", type = "integer", nullable = FALSE),
                     structField(x = "scheme_number", type = "string", nullable = FALSE),
                     structField(x = "product", type = "string", nullable = FALSE),
                     structField(x = "currency", type = "string", nullable = FALSE),
                     structField(x = "sensitivity", type = "double", nullable = FALSE))

# load a csv file as data frame
df <- read.df(sqlContext, inputFilePath, source = "com.databricks.spark.csv", schema = schema)


showDF(df)

2.2. Quiz

Next, I note a simple quiz. Enjoy!! :)

・ quiz
If you don't use the spark-csv package, how do you create the data frame with the folllowing schema from test_data.csv?

・ schema
  date,scheme_number,product,currency,sensitivity

・ test_data.csv
  20170228,scheme1,BermudanCallableSwap,JPY,21261339422
  20170228,scheme2,BermudanCallableSwap,JPY,22759109989
  20170228,scheme3,BermudanCallableSwap,JPY,21405741891
  ...

The time limit is 5 minutes :)


3. Answer

The answer is as follows. It's diffuse, but there is a bit of fun :)
You must prepare a record splitter.

・ record splitter
splitRecord <- function(record) {
  Sys.setlocale("LC_ALL", "C")
  part <- strsplit(record, ",")[[1]]
  list(column1 = as.integer(part[1]),
       column2 = part[2],
       column3 = part[3],
       column4 = part[4],
       column5 = as.double(part[5]))
}

And then, you must make a rdd, split it and convert to a data frame.

・ load a csv file and create data frame
# load a csv file
orgData <- SparkR:::textFile(sc, inputFilePath)

# split data
splitData <- SparkR:::lapply(orgData, splitRecord)

# create data frame
df <- SparkR:::createDataFrame(sqlContext, splitData, schema)

The full source code is as follows :)

・ source code
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

# init SparkR
sc <- sparkR.init(appName = "visualization", master = "local[*]")
sqlContext <- sparkRSQL.init(sc)

# parameter
inputFilePath <- "C:/dev/R/test_data.csv"

# schema of data frame
schema <- structType(structField(
                     structField(x = "date", type = "integer", nullable = FALSE),
                     structField(x = "scheme_number", type = "string", nullable = FALSE),
                     structField(x = "product", type = "string", nullable = FALSE),
                     structField(x = "currency", type = "string", nullable = FALSE),
                     structField(x = "sensitivity", type = "double", nullable = FALSE))

# record splitter
splitRecord <- function(record) {
  Sys.setlocale("LC_ALL", "C")
  part <- strsplit(record, ",")[[1]]
  list(column1 = as.integer(part[1]),
       column2 = part[2],
       column3 = part[3],
       column4 = part[4],
       column5 = as.double(part[5]))
}


# load a csv file
orgData <- SparkR:::textFile(sc, inputFilePath)


# split data
splitData <- SparkR:::lapply(orgData, splitRecord)


# create data frame
df <- SparkR:::createDataFrame(sqlContext, splitData, schema)


showDF(df)


4. Conclusion

If you enjoy this quiz, I  will be pleased :) The concert will start soon!


・ Gustav Mahler    Symphony No. 5
・ Richard Wagner  Tristan and Isolde: Prelude and Love Death
・ Konzerthausorchester Berlin, cond. Eliahu Inbal
・ Tokyo