1. Introduction
I usually see many lovers of R in financial investment banks. If they have the capability of processing 'Big Data' by R that they loves, they may feel so happy :) SparkR makes it possible. In this article, I note how to enjoy SparkR.
・SparkR
https://github.com/apache/spark/tree/master/RSparkR is included in Apache Spark. I use RStudio and Visual Studio as IDE. The versions of Apache Spark, R, Visual Studio and RStudio are as follows.
Apache Spark 1.6.3
R 3.3.2
RStudio 1.0.136
Visual Studio Community 2015 Update 3 (Update 3 or later are necessary)
R 3.3.2
RStudio 1.0.136
Visual Studio Community 2015 Update 3 (Update 3 or later are necessary)
2. SparkR with IDE
I deployed Spark 1.6.3 as follows and set SPARK_HOME to C:\enjyoyspace\spark-1.6.3. RStudio is often used in Executing SparkR as IDE. We can also use Visual Studio and there are many lovers of Visual Studio, So, I note how to execute SparkR with RStudio and Visual Studio.
C:\enjyoyspace\spark-1.6.3
├─bin
├─conf
├─data
├─ec2
├─examples
├─lib
├─licenses
├─python
├─R
├─bin
├─conf
├─data
├─ec2
├─examples
├─lib
├─licenses
├─python
├─R
2.1. Use RStudio
SparkR is loaded as follows. SparkR's APIs are so easy and kind for the lovers of R language. So, learning costs may be very low for them :)
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
The full source code is as follows :)
# load SparkR
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sc <- sparkR.init(appName = "CalcProhitAndLoss")
sqlContext <- sparkRSQL.init(sc)
# create data frame (R)
shockedPV <- data.frame(date = c("20161001", "20161101", "20161224"), PV = c(10000000000, 10000001000, 10000002000))
nonShockedPV <- data.frame(date_nonShocked = c("20161001", "20161101", "20161224"), PV_nonShocked = c(9000000000, 9000001000, 9000002000))
# create data frame (SparkR) from data frame (R)
shockedPVforSparkR <- createDataFrame(sqlContext,shockedPV)
nonShockedPVforSparkR <- createDataFrame(sqlContext,nonShockedPV)
# join of RDDs
masterPV <- join(shockedPVforSparkR, nonShockedPVforSparkR, shockedPVforSparkR$date == nonShockedPVforSparkR$date_nonShocked)
# register table
registerTempTable(masterPV, "masterPV")
# SparkSQL
prohitAndLossForSparkR <- sql(sqlContext, "SELECT date, PV-PV_nonShocked AS prohit_and_loss FROM masterPV")
# collect query results
prohitAndLoss <- collect(prohitAndLossForSparkR)
# display collected results
print(prohitAndLoss)
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sc <- sparkR.init(appName = "CalcProhitAndLoss")
sqlContext <- sparkRSQL.init(sc)
# create data frame (R)
shockedPV <- data.frame(date = c("20161001", "20161101", "20161224"), PV = c(10000000000, 10000001000, 10000002000))
nonShockedPV <- data.frame(date_nonShocked = c("20161001", "20161101", "20161224"), PV_nonShocked = c(9000000000, 9000001000, 9000002000))
# create data frame (SparkR) from data frame (R)
shockedPVforSparkR <- createDataFrame(sqlContext,shockedPV)
nonShockedPVforSparkR <- createDataFrame(sqlContext,nonShockedPV)
# join of RDDs
masterPV <- join(shockedPVforSparkR, nonShockedPVforSparkR, shockedPVforSparkR$date == nonShockedPVforSparkR$date_nonShocked)
# register table
registerTempTable(masterPV, "masterPV")
# SparkSQL
prohitAndLossForSparkR <- sql(sqlContext, "SELECT date, PV-PV_nonShocked AS prohit_and_loss FROM masterPV")
# collect query results
prohitAndLoss <- collect(prohitAndLossForSparkR)
# display collected results
print(prohitAndLoss)
2.2. Use Visual Studio Community
In the execution of SparkR with Visual Studio, it's necessary to install the following plugin. The version of Visual Studio must be '2015 Update 3 or later'.
・R Tools for Visual Studio
https://microsoft.github.io/RTVS-docs/
We are able to use the script described in 2.1. There is nothing to change :)
・R Tools for Visual Studio
https://microsoft.github.io/RTVS-docs/
We are able to use the script described in 2.1. There is nothing to change :)
3. Execution Result
The execution result is as follows. The calculation results are noramlly computed :)
date prohit_and_loss
1 20161224 1e+09
2 20161001 1e+09
3 20161101 1e+09
1 20161224 1e+09
2 20161001 1e+09
3 20161101 1e+09
4. Conclusion
We can use an attractive tool 'SparkR' with Visual Studio or RStudio. If you use Visual Studio, the version of Visual Studio is '2015 Update 3 or later'.