Feb 18, 2017

SparkR with Visual Studio and RStudio





1. Introduction

I usually see many lovers of R in financial investment banks. If they have the capability of processing  'Big Data' by R that they loves, they may feel so happy :) SparkR makes it possible. In this article, I note how to enjoy SparkR.

・SparkR
 https://github.com/apache/spark/tree/master/R

SparkR is included in Apache Spark. I use RStudio and Visual Studio  as  IDE. The versions of Apache Spark, R, Visual Studio and RStudio are as follows.

Apache Spark 1.6.3
R 3.3.2
RStudio 1.0.136
Visual Studio Community 2015 Update 3 (Update 3 or later are necessary)

 

2. SparkR with IDE

I deployed  Spark 1.6.3 as follows and set SPARK_HOME to C:\enjyoyspace\spark-1.6.3. RStudio is often used in Executing SparkR as IDE. We can also use Visual Studio and there are many lovers of Visual Studio, So, I note how to execute SparkR with RStudio and Visual Studio.

C:\enjyoyspace\spark-1.6.3
    ├─bin
    ├─conf
    ├─data
    ├─ec2
    ├─examples
    ├─lib
    ├─licenses
    ├─python
    ├─R


2.1. Use RStudio

SparkR is loaded as follows. SparkR's APIs are so easy and kind for the lovers of R language. So, learning costs may be very low for them :)

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))


The full source code is as follows :)

# load SparkR
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

sc <- sparkR.init(appName = "CalcProhitAndLoss")
sqlContext <- sparkRSQL.init(sc)

# create data frame (R)
shockedPV <- data.frame(date = c("20161001", "20161101", "20161224"), PV = c(10000000000, 10000001000, 10000002000))
nonShockedPV <- data.frame(date_nonShocked = c("20161001", "20161101", "20161224"), PV_nonShocked = c(9000000000, 9000001000, 9000002000))

# create data frame (SparkR) from data frame (R)
shockedPVforSparkR <- createDataFrame(sqlContext,shockedPV)
nonShockedPVforSparkR <- createDataFrame(sqlContext,nonShockedPV)

# join of RDDs
masterPV <- join(shockedPVforSparkR, nonShockedPVforSparkR, shockedPVforSparkR$date == nonShockedPVforSparkR$date_nonShocked)

# register table
registerTempTable(masterPV, "masterPV")

# SparkSQL
prohitAndLossForSparkR <- sql(sqlContext, "SELECT date, PV-PV_nonShocked AS prohit_and_loss FROM masterPV")

# collect query results
prohitAndLoss <- collect(prohitAndLossForSparkR)

# display collected results
print(prohitAndLoss)


2.2. Use Visual Studio Community

In the execution of SparkR with Visual Studio, it's necessary to install the following plugin.  The version of Visual Studio must be '2015 Update 3 or later'.

・R Tools for Visual Studio
 https://microsoft.github.io/RTVS-docs/

We are able to use the script described in 2.1. There is nothing to change :)


3. Execution Result

The execution result is as follows. The calculation results are noramlly computed :)

      date prohit_and_loss
1 20161224           1e+09
2 20161001           1e+09
3 20161101           1e+09


4. Conclusion

We can use an attractive tool 'SparkR' with Visual Studio or RStudio. If you use Visual Studio, the version of Visual Studio is '2015 Update 3 or later'.