Dec 4, 2016

Point to Note in Upgrading Apache Spark's Version to 2.0 and Executing




1. Introduction

To upgrade Spark's version to 2.0 is so nice. It's because the perfprmance of Spark 2.0's Tungsten engine is faster than of 1.6 as the following Databricks's blog 'Faster'.

・Technical Preview of Apache Spark 2.0 Now on Databricks
 Easier, Faster, and Smarter
 https://databricks.com/blog/2016/05/11/apache-spark-2-0-technical-preview-easier-faster-and-smarter.html

In executing a Spark 2.0 application, the spark-submit script which I had used in executing a Spark 1.6 application wasn't available. The purpose of this article is to show it's solution.

2. Problem

My Spark version was 1.6.0. I upgraded Spark's version to 2.0.1 last week. I build my application with Spark 2.0.1 and deploy it. And then, I executed the following spark-submit script.

spark-submit --master yarn-client --class xxxxx --num-executors 8 --executor-memory 16G --executor-cores 20 yyyyy.jar ... 

But the following error was occured.

16/12/01 19:35:21 WARN datasources.DataSource: Error while looking for metadata directory.
org.apache.spark.SparkException: Unable to create database default as failed to create its directory hdfs://namenode:9000/usr/spark-2.0.1-bin-hadoop2.7/spark-warehouse

...

3. Solution

In Spark 2.0, a warehouse directory is used to store tables and 'spark.sql.warehouse.dir' is added to set the warehouse location. If default file system is set to a local file system, avobe error isn't occured.
My default file system was set to HDFS, so hdfs://namenode:9000/usr/spark-2.0.1-bin-hadoop2.7/spark-warehouse was tried to create and failed.

I added the following string to the spark-submit script and explicitly set a local directory path as the default warehouse path.

--conf spark.sql.warehouse.dir=file:///c:\temp\spark-warehouse 


The full of spark-submit script is following. I executed this script and confirmed that 'c:\temp\spark-warehouse' was created and the job successfully terminated.

spark-submit --master yarn-client --conf spark.sql.warehouse.dir=file:///c:\temp\spark-warehouse --class xxxxx --num-executors 8 --executor-memory 16G --executor-cores 20 yyyyy.jar ... 

4. Conclusion

If your default file system is set to HDFS and you execute a Spark 2.0 application, there is a need to set a local directory path as the default warehouse path.