1. Introduction
To upgrade Spark's version to 2.0 is so nice. It's because the perfprmance of Spark 2.0's Tungsten engine is faster than of 1.6 as the following Databricks's blog 'Faster'.・Technical Preview of Apache Spark 2.0 Now on Databricks
Easier, Faster, and Smarter
https://databricks.com/blog/2016/05/11/apache-spark-2-0-technical-preview-easier-faster-and-smarter.html
In executing a Spark 2.0 application, the spark-submit script which I had used in executing a Spark 1.6 application wasn't available. The purpose of this article is to show it's solution.
2. Problem
My Spark version was 1.6.0. I upgraded Spark's version to 2.0.1 last week. I build my application with Spark 2.0.1 and deploy it. And then, I executed the following spark-submit script.
spark-submit --master yarn-client --class xxxxx --num-executors 8 --executor-memory 16G --executor-cores 20 yyyyy.jar ...
But the following error was occured.
16/12/01 19:35:21 WARN datasources.DataSource: Error while looking for metadata directory.
org.apache.spark.SparkException: Unable to create database default as failed to create its directory hdfs://namenode:9000/usr/spark-2.0.1-bin-hadoop2.7/spark-warehouse
...
org.apache.spark.SparkException: Unable to create database default as failed to create its directory hdfs://namenode:9000/usr/spark-2.0.1-bin-hadoop2.7/spark-warehouse
...
3. Solution
In Spark 2.0, a warehouse directory is used to store tables and 'spark.sql.warehouse.dir' is added to set the warehouse location. If default file system is set to a local file system, avobe error isn't occured.
My default file system was set to HDFS, so hdfs://namenode:9000/usr/spark-2.0.1-bin-hadoop2.7/spark-warehouse was tried to create and failed.
My default file system was set to HDFS, so hdfs://namenode:9000/usr/spark-2.0.1-bin-hadoop2.7/spark-warehouse was tried to create and failed.
I added the following string to the spark-submit script and explicitly set a local directory path as the default warehouse path.
--conf spark.sql.warehouse.dir=file:///c:\temp\spark-warehouse
The full of spark-submit script is following. I executed this script and confirmed that 'c:\temp\spark-warehouse' was created and the job successfully terminated.
spark-submit --master yarn-client --conf spark.sql.warehouse.dir=file:///c:\temp\spark-warehouse --class xxxxx --num-executors 8 --executor-memory 16G --executor-cores 20 yyyyy.jar ...