How to get rid of derby.log, metastore_db from Spark Shell - apache-spark

When running spark-shell it creates a file derby.log and a folder metastore_db. How do I configure spark to put these somewhere else?
For derby log I've tried Getting rid of derby.log like so spark-shell --driver-memory 10g --conf "-spark.driver.extraJavaOptions=Dderby.stream.info.file=/dev/null" with a couple of different properties but spark ignores them.
Does anyone know how to get rid of these or specify a default directory for them?

The use of the hive.metastore.warehouse.dir is deprecated since Spark 2.0.0,
see the docs.
As hinted by this answer, the real culprit for both the metastore_db directory and the derby.log file being created in every working subdirectory is the derby.system.home property defaulting to ..
Thus, a default location for both can be specified by adding the following line to spark-defaults.conf:
spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby
where /tmp/derby can be replaced by the directory of your choice.

For spark-shell, to avoid having the metastore_db directory and avoid doing it in the code (since the context/session is already created and you won't stop them and recreate them with the new configuration each time), you have to set its location in hive-site.xml file and copy this file into spark conf directory.
A sample hive-site.xml file to make the location of metastore_db in /tmp (refer to my answer here):
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/tmp/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/tmp/</value>
<description>location of default database for the warehouse</description>
</property>
</configuration>
After that you could start your spark-shell as the following to get rid of derby.log as well
$ spark-shell --conf "spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp"

Try setting derby.system.home to some other directory as a system property before firing up the spark shell. Derby will create new databases there. The default value for this property is .
Reference: https://db.apache.org/derby/integrate/plugin_help/properties.html

Use hive.metastore.warehouse.dir property. From docs:
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
For derby log: Getting rid of derby.log could be the answer. In general create derby.properties file in your working directory with following content:
derby.stream.error.file=/path/to/desired/log/file

For me setting the Spark property didn't work, neither on the driver nor the executor. So searching for this issue, I ended up setting the property for my system instead with:
System.setProperty("derby.system.home", "D:\\tmp\\derby")
val spark: SparkSession = SparkSession.builder
.appName("UT session")
.master("local[*]")
.enableHiveSupport
.getOrCreate
[...]
And that finally got me rid of those annoying items.

In case if you are using Jupyter/Jupyterhub/Jupyterlab or just setting this conf parameter inside python, use the following will work:
from pyspark import SparkConf, SparkContext
conf = (SparkConf()
.setMaster("local[*]")
.set('spark.driver.extraJavaOptions','-Dderby.system.home=/tmp/derby')
)
sc = SparkContext(conf = conf)

I used the below configuration for a pyspark project, i was able to setup sparkwarehouse db and derby db in sample path, so was able to avoid them setup in current directory.
from pyspark.sql import SparkSession
from os.path import abspath
location = abspath("C:\self\demo_dbx\data\spark-warehouse") #Path where you want to setup sparkwarehouse
local_spark = SparkSession.builder \
.master("local[*]") \
.appName('Spark_Dbx_Session') \
.config("spark.sql.warehouse.dir", location)\
.config("spark.driver.extraJavaOptions",
f"Dderby.system.home='{location}'")\
.getOrCreate()

Related

pyspark connection to MariaDB fails with ClassNotFoundException

I'm trying to retrieve data from MariaDB with pyspark.
I created spark_session with configuration to include jdbc jar file, but couldn't solve problem. Current code to create session looks like below.
path = "hdfs://nameservice1/user/PATH/TO/JDBC/mariadb-java-client-2.7.1.jar"
# or path = "/home/PATH/TO/JDBC/mariadb-java-client-2.7.1.jar"
spark = SparkSession.config("spark.jars", path)\
.config("spark.driver.extraClassPath", path)\
.config("spark.executor.extraClassPath", path)\
.enableHiveSupport()
.getOrCreate()
Note that I've tried every case of configuration I know
(Check Permission, change directory both hdfs or local, add or remove configuration ...)
And then, code to load data is.
sql = "SOME_SQL_TO_RETRIEVE_DATA"
spark = spark.read.format('jdbc').option('dbtable', sql)
.option('url', 'jdbc:mariadb://{host}:{port}/{db}')\
.option("user", SOME_USER)
.option("password", SOME_PASSWORD)
.option("driver", 'org.mariadb.jdbc.Driver')
.load()
But it fails with java.lang.ClassNotFoundException: org.mariadb.jdbc.Driver
When I tried this with spark-submit, I saw log message.
... INFO SparkContext: Added Jar /PATH/TO/JDBC/mariadb-java-client-2.7.1.jar at spark://SOME_PATH/jars/mariadb-java-client-2.7.1.jar with timestamp SOME_TIMESTAMP
What is wrong?
For anyone who suffers from same problem.
I figured out. Spark Document says that
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
So instead setting configuration on python code, I added arguments on spark-submit following this document.
spark-submit {other arguments ...} \
--driver-class-path PATH/TO/JDBC/my-jdbc.jar \
--jars PATH/TO/JDBC/my-jdbc.jar \
MY_PYTHON_SCRIPT.py

SparkSession Application Source Code Config Properties not Overriding JupyterHub & Zeppelin on AWS EMR defaults

I have Spark Driver setup to use Zeppelin and or JupyterHub as client for interactive Spark Programming on AWS EMR. However, when I create the SparkSession with custom config properties (application name, # of cores, executor ram, # of executors, serializer, etc) it is not overriding the default values for those configs (confirmed under Environment tab in Spark UI and spark.conf.get(...)).
Like any Spark App these clients on EMR should be using my custom config properties because SparkSession code is the 1st highest override before spark-submit, spark config file, and then spark-defaults. JupyterHub also immediately launches a Spark Application w/o coding one or when just running an empty cell.
Is there a setting specific to Zeppelin, JupyterHub, or a separate xml conf that needs adjusted to get custom configs recognized and working? Any help is much appreciated.
Example of creating a basic application where these cluster resource configs should be implemented instead of the standard default configs which is what is happening with Zeppelin/JupyterHub on EMR.
# via zep or jup [configs NOT being recognized]
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("app_name")\
.master("yarn")\
.config("spark.submit.deployMode","client")\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config("spark.executor.instances", 11)\
.config("spark.executor.cores", 5)\
.config("spark.executor.memory", "19g")\
.getOrCreate()
# via ssh terminal [configs ARE recognized at run-time]
pyspark \
--name "app_name" \
--master yarn \
--deploy-mode client \
--num-executors 11 \
--executor-cores 5 \
--executor-memory 19 \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
Found a solution. The config.json file under /etc/jupyter/conf had some default spark config values hence I removed them to display an empty json key/value like => _configs":{}. Creating a custom SparkSession via JupyterHub now understands the specified cluster configs.
These magic commands are always working %%configure
https://github.com/jupyter-incubator/sparkmagic

Unable to write data on hive using spark

I am using spark1.6. I am creating hivecontext using spark context. When I save the data into hive it gives error. I am using cloudera vm. My hive is inside cloudera vm and spark in on my system. I can access the vm using IP. I have started the thrift server and hiveserver2 on vm. I have user thrift server uri for hive.metastore.uris
val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.metastore.uris", "thrift://IP:9083")
............
............
df.write.mode(SaveMode.Append).insertInto("test")
I get the following error:
FAILED: SemanticException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClien‌​t
Probably inside spark conf folder, hive-site.xml is not available , I have added the details below.
Adding hive-site.xml inside spark configuration folder.
creating a symlink which points to hive-site.xml in hive configuration folder.
sudo ln -s /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf/hive-site.xml
after the above steps, restarting spark-shell should help.

configuration for `spark.hadoop.fs.s3` gets applied to `fs.s3a` not `fs.s3`

I have read the answer posted on how to configure s3 access keys for dataproc but I have find it dissatisfactory.
Reason is because when I follow the steps and set hadoop conf for spark.hadoop.fs.s3, s3://... path still have access issue while as s3a://... path works. Test spark-shell run can be seen below.
s3 vs s3n vs s3a is topic of it's own, although I guess we don't have to worry about s3n. But I find it strange that configuration application to s3 is manifest for s3a.
here are my questions:
is this dataproc or spark? I'm assuming spark considering spark-shell is having this issue.
is there a way to configure s3 from spark-submit conf flags without code change?
Is this a bug or are we now preferring s3a over `s3?
Thanks,
***#!!!:~$ spark-shell --conf spark.hadoop.fs.s3.awsAccessKeyId=CORRECT_ACCESS_KEY \
> --conf spark.hadoop.fs.s3.awsSecretAccessKey=SECRETE_KEY
// Try to read existing path, which breaks...
scala> spark.read.parquet("s3://bucket/path/to/folder")
17/06/01 16:19:58 WARN org.apache.spark.sql.execution.datasources.DataSource: Error while looking for metadata directory.
java.io.IOException: /path/to/folder doesn't exist
at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:170)
...
// notice `s3` not `s3a`
scala> spark.conf.getAll("spark.hadoop.fs.s3.awsAccessKeyId")
res3: String = CORRECT_ACCESS_KEY
scala> spark.conf.getAll("fs.s3.awsAccessKeyId")
java.util.NoSuchElementException: key not found: fs.s3.awsAccessKeyId
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
... 48 elided
scala> sc
res5: org.apache.spark.SparkContext = org.apache.spark.SparkContext#426bf2f2
scala> sc.hadoopConfiguration
res6: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, file:/etc/hive/conf.dist/hive-site.xml
scala> sc.hadoopConfiguration.get("fs.s3.access.key")
res7: String = null <--- ugh... wtf?
scala> sc.hadoopConfiguration.get("fs.s3n.access.key")
res10: String = null <--- I understand this...
scala> sc.hadoopConfiguration.get("fs.s3a.access.key")
res8: String = CORRECT_ACCESS_KEY <--- But what is this???
// Successfull file read
scala> spark.read.parquet("s3a://bucket/path/to/folder")
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
res9: org.apache.spark.sql.DataFrame = [whatev... ... 22 more fields]
There's some magic in spark-submit which picks up your AWS_ env vars and sets them for {s3, s3n, s3a} filesystens; that may be what's happening under the hood.
There is not any magic copying of the s3a settings from the s3a to s3n options in the Hadoop JARs, or anywhere else, so it may be some of the -site.xml files that is defining it.
Do use s3a:// if you have the Hadoop 2.7.x JARs on your CP anyway: better performance. V4 API auth (mandatory for some regions), ongoing maintenance.

Spark 2.1 - Error While instantiating HiveSessionState

With a fresh install of Spark 2.1, I am getting an error when executing the pyspark command.
Traceback (most recent call last):
File "/usr/local/spark/python/pyspark/shell.py", line 43, in <module>
spark = SparkSession.builder\
File "/usr/local/spark/python/pyspark/sql/session.py", line 179, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/local/spark/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"
I have Hadoop and Hive on the same machine. Hive is configured to use MySQL for the metastore. I did not get this error with Spark 2.0.2.
Can someone please point me in the right direction?
I was getting same error in windows environment and Below trick worked for me.
in shell.py the spark session is defined with .enableHiveSupport()
spark = SparkSession.builder\
.enableHiveSupport()\
.getOrCreate()
Remove hive support and redefine spark session as below:
spark = SparkSession.builder\
.getOrCreate()
you can find shell.py in your spark installation folder.
for me it's in "C:\spark-2.1.1-bin-hadoop2.7\python\pyspark"
Hope this helps
I had the same problem. Some of the answers sudo chmod -R 777 /tmp/hive/, or to downgrade spark with hadoop to 2.6 didn't work for me.
I realized that what caused this problem for me is that I was doing SQL queries using the sqlContext instead of using the sparkSession.
sparkSession =SparkSession.builder.master("local[*]").appName("appName").config("spark.sql.warehouse.dir", "./spark-warehouse").getOrCreate()
sqlCtx.registerDataFrameAsTable(..)
df = sparkSession.sql("SELECT ...")
this perfectly works for me now.
Spark 2.1.0 - When I run it with yarn client option - I don't see this issue, but yarn cluster mode gives "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':".
Still looking for answer.
The issue for me was solved by disabling HADOOP_CONF_DIR environment variable. It was pointing to hadoop configuration directory and while starting pyspark shell, the variable caused spark to initiate hadoop cluster which wasn't initiated.
So if you have HADOOP_CONF_DIR variable enabled, then you have to start hadoop cluster started before using spark shells
Or you need to disable the variable.
You are missing the spark-hive jar.
For example, if you are running on Scala 2.11, with Spark 2.1, you can use this jar.
https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.11/2.1.0
I saw this error on a new (2018) Mac, which came with Java 10. The fix was to set JAVA_HOME to Java 8:
export JAVA_HOME=`usr/libexec/java_home -v 1.8`
I too was struggling in cluster mode. Added hive-site.xml from sparkconf directory, if you have hdp cluster then it should be at /usr/hdp/current/spark2-client/conf. Its working for me.
I was getting this error trying to run pyspark and spark-shell when my HDFS wasn't started.
I have removed ".enableHiveSupport()\" from shell.py file and its working perfect
/*****Before********/
spark = SparkSession.builder\
.enableHiveSupport()\
.getOrCreate()
/*****After********/
spark = SparkSession.builder\
.getOrCreate()
/*************************/
Project location and file permissions would be issue. I have observed this error happening inspite of changes to my pom file.Then i changed the directory of my project to user directory where i have full permissions, this solved my issue.

Resources