Spark 2.1 - Error While instantiating HiveSessionState - apache-spark

With a fresh install of Spark 2.1, I am getting an error when executing the pyspark command.
Traceback (most recent call last):
File "/usr/local/spark/python/pyspark/shell.py", line 43, in <module>
spark = SparkSession.builder\
File "/usr/local/spark/python/pyspark/sql/session.py", line 179, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/local/spark/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"
I have Hadoop and Hive on the same machine. Hive is configured to use MySQL for the metastore. I did not get this error with Spark 2.0.2.
Can someone please point me in the right direction?

I was getting same error in windows environment and Below trick worked for me.
in shell.py the spark session is defined with .enableHiveSupport()
spark = SparkSession.builder\
.enableHiveSupport()\
.getOrCreate()
Remove hive support and redefine spark session as below:
spark = SparkSession.builder\
.getOrCreate()
you can find shell.py in your spark installation folder.
for me it's in "C:\spark-2.1.1-bin-hadoop2.7\python\pyspark"
Hope this helps

I had the same problem. Some of the answers sudo chmod -R 777 /tmp/hive/, or to downgrade spark with hadoop to 2.6 didn't work for me.
I realized that what caused this problem for me is that I was doing SQL queries using the sqlContext instead of using the sparkSession.
sparkSession =SparkSession.builder.master("local[*]").appName("appName").config("spark.sql.warehouse.dir", "./spark-warehouse").getOrCreate()
sqlCtx.registerDataFrameAsTable(..)
df = sparkSession.sql("SELECT ...")
this perfectly works for me now.

Spark 2.1.0 - When I run it with yarn client option - I don't see this issue, but yarn cluster mode gives "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':".
Still looking for answer.

The issue for me was solved by disabling HADOOP_CONF_DIR environment variable. It was pointing to hadoop configuration directory and while starting pyspark shell, the variable caused spark to initiate hadoop cluster which wasn't initiated.
So if you have HADOOP_CONF_DIR variable enabled, then you have to start hadoop cluster started before using spark shells
Or you need to disable the variable.

You are missing the spark-hive jar.
For example, if you are running on Scala 2.11, with Spark 2.1, you can use this jar.
https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.11/2.1.0

I saw this error on a new (2018) Mac, which came with Java 10. The fix was to set JAVA_HOME to Java 8:
export JAVA_HOME=`usr/libexec/java_home -v 1.8`

I too was struggling in cluster mode. Added hive-site.xml from sparkconf directory, if you have hdp cluster then it should be at /usr/hdp/current/spark2-client/conf. Its working for me.

I was getting this error trying to run pyspark and spark-shell when my HDFS wasn't started.

I have removed ".enableHiveSupport()\" from shell.py file and its working perfect
/*****Before********/
spark = SparkSession.builder\
.enableHiveSupport()\
.getOrCreate()
/*****After********/
spark = SparkSession.builder\
.getOrCreate()
/*************************/

Project location and file permissions would be issue. I have observed this error happening inspite of changes to my pom file.Then i changed the directory of my project to user directory where i have full permissions, this solved my issue.

Related

org.apache.spark.SparkException: Writing job aborted on Databricks

I have used Databricks to ingest data from Event Hub and process it in real time with Pyspark Streaming. The code is working fine, but after this line:
df.writeStream.trigger(processingTime='100 seconds').queryName("myquery")\
.format("console").outputMode('complete').start()
I'm getting the following error:
org.apache.spark.SparkException: Writing job aborted.
Caused by: java.io.InvalidClassException: org.apache.spark.eventhubs.rdd.EventHubsRDD; local class incompatible: stream classdesc
I have read that this could be due to low processing power, but I am using a Standard_F4 machine, standard cluster mode with autoscaling enabled.
Any ideas?
This looks like a JAR issue. Go to your JAR's folder in spark and check if you have multiple jars for azure-eventhubs-spark_XXX.XX. I think you've downloaded different versions of it and placed it there, you should remove any JAR with that name from your collection. This error may also occur if your JAR version is incompatible with other JAR's. Try adding spark jars using spark config.
spark = SparkSession \
.builder \
.appName('my-spark') \
.config('spark.jars.packages', 'com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.12') \
.getOrCreate()
This way spark will download JAR files through maven.

Notable to load twritter jar file in spark cluster

I am working on basic spark twitter application. But I am not able to load twitter jar file in spark cluster.
spark-shell --jars /usr/local/Twitter/spark-streaming-twitter_2.11-2.0.1.jar,\
/usr/local/Twitter/twitter4j-core-4.0.6.jar,\
/usr/local/Twitter/twitter4j-stream-4.0.6.jar
I am using above command to add the jar file in spark env. But I am getting file not found exception
18/03/28 09:11:39 ERROR SparkContext: Failed to add
file:/usr/local/Twitter/spark-streaming-twitter_2.11-2.0.1.jar to
Spark environment java.io.FileNotFoundException: Jar
/usr/local/Twitter/spark-streaming-twitter_2.11-2.0.1.jar not found

saveAsTable ends in failure in Spark-yarn cluster environment

I set up a spark-yarn cluster environment, and try spark-SQL with spark-shell:
spark-shell --master yarn --deploy-mode client --conf spark.yarn.archive=hdfs://hadoop_273_namenode_ip:namenode_port/spark-archive.zip
One thing to mention is the Spark is in Windows 7. After spark-shell starts up successfully, I execute the commands as below:
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> val df_mysql_address = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://mysql_db_ip/db").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "ADDRESS").option("user", "root").option("password", "root").load()
scala> df_mysql_address.show
scala> df_mysql_address.write.format("parquet").saveAsTable("address_local")
"show" command returns result-set correctly, but the "saveAsTable" ends in failure. The error message says:
java.io.IOException: Mkdirs failed to create file:/C:/jshen.workspace/programs/spark-2.2.0-bin-hadoop2.7/spark-warehouse/address_local/_temporary/0/_temporary/attempt_20171018104423_0001_m_000000_0 (exists=false, cwd=file:/tmp/hadoop/nm-local-dir/usercache/hduser/appcache/application_1508319604173_0005/container_1508319604173_0005_01_000003)
I expect and guess the table is to be saved in the hadoop cluster, but you can see that the dir (C:/jshen.workspace/programs/spark-2.2.0-bin-hadoop2.7/spark-warehouse) is the folder in my Windows 7, not in hdfs, not even in the hadoop ubuntu machine.
How could I do? Please advise, thanks.
The way to get rid of the problem is to provide "path" option prior to "save" operation as shown below:
scala> df_mysql_address.write.option("path", "/spark-warehouse").format("parquet").saveAsTable("address_l‌​ocal")
Thanks #philantrovert.

Connect to Oracle DB using PySpark

I am trying to connect to an Oracle DB using PySpark.
spark_config = SparkConf().setMaster(config['cluster']).setAppName('sim_transactions_test').set("jars", "..\Lib\ojdbc7.jar")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
df_sim_input = self.sqlContext.read\
.format("jdbc")\
.option("driver", "oracle.jdbc.driver.OracleDriver")\
.option("url", config["db.url"])\
.option("dbtable", query)\
.option("user", config["db.user"])\
.option("password", config["db.password"])\
.load()
This gives me a
py4j.protocol.Py4JJavaError: An error occurred while calling o31.load.
: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
So it seems it cannot find the jar file in the SparkContext. It seems to be possible to load a PySpark shell with external jars, but I want to load them from the Python code.
Can someone explain to me how you can add this external jar from Python and make a query to an Oracle DB?
Extra question, how come that for a postgres DB the code works fine without importing an external jdbc? Is that because if it is installed on your system, it will automatically find it?
You should probably also set driver-class-path as jars sends the jar file only to workers, not the driver.
That said, you should be very careful when setting JVM configuration in the python code as you need to make sure the JVM loads with them (you can't add them later). You can try setting PYSPARK_SUBMIT_ARGS e.g.:
export PYSPARK_SUBMIT_ARGS="--jars jarname --driver-class-path jarname pyspark-shell"
This will tell pyspark to add these options to the JVM loading the same as if you would have added it in the command line

How to get rid of derby.log, metastore_db from Spark Shell

When running spark-shell it creates a file derby.log and a folder metastore_db. How do I configure spark to put these somewhere else?
For derby log I've tried Getting rid of derby.log like so spark-shell --driver-memory 10g --conf "-spark.driver.extraJavaOptions=Dderby.stream.info.file=/dev/null" with a couple of different properties but spark ignores them.
Does anyone know how to get rid of these or specify a default directory for them?
The use of the hive.metastore.warehouse.dir is deprecated since Spark 2.0.0,
see the docs.
As hinted by this answer, the real culprit for both the metastore_db directory and the derby.log file being created in every working subdirectory is the derby.system.home property defaulting to ..
Thus, a default location for both can be specified by adding the following line to spark-defaults.conf:
spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby
where /tmp/derby can be replaced by the directory of your choice.
For spark-shell, to avoid having the metastore_db directory and avoid doing it in the code (since the context/session is already created and you won't stop them and recreate them with the new configuration each time), you have to set its location in hive-site.xml file and copy this file into spark conf directory.
A sample hive-site.xml file to make the location of metastore_db in /tmp (refer to my answer here):
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/tmp/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/tmp/</value>
<description>location of default database for the warehouse</description>
</property>
</configuration>
After that you could start your spark-shell as the following to get rid of derby.log as well
$ spark-shell --conf "spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp"
Try setting derby.system.home to some other directory as a system property before firing up the spark shell. Derby will create new databases there. The default value for this property is .
Reference: https://db.apache.org/derby/integrate/plugin_help/properties.html
Use hive.metastore.warehouse.dir property. From docs:
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
For derby log: Getting rid of derby.log could be the answer. In general create derby.properties file in your working directory with following content:
derby.stream.error.file=/path/to/desired/log/file
For me setting the Spark property didn't work, neither on the driver nor the executor. So searching for this issue, I ended up setting the property for my system instead with:
System.setProperty("derby.system.home", "D:\\tmp\\derby")
val spark: SparkSession = SparkSession.builder
.appName("UT session")
.master("local[*]")
.enableHiveSupport
.getOrCreate
[...]
And that finally got me rid of those annoying items.
In case if you are using Jupyter/Jupyterhub/Jupyterlab or just setting this conf parameter inside python, use the following will work:
from pyspark import SparkConf, SparkContext
conf = (SparkConf()
.setMaster("local[*]")
.set('spark.driver.extraJavaOptions','-Dderby.system.home=/tmp/derby')
)
sc = SparkContext(conf = conf)
I used the below configuration for a pyspark project, i was able to setup sparkwarehouse db and derby db in sample path, so was able to avoid them setup in current directory.
from pyspark.sql import SparkSession
from os.path import abspath
location = abspath("C:\self\demo_dbx\data\spark-warehouse") #Path where you want to setup sparkwarehouse
local_spark = SparkSession.builder \
.master("local[*]") \
.appName('Spark_Dbx_Session') \
.config("spark.sql.warehouse.dir", location)\
.config("spark.driver.extraJavaOptions",
f"Dderby.system.home='{location}'")\
.getOrCreate()

Resources