Zeppelin - Unable to instantiate SessionHiveMetaStoreClient - apache-spark

I am trying to get Zeppelin to work. But when I run a notebook twice, the second time it fails due to Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient. (full log at the end of the post)
It seems to be due to the fact that the lock in the metastore doesn't get removed. It is also advised to use for example Postgres instead of Hive as it allows multiple users to run jobs in Zeppelin.
I made a postgres DB and a hive-site.xml pointing to this DB. I added this file into the config folder of Zeppelin but also into the config folder of Spark. Also in the jdbc interpreter of Zeppelin I added similar parameters than the ones in the hive-site.xml.
The problems persists though.
Error log: http://pastebin.com/Jqf9cdtU
hive-site.xml: http://pastebin.com/RZdXHPX4

Try using Thrift server architecture in the Spark setup instead of working on a single instance JVM of Hive where you cannot generate multiple of sessions.
There are mainly three types of connection to Hive:
Single JVM - Metastore stored locally in the warehouse which doesn't allow multiple sessions
Mutiple JVM - where each worker behaves as a metastore
Thrift Server Architecture - Multiple Users can access the SQL engine and parallelism can be achieved

Another instance of Derby may have already booted the database
By default, spark use derby as the metadata store which can only serve one user. It seems you start multiple spark interpreter, that's why you see the above error message. So here's the 2 solutions for you
Disable hive in spark interpreter via setting zeppelin.spark.useHiveContext to false if you don't need hive.
Set up hive metadata store which support multiple users. Refer this https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_ig_hive_metastore_configure.html

Stop Zeppelin. Go to your bin folder in Apache Zeppelin and try deleting metastore_db
sudo rm -r metastore_db/
Start Zeppelin again and try now.

Related

Cannot read persisted spark warehouse databases on subsequent sessions

I am trying to create a locally persisted spark warehouse database that will be present/loaded/accessible to future spark sessions created by the same application.
I have configured the spark session conf with:
.config("spark.sql.warehouse.dir", "C:/path/to/my/long/lived/mock-hive")
When I create the databases, I see the mock-hive folder get created, and underneath two distinct databases that I create have folders: db1.db and db2.db
However, these folders are EMPTY after the session completes, despite the databases being successfully created and subsequently queried in the run that stands them up.
On a subsequent run with the same configured spark session, if I
baseSparkSession.catalog.listDatabases().collect() I only see the default database. The two I created did not persist into the second spark session.
What is the trick to get these local persisted databases to be available to read in subsequent execution?
I've noticed that spark.sql.warehouse.dir *.db folders empty after creation, which might have something to do with it...
Spark Version: 3.0.1
Turns out spark.sql.warehouse.dir is not where local db data is stored... it's in the derby database stored in metastore_db. To relocate that, you need to change a system param:
System.setProperty("derby.system.home", derbyPath)
I didn't even have to set spark.sql.warehouse.dir, just relocate the derbyPath to a common location all spark sessions use.
NOTE - You don't need to specify the "metastore_db" portion of the derbyPath, it will be auto appended to the location.

Spark SQL cannot access Spark Thrift Server

I cannot configure Spark SQL so that I could access Hive Table in Spark Thrift Server (without using JDBC, but natively from Spark)
I use single configuration file conf/hive-site.xml for both Spark Thrift Server and Spark SQL. I have javax.jdo.option.ConnectionURL property set to jdbc:derby:;databaseName=/home/user/spark-2.4.0-bin-hadoop2.7/metastore_db;create=true. I also set spark.sql.warehouse.dir property to absolute path pointing to spark-warehouse directory. I run Thrift server with ./start-thriftserver.sh and I can observe that embedded Derby database is being created with metastore_db directory. I can connect with beeline, create a table and see spark-warehouse directory created with subdirectory for table. So at this stage it's fine.
I launch pyspark shell with Hive support enabled ./bin/pyspark --conf spark.sql.catalogImplementation=hive, and try to access the Hive table with:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
hc.sql('show tables')
I got errors like:
ERROR XJ040: Failed to start database
'/home/user/spark-2.4.0-bin-hadoop2.7/metastore_db' with class loader
sun.misc.Launcher$AppClassLoader#1b4fb997
ERROR XSDB6: Another instance of Derby may have already booted the
database /home/user/spark-2.4.0-bin-hadoop2.7/metastore_db
pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Apparently Spark is trying to create new Derby database instead of using Metastore I put in config file. If I stop Thrift Server and run only spark, everything is fine. How could I fix it?
Is Embedded Derby Metastore Database fine to have both Thrift Server and Spark access one Hive or I need to use e.g. MySQL? I don't have a cluster and do everything locally.
Embedded Derby Metastore Database is fine to be used in local, but for production environment, it is recommended to use any other Metastore database.
Yes, you can definitely use MYSQL as metastore. For this, you have to make an entry in hive-site.xml.
You can follow the configuration guide at Use MySQL for the Hive Metastore for the exact details.

Running Spark App: Persist Metastore

I work on a Spark 2.1 application that also uses SparkSQL and saves data with dataframe.write.saveAsTable(tbl). My understanding is that an in-memory Derby DB is used for the Hive metastore (right?). This means that a table that I create in the first execution is not available in any subsequent executions. In many cases that might be the intended behaviour - but I would like to persist the metastore across executions (since this is also the behavior I have in my production system).
So, a simple question: How can I change the configuration to persist the metastore on disc?
One remark: I am not starting the Spark job with spark-shell or spark-submit, but as a standalone Scala application.
It is already persisted on disk. As long as both sessions use the same working directory or specific metastore configuration, the permanent table will be persisted between sessions.

Registering temp tables in ThriftServer

I am new to Spark and am trying to understand how (if at all) is it possible to register dataframes as temp tables in the Spark thrift server.
To clarify, this is what I am trying to do:
Submit an application that generates a dataframe and registers it as a temporary table
Connect from a JDBC client to the Spark ThriftServer (running on the master) and query the temporary table, even after the application that registered it completed.
So far I've had no success with this - the Spark ThriftServer is running on the Spark master, but I'm unable to actually register any temp table to it.
Is this possible? I know I can use HiveThriftServer2.startWithContext to serve a dataframe via JDBC, but that requires the application to keep running forever + it requires me to launch additional applications.
The key idea is to register all temp tables in the Spark job and finally start SparkThriftServer from this job. It will keep your job running until you terminate thrift server. Also you will be able to query SparkThriftServer for all temp table via JDBC.
Here it is described with example

thrift server - hive contexts - load/update data from spark code

Does the ThriftServer create its own HiveContext?
My aim is to create tables/load data from spark code (spark-submit) by HiveContext such that clients of thriftServer will be able to see it.
yes, of course it creates context:
Thrift Code
But I have seen strange issue - it looks like hive context is cached on starting of thrift server. If I run some other app which creates/changes hive table, thrift server doesn't see the changes. Only restarting the service helps

Resources