Running Spark App: Persist Metastore - apache-spark

I work on a Spark 2.1 application that also uses SparkSQL and saves data with dataframe.write.saveAsTable(tbl). My understanding is that an in-memory Derby DB is used for the Hive metastore (right?). This means that a table that I create in the first execution is not available in any subsequent executions. In many cases that might be the intended behaviour - but I would like to persist the metastore across executions (since this is also the behavior I have in my production system).
So, a simple question: How can I change the configuration to persist the metastore on disc?
One remark: I am not starting the Spark job with spark-shell or spark-submit, but as a standalone Scala application.

It is already persisted on disk. As long as both sessions use the same working directory or specific metastore configuration, the permanent table will be persisted between sessions.

Related

Cannot read persisted spark warehouse databases on subsequent sessions

I am trying to create a locally persisted spark warehouse database that will be present/loaded/accessible to future spark sessions created by the same application.
I have configured the spark session conf with:
.config("spark.sql.warehouse.dir", "C:/path/to/my/long/lived/mock-hive")
When I create the databases, I see the mock-hive folder get created, and underneath two distinct databases that I create have folders: db1.db and db2.db
However, these folders are EMPTY after the session completes, despite the databases being successfully created and subsequently queried in the run that stands them up.
On a subsequent run with the same configured spark session, if I
baseSparkSession.catalog.listDatabases().collect() I only see the default database. The two I created did not persist into the second spark session.
What is the trick to get these local persisted databases to be available to read in subsequent execution?
I've noticed that spark.sql.warehouse.dir *.db folders empty after creation, which might have something to do with it...
Spark Version: 3.0.1
Turns out spark.sql.warehouse.dir is not where local db data is stored... it's in the derby database stored in metastore_db. To relocate that, you need to change a system param:
System.setProperty("derby.system.home", derbyPath)
I didn't even have to set spark.sql.warehouse.dir, just relocate the derbyPath to a common location all spark sessions use.
NOTE - You don't need to specify the "metastore_db" portion of the derbyPath, it will be auto appended to the location.

Possible memory leak on hadoop cluster ? (hive, hiveserver2, zeppelin, spark)

The heap usage of hiveserver2 is constantly increasing (first pic).
There are applications such as nifi, zeppelin, spark related to hive. Nifi use puthivesql, zeppelin use jdbc(hive) and spark use spark-sql. I couldn't find any clue to this.
Hive requires a lot of resources for establishing connection. So, first reason is a lot of queries in your puthiveql processor, cause for everyone of them hive need to open connection. Get attention on your hive job browser (you can use hue for this purpose)
Possible resolution: e.g. if you use insert queries - so use orc files to insert data. If you use update queries - use temporary table and merge query.

Where is Spark catalog metadata stored?

Have been trying to get an accurate view of how Spark's catalog API stores the metadata.
I have found some resources, but no answer:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-Catalog.html
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-CatalogImpl.html
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/catalog/Catalog.html
I see some tutorials that take for granted the existence of Hive Metastore.
Is Hive Metastore potentially included with Spark distribution?
Spark cluster can be short-lived, but Hive metastore would obviously need to be long-lived
Apart from the catalog feature, partitioning and sorting features when writing out a DF seem to depend on Hive... So "everyone" seems to take Hive as granted when talking about key Spark features of persisting a DF.
Spark becomes aware of Hive MetaStore when it is provided with hive-site.xml, which is typically placed under $SPARK_HOME/conf. Whenever enableHiveSupport() method is used while creating SparkSession, Spark finds where and how to
get connected with Hive metastore. Spark therefore does not explicitly stores hive settings.

Zeppelin - Unable to instantiate SessionHiveMetaStoreClient

I am trying to get Zeppelin to work. But when I run a notebook twice, the second time it fails due to Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient. (full log at the end of the post)
It seems to be due to the fact that the lock in the metastore doesn't get removed. It is also advised to use for example Postgres instead of Hive as it allows multiple users to run jobs in Zeppelin.
I made a postgres DB and a hive-site.xml pointing to this DB. I added this file into the config folder of Zeppelin but also into the config folder of Spark. Also in the jdbc interpreter of Zeppelin I added similar parameters than the ones in the hive-site.xml.
The problems persists though.
Error log: http://pastebin.com/Jqf9cdtU
hive-site.xml: http://pastebin.com/RZdXHPX4
Try using Thrift server architecture in the Spark setup instead of working on a single instance JVM of Hive where you cannot generate multiple of sessions.
There are mainly three types of connection to Hive:
Single JVM - Metastore stored locally in the warehouse which doesn't allow multiple sessions
Mutiple JVM - where each worker behaves as a metastore
Thrift Server Architecture - Multiple Users can access the SQL engine and parallelism can be achieved
Another instance of Derby may have already booted the database
By default, spark use derby as the metadata store which can only serve one user. It seems you start multiple spark interpreter, that's why you see the above error message. So here's the 2 solutions for you
Disable hive in spark interpreter via setting zeppelin.spark.useHiveContext to false if you don't need hive.
Set up hive metadata store which support multiple users. Refer this https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_ig_hive_metastore_configure.html
Stop Zeppelin. Go to your bin folder in Apache Zeppelin and try deleting metastore_db
sudo rm -r metastore_db/
Start Zeppelin again and try now.

Storing Data in Spark In Memory

I have got a requirement of keeping the data in Spark's in memory in table format even when the SparkContext object dies, so that Tableau can access it.
I have used registerTempTable , but data gets removed once the SparkContext object dies.
Is it possible to store data like this?If not what possible way I can look into to feed data to Tableau without reading it from HDFS location.
You will need to do one of the below:
run your Spark application as a long running application. Spark streaming usually does that out of the box (when you do StreamingContext.awaitTermination()). I have never tried it myself but I think YARN and MESOS have support for long running tasks. As you mentioned whenever your SparkContext dies, all the data is lost (because all the information is stored in the context). I consider spark-shell a long running application, that's why most Tableau/Spark demos use it because the context never dies.
store it into a data store (HDFS, database, etc.)
Try to use some distributed in-memory framework/file system like Tachyon - not sure if it has Tableau connectors, though.
Does Tableau read data from custom Spark Application?
I use PowerBi (instead Tableau) and it queries Spark through Thrift client, so each time it dies and restarts, I send him "cache table myTable" query through odbc/jdbc driver
I came to know a very interesting answer to the question asked above.
TACHYON.
http://ampcamp.berkeley.edu/5/exercises/tachyon.html

Resources