Registering temp tables in ThriftServer - apache-spark

I am new to Spark and am trying to understand how (if at all) is it possible to register dataframes as temp tables in the Spark thrift server.
To clarify, this is what I am trying to do:
Submit an application that generates a dataframe and registers it as a temporary table
Connect from a JDBC client to the Spark ThriftServer (running on the master) and query the temporary table, even after the application that registered it completed.
So far I've had no success with this - the Spark ThriftServer is running on the Spark master, but I'm unable to actually register any temp table to it.
Is this possible? I know I can use HiveThriftServer2.startWithContext to serve a dataframe via JDBC, but that requires the application to keep running forever + it requires me to launch additional applications.

The key idea is to register all temp tables in the Spark job and finally start SparkThriftServer from this job. It will keep your job running until you terminate thrift server. Also you will be able to query SparkThriftServer for all temp table via JDBC.
Here it is described with example

Related

Spark SQL query history

Is there any way to get a list of Spark SQL queries executed by various users in a Hadoop cluster?
For example, is there any log file where a Spark application stores the query in string format ?
There is a Spark History Server (port 18080 by default). If you have spark.eventLog.enabled,spark.eventLog.dir configured and Spark HS is running - you can check what Spark apps have been executed on your cluster. Each job there might contain SQL tab in UI where you can see SQL queries. But there are no single place or log file which stores them all.

How to connect local spark to Hive in cluster in Scala IDE

Can you please let me know the steps to connect scala ide which I use for developing spark to connect to hive. Currently the output goes to hdfs and then I create an external table on top of it. But as spark streaming creates small files , the performance is getting bad and I want spark to write directly to Hive and I am not sure what configuration in my PC that I should make for that to happen for my development.

Running Spark App: Persist Metastore

I work on a Spark 2.1 application that also uses SparkSQL and saves data with dataframe.write.saveAsTable(tbl). My understanding is that an in-memory Derby DB is used for the Hive metastore (right?). This means that a table that I create in the first execution is not available in any subsequent executions. In many cases that might be the intended behaviour - but I would like to persist the metastore across executions (since this is also the behavior I have in my production system).
So, a simple question: How can I change the configuration to persist the metastore on disc?
One remark: I am not starting the Spark job with spark-shell or spark-submit, but as a standalone Scala application.
It is already persisted on disk. As long as both sessions use the same working directory or specific metastore configuration, the permanent table will be persisted between sessions.

Zeppelin - Unable to instantiate SessionHiveMetaStoreClient

I am trying to get Zeppelin to work. But when I run a notebook twice, the second time it fails due to Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient. (full log at the end of the post)
It seems to be due to the fact that the lock in the metastore doesn't get removed. It is also advised to use for example Postgres instead of Hive as it allows multiple users to run jobs in Zeppelin.
I made a postgres DB and a hive-site.xml pointing to this DB. I added this file into the config folder of Zeppelin but also into the config folder of Spark. Also in the jdbc interpreter of Zeppelin I added similar parameters than the ones in the hive-site.xml.
The problems persists though.
Error log: http://pastebin.com/Jqf9cdtU
hive-site.xml: http://pastebin.com/RZdXHPX4
Try using Thrift server architecture in the Spark setup instead of working on a single instance JVM of Hive where you cannot generate multiple of sessions.
There are mainly three types of connection to Hive:
Single JVM - Metastore stored locally in the warehouse which doesn't allow multiple sessions
Mutiple JVM - where each worker behaves as a metastore
Thrift Server Architecture - Multiple Users can access the SQL engine and parallelism can be achieved
Another instance of Derby may have already booted the database
By default, spark use derby as the metadata store which can only serve one user. It seems you start multiple spark interpreter, that's why you see the above error message. So here's the 2 solutions for you
Disable hive in spark interpreter via setting zeppelin.spark.useHiveContext to false if you don't need hive.
Set up hive metadata store which support multiple users. Refer this https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_ig_hive_metastore_configure.html
Stop Zeppelin. Go to your bin folder in Apache Zeppelin and try deleting metastore_db
sudo rm -r metastore_db/
Start Zeppelin again and try now.

thrift server - hive contexts - load/update data from spark code

Does the ThriftServer create its own HiveContext?
My aim is to create tables/load data from spark code (spark-submit) by HiveContext such that clients of thriftServer will be able to see it.
yes, of course it creates context:
Thrift Code
But I have seen strange issue - it looks like hive context is cached on starting of thrift server. If I run some other app which creates/changes hive table, thrift server doesn't see the changes. Only restarting the service helps

Resources