NullPointerException in Spark Thrift Server using Apache Superset or Redash - apache-spark

I get a NullPointerException after connecting BI tools like Redash or Superset to a Spark Thriftserver (both tools use PyHive). Apache Zeppelin works fine for queries using STS and I could never reproduce the error there (Zeppelin uses org.apache.hive.jdbc.HiveDriver).
DB engine Error
hive error: ('Query error', 'Error running query: java.lang.NullPointerException')
This sends the STS into a state where only a restart can bring it back. Queries from all clients will fail (Zeppelin, beeline, Redash, Superset). It seems to occur mostly when schema is automatically fetched (which doesn't quite work, DB name is fetched correctly, table names are wrong). While browsing PyHive code I encountered some incompatibilities between PyHive <-> STS (like this and this). The connection between Redash/Superset and STS works, I am able to do queries until the Thriftserver enters the broken state.
I understand why schema refresh doesn't work (and might be able work around it), but I don't understand why the Thriftserver enters an unrecoverable, broken state with the NullPointerException.
My setup:
Kubernetes
Delta Lake with data formatted as delta
Hive Metastore
Spark Cluster where a Spark Thriftserver is started: start-thriftserver.sh --total-executor-cores 3 --driver-memory 3G --executor-memory 1536M --hiveconf hive.server2.thrift.port 10000 --hiveconf hive.server2.thrift.max.worker.threads 2000 --hiveconf hive.server2.thrift.bind.host my-host
(I also tried spark.sql.thriftServer.incrementalCollect=false but that didn't affect anything.)
Redash / Apache Superset connected to the STS

Related

Spark SQL cannot access Spark Thrift Server

I cannot configure Spark SQL so that I could access Hive Table in Spark Thrift Server (without using JDBC, but natively from Spark)
I use single configuration file conf/hive-site.xml for both Spark Thrift Server and Spark SQL. I have javax.jdo.option.ConnectionURL property set to jdbc:derby:;databaseName=/home/user/spark-2.4.0-bin-hadoop2.7/metastore_db;create=true. I also set spark.sql.warehouse.dir property to absolute path pointing to spark-warehouse directory. I run Thrift server with ./start-thriftserver.sh and I can observe that embedded Derby database is being created with metastore_db directory. I can connect with beeline, create a table and see spark-warehouse directory created with subdirectory for table. So at this stage it's fine.
I launch pyspark shell with Hive support enabled ./bin/pyspark --conf spark.sql.catalogImplementation=hive, and try to access the Hive table with:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
hc.sql('show tables')
I got errors like:
ERROR XJ040: Failed to start database
'/home/user/spark-2.4.0-bin-hadoop2.7/metastore_db' with class loader
sun.misc.Launcher$AppClassLoader#1b4fb997
ERROR XSDB6: Another instance of Derby may have already booted the
database /home/user/spark-2.4.0-bin-hadoop2.7/metastore_db
pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Apparently Spark is trying to create new Derby database instead of using Metastore I put in config file. If I stop Thrift Server and run only spark, everything is fine. How could I fix it?
Is Embedded Derby Metastore Database fine to have both Thrift Server and Spark access one Hive or I need to use e.g. MySQL? I don't have a cluster and do everything locally.
Embedded Derby Metastore Database is fine to be used in local, but for production environment, it is recommended to use any other Metastore database.
Yes, you can definitely use MYSQL as metastore. For this, you have to make an entry in hive-site.xml.
You can follow the configuration guide at Use MySQL for the Hive Metastore for the exact details.

Zeppelin - Unable to instantiate SessionHiveMetaStoreClient

I am trying to get Zeppelin to work. But when I run a notebook twice, the second time it fails due to Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient. (full log at the end of the post)
It seems to be due to the fact that the lock in the metastore doesn't get removed. It is also advised to use for example Postgres instead of Hive as it allows multiple users to run jobs in Zeppelin.
I made a postgres DB and a hive-site.xml pointing to this DB. I added this file into the config folder of Zeppelin but also into the config folder of Spark. Also in the jdbc interpreter of Zeppelin I added similar parameters than the ones in the hive-site.xml.
The problems persists though.
Error log: http://pastebin.com/Jqf9cdtU
hive-site.xml: http://pastebin.com/RZdXHPX4
Try using Thrift server architecture in the Spark setup instead of working on a single instance JVM of Hive where you cannot generate multiple of sessions.
There are mainly three types of connection to Hive:
Single JVM - Metastore stored locally in the warehouse which doesn't allow multiple sessions
Mutiple JVM - where each worker behaves as a metastore
Thrift Server Architecture - Multiple Users can access the SQL engine and parallelism can be achieved
Another instance of Derby may have already booted the database
By default, spark use derby as the metadata store which can only serve one user. It seems you start multiple spark interpreter, that's why you see the above error message. So here's the 2 solutions for you
Disable hive in spark interpreter via setting zeppelin.spark.useHiveContext to false if you don't need hive.
Set up hive metadata store which support multiple users. Refer this https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_ig_hive_metastore_configure.html
Stop Zeppelin. Go to your bin folder in Apache Zeppelin and try deleting metastore_db
sudo rm -r metastore_db/
Start Zeppelin again and try now.

Privileges for spark sql with sentry

I'm trying to make the privileges to access Spark-SQL with sentry, and spark sql connects with thrift port withe hiveserver2( --hiveconf hive.server2.thrift.port). However, while I can limit users' privileges on hive successfully, I cannot limit the access with spark SQL through sentry.
Anyone who met with the same problem?
Follow the doc,config hive in Spark.
Add Sentry jars to the classpath, Spark will load them automatically.
It works for me.

thrift server - hive contexts - load/update data from spark code

Does the ThriftServer create its own HiveContext?
My aim is to create tables/load data from spark code (spark-submit) by HiveContext such that clients of thriftServer will be able to see it.
yes, of course it creates context:
Thrift Code
But I have seen strange issue - it looks like hive context is cached on starting of thrift server. If I run some other app which creates/changes hive table, thrift server doesn't see the changes. Only restarting the service helps

Use JDBC (eg Squirrel SQL) to query Cassandra with Spark SQL

I have a Cassandra cluster with a co-located Spark cluster, and I can run the usual Spark jobs by compiling them, copying them over, and using the ./spark-submit script. I wrote a small job that accepts SQL as a command-line argument, submits it to Spark as Spark SQL, Spark runs that SQL against Cassandra and writes the output to a csv file.
Now I feel like I'm going round in circles trying to figure out if it's possible to query Cassandra via Spark SQL directly in a JDBC connection (eg from Squirrel SQL). The Spark SQL documentation says
Connect through JDBC or ODBC.
A server mode provides industry standard JDBC and ODBC connectivity for
business intelligence tools.
The Spark SQL Programming Guide says
Spark SQL can also act as a distributed query engine using its JDBC/ODBC or
command-line interface. In this mode, end-users or applications can interact
with Spark SQL directly to run SQL queries, without the need to write any
code.
So I can run the Thrift Server, and submit SQL to it. But what I can't figure out, is how do I get the Thrift Server to connect to Cassandra? Do I simply pop the Datastax Cassandra Connector on the Thrift Server classpath? How do I tell the Thrift Server the IP and Port of my Cassandra cluster? Has anyone done this already and can give me some pointers?
Configure those properties in spark-default.conf file
spark.cassandra.connection.host 192.168.1.17,192.168.1.19,192.168.1.21
# if you configured security in you cassandra cluster
spark.cassandra.auth.username smb
spark.cassandra.auth.password bigdata#123
Start your thrift server with spark-cassandra-connector dependencies and mysql-connector dependencies with some port that you will connect via JDBC or Squirrel.
sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.bind.host 192.168.1.17 --hiveconf hive.server2.thrift.port 10003 --jars <shade-jar>-0.0.1.jar --driver-class-path <shade-jar>-0.0.1.jar
For getting cassandra table run Spark-SQL queries like
CREATE TEMPORARY TABLE mytable USING org.apache.spark.sql.cassandra OPTIONS (cluster 'BDI Cassandra', keyspace 'testks', table 'testtable');
why don`t you use the spark-cassandra-connector and cassandra-driver-core? Just add the dependencies, specify the host address/login in your spark context and then you can read/write to cassandra using sql.

Resources