Spark SQL cannot access Spark Thrift Server - apache-spark

I cannot configure Spark SQL so that I could access Hive Table in Spark Thrift Server (without using JDBC, but natively from Spark)
I use single configuration file conf/hive-site.xml for both Spark Thrift Server and Spark SQL. I have javax.jdo.option.ConnectionURL property set to jdbc:derby:;databaseName=/home/user/spark-2.4.0-bin-hadoop2.7/metastore_db;create=true. I also set spark.sql.warehouse.dir property to absolute path pointing to spark-warehouse directory. I run Thrift server with ./start-thriftserver.sh and I can observe that embedded Derby database is being created with metastore_db directory. I can connect with beeline, create a table and see spark-warehouse directory created with subdirectory for table. So at this stage it's fine.
I launch pyspark shell with Hive support enabled ./bin/pyspark --conf spark.sql.catalogImplementation=hive, and try to access the Hive table with:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
hc.sql('show tables')
I got errors like:
ERROR XJ040: Failed to start database
'/home/user/spark-2.4.0-bin-hadoop2.7/metastore_db' with class loader
sun.misc.Launcher$AppClassLoader#1b4fb997
ERROR XSDB6: Another instance of Derby may have already booted the
database /home/user/spark-2.4.0-bin-hadoop2.7/metastore_db
pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Apparently Spark is trying to create new Derby database instead of using Metastore I put in config file. If I stop Thrift Server and run only spark, everything is fine. How could I fix it?
Is Embedded Derby Metastore Database fine to have both Thrift Server and Spark access one Hive or I need to use e.g. MySQL? I don't have a cluster and do everything locally.

Embedded Derby Metastore Database is fine to be used in local, but for production environment, it is recommended to use any other Metastore database.
Yes, you can definitely use MYSQL as metastore. For this, you have to make an entry in hive-site.xml.
You can follow the configuration guide at Use MySQL for the Hive Metastore for the exact details.

Related

Can Spark-sql work without a hive installation?

I have installed spark 2.4.0 on a clean ubuntu instance. Spark dataframes work fine but when I try to use spark.sql against a dataframe such as in the example below,i am getting an error "Failed to access metastore. This class should not accessed in runtime."
spark.read.json("/data/flight-data/json/2015-summary.json")
.createOrReplaceTempView("some_sql_view")
spark.sql("""SELECT DEST_COUNTRY_NAME, sum(count)
FROM some_sql_view GROUP BY DEST_COUNTRY_NAME
""").where("DEST_COUNTRY_NAME like 'S%'").where("sum(count) > 10").count()
Most of the fixes that I have see in relation to this error refer to environments where hive is installed. Is hive required if I want to use sql statements against dataframes in spark or am i missing something else?
To follow up with my fix. The problem in my case was that Java 11 was the default on my system. As soon as I set Java 8 as the default metastore_db started working.
Yes, we can run spark sql queries on spark without installing hive, by default hive uses mapred as an execution engine, we can configure hive to use spark or tez as an execution engine to execute our queries much faster. Hive on spark hive uses hive metastore to run hive queries. At the same time, sql queries can be executed through spark. If spark is used to execute simple sql queries or not connected with hive metastore server, its uses embedded derby database and a new folder with name metastore_db will be created under the user home folder who executes the query.

Zeppelin - Unable to instantiate SessionHiveMetaStoreClient

I am trying to get Zeppelin to work. But when I run a notebook twice, the second time it fails due to Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient. (full log at the end of the post)
It seems to be due to the fact that the lock in the metastore doesn't get removed. It is also advised to use for example Postgres instead of Hive as it allows multiple users to run jobs in Zeppelin.
I made a postgres DB and a hive-site.xml pointing to this DB. I added this file into the config folder of Zeppelin but also into the config folder of Spark. Also in the jdbc interpreter of Zeppelin I added similar parameters than the ones in the hive-site.xml.
The problems persists though.
Error log: http://pastebin.com/Jqf9cdtU
hive-site.xml: http://pastebin.com/RZdXHPX4
Try using Thrift server architecture in the Spark setup instead of working on a single instance JVM of Hive where you cannot generate multiple of sessions.
There are mainly three types of connection to Hive:
Single JVM - Metastore stored locally in the warehouse which doesn't allow multiple sessions
Mutiple JVM - where each worker behaves as a metastore
Thrift Server Architecture - Multiple Users can access the SQL engine and parallelism can be achieved
Another instance of Derby may have already booted the database
By default, spark use derby as the metadata store which can only serve one user. It seems you start multiple spark interpreter, that's why you see the above error message. So here's the 2 solutions for you
Disable hive in spark interpreter via setting zeppelin.spark.useHiveContext to false if you don't need hive.
Set up hive metadata store which support multiple users. Refer this https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_ig_hive_metastore_configure.html
Stop Zeppelin. Go to your bin folder in Apache Zeppelin and try deleting metastore_db
sudo rm -r metastore_db/
Start Zeppelin again and try now.

thrift server - hive contexts - load/update data from spark code

Does the ThriftServer create its own HiveContext?
My aim is to create tables/load data from spark code (spark-submit) by HiveContext such that clients of thriftServer will be able to see it.
yes, of course it creates context:
Thrift Code
But I have seen strange issue - it looks like hive context is cached on starting of thrift server. If I run some other app which creates/changes hive table, thrift server doesn't see the changes. Only restarting the service helps

Use JDBC (eg Squirrel SQL) to query Cassandra with Spark SQL

I have a Cassandra cluster with a co-located Spark cluster, and I can run the usual Spark jobs by compiling them, copying them over, and using the ./spark-submit script. I wrote a small job that accepts SQL as a command-line argument, submits it to Spark as Spark SQL, Spark runs that SQL against Cassandra and writes the output to a csv file.
Now I feel like I'm going round in circles trying to figure out if it's possible to query Cassandra via Spark SQL directly in a JDBC connection (eg from Squirrel SQL). The Spark SQL documentation says
Connect through JDBC or ODBC.
A server mode provides industry standard JDBC and ODBC connectivity for
business intelligence tools.
The Spark SQL Programming Guide says
Spark SQL can also act as a distributed query engine using its JDBC/ODBC or
command-line interface. In this mode, end-users or applications can interact
with Spark SQL directly to run SQL queries, without the need to write any
code.
So I can run the Thrift Server, and submit SQL to it. But what I can't figure out, is how do I get the Thrift Server to connect to Cassandra? Do I simply pop the Datastax Cassandra Connector on the Thrift Server classpath? How do I tell the Thrift Server the IP and Port of my Cassandra cluster? Has anyone done this already and can give me some pointers?
Configure those properties in spark-default.conf file
spark.cassandra.connection.host 192.168.1.17,192.168.1.19,192.168.1.21
# if you configured security in you cassandra cluster
spark.cassandra.auth.username smb
spark.cassandra.auth.password bigdata#123
Start your thrift server with spark-cassandra-connector dependencies and mysql-connector dependencies with some port that you will connect via JDBC or Squirrel.
sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.bind.host 192.168.1.17 --hiveconf hive.server2.thrift.port 10003 --jars <shade-jar>-0.0.1.jar --driver-class-path <shade-jar>-0.0.1.jar
For getting cassandra table run Spark-SQL queries like
CREATE TEMPORARY TABLE mytable USING org.apache.spark.sql.cassandra OPTIONS (cluster 'BDI Cassandra', keyspace 'testks', table 'testtable');
why don`t you use the spark-cassandra-connector and cassandra-driver-core? Just add the dependencies, specify the host address/login in your spark context and then you can read/write to cassandra using sql.

Can't load a Hive table through Spark

I am new to Spark and needed help in figuring out why my Hive databases are not accessible to perform a data load through Spark.
Background:
I am running Hive, Spark, and my Java program on a single machine. It's a Cloudera QuickStart VM, CDH5.4x, on a VirtualBox.
I have downloaded pre-built Spark 1.3.1.
I am using the Hive bundled with the VM and can run hive queries through Spark-shell and Hive cmd line without any issue. This includes running the command:
LOAD DATA INPATH 'hdfs://quickstart.cloudera:8020/user/cloudera/test_table/result.parquet/' INTO TABLE test_spark.test_table PARTITION(part = '2015-08-21');
Problem:
I am writing a Java program to read data from Cassandra and load it into Hive. I have saved the results of the Cassandra read in parquet format in a folder called 'result.parquet'.
Now I would like to load this into Hive. For this, I
Copied the Hive-site.xml to the Spark conf folder.
I made a change to this xml. I noticed that I had two hive-site.xml - one which was auto generated and another which had Hive execution parameters. I combined both into a single hive-site.xml.
Code used (Java):
HiveContext hiveContext = new
HiveContext(JavaSparkContext.toSparkContext(sc));
hiveContext.sql("show databases").show();
hiveContext.sql("LOAD DATA INPATH
'hdfs://quickstart.cloudera:8020/user/cloudera/test_table/result.parquet/'
INTO TABLE test_spark.test_table PARTITION(part = '2015-08-21')").show();
So, this worked. And I could load data into Hive. Except, after I restarted my VM, it has stopped working.
When I run the show databases Hive query, I get a result saying
result
default
instead of the databases in Hive, which are
default
test_spark
I also notice a folder called metastore_db being created in my Project Folder. From googling around, I know this happens when Spark can't connect to the Hive metastore, so it creates one of its own.I thought I had fixed that, but clearly not.
What am I missing?

Resources