I've setup an AWS EMR cluster that includes spark 2.3.2, hive 2.3.3, and hbase 1.4.7. How can I configure spark to access hive tables?
I've taken the following steps, but the result is the error message:
java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError:
org/apache/tez/dag/api/SessionNotRunning when creating Hive client
using classpath:
Please make sure that jars for your version of hive and hadoop are
included in the paths passed to spark.sql.hive.metastore.jars
Steps:
cp /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf
In /usr/lib/spark/conf/spark-defaults.conf added:
spark.sql.hive.metastore.jars /usr/lib/hadoop/lib/*:/usr/lib/hive/lib/*
In zeppelin I create a spark session:
val spark = SparkSession.builder.appName("clue").enableHiveSupport().getOrCreate()
import spark.implicits._
Step (1, & 2) you mentioned are partially fine, except for a little tweak that might help you.
Since you are using hive-2.x, configure spark.sql.hive.metastore.jars and set it to maven instead and spark.sql.hive.metastore.version to match the version of your metastore 2.3.3. It should be sufficient to just use 2.3 as version, see why in the Apache Spark Code
Here is a sample of my working configuration that I set in spark-default.conf:
spark.sql.broadcastTimeout 600 # An arbitrary number that you can change
spark.sql.catalogImplementation hive
spark.sql.hive.metastore.jars maven
spark.sql.hive.metastore.version 2.3 # No need for minor version
spark.sql.hive.thriftServer.singleSession true
spark.sql.warehouse.dir {hdfs | s3 | etc}
hive.metastore.uris thrift://hive-host:9083
With the previous setup, I have been able to execute queries against my datawarehouse in Zeppelin as follow:
val rows = spark.sql("YOUR QUERY").show
More details for connecting to an external hive metastore can be found here (Databricks)
Related
I use HDP3.1 and I added Spark2, Hive and Other services which are needed. I turned of the ACID feature in Hive. The spark job can't find the table in hive. But the table exists in Hive. The exception likes:
org.apache.spark.sql.AnalysisException: Table or view not found
There is hive-site.xml in Spark's conf folder. It is automaticly created by HDP. But it isn't same as the file in hive's conf folder. And from the log, the spark can get the thrift URI of hive correctly.
I use spark sql and created one hive table in spark-shell. I found the table was created in the fold which is specified by spark.sql.warehouse.dir. I changed its value to the value of hive.metastore.warehouse.dir. But the problem is still there.
I also enabled hive support when creating spark session.
val ss = SparkSession.builder().appName("统计").enableHiveSupport().getOrCreate()
There is metastore.catalog.default in hive-site.xml in spark's conf folder. It value is spark. It should be changed to hive. And btw, we should disable ACID feature of hive.
You can use hivewarehouse connector and use llap in hive conf
In HDP 3.0 and later, Spark and Hive use independent catalogs for accessing SparkSQL or Hive tables on the same or different platforms.
Spark, by default, only reads the Spark Catalog. And, this means Spark applications that attempt to read/write to tables created using hive CLI will fail with table not found exception.
Workaround:
Create the table in Hive CLI and Spark SQL
Hive Warehouse Connector
I am using snappydata-1.0.1 on HDP2.6.2, spark 2.1.1 and was able to connect from an external spark application. But when i enable hive support by adding hive-site.xml to spark conf, snappysession is listing the tables from hivemetastore instead of snappystore.
SparkConf sparkConf = new SparkConf().setAppName("TEST APP");
JavaSparkContext javaSparkContxt = new JavaSparkContext(sparkConf);
SparkSession sps = new SparkSession.Builder().enableHiveSupport().getOrCreate();
SnappySession snc = new SnappySession(new SparkSession(javaSparkContxt.sc()).sparkContext());
snc.sqlContext().sql("show tables").show();
The above code gives me list of tables in snappy store when hive-site.xml is not in sparkconf, if hive-site.xml added.. it lists me tables from hive metastore.
Is it not possible to use hive metastore and snappydata metastore in the same application?
Can is read hive table into a dataframe and snappydata table to another DF in same application?
Thanks in advance
So, it isn't the hive metastore that is the problem. You can use Hive tables and Snappy Tables in the same application. e.g. copy hive table into Snappy in-memory.
But, we will need to test the use of external hive metastore configured in hive-site.xml. Perhaps a bug.
You should try using the Snappy smart connector. i.e. Run your Spark using the Spark distribution in HDP and connect to Snappydata cluster using the connector (see docs). Here it looks like you are trying to run your Spark app using the Snappydata distribution.
when working with HDP 2.5 with spark 1.6.2 we used Hive with Tez as its execution engine and it worked.
But when we moved to HDP 2.6 with spark 2.1.0, Hive didn't work with Tez as its execution engine, and the following exception was thrown when the DataFrame.saveAsTable API was called:
java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:529)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init> HiveClientImpl.scala:188)
after looking at the answer to this question, we switched hive execution engine to MR (MapReduce) instead of Tez and it worked.
However, we'd like to work with Hive on Tez. what's required to solve the above exception in order for Hive on Tez to work?
I had the same issue when the spark job was running in YARN CLUSTER mode and that was resolved when correct hive-site.xml was added to ( add to spark-default configuration) " spark.yarn.dist.files "
Basically there are two different hive-site.xml files,
one is for hive configuration : /usr/hdp/current/hive-client/conf/hive-site.xml
The other one is lighter version for spark ( had the details only for spark to work with hive) : /etc/spark//0/hive-site.xml ( please check the path once for your setup)
we need to use the second file for spark.yarn.dist.files.
I am beginner in Spark.
I installed java and spark-1.6.1-bin-hadoop2.6.tgz(I have not installed Hadoop) and with out changing any configuration in conf directory ran spark-shell.
In the director where spark is installed , I see another metastore_db created with tmp folder inside it.
why is this metastore_db is created , where is this configured ?
Also I see sqlContext being created after running spark-shell, what does this sqlContext represent?
When running spark-shell, a SparkContext and SQLContext are created. SQLContext is an extension of SparkContext to enable support of Spark SQL. It has method to execute sql queries (method sql) and to create DataFrames.
db_metastore is a Hive metastore path. Spark support Apache Hive queries via HiveContext. If there is no hive-site.xml configured, Spark will use db_metastore path, see documentation for details.
However, it would be good if you will download Spark 2.0. There you've got unified entry point to Spark, named SparkSession. This class allows you to read data from many sources, create Datasets, etc.
I am using spark(standalone) of CDH5.4.2
After copying hive-site.xml to $SPARK_HOME/conf,i can query from hive in spark-shell,such as below:
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext#6c6f3a15
scala> hiveContext.sql("show tables").show();
But when i open spark-sql ,it show wrong:
java.lang.ClassNotFoundException: org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
Failed to load main class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
You need to build Spark with -Phive and -Phive-thriftserver.
What is different between spark-shell and spark-sql? If spark of cdh don't support hive,why can i use HiveConext?
Cloudera has a list of unsupported features here:
https://docs.cloudera.com/runtime/7.2.6/spark-overview/topics/spark-unsupported-features.html
The Thrift server is not supported.
This is a copy of the list for 7.2.6:
Apache Spark experimental features/APIs are not supported unless stated otherwise.
Using the JDBC Datasource API to access Hive or Impala is not supported
ADLS not supported for All Spark Components. Microsoft Azure Data Lake Store (ADLS) is a cloud-based filesystem that you can access through Spark applications. Spark with Kudu is not currently supported for ADLS data. (Hive on Spark is available for ADLS.)
IPython / Jupyter notebooks is not supported. The IPython notebook system (renamed to Jupyter as of IPython 4.0) is not supported.
Certain Spark Streaming features, such as the mapWithState method, are not supported.
Thrift JDBC/ODBC server is not supported
Spark SQL CLI is not supported
GraphX is not supported
SparkR is not supported
Structured Streaming is supported, but the following features of it are not:
Continuous processing, which is still experimental, is not supported.
Stream static joins with HBase have not been tested and therefore are not supported.
Spark cost-based optimizer (CBO) not supported.