Hive on Tez doesn't work in Spark 2 - apache-spark

when working with HDP 2.5 with spark 1.6.2 we used Hive with Tez as its execution engine and it worked.
But when we moved to HDP 2.6 with spark 2.1.0, Hive didn't work with Tez as its execution engine, and the following exception was thrown when the DataFrame.saveAsTable API was called:
java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:529)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init> HiveClientImpl.scala:188)
after looking at the answer to this question, we switched hive execution engine to MR (MapReduce) instead of Tez and it worked.
However, we'd like to work with Hive on Tez. what's required to solve the above exception in order for Hive on Tez to work?

I had the same issue when the spark job was running in YARN CLUSTER mode and that was resolved when correct hive-site.xml was added to ( add to spark-default configuration) " spark.yarn.dist.files "
Basically there are two different hive-site.xml files,
one is for hive configuration : /usr/hdp/current/hive-client/conf/hive-site.xml
The other one is lighter version for spark ( had the details only for spark to work with hive) : /etc/spark//0/hive-site.xml ( please check the path once for your setup)
we need to use the second file for spark.yarn.dist.files.

Related

Spark Program queries executing via Tez engine instead of Spark

We have a Spark program which executes multiple queries and the tables are Hive tables.
Currently the queries are executed using Tez engine from Spark.
I set it sqlContext.sql("SET hive.execution.engine=spark") in the program and understand that the queries/program would run as Spark. We are using HDP 2.6.5 version and Spark 2.3.0 version in our cluster.
Can someone suggest that it is the correct way as we do not need to run the queries using Tez engine and Spark should run as it is .
In the config file /etc/spark2/conf/hive-site.xml, we do not have any specific engine property setup and we do have only kerberos, metastore property details.

Apache Spark: how can I understand and control if my query is executed on Hive engine or on Spark engine?

I am running local instance of spark 2.4.0
I want to execute an SQL query vs Hive
Before, with Spark 1.x.x., I was using HiveContext for this:
import org.apache.spark.sql.hive.HiveContext
val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val hivequery = hc.sql(“show databases”)
But now I see that HiveContext is deprecated: https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/hive/HiveContext.html. Inside HiveContext.sql() code I see that it is now simply a wrapper over SparkSession.sql(). The recomendation is to use enableHiveSupport in SparkSession builder, but as this question clarifies this is only about metastore and list of tables, this is not changing execution engine.
So the questions are:
how can I understand if my query is running on Hive engine or on Spark engine?
how can I control this?
From my understanding there is no Hive Engine to run your query. You submit a query to Hive and Hive would execute it on an engine :
Spark
Tez(based on MapReduce)
MapReduce (commnly Hadoop)
If you use Spark, your query will be executed by Spark using SparkSQL (starting with Spark v1.5.x, if I recall correctly)
How is configured the Hive Engine depends on configuration and I remember seeing Hive on Spark configuration on Cloudera distribution.
So Hive would use Spark to execute the job matching you query (instead of MapReduce or Tez) but Hive would parse, analyze it.
Using local Spark instance, you will only use Spark engine (SparkSQL / Catalyst), but you can use it with Hive Support. It means, you would be able to read an existing Hive metastore and interact with it.
It requires a Spark installation with Hive support : Hive dependencies and hive-site.xml in your classpath

Accessing Hive Tables with Spark SQL

I've setup an AWS EMR cluster that includes spark 2.3.2, hive 2.3.3, and hbase 1.4.7. How can I configure spark to access hive tables?
I've taken the following steps, but the result is the error message:
java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError:
org/apache/tez/dag/api/SessionNotRunning when creating Hive client
using classpath:
Please make sure that jars for your version of hive and hadoop are
included in the paths passed to spark.sql.hive.metastore.jars
Steps:
cp /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf
In /usr/lib/spark/conf/spark-defaults.conf added:
spark.sql.hive.metastore.jars /usr/lib/hadoop/lib/*:/usr/lib/hive/lib/*
In zeppelin I create a spark session:
val spark = SparkSession.builder.appName("clue").enableHiveSupport().getOrCreate()
import spark.implicits._
Step (1, & 2) you mentioned are partially fine, except for a little tweak that might help you.
Since you are using hive-2.x, configure spark.sql.hive.metastore.jars and set it to maven instead and spark.sql.hive.metastore.version to match the version of your metastore 2.3.3. It should be sufficient to just use 2.3 as version, see why in the Apache Spark Code
Here is a sample of my working configuration that I set in spark-default.conf:
spark.sql.broadcastTimeout 600 # An arbitrary number that you can change
spark.sql.catalogImplementation hive
spark.sql.hive.metastore.jars maven
spark.sql.hive.metastore.version 2.3 # No need for minor version
spark.sql.hive.thriftServer.singleSession true
spark.sql.warehouse.dir {hdfs | s3 | etc}
hive.metastore.uris thrift://hive-host:9083
With the previous setup, I have been able to execute queries against my datawarehouse in Zeppelin as follow:
val rows = spark.sql("YOUR QUERY").show
More details for connecting to an external hive metastore can be found here (Databricks)

Can Spark-sql work without a hive installation?

I have installed spark 2.4.0 on a clean ubuntu instance. Spark dataframes work fine but when I try to use spark.sql against a dataframe such as in the example below,i am getting an error "Failed to access metastore. This class should not accessed in runtime."
spark.read.json("/data/flight-data/json/2015-summary.json")
.createOrReplaceTempView("some_sql_view")
spark.sql("""SELECT DEST_COUNTRY_NAME, sum(count)
FROM some_sql_view GROUP BY DEST_COUNTRY_NAME
""").where("DEST_COUNTRY_NAME like 'S%'").where("sum(count) > 10").count()
Most of the fixes that I have see in relation to this error refer to environments where hive is installed. Is hive required if I want to use sql statements against dataframes in spark or am i missing something else?
To follow up with my fix. The problem in my case was that Java 11 was the default on my system. As soon as I set Java 8 as the default metastore_db started working.
Yes, we can run spark sql queries on spark without installing hive, by default hive uses mapred as an execution engine, we can configure hive to use spark or tez as an execution engine to execute our queries much faster. Hive on spark hive uses hive metastore to run hive queries. At the same time, sql queries can be executed through spark. If spark is used to execute simple sql queries or not connected with hive metastore server, its uses embedded derby database and a new folder with name metastore_db will be created under the user home folder who executes the query.

How to configure Hive to use Spark execution engine on Google Dataproc?

I'm trying to configure Hive, running on Google Dataproc image v1.1 (so Hive 2.1.0 and Spark 2.0.2), to use Spark as an execution engine instead of the default MapReduce one.
Following the instructions here https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started doesn't really help, I keep getting Error running query: java.lang.NoClassDefFoundError: scala/collection/Iterable errors when I set hive.execution.engine=spark.
Does anyone know the specific steps to get this running on Dataproc? From what I can tell it should just be a question of making Hive see the right JARs, since both Hive and Spark are already installed and configured on the cluster, and using Hive from Spark (so the other way around) works fine.
This will probably not work with the jars in a Dataproc cluster. In Dataproc, Spark is compiled with Hive bundled (-Phive), which is not suggested / supported by Hive on Spark.
If you really want to run Hive on Spark, you might want to try to bring your own Spark in an initialization action compiled as described in the wiki.
If you just want to run Hive off MapReduce on Dataproc running Tez, with this initialization action would probably be easier.

Resources