How to configure Hive to use Spark execution engine on Google Dataproc? - apache-spark

I'm trying to configure Hive, running on Google Dataproc image v1.1 (so Hive 2.1.0 and Spark 2.0.2), to use Spark as an execution engine instead of the default MapReduce one.
Following the instructions here https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started doesn't really help, I keep getting Error running query: java.lang.NoClassDefFoundError: scala/collection/Iterable errors when I set hive.execution.engine=spark.
Does anyone know the specific steps to get this running on Dataproc? From what I can tell it should just be a question of making Hive see the right JARs, since both Hive and Spark are already installed and configured on the cluster, and using Hive from Spark (so the other way around) works fine.

This will probably not work with the jars in a Dataproc cluster. In Dataproc, Spark is compiled with Hive bundled (-Phive), which is not suggested / supported by Hive on Spark.
If you really want to run Hive on Spark, you might want to try to bring your own Spark in an initialization action compiled as described in the wiki.
If you just want to run Hive off MapReduce on Dataproc running Tez, with this initialization action would probably be easier.

Related

Spark Program queries executing via Tez engine instead of Spark

We have a Spark program which executes multiple queries and the tables are Hive tables.
Currently the queries are executed using Tez engine from Spark.
I set it sqlContext.sql("SET hive.execution.engine=spark") in the program and understand that the queries/program would run as Spark. We are using HDP 2.6.5 version and Spark 2.3.0 version in our cluster.
Can someone suggest that it is the correct way as we do not need to run the queries using Tez engine and Spark should run as it is .
In the config file /etc/spark2/conf/hive-site.xml, we do not have any specific engine property setup and we do have only kerberos, metastore property details.

CDH 6.2 Hive cannot execute queries neither on Spark nor MapReduce

I'm trying to run a simple select count(*) from table query on Hive, but it fails with the following error:
FAILED: Execution Error, return code 30041 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Failed to create Spark client for Spark session 5414a8a4-5252-4ccf-b63e-2ee563f7d772_0: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
This is happening since I've moved to CDH 6.2 and enabled Spark (version 2.4.0-cdh6.2.0) as the execution engine of Hive (version 2.1.1-cdh6.2.0).
My guess is that Hive is not correctly configured to launch Spark. I've tried setting the spark.home property of the hive-site.xml to /opt/cloudera/parcels/CDH/lib/spark/, and setting the SPARK_HOME environment variable to the same value, but it made no difference.
A similar issue was reported here, but the solution (i.e., to put the spark-assembly.jar file in Hive's lib directory) cannot be applied (as the file is no longer built in latest Spark's versions).
A previous question addressed a similar but different issue, related to memory limits on YARN.
Also, switching to MapReduce as the execution engine still fails, but with a different error:
FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org/apache/hadoop/hdfs/protocol/SystemErasureCodingPolicies
Looking for the latest error on Google shows no result at all.
UPDATE: I discovered that queries do work when connecting to Hive through other tools (e.g., Beeline, Hue, Spark) and independently of the underlying execution engine (i.e., MapReduce or Spark). Thus, the error may lie within the Hive CLI, which is currently deprecated.
UPDATE 2: the same problem actually happened on Beeline and Hue with a CREATE TABLE query; I was able to execute it only with the Hive interpreter of Zeppelin

Can Spark-sql work without a hive installation?

I have installed spark 2.4.0 on a clean ubuntu instance. Spark dataframes work fine but when I try to use spark.sql against a dataframe such as in the example below,i am getting an error "Failed to access metastore. This class should not accessed in runtime."
spark.read.json("/data/flight-data/json/2015-summary.json")
.createOrReplaceTempView("some_sql_view")
spark.sql("""SELECT DEST_COUNTRY_NAME, sum(count)
FROM some_sql_view GROUP BY DEST_COUNTRY_NAME
""").where("DEST_COUNTRY_NAME like 'S%'").where("sum(count) > 10").count()
Most of the fixes that I have see in relation to this error refer to environments where hive is installed. Is hive required if I want to use sql statements against dataframes in spark or am i missing something else?
To follow up with my fix. The problem in my case was that Java 11 was the default on my system. As soon as I set Java 8 as the default metastore_db started working.
Yes, we can run spark sql queries on spark without installing hive, by default hive uses mapred as an execution engine, we can configure hive to use spark or tez as an execution engine to execute our queries much faster. Hive on spark hive uses hive metastore to run hive queries. At the same time, sql queries can be executed through spark. If spark is used to execute simple sql queries or not connected with hive metastore server, its uses embedded derby database and a new folder with name metastore_db will be created under the user home folder who executes the query.

Hive on Tez doesn't work in Spark 2

when working with HDP 2.5 with spark 1.6.2 we used Hive with Tez as its execution engine and it worked.
But when we moved to HDP 2.6 with spark 2.1.0, Hive didn't work with Tez as its execution engine, and the following exception was thrown when the DataFrame.saveAsTable API was called:
java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:529)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init> HiveClientImpl.scala:188)
after looking at the answer to this question, we switched hive execution engine to MR (MapReduce) instead of Tez and it worked.
However, we'd like to work with Hive on Tez. what's required to solve the above exception in order for Hive on Tez to work?
I had the same issue when the spark job was running in YARN CLUSTER mode and that was resolved when correct hive-site.xml was added to ( add to spark-default configuration) " spark.yarn.dist.files "
Basically there are two different hive-site.xml files,
one is for hive configuration : /usr/hdp/current/hive-client/conf/hive-site.xml
The other one is lighter version for spark ( had the details only for spark to work with hive) : /etc/spark//0/hive-site.xml ( please check the path once for your setup)
we need to use the second file for spark.yarn.dist.files.

Spark with custom hive bindings

How can I build spark with current (hive 2.1) bindings instead of 1.2?
http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support
Does not mention how this works.
Does spark work well with hive 2.x?
I had the same question and this is what I've found so far. You can try to build spark with the newer version of hive:
mvn -Dhive.group=org.apache.hive -Dhive.version=2.1.0 clean package
This runs for a long time and fails in unit tests. If you skip tests, you get a bit farther but then run into compilation errors. In summary, spark does not work well with hive 2.x!
I also searched through the ASF Jira for Spark and Hive and haven't found any mentions of upgrading. This is the closest ticket I was able to find: https://issues.apache.org/jira/browse/SPARK-15691

Resources