Spark Program queries executing via Tez engine instead of Spark - apache-spark

We have a Spark program which executes multiple queries and the tables are Hive tables.
Currently the queries are executed using Tez engine from Spark.
I set it sqlContext.sql("SET hive.execution.engine=spark") in the program and understand that the queries/program would run as Spark. We are using HDP 2.6.5 version and Spark 2.3.0 version in our cluster.
Can someone suggest that it is the correct way as we do not need to run the queries using Tez engine and Spark should run as it is .
In the config file /etc/spark2/conf/hive-site.xml, we do not have any specific engine property setup and we do have only kerberos, metastore property details.

Related

Apache Spark: how can I understand and control if my query is executed on Hive engine or on Spark engine?

I am running local instance of spark 2.4.0
I want to execute an SQL query vs Hive
Before, with Spark 1.x.x., I was using HiveContext for this:
import org.apache.spark.sql.hive.HiveContext
val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val hivequery = hc.sql(“show databases”)
But now I see that HiveContext is deprecated: https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/hive/HiveContext.html. Inside HiveContext.sql() code I see that it is now simply a wrapper over SparkSession.sql(). The recomendation is to use enableHiveSupport in SparkSession builder, but as this question clarifies this is only about metastore and list of tables, this is not changing execution engine.
So the questions are:
how can I understand if my query is running on Hive engine or on Spark engine?
how can I control this?
From my understanding there is no Hive Engine to run your query. You submit a query to Hive and Hive would execute it on an engine :
Spark
Tez(based on MapReduce)
MapReduce (commnly Hadoop)
If you use Spark, your query will be executed by Spark using SparkSQL (starting with Spark v1.5.x, if I recall correctly)
How is configured the Hive Engine depends on configuration and I remember seeing Hive on Spark configuration on Cloudera distribution.
So Hive would use Spark to execute the job matching you query (instead of MapReduce or Tez) but Hive would parse, analyze it.
Using local Spark instance, you will only use Spark engine (SparkSQL / Catalyst), but you can use it with Hive Support. It means, you would be able to read an existing Hive metastore and interact with it.
It requires a Spark installation with Hive support : Hive dependencies and hive-site.xml in your classpath

Can Spark-sql work without a hive installation?

I have installed spark 2.4.0 on a clean ubuntu instance. Spark dataframes work fine but when I try to use spark.sql against a dataframe such as in the example below,i am getting an error "Failed to access metastore. This class should not accessed in runtime."
spark.read.json("/data/flight-data/json/2015-summary.json")
.createOrReplaceTempView("some_sql_view")
spark.sql("""SELECT DEST_COUNTRY_NAME, sum(count)
FROM some_sql_view GROUP BY DEST_COUNTRY_NAME
""").where("DEST_COUNTRY_NAME like 'S%'").where("sum(count) > 10").count()
Most of the fixes that I have see in relation to this error refer to environments where hive is installed. Is hive required if I want to use sql statements against dataframes in spark or am i missing something else?
To follow up with my fix. The problem in my case was that Java 11 was the default on my system. As soon as I set Java 8 as the default metastore_db started working.
Yes, we can run spark sql queries on spark without installing hive, by default hive uses mapred as an execution engine, we can configure hive to use spark or tez as an execution engine to execute our queries much faster. Hive on spark hive uses hive metastore to run hive queries. At the same time, sql queries can be executed through spark. If spark is used to execute simple sql queries or not connected with hive metastore server, its uses embedded derby database and a new folder with name metastore_db will be created under the user home folder who executes the query.

Hive queries in Spark cluster

I need to understand how the hive query will get executed in Spark cluster. It will operate as a Mapreduce job running in memory or it will use the spark architecture for running the hive queries. Pls clarify.
If you run hive queries in hive or beeline it will use Map-reduce, but if you run hive queries in spark REPL or program the queries will simply get converted into dataframes and created the logical and physical plan same as data frame and executes. Hence will use all the power of spark.
Assuming that you have a Hadoop cluster with YARN and Spark configured;
Hive execution engine is controlled by hive.execution.engine property. According to the docs it can be mr (default), tez or spark.

What is difference between "Hive on Spark" with "Spark SQL with Hive Metastore"? In production, Which one should be used? Why?

This is my opinion:
Hive on Spark provides Hive with the ability to utilize Apache Spark as its execution engine.Spark SQL also supports reading and writing data stored in Apache Hive.Hive on Spark only uses Spark execution engine. Spark SQL with Hive Metastore not only uses Spark execution engine, but also uses Spark SQL Which is a Spark module for structured data processing and to execute SQL queries. Due to Spark SQL with Hive Metastore does not support all of Hive configurations and all version of the Hive metastore(Available versions are 0.12.0 through 1.2.1.), In production, The deploy mode of Hive on Spark is better and more effective.
So, am I wrong? Does anyone have others ideas?

How to configure Hive to use Spark execution engine on Google Dataproc?

I'm trying to configure Hive, running on Google Dataproc image v1.1 (so Hive 2.1.0 and Spark 2.0.2), to use Spark as an execution engine instead of the default MapReduce one.
Following the instructions here https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started doesn't really help, I keep getting Error running query: java.lang.NoClassDefFoundError: scala/collection/Iterable errors when I set hive.execution.engine=spark.
Does anyone know the specific steps to get this running on Dataproc? From what I can tell it should just be a question of making Hive see the right JARs, since both Hive and Spark are already installed and configured on the cluster, and using Hive from Spark (so the other way around) works fine.
This will probably not work with the jars in a Dataproc cluster. In Dataproc, Spark is compiled with Hive bundled (-Phive), which is not suggested / supported by Hive on Spark.
If you really want to run Hive on Spark, you might want to try to bring your own Spark in an initialization action compiled as described in the wiki.
If you just want to run Hive off MapReduce on Dataproc running Tez, with this initialization action would probably be easier.

Resources