Difference between executing hive queries on mapreduce vs hive on spark - apache-spark

By default hive query runs on map reduce backed by YARN engine. We could use TEZ as and execution engine for performance when we work with hortonworks distribution.We could also run hive on SPARK which is also backed by YARN, then Could you please explain me in detail the PROs and CONs of using hive on spark execution with respect to performance , efficiency and other factors.
set hive.execution.engine=spark;
vs
set hive.execution.engine=mr;

Related

Spark Program queries executing via Tez engine instead of Spark

We have a Spark program which executes multiple queries and the tables are Hive tables.
Currently the queries are executed using Tez engine from Spark.
I set it sqlContext.sql("SET hive.execution.engine=spark") in the program and understand that the queries/program would run as Spark. We are using HDP 2.6.5 version and Spark 2.3.0 version in our cluster.
Can someone suggest that it is the correct way as we do not need to run the queries using Tez engine and Spark should run as it is .
In the config file /etc/spark2/conf/hive-site.xml, we do not have any specific engine property setup and we do have only kerberos, metastore property details.

Why spark sql is preferred over hive?

I am evaluating both spark sql and hive with processing engine as spark .Most people prefer to use spark sql over hive with spark . I feel hive with spark is same as spark sql . Or I am missing anything here. Is there any advantages of using spark sql over hive which run on spark processing engine.
Any clue would be helpful
One point would be the difference in how the queries are executed.
While Hive with Spark execution engine, for each query you spin up a new set of executors, on Spark SQL you have a Spark session with a set of long-living executors where you can cache data (create temporary tables) which can speed up your queries substantially.

Hive queries in Spark cluster

I need to understand how the hive query will get executed in Spark cluster. It will operate as a Mapreduce job running in memory or it will use the spark architecture for running the hive queries. Pls clarify.
If you run hive queries in hive or beeline it will use Map-reduce, but if you run hive queries in spark REPL or program the queries will simply get converted into dataframes and created the logical and physical plan same as data frame and executes. Hence will use all the power of spark.
Assuming that you have a Hadoop cluster with YARN and Spark configured;
Hive execution engine is controlled by hive.execution.engine property. According to the docs it can be mr (default), tez or spark.

What is difference between "Hive on Spark" with "Spark SQL with Hive Metastore"? In production, Which one should be used? Why?

This is my opinion:
Hive on Spark provides Hive with the ability to utilize Apache Spark as its execution engine.Spark SQL also supports reading and writing data stored in Apache Hive.Hive on Spark only uses Spark execution engine. Spark SQL with Hive Metastore not only uses Spark execution engine, but also uses Spark SQL Which is a Spark module for structured data processing and to execute SQL queries. Due to Spark SQL with Hive Metastore does not support all of Hive configurations and all version of the Hive metastore(Available versions are 0.12.0 through 1.2.1.), In production, The deploy mode of Hive on Spark is better and more effective.
So, am I wrong? Does anyone have others ideas?

SparkSQL vs Hive on Spark - Difference and pros and cons?

SparkSQL CLI internally uses HiveQL and in case Hive on spark(HIVE-7292) , hive uses spark as backend engine. Can somebody throw some more light, how exactly these two scenarios are different and pros and cons of both approaches?
When SparkSQL uses hive
SparkSQL can use HiveMetastore to get the metadata of the data stored in HDFS. This metadata enables SparkSQL to do better optimization of the queries that it executes. Here Spark is the query processor.
When Hive uses Spark See the JIRA entry: HIVE-7292
Here the the data is accessed via spark. And Hive is the Query processor. So we have all the deign features of Spark Core to take advantage of. But this is a Major Improvement for Hive and is still "in progress" as of Feb 2 2016.
There is a third option to process data with SparkSQL
Use SparkSQL without using Hive. Here SparkSQL does not have access to the metadata from the Hive Metastore. And the queries run slower. I have done some performance tests comparing options 1 and 3. The results are here.
SparkSQL vs Spark API you can simply imagine you are in RDBMS world:
SparkSQL is pure SQL, and Spark API is language for writing stored procedure
Hive on Spark is similar to SparkSQL, it is a pure SQL interface that use spark as execution engine, SparkSQL uses Hive's syntax, so as a language, i would say they are almost the same.
but Hive on Spark has a much better support for hive features, especially hiveserver2 and security features, hive features in SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't work with hivevar and hiveconf argument anymore, and the username for login via jdbc doesn't work either...
see https://issues.apache.org/jira/browse/SPARK-13983
i believe hive support in spark project is really very low priority stuff...
sadly Hive on spark integration is not that easy, there are a lot of dependency conflicts... such as
https://issues.apache.org/jira/browse/HIVE-13301
and, when i'm trying hive with spark integration, for debug purpose, i'm always starting hive cli like this:
export HADOOP_USER_CLASSPATH_FIRST=true
bin/hive --hiveconf hive.root.logger=DEBUG,console
our requirement is using spark with hiveserver2 in a secure way (with authentication and authorization), currently SparkSQL alone can not provide this, we are using ranger/sentry + Hive on Spark.
hope this can help you to get a better idea which direction you should go.
here is related answer I find in the hive official site:
1.3 Comparison with Shark and Spark SQL 
There are two related projects in the Spark ecosystem that provide Hive QL support on Spark: Shark and Spark SQL. 
●The Shark project translates query plans generated by Hive into its own representation and executes them over Spark.  
●Spark SQL is a feature in Spark. It uses Hive’s parser as the frontend to provide Hive QL support. Spark application developers can easily express their data processing logic in SQL, as well as the other Spark operators, in their code. Spark SQL supports a different use case than Hive. 
Compared with Shark and Spark SQL, our approach by design supports all existing Hive features, including Hive QL (and any future extension), and Hive’s integration with authorization, monitoring, auditing, and other operational tools. 
3. Hive­-Level Design 
As noted in the introduction, this project takes a different approach from that of Shark or Spark SQL in the sense that we are not going to implement SQL semantics using Spark's primitives. On the contrary, we will implement it using MapReduce primitives. The only new thing here is that these MapReduce primitives will be executed in Spark. In fact, only a few of Spark's primitives will be used in this design. 
The approach of executing Hive’s MapReduce primitives on Spark that is different from what Shark or Spark SQL does has the following direct advantages: 
1.Spark users will automatically get the whole set of Hive’s rich features, including any new features that Hive might introduce in the future. 
2.This approach avoids or reduces the necessity of any customization work in Hive’s Spark execution engine.
3.It will also limit the scope of the project and reduce longterm maintenance by keeping Hive­-on­-Spark congruent to Hive MapReduce and Tez. 

Resources