Why spark sql is preferred over hive? - apache-spark

I am evaluating both spark sql and hive with processing engine as spark .Most people prefer to use spark sql over hive with spark . I feel hive with spark is same as spark sql . Or I am missing anything here. Is there any advantages of using spark sql over hive which run on spark processing engine.
Any clue would be helpful

One point would be the difference in how the queries are executed.
While Hive with Spark execution engine, for each query you spin up a new set of executors, on Spark SQL you have a Spark session with a set of long-living executors where you can cache data (create temporary tables) which can speed up your queries substantially.

Related

Hive with Hadoop vs Hive with spark vs spark sql vs HDFS - How do they all work with each other?

I am lil confused which combination should I use to achieve my goal, where I need to store data in HDFS and need to perform analysis based on queried data.
have few queries regarding it:
If I use hive with hadoop, then it would use map reduce which is going to slower my queries.(as I am using hadoop HDFS is present here for data storage)
Instead of hadoop, If I use spark engine to evaluate my queries, It would be faster but what about HDFS. I will have to create another hadoop cluster to store data in HDFS.
If we have spark sql then what is the need of hive ?
If I use spark sql then how would it be connecting to HDFS ?
If anyone could explain usage of these tools.
Thanks !!
You can use Hive on Spark. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark
You do not need to create another Hadoop cluster. Spark can access data from HDFS.
Spark can work with Hive or without Hive.
Spark can connect to multiple data sources, including HDFS.

Hive queries in Spark cluster

I need to understand how the hive query will get executed in Spark cluster. It will operate as a Mapreduce job running in memory or it will use the spark architecture for running the hive queries. Pls clarify.
If you run hive queries in hive or beeline it will use Map-reduce, but if you run hive queries in spark REPL or program the queries will simply get converted into dataframes and created the logical and physical plan same as data frame and executes. Hence will use all the power of spark.
Assuming that you have a Hadoop cluster with YARN and Spark configured;
Hive execution engine is controlled by hive.execution.engine property. According to the docs it can be mr (default), tez or spark.

Difference between executing hive queries on mapreduce vs hive on spark

By default hive query runs on map reduce backed by YARN engine. We could use TEZ as and execution engine for performance when we work with hortonworks distribution.We could also run hive on SPARK which is also backed by YARN, then Could you please explain me in detail the PROs and CONs of using hive on spark execution with respect to performance , efficiency and other factors.
set hive.execution.engine=spark;
vs
set hive.execution.engine=mr;

What is difference between "Hive on Spark" with "Spark SQL with Hive Metastore"? In production, Which one should be used? Why?

This is my opinion:
Hive on Spark provides Hive with the ability to utilize Apache Spark as its execution engine.Spark SQL also supports reading and writing data stored in Apache Hive.Hive on Spark only uses Spark execution engine. Spark SQL with Hive Metastore not only uses Spark execution engine, but also uses Spark SQL Which is a Spark module for structured data processing and to execute SQL queries. Due to Spark SQL with Hive Metastore does not support all of Hive configurations and all version of the Hive metastore(Available versions are 0.12.0 through 1.2.1.), In production, The deploy mode of Hive on Spark is better and more effective.
So, am I wrong? Does anyone have others ideas?

Spark SQL: how does it tell hive to run query on spark?

As rightly pointed out here:
Spark SQL query execution on Hive
Spark SQL when running through HiveContext will make SQL query use the spark engine.
How does spark SQL setting hive.execution.engine=spark tell hive to do so?
Note this works automatically, we do not have to specify this in hive-site.xml in the conf directory of spark.
There are 2 independent projects here
Hive on Spark - Hive project that integrates Spark as an additional engine.
Spark SQL - Spark module that makes use of the Hive code.
HiveContext belongs to the 2nd and hive.execution.engine is a property of the 1st.

Resources