How to set up metadata database for Spark SQL? - apache-spark

Hive can have its metadata and stores the tables,columns,partitions information over there.
If I do not want to use the hive.Can we create a metadata for spark same as hive.
I want to query spark SQL (not using dataframe) like Hive (select, from and where) Can we do that? if yes, which relational DB can we use for metadata storage?

Can we create a metadata for spark same as hive.
Spark does this for you and you don't have to use a separate installation of Hive or even just part of it (e.g. a Hive metastore).
Regardless of the installation of Apache Spark you use, Spark SQL uses a Hive metastore internally for the same purpose as Hive does (but the metastore is now part of Spark SQL).
if yes which relational DB can we use for metadata storage?
Anything that Hive supports, e.g. Oracle, MySQL, PostgreSQL. The configuration is pretty much as you would do with a separate Hive installation (which is usually the case in such enterprisey installations).
You may want to read Hive Metastore.

Spark is essentially a distributed computation system instead of a distributed storage. Therefore, we mostly use Spark to do the computation work, which needs the metadata from different storage.
However, Spark internally provides an InMemoryCatalog to store the metadata if it's not configured with Hive.
You can take a look at this for more information.

Related

Spark-SQL plug in on HIVE

HIVE has a metastore and HIVESERVER2 listens for SQL requests; with the help of metastore, the query is executed and the result is passed back.
The Thrift framework is actually customised as HIVESERVER2. In this way, HIVE is acting as a service. Via programming language, we can use HIVE as a database.
The relationship between Spark-SQL and HIVE is that:
Spark-SQL just utilises the HIVE setup (HDFS file system, HIVE Metastore, Hiveserver2). When we invoke /sbin/start-thriftserver2.sh (present in spark installation), we are supposed to give hiveserver2 port number, and the hostname. Then via spark's beeline, we can actually create, drop and manipulate tables in HIVE. The API can be either Spark-SQL or HIVE QL.
If we create a table / drop a table, it will be clearly visible if we login into HIVE and check(say via HIVE beeline or HIVE CLI). To put in other words, changes made via Spark can be seen in HIVE tables.
My understanding is that Spark does not have its own meta store setup like HIVE. Spark just utilises the HIVE setup and simply the SQL execution happens via Spark SQL API.
Is my understanding correct here?
Then I am little confused about the usage of bin/spark-sql.sh (which is also present in Spark installation). Documentation says that via this SQL shell, we can create tables like we do above (via Thrift Server/Beeline). Now my question is: How the metadata information is maintained by spark then?
Or like the first approach, can we make spark-sql CLI to communicate to HIVE (to be specific: hiveserver2 of HIVE) ?
If yes, how can we do that ?
Thanks in advance!
My understanding is that Spark does not have its own meta store setup like HIVE
Spark will start a Derby server on its own, if a Hive metastore is not provided
can we make spark-sql CLI to communicate to HIVE
Start an external metastore process, add a hive-site.xml file to $SPARK_CONF_DIR with hive.metastore.uris, or use SET SQL statements for the same.
Then spark-sql CLI should be able to query Hive tables. From code, you need to use enableHiveSupport() method on the SparkSession.

Hive with Hadoop vs Hive with spark vs spark sql vs HDFS - How do they all work with each other?

I am lil confused which combination should I use to achieve my goal, where I need to store data in HDFS and need to perform analysis based on queried data.
have few queries regarding it:
If I use hive with hadoop, then it would use map reduce which is going to slower my queries.(as I am using hadoop HDFS is present here for data storage)
Instead of hadoop, If I use spark engine to evaluate my queries, It would be faster but what about HDFS. I will have to create another hadoop cluster to store data in HDFS.
If we have spark sql then what is the need of hive ?
If I use spark sql then how would it be connecting to HDFS ?
If anyone could explain usage of these tools.
Thanks !!
You can use Hive on Spark. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark
You do not need to create another Hadoop cluster. Spark can access data from HDFS.
Spark can work with Hive or without Hive.
Spark can connect to multiple data sources, including HDFS.

Need use case or example for Spark’s Relationship to Hive

I am reading Spark Definitive Guide
In the "Spark’s Relationship to Hive" section ..the below lines are give
"With Spark SQL, you can connect to your Hive metastore (if you already have one) and access table metadata to reduce file listing when accessing information. This is popular for users who are migrating from a legacy Hadoop environment and beginning to run all their workloads using Spark."
I am not able to understand what it means. Someone please help me with examples for the above use case.
Spark being the latest tool in Hadoop ecosystem has connectivity with earlier Hadoop tools. Hive was the most popular until recent times. Most Hadoop platforms have data stored in Hive tables which can be accessed using Hive as a SQL engine. However, Spark can also do the same things.
So, the given statements mention that you can connect to Hive metastore (which contains information about existing tables, databases, their location, schema, file types, etc.) and then you can run similar Hive queries on them just like you would with Hive.
Below are two examples that you can do with spark once you can connect to Hive metastore.
spark.sql("show databases")
spark.sql("select * from test_db.test_table")
I hope this answers your question.

Where is Spark catalog metadata stored?

Have been trying to get an accurate view of how Spark's catalog API stores the metadata.
I have found some resources, but no answer:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-Catalog.html
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-CatalogImpl.html
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/catalog/Catalog.html
I see some tutorials that take for granted the existence of Hive Metastore.
Is Hive Metastore potentially included with Spark distribution?
Spark cluster can be short-lived, but Hive metastore would obviously need to be long-lived
Apart from the catalog feature, partitioning and sorting features when writing out a DF seem to depend on Hive... So "everyone" seems to take Hive as granted when talking about key Spark features of persisting a DF.
Spark becomes aware of Hive MetaStore when it is provided with hive-site.xml, which is typically placed under $SPARK_HOME/conf. Whenever enableHiveSupport() method is used while creating SparkSession, Spark finds where and how to
get connected with Hive metastore. Spark therefore does not explicitly stores hive settings.

SparkSQL vs Hive on Spark - Difference and pros and cons?

SparkSQL CLI internally uses HiveQL and in case Hive on spark(HIVE-7292) , hive uses spark as backend engine. Can somebody throw some more light, how exactly these two scenarios are different and pros and cons of both approaches?
When SparkSQL uses hive
SparkSQL can use HiveMetastore to get the metadata of the data stored in HDFS. This metadata enables SparkSQL to do better optimization of the queries that it executes. Here Spark is the query processor.
When Hive uses Spark See the JIRA entry: HIVE-7292
Here the the data is accessed via spark. And Hive is the Query processor. So we have all the deign features of Spark Core to take advantage of. But this is a Major Improvement for Hive and is still "in progress" as of Feb 2 2016.
There is a third option to process data with SparkSQL
Use SparkSQL without using Hive. Here SparkSQL does not have access to the metadata from the Hive Metastore. And the queries run slower. I have done some performance tests comparing options 1 and 3. The results are here.
SparkSQL vs Spark API you can simply imagine you are in RDBMS world:
SparkSQL is pure SQL, and Spark API is language for writing stored procedure
Hive on Spark is similar to SparkSQL, it is a pure SQL interface that use spark as execution engine, SparkSQL uses Hive's syntax, so as a language, i would say they are almost the same.
but Hive on Spark has a much better support for hive features, especially hiveserver2 and security features, hive features in SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't work with hivevar and hiveconf argument anymore, and the username for login via jdbc doesn't work either...
see https://issues.apache.org/jira/browse/SPARK-13983
i believe hive support in spark project is really very low priority stuff...
sadly Hive on spark integration is not that easy, there are a lot of dependency conflicts... such as
https://issues.apache.org/jira/browse/HIVE-13301
and, when i'm trying hive with spark integration, for debug purpose, i'm always starting hive cli like this:
export HADOOP_USER_CLASSPATH_FIRST=true
bin/hive --hiveconf hive.root.logger=DEBUG,console
our requirement is using spark with hiveserver2 in a secure way (with authentication and authorization), currently SparkSQL alone can not provide this, we are using ranger/sentry + Hive on Spark.
hope this can help you to get a better idea which direction you should go.
here is related answer I find in the hive official site:
1.3 Comparison with Shark and Spark SQL 
There are two related projects in the Spark ecosystem that provide Hive QL support on Spark: Shark and Spark SQL. 
●The Shark project translates query plans generated by Hive into its own representation and executes them over Spark.  
●Spark SQL is a feature in Spark. It uses Hive’s parser as the frontend to provide Hive QL support. Spark application developers can easily express their data processing logic in SQL, as well as the other Spark operators, in their code. Spark SQL supports a different use case than Hive. 
Compared with Shark and Spark SQL, our approach by design supports all existing Hive features, including Hive QL (and any future extension), and Hive’s integration with authorization, monitoring, auditing, and other operational tools. 
3. Hive­-Level Design 
As noted in the introduction, this project takes a different approach from that of Shark or Spark SQL in the sense that we are not going to implement SQL semantics using Spark's primitives. On the contrary, we will implement it using MapReduce primitives. The only new thing here is that these MapReduce primitives will be executed in Spark. In fact, only a few of Spark's primitives will be used in this design. 
The approach of executing Hive’s MapReduce primitives on Spark that is different from what Shark or Spark SQL does has the following direct advantages: 
1.Spark users will automatically get the whole set of Hive’s rich features, including any new features that Hive might introduce in the future. 
2.This approach avoids or reduces the necessity of any customization work in Hive’s Spark execution engine.
3.It will also limit the scope of the project and reduce longterm maintenance by keeping Hive­-on­-Spark congruent to Hive MapReduce and Tez. 

Resources