Where is Spark catalog metadata stored? - apache-spark

Have been trying to get an accurate view of how Spark's catalog API stores the metadata.
I have found some resources, but no answer:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-Catalog.html
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-CatalogImpl.html
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/catalog/Catalog.html
I see some tutorials that take for granted the existence of Hive Metastore.
Is Hive Metastore potentially included with Spark distribution?
Spark cluster can be short-lived, but Hive metastore would obviously need to be long-lived
Apart from the catalog feature, partitioning and sorting features when writing out a DF seem to depend on Hive... So "everyone" seems to take Hive as granted when talking about key Spark features of persisting a DF.

Spark becomes aware of Hive MetaStore when it is provided with hive-site.xml, which is typically placed under $SPARK_HOME/conf. Whenever enableHiveSupport() method is used while creating SparkSession, Spark finds where and how to
get connected with Hive metastore. Spark therefore does not explicitly stores hive settings.

Related

Need use case or example for Spark’s Relationship to Hive

I am reading Spark Definitive Guide
In the "Spark’s Relationship to Hive" section ..the below lines are give
"With Spark SQL, you can connect to your Hive metastore (if you already have one) and access table metadata to reduce file listing when accessing information. This is popular for users who are migrating from a legacy Hadoop environment and beginning to run all their workloads using Spark."
I am not able to understand what it means. Someone please help me with examples for the above use case.
Spark being the latest tool in Hadoop ecosystem has connectivity with earlier Hadoop tools. Hive was the most popular until recent times. Most Hadoop platforms have data stored in Hive tables which can be accessed using Hive as a SQL engine. However, Spark can also do the same things.
So, the given statements mention that you can connect to Hive metastore (which contains information about existing tables, databases, their location, schema, file types, etc.) and then you can run similar Hive queries on them just like you would with Hive.
Below are two examples that you can do with spark once you can connect to Hive metastore.
spark.sql("show databases")
spark.sql("select * from test_db.test_table")
I hope this answers your question.

Hadoop 3 and spark.sql: working with both HiveWarehouseSession and spark.sql

Previously I could work entirely within the spark.sql api to interact with both hive tables and spark data frames. I could query views registered with spark or the hive tables with the same api.
I'd like to confirm, that is no longer possible with hadoop 3.1 and pyspark 2.3.2? To do any operation on a hive table one must use the 'HiveWarehouseSession' api and not the spark.sql api. Is there any way to continue using the spark.sql api and interact with hive or will I have to refactor all my code?
hive = HiveWarehouseSession.session(spark).build()
hive.execute("arbitrary example query here")
spark.sql("arbitrary example query here")
It's confusing because the spark documentation says
Connect to any data source the same way
and specifically gives Hive as an example, but then the Hortonworks hadoop 3 documentation says
As a Spark developer, you execute queries to Hive using the JDBC-style HiveWarehouseSession API
These two statements are in direct contradiction.
The Hadoop documentation continues "You can use the Hive Warehouse Connector (HWC) API to access any type of table in the Hive catalog from Spark. When you use SparkSQL, standard Spark APIs access tables in the Spark catalog."
At least as of present, Spark.sql spark is no longer universal correct? and I can no longer seamlessly interact with hive tables using the same api?
Yep, correct. I'm using Spark 2.3.2 but I can no longer access to hive tables using Spark SQL default API.
From HDP 3.0, catalogs for Apache Hive and Apache Spark are separated, they are mutually exclusive.
As you mentioned you have to use HiveWarehouseSession from pyspark-llap library.

How to set up metadata database for Spark SQL?

Hive can have its metadata and stores the tables,columns,partitions information over there.
If I do not want to use the hive.Can we create a metadata for spark same as hive.
I want to query spark SQL (not using dataframe) like Hive (select, from and where) Can we do that? if yes, which relational DB can we use for metadata storage?
Can we create a metadata for spark same as hive.
Spark does this for you and you don't have to use a separate installation of Hive or even just part of it (e.g. a Hive metastore).
Regardless of the installation of Apache Spark you use, Spark SQL uses a Hive metastore internally for the same purpose as Hive does (but the metastore is now part of Spark SQL).
if yes which relational DB can we use for metadata storage?
Anything that Hive supports, e.g. Oracle, MySQL, PostgreSQL. The configuration is pretty much as you would do with a separate Hive installation (which is usually the case in such enterprisey installations).
You may want to read Hive Metastore.
Spark is essentially a distributed computation system instead of a distributed storage. Therefore, we mostly use Spark to do the computation work, which needs the metadata from different storage.
However, Spark internally provides an InMemoryCatalog to store the metadata if it's not configured with Hive.
You can take a look at this for more information.

Spark and Metastore Relation

I was aware of the fact that Hive Metastore is used to store metadata of the tables that we create in HIVE but why do spark required Metastore, what is the default relation between Metastore and Spark
Does metasore is being used by spark SQL, if so is this to store dataframes metadata?
Why does spark by defaults checks for metastore connectivity even though iam not using any sql libraries?
Here is explanation from spark-2.2.0 documentation
When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse.

Does Spark SQL use Hive Metastore?

I am developing a Spark SQL application and I've got few questions:
I read that Spark-SQL uses Hive metastore under the cover? Is this true? I'm talking about a pure Spark-SQL application that does not explicitly connect to any Hive installation.
I am starting a Spark-SQL application, and have no need to use Hive. Is there any reason to use Hive? From what I understand Spark-SQL is much faster than Hive; so, I don't see any reason to use Hive. But am I correct?
I read that Spark-SQL uses Hive metastore under the cover? Is this true? I'm talking about a pure Spark-SQL application that does not explicitly connect to any Hive installation.
Spark SQL does not use a Hive metastore under the covers (and defaults to in-memory non-Hive catalogs unless you're in spark-shell that does the opposite).
The default external catalog implementation is controlled by spark.sql.catalogImplementation internal property and can be one of the two possible values: hive and in-memory.
Use the SparkSession to know what catalog is in use.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> spark.version
res0: String = 2.4.0
scala> :type spark.sharedState.externalCatalog
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener
scala> println(spark.sharedState.externalCatalog.unwrapped)
org.apache.spark.sql.hive.HiveExternalCatalog#49d5b651
Please note that I used spark-shell that does start a Hive-aware SparkSession and so I had to start it with --conf spark.sql.catalogImplementation=in-memory to turn it off.
I am starting a Spark-SQL application, and have no need to use Hive. Is there any reason to use Hive? From what I understand Spark-SQL is much faster than Hive; so, I don't see any reason to use Hive.
That's a very interesting question and can have different answers (some even primarily opinion-based so we have to be extra careful and follow the StackOverflow rules).
Is there any reason to use Hive?
No.
But...if you want to use the very recent feature of Spark 2.2, i.e. cost-based optimizer, you may want to consider it as ANALYZE TABLE for cost statistics can be fairly expensive and so doing it once for tables that are used over and over again across different Spark application runs could give a performance boost.
Please note that Spark SQL without Hive can do it too, but have some limitation as the local default metastore is just for a single-user access and reusing the metadata across Spark applications submitted at the same time won't work.
I don't see any reason to use Hive.
I wrote a blog post Why is Spark SQL so obsessed with Hive?! (after just a single day with Hive) where I asked a similar question and to my surprise it's only now (almost a year after I posted the blog post on Apr 9, 2016) when I think I may have understood why the concept of Hive metastore is so important, esp. in multi-user Spark notebook environments.
Hive itself is just a data warehouse on HDFS so not much use if you've got Spark SQL, but there are still some concepts Hive has done fairly well that are of much use in Spark SQL (until it fully stands on its own legs with a Hive-like metastore).
It will connect to a Hive Metastore or instantiate one if none is found when you initialize a HiveContext() object or a spark-shell.
The main reason to use Hive is if you are reading HDFS data in from Hive's managed tables or if you want the convenience of selecting from external tables.
Remember that Hive is simply a lens for reading and writing HDFS files and not an execution engine in and of itself.

Resources