Spark and Metastore Relation - apache-spark

I was aware of the fact that Hive Metastore is used to store metadata of the tables that we create in HIVE but why do spark required Metastore, what is the default relation between Metastore and Spark
Does metasore is being used by spark SQL, if so is this to store dataframes metadata?
Why does spark by defaults checks for metastore connectivity even though iam not using any sql libraries?

Here is explanation from spark-2.2.0 documentation
When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse.

Related

Spark sql can't find table in hive in HDP

I use HDP3.1 and I added Spark2, Hive and Other services which are needed. I turned of the ACID feature in Hive. The spark job can't find the table in hive. But the table exists in Hive. The exception likes:
org.apache.spark.sql.AnalysisException: Table or view not found
There is hive-site.xml in Spark's conf folder. It is automaticly created by HDP. But it isn't same as the file in hive's conf folder. And from the log, the spark can get the thrift URI of hive correctly.
I use spark sql and created one hive table in spark-shell. I found the table was created in the fold which is specified by spark.sql.warehouse.dir. I changed its value to the value of hive.metastore.warehouse.dir. But the problem is still there.
I also enabled hive support when creating spark session.
val ss = SparkSession.builder().appName("统计").enableHiveSupport().getOrCreate()
There is metastore.catalog.default in hive-site.xml in spark's conf folder. It value is spark. It should be changed to hive. And btw, we should disable ACID feature of hive.
You can use hivewarehouse connector and use llap in hive conf
In HDP 3.0 and later, Spark and Hive use independent catalogs for accessing SparkSQL or Hive tables on the same or different platforms.
Spark, by default, only reads the Spark Catalog. And, this means Spark applications that attempt to read/write to tables created using hive CLI will fail with table not found exception.
Workaround:
Create the table in Hive CLI and Spark SQL
Hive Warehouse Connector

Where is table data stored in Spark?

Hi I'm trying to find out where SparkSQL stores the table metadata in Spark? If it is not in the Hive metastore by default, then where is it stored?
Here is explanation from spark-2.2.0 documentation
When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse.
Here is the link:
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html

Where is Spark table created if Hive support is not enabled?

Where is table created when I use SQL syntax for create table in Spark 2.*?
CREATE TABLE ... AS
Does spark store it internally somewhere of there has to always be Impala or Hive support for Spark tables to work?
By default (when no connection specified e.g. in hive-site.xml). Spark will save it in the local metastore that is on Derby db.
And it should be stored in spark-warehouse directory in your Spark directory or whatever is defined in spark.sql.warehouse.dir prop.

Where is Spark catalog metadata stored?

Have been trying to get an accurate view of how Spark's catalog API stores the metadata.
I have found some resources, but no answer:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-Catalog.html
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-CatalogImpl.html
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/catalog/Catalog.html
I see some tutorials that take for granted the existence of Hive Metastore.
Is Hive Metastore potentially included with Spark distribution?
Spark cluster can be short-lived, but Hive metastore would obviously need to be long-lived
Apart from the catalog feature, partitioning and sorting features when writing out a DF seem to depend on Hive... So "everyone" seems to take Hive as granted when talking about key Spark features of persisting a DF.
Spark becomes aware of Hive MetaStore when it is provided with hive-site.xml, which is typically placed under $SPARK_HOME/conf. Whenever enableHiveSupport() method is used while creating SparkSession, Spark finds where and how to
get connected with Hive metastore. Spark therefore does not explicitly stores hive settings.

How to set hive.metastore.warehouse.dir in HiveContext?

I'm trying to write a unit test case that relies on DataFrame.saveAsTable() (since it is backed by a file system). I point the hive warehouse parameter to a local disk location:
sql.sql(s"SET hive.metastore.warehouse.dir=file:///home/myusername/hive/warehouse")
By default, Embedded Mode of metastore should be enabled, thus doesn't require an external database.
But HiveContext seems to be ignoring this configuration: since I still get this error when calling saveAsTable():
MetaException(message:file:/user/hive/warehouse/users is not a directory or unable to create one)
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:file:/user/hive/warehouse/users is not a directory or unable to create one)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:619)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:172)
at org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:224)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:54)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:54)
at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:64)
at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1099)
at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1099)
at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1121)
at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1071)
at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1037)
This is quite annoying, why is it still happening and how to fix it?
According to http://spark.apache.org/docs/latest/sql-programming-guide.html#sql
Note that the hive.metastore.warehouse.dir property in hive-site.xml
is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir
to specify the default location of database in warehouse.
tl;dr Set hive.metastore.warehouse.dir while creating a SQLContext (or SparkSession).
The location of the default database for the Hive metastore warehouse is /user/hive/warehouse by default. It used to be set using hive.metastore.warehouse.dir Hive-specific configuration property (in a Hadoop configuration).
It's been a while since you asked this question (it's Spark 2.3 days), but that part has not changed since - if you use sql method of SQLContext (or SparkSession these days), it's simply too late to change where Spark creates the metastore database. It is far too late as the underlying infrastructure has been set up already (so you can use the SQLContext). The warehouse location has to be set up before the HiveContext / SQLContext / SparkSession initialization.
You should set hive.metastore.warehouse.dir while creating SparkSession (or SQLContext before Spark SQL 2.0) using config and (very important) enable the Hive support using enableHiveSupport.
config(key: String, value: String): Builder Sets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's own configuration.
enableHiveSupport(): Builder Enables Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
You could use hive-site.xml configuration file or spark.hadoop prefix, but I'm digressing (and it strongly depends on the current configuration).
Another option is to just create a new database and then USE new_DATATBASE and then create the table. The warehouse will be created under the folder you ran the sql-spark.
I faced exactly theI faced exactly the same issues. I was running spark-submit command in shell action via oozie.
Setting warehouse directory doesn't worked for me while creating sparksession
All you need to do is to pass the pass the hive-site.xml in spark-submit command using below property:
--files ${location_of_hive-site.xml}

Resources