How to set hive.metastore.warehouse.dir in HiveContext? - apache-spark

I'm trying to write a unit test case that relies on DataFrame.saveAsTable() (since it is backed by a file system). I point the hive warehouse parameter to a local disk location:
sql.sql(s"SET hive.metastore.warehouse.dir=file:///home/myusername/hive/warehouse")
By default, Embedded Mode of metastore should be enabled, thus doesn't require an external database.
But HiveContext seems to be ignoring this configuration: since I still get this error when calling saveAsTable():
MetaException(message:file:/user/hive/warehouse/users is not a directory or unable to create one)
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:file:/user/hive/warehouse/users is not a directory or unable to create one)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:619)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:172)
at org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:224)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:54)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:54)
at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:64)
at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1099)
at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1099)
at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1121)
at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1071)
at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1037)
This is quite annoying, why is it still happening and how to fix it?

According to http://spark.apache.org/docs/latest/sql-programming-guide.html#sql
Note that the hive.metastore.warehouse.dir property in hive-site.xml
is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir
to specify the default location of database in warehouse.

tl;dr Set hive.metastore.warehouse.dir while creating a SQLContext (or SparkSession).
The location of the default database for the Hive metastore warehouse is /user/hive/warehouse by default. It used to be set using hive.metastore.warehouse.dir Hive-specific configuration property (in a Hadoop configuration).
It's been a while since you asked this question (it's Spark 2.3 days), but that part has not changed since - if you use sql method of SQLContext (or SparkSession these days), it's simply too late to change where Spark creates the metastore database. It is far too late as the underlying infrastructure has been set up already (so you can use the SQLContext). The warehouse location has to be set up before the HiveContext / SQLContext / SparkSession initialization.
You should set hive.metastore.warehouse.dir while creating SparkSession (or SQLContext before Spark SQL 2.0) using config and (very important) enable the Hive support using enableHiveSupport.
config(key: String, value: String): Builder Sets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's own configuration.
enableHiveSupport(): Builder Enables Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
You could use hive-site.xml configuration file or spark.hadoop prefix, but I'm digressing (and it strongly depends on the current configuration).

Another option is to just create a new database and then USE new_DATATBASE and then create the table. The warehouse will be created under the folder you ran the sql-spark.

I faced exactly theI faced exactly the same issues. I was running spark-submit command in shell action via oozie.
Setting warehouse directory doesn't worked for me while creating sparksession
All you need to do is to pass the pass the hive-site.xml in spark-submit command using below property:
--files ${location_of_hive-site.xml}

Related

How can i show hive table using pyspark

Hello i created a spark HD insight cluster on azure and i’m trying to read hive tables with pyspark but the proble that its show me only default database
Anyone have an idea ?
If you are using HDInsight 4.0, Spark and Hive not share metadata anymore.
For default you will not see hive tables from pyspark, is a problem that i share on this post: How save/update table in hive, to be readbale on spark.
But, anyway, things you can try:
If you want test only on head node, you can change the hive-site.xml, on property "metastore.catalog.default", change the value to hive, after that open pyspark from command line.
If you want to apply to all cluster nodes, changes need to be made on Ambari.
Login as admin on ambari
Go to spark2 > Configs > hive-site-override
Again, update property "metastore.catalog.default", to hive value
Restart all required on Ambari panel
These changes define hive metastore catalog as default.
You can see hive databases and table now, but depending of table structure, you will not see the table data properly.
If you have created tables in other databases, try show tables from database_name. Replace database_name with the actual name.
You are missing details of hive server in SparkSession. If you haven't added any it will create and use default database to run sparksql.
If you've added configuration details in spark default conf file for spark.sql.warehouse.dir and spark.hadoop.hive.metastore.uris then while creating SparkSession add enableHiveSupport().
Else add configuration details while creating sparksession
.config("spark.sql.warehouse.dir","/user/hive/warehouse")
.config("hive.metastore.uris","thrift://localhost:9083")
.enableHiveSupport()

Where is table data stored in Spark?

Hi I'm trying to find out where SparkSQL stores the table metadata in Spark? If it is not in the Hive metastore by default, then where is it stored?
Here is explanation from spark-2.2.0 documentation
When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse.
Here is the link:
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html

Spark SQL cannot access Spark Thrift Server

I cannot configure Spark SQL so that I could access Hive Table in Spark Thrift Server (without using JDBC, but natively from Spark)
I use single configuration file conf/hive-site.xml for both Spark Thrift Server and Spark SQL. I have javax.jdo.option.ConnectionURL property set to jdbc:derby:;databaseName=/home/user/spark-2.4.0-bin-hadoop2.7/metastore_db;create=true. I also set spark.sql.warehouse.dir property to absolute path pointing to spark-warehouse directory. I run Thrift server with ./start-thriftserver.sh and I can observe that embedded Derby database is being created with metastore_db directory. I can connect with beeline, create a table and see spark-warehouse directory created with subdirectory for table. So at this stage it's fine.
I launch pyspark shell with Hive support enabled ./bin/pyspark --conf spark.sql.catalogImplementation=hive, and try to access the Hive table with:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
hc.sql('show tables')
I got errors like:
ERROR XJ040: Failed to start database
'/home/user/spark-2.4.0-bin-hadoop2.7/metastore_db' with class loader
sun.misc.Launcher$AppClassLoader#1b4fb997
ERROR XSDB6: Another instance of Derby may have already booted the
database /home/user/spark-2.4.0-bin-hadoop2.7/metastore_db
pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Apparently Spark is trying to create new Derby database instead of using Metastore I put in config file. If I stop Thrift Server and run only spark, everything is fine. How could I fix it?
Is Embedded Derby Metastore Database fine to have both Thrift Server and Spark access one Hive or I need to use e.g. MySQL? I don't have a cluster and do everything locally.
Embedded Derby Metastore Database is fine to be used in local, but for production environment, it is recommended to use any other Metastore database.
Yes, you can definitely use MYSQL as metastore. For this, you have to make an entry in hive-site.xml.
You can follow the configuration guide at Use MySQL for the Hive Metastore for the exact details.

Spark and Metastore Relation

I was aware of the fact that Hive Metastore is used to store metadata of the tables that we create in HIVE but why do spark required Metastore, what is the default relation between Metastore and Spark
Does metasore is being used by spark SQL, if so is this to store dataframes metadata?
Why does spark by defaults checks for metastore connectivity even though iam not using any sql libraries?
Here is explanation from spark-2.2.0 documentation
When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse.

Apache spark installation and db_metastore

I am beginner in Spark.
I installed java and spark-1.6.1-bin-hadoop2.6.tgz(I have not installed Hadoop) and with out changing any configuration in conf directory ran spark-shell.
In the director where spark is installed , I see another metastore_db created with tmp folder inside it.
why is this metastore_db is created , where is this configured ?
Also I see sqlContext being created after running spark-shell, what does this sqlContext represent?
When running spark-shell, a SparkContext and SQLContext are created. SQLContext is an extension of SparkContext to enable support of Spark SQL. It has method to execute sql queries (method sql) and to create DataFrames.
db_metastore is a Hive metastore path. Spark support Apache Hive queries via HiveContext. If there is no hive-site.xml configured, Spark will use db_metastore path, see documentation for details.
However, it would be good if you will download Spark 2.0. There you've got unified entry point to Spark, named SparkSession. This class allows you to read data from many sources, create Datasets, etc.

Resources