How to specify which hive metastore to connect to? - apache-spark

Going back a few versions of Spark it used to be required to put the
hive-site.xml
in the $SPARK_HOME/conf directory. Is that still the case?
The motivation for this question: we are unable to see hive tables that are defined within the metastore instance for which we did copy the hive-site.xml to the conf dir.

I have verified that the hive-site.xml is still used. It is selected from the classpath of spark. This may be set up via
export SPARK_CLASSPATH=/path/to/conf/dir

Related

Where is table data stored in Spark?

Hi I'm trying to find out where SparkSQL stores the table metadata in Spark? If it is not in the Hive metastore by default, then where is it stored?
Here is explanation from spark-2.2.0 documentation
When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse.
Here is the link:
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html

Sparklyr not connecting to my hive warehouse

I'm doing a very silly thing and trying to install a Yarn/Hive/Spark/R platform from scratch, not using Hortonworks or Cloudera. I've gotten many pieces figured out but am stuck trying to get my sparklyr to connect to my Hive warehouse.
I am using Rstudio on one machine and connecting to yarn-client located on a separate cluster. I've put hive-site.xml pretty much everywhere, the local $SPARK_HOME/conf and each of the hadoop nodes' $SPARK_HOME/conf and $HADOOP_CONF_DIR. In hive-site.xml I've included the param:
<property>
<name>spark.sql.warehouse.dir</name>
<value>hdfs://<driver node>/user/hive/warehouse/</value>
<description>The loation of the hive warehouse</description>
</property>
I feel that that should make it pretty clear that I'm trying to use hive but when I run this code:
DBI::dbGetQuery(sc, "CREATE DATABASE test")
DBI::dbGetQuery(sc, "use test")
iris_spark_table <- copy_to(sc, iris, overwrite = TRUE)
sdf_copy_to(sc, iris_spark_table)
DBI::dbGetQuery(sc, "create table iris_hive as SELECT * FROM iris_spark_table")
I get this error:
org.apache.hadoop.hive.ql.metadata.HiveException:
java.io.IOException:
Mkdirs failed to create file:/<my-r-code's-working-dir>/spark-warehouse/test.db/iris_hive/.hive-staging_hive_2018-08-05_14-18-58_646_6160231583951115949-1/-ext-10000/_temporary/0/_temporary/attempt_20180805141859_0013_m_000000_3
(exists=false, cwd=file:/tmp/hadoop-hadoop/nm-local-dir/usercache/dzafar/appcache/application_1533357216333_0015/container_1533357216333_0015_01_000002)
What am I missing??? Thanks in advance!!!
First of all Spark specific properties should be placed in Spark configuration files. It means you should put
spark.sql.warehouse.dir
in $SPARK_HOME/conf/spark-defaults.conf
Additionally you might have a problem with hdfs-site.xml not being present on the search path.

Spark and Metastore Relation

I was aware of the fact that Hive Metastore is used to store metadata of the tables that we create in HIVE but why do spark required Metastore, what is the default relation between Metastore and Spark
Does metasore is being used by spark SQL, if so is this to store dataframes metadata?
Why does spark by defaults checks for metastore connectivity even though iam not using any sql libraries?
Here is explanation from spark-2.2.0 documentation
When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse.

How to set hive.metastore.warehouse.dir in HiveContext?

I'm trying to write a unit test case that relies on DataFrame.saveAsTable() (since it is backed by a file system). I point the hive warehouse parameter to a local disk location:
sql.sql(s"SET hive.metastore.warehouse.dir=file:///home/myusername/hive/warehouse")
By default, Embedded Mode of metastore should be enabled, thus doesn't require an external database.
But HiveContext seems to be ignoring this configuration: since I still get this error when calling saveAsTable():
MetaException(message:file:/user/hive/warehouse/users is not a directory or unable to create one)
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:file:/user/hive/warehouse/users is not a directory or unable to create one)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:619)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:172)
at org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:224)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:54)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:54)
at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:64)
at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1099)
at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1099)
at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1121)
at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1071)
at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1037)
This is quite annoying, why is it still happening and how to fix it?
According to http://spark.apache.org/docs/latest/sql-programming-guide.html#sql
Note that the hive.metastore.warehouse.dir property in hive-site.xml
is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir
to specify the default location of database in warehouse.
tl;dr Set hive.metastore.warehouse.dir while creating a SQLContext (or SparkSession).
The location of the default database for the Hive metastore warehouse is /user/hive/warehouse by default. It used to be set using hive.metastore.warehouse.dir Hive-specific configuration property (in a Hadoop configuration).
It's been a while since you asked this question (it's Spark 2.3 days), but that part has not changed since - if you use sql method of SQLContext (or SparkSession these days), it's simply too late to change where Spark creates the metastore database. It is far too late as the underlying infrastructure has been set up already (so you can use the SQLContext). The warehouse location has to be set up before the HiveContext / SQLContext / SparkSession initialization.
You should set hive.metastore.warehouse.dir while creating SparkSession (or SQLContext before Spark SQL 2.0) using config and (very important) enable the Hive support using enableHiveSupport.
config(key: String, value: String): Builder Sets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's own configuration.
enableHiveSupport(): Builder Enables Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
You could use hive-site.xml configuration file or spark.hadoop prefix, but I'm digressing (and it strongly depends on the current configuration).
Another option is to just create a new database and then USE new_DATATBASE and then create the table. The warehouse will be created under the folder you ran the sql-spark.
I faced exactly theI faced exactly the same issues. I was running spark-submit command in shell action via oozie.
Setting warehouse directory doesn't worked for me while creating sparksession
All you need to do is to pass the pass the hive-site.xml in spark-submit command using below property:
--files ${location_of_hive-site.xml}

How can i access cfs url from a remote non dse (datastax) node

im am trying to do... from my prog.
val file = sc.textFile("cfs://ip/.....")
but i get java.io.IOException: No FileSystem for scheme: cfs exception...
How should i modify the core-site.xml and where? It should be on dse nodes or should i add it as a resource in my jar.
I use maven to build my jar and execute the jobs remotely...from a non dse node which does not have cassandra or spark or something similar... Other type of flows without cfs files work ok... so the jar is ok so far...
Thnx!
There is some info in the middle of this page about Spark using Hadoop for some operations, such as CFS access: http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkCassProps.html
I heard about a problem using Hive from a non-DSE node that was solved by adding a property file to core-site.xml. This is really a long-shot since it's Spark, but if you're willing to experiment, try adding the IP address of the remote machine to the core-site.xml file.
<property>
<name>cassandra.host</name>
<value>192.168.2.100</value>
<property>
Find the core-site.xml in /etc/dse/hadoop/conf/ or install_location/resources/hadoop/conf/, depending on the type of installation.
I assume you started the DSE cluster in hadoop and spark mode: http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkStart.html
Been quite some time.
The integration is done as usual with any integration of a hadoop client to a compatible hadoop fs.
Copy core-site.xml (append the dse-core-default.xml there) along with dse.yaml, cassandra.yaml and then it requires a proper dependency set-up in the class path eg. dse.jar, cassandra-all, etc.
Note: this is not officially supported so better use other way.

Resources