Setting MySQL as the metastore for built in spark hive - apache-spark

I have an spark, scala sbt project using spark. I need to multiple create HiveContexts, which is not allowed by the built in derby for spark hive. Can someone help me with setting up mysql as the metastore instead of derby, which is the default db. I don't have actual hive installed or spark installed. I use sbt dependency for spark and hive.

Copy hive-site.xml file in Spark's conf directory and change some properties in that file
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>password to use against metastore database</description>
</property>

Shakti
You need to have the conf files in the class path. I'm using hadoop, hive, and spark with Intellij. In Intellij I have file:/usr/local/spark/conf/, file:/usr/local/hadoop/etc/hadoop/, and file:/usr/local/hive/conf/ in my class path. You can use following to print your run time class path:
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)
I hope this help if you haven't already found a fix.

Related

hive on spark :org.apache.hadoop.hive.ql.metadata.HiveException

Buddy!
I get a problem while using hive(version 3.1.2) on spark(version 3.2.1) in my mac(Catalina 10.15.7). My hadoop and hive run in my mac in local mode and they both work well (I can insert records into hive table and select them out when I set hive.execution.engine=mr ).
After I made the configuration listed below in hive's hive-site.xml
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>spark.home</name>
<value>/Users/admin/spark/spark-3.2.1-bin-hadoop3.2/</value>
</property>
<property>
<name>spark.master</name>
<value>spark://localhost:7077</value>
</property>
Then I run hive in terminal and it works well(I can run select statement).
But when I insert records into table ,I get exception
FAILED: SemanticException Failed to get a spark session:
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark client for
Spark session b9704c3a-6932-44b3-a011-600ae80f39e1

Configure Hive metastore in my local Spark shell

I need to configure Hive metastore for use with Spark SQL in spark-shell.
I copied my hive-site.xml to spark/conf folder - it didn't work.
Then tried in spark shell
spark.conf.set("hive.metastore.uris","jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true")
spark.conf.set("spark.sql.catalogImplementation","hive")
But got an error:
scala> spark.conf.set("spark.sql.catalogImplementation","hive")
org.apache.spark.sql.AnalysisException: Cannot modify the value of a static config: spark.sql.catalogImplementation;
at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:155)
at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:41)
... 49 elided
Tried opening spark shell using
spark-shell --conf spark.sql.catalogImplementation=hive hive.metastore.uris=jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true
still not able to read
Error:
scala> spark.sql("select * from car_test.car_data_table").show
org.apache.spark.sql.AnalysisException: Table or view not found: `car_test`.`car_data_table`
Hive metastore is not getting attached to spark sql.
My hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/raptor/tmp/hive/warehouse/</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>HIVE_USER</value>
</property>
<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>HIVE_PASSWORD</value>
</property>
</configuration>
spark-env.sh
#!/usr/bin/env bash
export JAVA_HOME=/home/user/Softwares/jdk1.8.0_221/
export SPARK_LOCAL_IP=127.0.1.1
export HADOOP_HOME=/home/user/Softwares/hadoop-2.7.3/
export HIVE_HOME=/home/user/Softwares/apache-hive-2.1.1-bin/
export SPARK_LOCAL_IP=127.0.1.1
export SPARK_DIST_CLASSPATH=$HADOOP_HOME/bin/hadoop
#SPARK_DIST_CLASSPATH=$HADOOP_HOME/share/hadoop/common/lib:$HADOOP_HOME/share/hadoop/common:$HADOOP_HOME/share/hadoop/mapreduce:$HADOOP_HOME/share/hadoop/mapreduce/lib:$HADOOP_HOME/share/hadoop/yarn:$HADOOP_HOME/share/hadoop/yarn/lib:$HADOOP_HOME/share/hadoop/hdfs:$HADOOP_HOME/share/hadoop/hdfs/lib
export SPARK_HOME=/home/user/Softwares/spark-2.4.4-bin-hadoop-2.7-scala-2.12
export SPARK_CONF_DIR=${SPARK_HOME}/conf
export SPARK_LOG_DIR=${SPARK_HOME}/logs
Got help from this post.
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
And One another thing, Spark 2.* will only support hive metastores from hive 0.12.0 to hive 1.2.0. I was using hive 2.1.1 that was the issue.

How to connect to remote Hive server without hive downloaded?

I am trying to access a Hive cluster without Hive downloaded on my machine. I read on here that I just need a jdbc client to do so. I have a url, username and password for the hive cluster. I have tried making a hive-site.xml with these, as well as doing it programmatically, although this method does not seem to have a place to input username and password. No matter what I do, it seems that the following error is keeping me from accessing hive:
Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
I feel like this is because I do not have Hive downloaded on my computer from the answers to this error online. What exactly do I need to do here to access it without hive downloaded, or do I actually have to download it? Here is my code for reference:
spark = SparkSession \
.builder \
.appName("interfacing spark sql to hive metastore without
configuration file") \
.config("hive.metastore.uris", "https://prod-fmhdinsight-
eu.azurehdinsight.net") \
.enableHiveSupport() \
.getOrCreate()
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4),
('Fifth', 5)]
df = spark.createDataFrame(data)
# see the frame created
df.show()
# write the frame
df.write.mode("overwrite").saveAsTable("t4")
and the hive-site.xml:
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>https://prod-fmhdinsight-eu.azurehdinsight.net</value>
</property>
<!--
<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
<-->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>https://prod-fmhdinsight-eu.azurehdinsight.net</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>username</value>
<description>user name for connecting to mysql server
</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
<description>password for connecting to mysql server
</description>
</property>
tl;dr Use spark.sql.hive.metastore.jars configuration property with maven to let Spark SQL download the required jars.
The other options are builtin (that simply assumes Hive 1.2.1) and a classpath of the Hive JARs (e.g. spark.sql.hive.metastore.jars="/Users/jacek/dev/apps/hive/lib/*").
If your Hive metastore is available remotely via thrift protocol you may want to create $SPARK_HOME/conf/hive-site.xml as follows:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value>
</property>
</configuration>
A nice feature of Hive is to define configuration properties as System properties so the above would look as follows:
$SPARK_HOME/bin/spark-shell \
--driver-java-options="-Dhive.metastore.uris=thrift://localhost:9083"
You may want to add the following to conf/log4j.properties for a more low-level logging:
log4j.logger.org.apache.spark.sql.hive.HiveUtils$=ALL
log4j.logger.org.apache.spark.sql.internal.SharedState=ALL

spark 1.6.1 -- hive-site.xml -- not connecting to mysql [duplicate]

This question already has answers here:
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
(11 answers)
Closed 2 years ago.
The following are the versions that we have
Spark 1.6.1
Hadoop 2.6.2
Hive 1.1.0
I have the hive-site.xml in $SPARK_HOME/conf directory. The hive.metastore.uris property is also configured properly.
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://host.domain.com:3306/metastore</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>user name for connecting to mysql server </description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>*****</value>
<description>password for connecting to mysql server </description>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://host.domain.com:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
Unfortunately Spark is creating a temp derby db without connecting to MySQL metastore
I need Spark to connect to MySQL metastore as that is the central store for all metadata. Please help
Regards
Bala
Can you try passing the hive-site.xml
(--files) with spark-submit when running in cluster mode?

Error in Configuring Spark/Shark on DSE

, I have installed
1) scala-2.10.3
2) spark-1.0.0
Changed spark-env.sh with below variables
export SCALA_HOME=$HOME/scala-2.10.3
export SPARK_WORKER_MEMORY=16g
I can see Spark master.
3) shark-0.9.1-bin-hadoop1
Changed shark-env.sh with below variables
export SHARK_MASTER_MEM=1g
SPARK_JAVA_OPTS=" -Dspark.local.dir=/tmp "
SPARK_JAVA_OPTS+="-Dspark.kryoserializer.buffer.mb=10 "
SPARK_JAVA_OPTS+="-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps "
export SPARK_JAVA_OPTS
export HIVE_HOME=/usr/share/dse/hive
export HIVE_CONF_DIR="/etc/dse/hive"
export SPARK_HOME=/home/ubuntu/spark-1.0.0
export SPARK_MEM=16g
source $SPARK_HOME/conf/spark-env.sh
4) In DSE, Hive version is Hive 0.11
Existing Hive-site.xml is
<configuration>
<!-- Hive Execution Parameters -->
<property>
<name>hive.exec.mode.local.auto</name>
<value>false</value>
<description>Let hive determine whether to run in local mode automatically</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>cfs:///user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
<property>
<name>hive.hwi.war.file</name>
<value>lib/hive-hwi.war</value>
<description>This sets the path to the HWI war file, relative to ${HIVE_HOME}</description>
</property>
<property>
<name>hive.metastore.rawstore.impl</name>
<value>com.datastax.bdp.hadoop.hive.metastore.CassandraHiveMetaStore</value>
<description>Use the Apache Cassandra Hive RawStore implementation</description>
</property>
<property>
<name>hadoop.bin.path</name>
<value>${dse.bin}/dse hadoop</value>
</property>
<!-- Set this to true to enable auto-creation of Cassandra keyspaces as Hive Databases -->
<property>
<name>cassandra.autoCreateHiveSchema</name>
<value>true</value>
</property>
</configuration>
5) while running Shark shell getting error:
Unable to instantiate Org.apache.hadoop.hive.metastore.HiveMetaStoreClient
And
6) While running shark shell with -skipRddReload - I'm able to get Shark shell but not able to connect hive and not able execute any commands.
shark> DESCRIVE mykeyspace;
and getting error message:
FAILED: Error in metastore: java.lang.RuntimeException: Unable to instantiate org.apache.haddop.hive.metastore.HiveMataStoreClient.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.q1.exec.DDLTask.
Please provide details how to configure spark/shark on Datastax enterprise (Cassandra).

Resources