Spark Thrift Server connection to S3 - apache-spark

I'm trying to connect to AWS S3 using Spark Thrift Service. I'm using:
spark-defaults.conf
spark.sql.warehouse.dir s3://demo-metastore-001/
spark.hadoop.fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3.aws.credentials.provider org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
spark.hadoop.fs.s3.access.key XXXXXXXXXXXXX
spark.hadoop.fs.s3.secret.key yyyyyyyyyyyyyyyyyyyy
spark.hadoop.fs.s3a.access.key XXXXXXXXXXXXX
spark.hadoop.fs.s3a.secret.key yyyyyyyyyyyyyyyyyyyy
hive-site.xml
<property>
<name>hive.metastore.warehouse.dir</name>
<value>s3://demo-metastore-001/</value>
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>XXXXXXXXXXXXX</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>yyyyyyyyyyyyyyyyyyyy</value>
</property>
<property>
<name>fs.s3a.awsAccessKeyId</name>
<value>XXXXXXXXXXXXX</value>
</property>
<property>
<name>fs.s3a.awsSecretAccessKey</name>
<value>yyyyyyyyyyyyyyyyyyyy</value>
</property>
As you see I'm using brute force mixing s3 and s3a, not sure what are the right parameters
I'm running:
start-thriftserver.sh --packages org.apache.spark:spark-hadoop-cloud_2.12:3.3.1 --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem --master local[*]
The log doesn't show any error:
23/01/12 04:16:12 INFO MetricsSystemImpl: s3a-file-system metrics system started
23/01/12 04:16:13 INFO SharedState: Warehouse path is 's3a://demo-metastore-001/'.
23/01/12 04:16:14 INFO HiveUtils: Initializing HiveMetastoreConnection version 2.3.9 using Spark classes.
23/01/12 04:16:14 INFO HiveClientImpl: Warehouse location for Hive client (version 2.3.9) is s3a://demo-metastore-001/
23/01/12 04:16:18 INFO HiveUtils: Initializing execution hive, version 2.3.9
23/01/12 04:16:18 INFO HiveClientImpl: Warehouse location for Hive client (version 2.3.9) is s3a://demo-metastore-001/
But the metastore it is always created in the Master server directory not the in S3. Any idea how can I connect Spark-Thrift Server to AWS s3?

Related

hive on spark :org.apache.hadoop.hive.ql.metadata.HiveException

Buddy!
I get a problem while using hive(version 3.1.2) on spark(version 3.2.1) in my mac(Catalina 10.15.7). My hadoop and hive run in my mac in local mode and they both work well (I can insert records into hive table and select them out when I set hive.execution.engine=mr ).
After I made the configuration listed below in hive's hive-site.xml
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>spark.home</name>
<value>/Users/admin/spark/spark-3.2.1-bin-hadoop3.2/</value>
</property>
<property>
<name>spark.master</name>
<value>spark://localhost:7077</value>
</property>
Then I run hive in terminal and it works well(I can run select statement).
But when I insert records into table ,I get exception
FAILED: SemanticException Failed to get a spark session:
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark client for
Spark session b9704c3a-6932-44b3-a011-600ae80f39e1

Connect to hive metastore from remote spark

I have the hadoop cluster with installed hive and spark. In addition I have a separate workstation machine and I am trying to connect to the cluster from it
I installed spark on this machine and try to connect using following command:
pyspark --name testjob --master spark://hadoop-master.domain:7077
In the results I see sunning application on the spark WebUI page.
I want to connect to hive database (in the cluster) from my workstation, but I can't do this. I have the hive-site.xml config into my spark conf directory on local workstation with following contents:
<configuration>
<property>
<name>metastore.thrift.uris</name>
<value>thrift://hadoop-master.domain:9083</value>
<description>IP address (or domain name) and port of the metastore host</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://hadoop-master.domain:9000/user/hive/warehouse</value>
<description>Warehouse location</description>
</property>
<property>
<name>metastore.warehouse.dir</name>
<value>hdfs://hadoop-master.domain:9000/user/hive/warehouse</value>
<description>Warehouse location</description>
</property>
<property>
<name>spark.sql.hive.metastore.version</name>
<value>3.1.0</value>
<description>Metastore version</description>
</property>
</configuration>
I tied this construction, but can't make it work with external hive databases:
spark = SparkSession \
.builder \
.appName('test01') \
.config('hive.metastore.uris', "thrift://hadoop-master.domain:9083") \
.config("spark.sql.warehouse.dir", "hdfs://hadoop-master.domain:9000/user/hive/warehouse") \
.enableHiveSupport() \
.getOrCreate()
What I shoul do to connect from local pyspark to remote hive database?

No FileSystem for scheme: wasbs Apache Zeppelin 0.8.0 with Spark 2.3.2

I am trying to run note in Apache Zeppelin 0.8.0 with Spark 2.3.2 and Azure Blob Storage, but I'm getting No FileSystem for scheme: wasbs error, though I configured all properly, as it is recommended in related issues.
Here are some conf files:
spark-defaults.conf
spark.driver.extraClassPath /opt/jars/*
spark.driver.extraLibraryPath /opt/jars
spark.jars /opt/jars/hadoop-azure-2.7.3.jar,/opt/jars/azure-storage-2.2.0.jar
spark.driver.memory 28669m
core-site.xml
<configuration>
<property>
<name>fs.AbstractFileSystem.wasb.Impl</name>
<value>org.apache.hadoop.fs.azure.Wasb</value>
</property>
<property>
<name>fs.azure.account.key.{storage_account_name}.blob.core.windows.net</name>
<value>{account_key_value}</value>
</property>

spark 1.6.1 -- hive-site.xml -- not connecting to mysql [duplicate]

This question already has answers here:
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
(11 answers)
Closed 2 years ago.
The following are the versions that we have
Spark 1.6.1
Hadoop 2.6.2
Hive 1.1.0
I have the hive-site.xml in $SPARK_HOME/conf directory. The hive.metastore.uris property is also configured properly.
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://host.domain.com:3306/metastore</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>user name for connecting to mysql server </description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>*****</value>
<description>password for connecting to mysql server </description>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://host.domain.com:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
Unfortunately Spark is creating a temp derby db without connecting to MySQL metastore
I need Spark to connect to MySQL metastore as that is the central store for all metadata. Please help
Regards
Bala
Can you try passing the hive-site.xml
(--files) with spark-submit when running in cluster mode?

Setting MySQL as the metastore for built in spark hive

I have an spark, scala sbt project using spark. I need to multiple create HiveContexts, which is not allowed by the built in derby for spark hive. Can someone help me with setting up mysql as the metastore instead of derby, which is the default db. I don't have actual hive installed or spark installed. I use sbt dependency for spark and hive.
Copy hive-site.xml file in Spark's conf directory and change some properties in that file
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>password to use against metastore database</description>
</property>
Shakti
You need to have the conf files in the class path. I'm using hadoop, hive, and spark with Intellij. In Intellij I have file:/usr/local/spark/conf/, file:/usr/local/hadoop/etc/hadoop/, and file:/usr/local/hive/conf/ in my class path. You can use following to print your run time class path:
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)
I hope this help if you haven't already found a fix.

Resources