No FileSystem for scheme: wasbs Apache Zeppelin 0.8.0 with Spark 2.3.2 - azure

I am trying to run note in Apache Zeppelin 0.8.0 with Spark 2.3.2 and Azure Blob Storage, but I'm getting No FileSystem for scheme: wasbs error, though I configured all properly, as it is recommended in related issues.
Here are some conf files:
spark-defaults.conf
spark.driver.extraClassPath /opt/jars/*
spark.driver.extraLibraryPath /opt/jars
spark.jars /opt/jars/hadoop-azure-2.7.3.jar,/opt/jars/azure-storage-2.2.0.jar
spark.driver.memory 28669m
core-site.xml
<configuration>
<property>
<name>fs.AbstractFileSystem.wasb.Impl</name>
<value>org.apache.hadoop.fs.azure.Wasb</value>
</property>
<property>
<name>fs.azure.account.key.{storage_account_name}.blob.core.windows.net</name>
<value>{account_key_value}</value>
</property>

Related

Spark Thrift Server connection to S3

I'm trying to connect to AWS S3 using Spark Thrift Service. I'm using:
spark-defaults.conf
spark.sql.warehouse.dir s3://demo-metastore-001/
spark.hadoop.fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3.aws.credentials.provider org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
spark.hadoop.fs.s3.access.key XXXXXXXXXXXXX
spark.hadoop.fs.s3.secret.key yyyyyyyyyyyyyyyyyyyy
spark.hadoop.fs.s3a.access.key XXXXXXXXXXXXX
spark.hadoop.fs.s3a.secret.key yyyyyyyyyyyyyyyyyyyy
hive-site.xml
<property>
<name>hive.metastore.warehouse.dir</name>
<value>s3://demo-metastore-001/</value>
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>XXXXXXXXXXXXX</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>yyyyyyyyyyyyyyyyyyyy</value>
</property>
<property>
<name>fs.s3a.awsAccessKeyId</name>
<value>XXXXXXXXXXXXX</value>
</property>
<property>
<name>fs.s3a.awsSecretAccessKey</name>
<value>yyyyyyyyyyyyyyyyyyyy</value>
</property>
As you see I'm using brute force mixing s3 and s3a, not sure what are the right parameters
I'm running:
start-thriftserver.sh --packages org.apache.spark:spark-hadoop-cloud_2.12:3.3.1 --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem --master local[*]
The log doesn't show any error:
23/01/12 04:16:12 INFO MetricsSystemImpl: s3a-file-system metrics system started
23/01/12 04:16:13 INFO SharedState: Warehouse path is 's3a://demo-metastore-001/'.
23/01/12 04:16:14 INFO HiveUtils: Initializing HiveMetastoreConnection version 2.3.9 using Spark classes.
23/01/12 04:16:14 INFO HiveClientImpl: Warehouse location for Hive client (version 2.3.9) is s3a://demo-metastore-001/
23/01/12 04:16:18 INFO HiveUtils: Initializing execution hive, version 2.3.9
23/01/12 04:16:18 INFO HiveClientImpl: Warehouse location for Hive client (version 2.3.9) is s3a://demo-metastore-001/
But the metastore it is always created in the Master server directory not the in S3. Any idea how can I connect Spark-Thrift Server to AWS s3?

Configure Hive metastore in my local Spark shell

I need to configure Hive metastore for use with Spark SQL in spark-shell.
I copied my hive-site.xml to spark/conf folder - it didn't work.
Then tried in spark shell
spark.conf.set("hive.metastore.uris","jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true")
spark.conf.set("spark.sql.catalogImplementation","hive")
But got an error:
scala> spark.conf.set("spark.sql.catalogImplementation","hive")
org.apache.spark.sql.AnalysisException: Cannot modify the value of a static config: spark.sql.catalogImplementation;
at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:155)
at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:41)
... 49 elided
Tried opening spark shell using
spark-shell --conf spark.sql.catalogImplementation=hive hive.metastore.uris=jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true
still not able to read
Error:
scala> spark.sql("select * from car_test.car_data_table").show
org.apache.spark.sql.AnalysisException: Table or view not found: `car_test`.`car_data_table`
Hive metastore is not getting attached to spark sql.
My hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/raptor/tmp/hive/warehouse/</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>HIVE_USER</value>
</property>
<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>HIVE_PASSWORD</value>
</property>
</configuration>
spark-env.sh
#!/usr/bin/env bash
export JAVA_HOME=/home/user/Softwares/jdk1.8.0_221/
export SPARK_LOCAL_IP=127.0.1.1
export HADOOP_HOME=/home/user/Softwares/hadoop-2.7.3/
export HIVE_HOME=/home/user/Softwares/apache-hive-2.1.1-bin/
export SPARK_LOCAL_IP=127.0.1.1
export SPARK_DIST_CLASSPATH=$HADOOP_HOME/bin/hadoop
#SPARK_DIST_CLASSPATH=$HADOOP_HOME/share/hadoop/common/lib:$HADOOP_HOME/share/hadoop/common:$HADOOP_HOME/share/hadoop/mapreduce:$HADOOP_HOME/share/hadoop/mapreduce/lib:$HADOOP_HOME/share/hadoop/yarn:$HADOOP_HOME/share/hadoop/yarn/lib:$HADOOP_HOME/share/hadoop/hdfs:$HADOOP_HOME/share/hadoop/hdfs/lib
export SPARK_HOME=/home/user/Softwares/spark-2.4.4-bin-hadoop-2.7-scala-2.12
export SPARK_CONF_DIR=${SPARK_HOME}/conf
export SPARK_LOG_DIR=${SPARK_HOME}/logs
Got help from this post.
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
And One another thing, Spark 2.* will only support hive metastores from hive 0.12.0 to hive 1.2.0. I was using hive 2.1.1 that was the issue.

Enabling dynamic allocation on spark on YARN mode

This question is similar to this but there was no answer.
I am trying to enable dynamic allocation in Spark in YARN mode. I have 11 node cluster with 1 master node and 10 worker nodes. I am following below link for instructions:
For setup in YARN:
http://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service
Config variables needs to be set in spark-defaults.conf: https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior
I have also taken reference from below link and few other resources:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-dynamic-allocation.html#spark.dynamicAllocation.testing
Here are the steps I am doing:
Setting up config variables in spark-defaults.conf.
My spark-defaults.conf related to dynamic allocation and shuffle service is as:
spark.dynamicAllocation.enabled=true
spark.shuffle.service.enabled=true
spark.shuffle.service.port=7337
Making changes in yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
<name>yarn.nodemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value> $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/common/*,$HADOOP_MAPRED_HOME/share/hadoop/common/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/tools/*,$HADOOP_MAPRED_HOME/share/hadoop/tools/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/client/*,$HADOOP_MAPRED_HOME/share/hadoop/client/lib/*,/home/hadoop/spark/common/network-yarn/target/scala-2.11/spark-2.2.2-SNAPSHOT-yarn-shuffle.jar </value>
</property>
All these steps are replicated in all worker nodes i.e spark-defaults.conf has the above mentioned values and yarn-site.xml has these properties. I have made sure that /home/hadoop/spark/common/network-yarn/target/scala-2.11/spark-2.2.2-SNAPSHOT-yarn-shuffle.jar exists in all worker nodes.
Then I am running $SPARK_HOME/sbin/start-shuffle-service.sh in worker nodes and master node. In master node, I am restarting the YARN using stop-yarn.sh and then start-yarn.sh
Then I am doing YARN node -list -all to see the worker nodes but I am not able to see any node
When I am disabling the property
<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle</value>
</property>
I can see all the worker nodes as normal so it seems like shuffle service is not properly configured.

spark 1.6.1 -- hive-site.xml -- not connecting to mysql [duplicate]

This question already has answers here:
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
(11 answers)
Closed 2 years ago.
The following are the versions that we have
Spark 1.6.1
Hadoop 2.6.2
Hive 1.1.0
I have the hive-site.xml in $SPARK_HOME/conf directory. The hive.metastore.uris property is also configured properly.
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://host.domain.com:3306/metastore</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>user name for connecting to mysql server </description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>*****</value>
<description>password for connecting to mysql server </description>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://host.domain.com:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
Unfortunately Spark is creating a temp derby db without connecting to MySQL metastore
I need Spark to connect to MySQL metastore as that is the central store for all metadata. Please help
Regards
Bala
Can you try passing the hive-site.xml
(--files) with spark-submit when running in cluster mode?

Setting MySQL as the metastore for built in spark hive

I have an spark, scala sbt project using spark. I need to multiple create HiveContexts, which is not allowed by the built in derby for spark hive. Can someone help me with setting up mysql as the metastore instead of derby, which is the default db. I don't have actual hive installed or spark installed. I use sbt dependency for spark and hive.
Copy hive-site.xml file in Spark's conf directory and change some properties in that file
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>password to use against metastore database</description>
</property>
Shakti
You need to have the conf files in the class path. I'm using hadoop, hive, and spark with Intellij. In Intellij I have file:/usr/local/spark/conf/, file:/usr/local/hadoop/etc/hadoop/, and file:/usr/local/hive/conf/ in my class path. You can use following to print your run time class path:
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)
I hope this help if you haven't already found a fix.

Resources