howto add hive properties at runtime in spark-shell - apache-spark

How do you set a hive property like: hive.metastore.warehouse.dir at runtime? Or at least a more dynamic way of setting a property like the above, than putting it in a file like spark_home/conf/hive-site.xml

I faced the same issue and for me it worked by setting Hive properties from Spark (2.4.0). Please find below all the options through spark-shell, spark-submit and SparkConf.
Option 1 (spark-shell)
spark-shell --conf spark.hadoop.hive.metastore.warehouse.dir=some_path\metastore_db_2
Initially I tried with spark-shell with hive.metastore.warehouse.dir set to some_path\metastore_db_2. Then I get the next warning:
Warning: Ignoring non-spark config property:
hive.metastore.warehouse.dir=C:\winutils\hadoop-2.7.1\bin\metastore_db_2
Although when I create a Hive table with:
bigDf.write.mode("overwrite").saveAsTable("big_table")
The Hive metadata are stored correctly under metastore_db_2 folder.
When I use spark.hadoop.hive.metastore.warehouse.dir the warning disappears and the results are still saved in the metastore_db_2 directory.
Option 2 (spark-submit)
In order to use hive.metastore.warehouse.dir when submitting a job with spark-submit I followed the next steps.
First I wrote some code to save some random data with Hive:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setAppName("metastore_test").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
import spark.implicits._
var dfA = spark.createDataset(Seq(
(1, "val1", "p1"),
(2, "val1", "p2"),
(3, "val2", "p3"),
(3, "val3", "p4"))).toDF("id", "value", "p")
dfA.write.mode("overwrite").saveAsTable("metastore_test")
spark.sql("select * from metastore_test").show(false)
Next I submitted the job with:
spark-submit --class org.tests.Main \
--conf spark.hadoop.hive.metastore.warehouse.dir=C:\winutils\hadoop-2.7.1\bin\metastore_db_2
spark-scala-test_2.11-0.1.jar
The metastore_test table was properly created under the C:\winutils\hadoop-2.7.1\bin\metastore_db_2 folder.
Option 3 (SparkConf)
Via SparkSession in the Spark code.
val sparkConf = new SparkConf()
.setAppName("metastore_test")
.set("spark.hadoop.hive.metastore.warehouse.dir", "C:\\winutils\\hadoop-2.7.1\\bin\\metastore_db_2")
.setMaster("local")
This attempt was successful as well.
The question which still remains is why I have to extend the property with spark.hadoop in order to work as expected?

Related

Set spark configuration

I am trying to set the configuration of a few spark parameters inside the pyspark shell.
I tried the following
spark.conf.set("spark.executor.memory", "16g")
To check if the executor memory has been set, I did the following
spark.conf.get("spark.executor.memory")
which returned "16g".
I tried to check it through sc using
sc._conf.get("spark.executor.memory")
and that returned "4g".
Why do these two return different values and whats the correct way to set these configurations.
Also, I am fiddling with a bunch of parameters like
"spark.executor.instances"
"spark.executor.cores"
"spark.executor.memory"
"spark.executor.memoryOverhead"
"spark.driver.memory"
"spark.driver.cores"
"spark.driver.memoryOverhead"
"spark.memory.offHeap.size"
"spark.memory.fraction"
"spark.task.cpus"
"spark.memory.offHeap.enabled "
"spark.rpc.io.serverThreads"
"spark.shuffle.file.buffer"
Is there a way that will set the configurations for all the variables.
EDIT
I need to set the configuration programmatically. How do I change it after I have done spark-submit or started the pyspark shell? I am trying to reduce the runtime of my jobs for which I am going through multiple iterations changing the spark configuration and recording the runtimes.
You can set environment variables by using: (e.g. in spark-env.sh, only stand-alone)
SPARK_EXECUTOR_MEMORY=16g
You can also set the spark-defaults.conf:
spark.executor.memory=16g
But these solutions are hardcoded and pretty much static, and you want to have different parameters for different jobs, however, you might want to set up some defaults.
The best approach is to use spark-submit:
spark-submit --executor-memory 16G
The problem of defining variables programmatically is that some of them need to be defined at startup time if not precedence rules will take over and your changes after the initiation of the job will be ignored.
Edit:
The amount of memory per executor is looked up when SparkContext is created.
And
once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
See: SparkConf Documentation
Have you tried changing the variable before the SparkContext is created, then running your iteration, stopping your SparkContext and changing your variable to iterate again?
import org.apache.spark.{SparkContext, SparkConf}
val conf = new SparkConf.set("spark.executor.memory", "16g")
val sc = new SparkContext(conf)
...
sc.stop()
val conf2 = new SparkConf().set("spark.executor.memory", "24g")
val sc2 = new SparkContext(conf2)
You can debug your configuration using: sc.getConf.toDebugString
See: Spark Configuration
Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
You'll need to make sure that your variable is not defined with higher precedence.
Precedence order:
conf/spark-defaults.conf
--conf or -c - the command-line option used by spark-submit
SparkConf
I hope this helps.
In Pyspark,
Suppose I want to increase the driver memory and executor in code. I can do it as below:
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '23g'), ('spark.driver.memory','9.7g')])
To view the updated settings:
spark.sparkContext._conf.getAll()

SPARK_HOME read from IntellIj IDEA

I know it might be trivial question, but I stuck with it for a while :) Basically, I am querying Hive tables from Spark. All connections settings for Hive I have in hive-site.xml file which is in SPARK_HOME/conf directory on my Windows PC. It works fine through spark-shell. However, when I run the same from IntelliJ Idea, it will query Hive locally instead connecting to remote metastore defined in hive-site.xml.
package main.scala
import org.apache.spark.sql.SparkSession
object Main {
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.appName("Hive Tables")
.master("local")
.enableHiveSupport()
.getOrCreate()
import spark.sql
sql("show tables").show()
}
}
I have added environmental variables for HADOOP_HOME and SPARK_HOME. As I mentioned it works fine through spark-shell. But seems like hive-site.xml is not read in IntelliJ Idea.
Do you have any ideas what I may do wrong ?
Thanks

How to start sparksession in pyspark

I want to change the default memory, executor and core settings of a spark session.
The first code in my pyspark notebook on HDInsight cluster in Jupyter looks like this:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("Juanita_Smith")\
.config("spark.executor.instances", "2")\
.config("spark.executor.cores", "2")\
.config("spark.executor.memory", "2g")\
.config("spark.driver.memory", "2g")\
.getOrCreate()
On completion, I read the parameters back, which looks like the statement worked
However if I look in yarn, the setting have indeed not worked.
Which settings or commands do I need to make to let the session configuration take effect ?
Thank you for help in advance
By the time your notebook kernel has started, the SparkSession is already created with parameters defined in a kernel configuration file. To change this, you will need to update or replace the kernel configuration file, which I believe is usually somewhere like <jupyter home>/kernels/<kernel name>/kernel.json.
Update
If you have access to the machine hosting your Jupyter server, you can find the location of the current kernel configurations using jupyter kernelspec list. You can then either edit one of the pyspark kernel configurations, or copy it to a new file and edit that. For your purposes, you will need to add the following arguments to the PYSPARK_SUBMIT_ARGS:
"PYSPARK_SUBMIT_ARGS": "--conf spark.executor.instances=2 --conf spark.executor.cores=2 --conf spark.executor.memory=2g --conf spark.driver.memory=2g"

Can't access Spark 2.0 Temporary Table from beeline

With Spark 1.5.1, I've already been able to access spark-shell temporary tables from Beeline using Thrift Server. I've been able to do so by reading answers to related questions on Stackoverflow.
However, after upgrading to Spark 2.0, I can't see temporary tables from Beeline anymore, here are the steps I'm following.
I'm launching spark-shell using the following command:
./bin/spark-shell --master=myHost.local:7077 —conf spark.sql.hive.thriftServer.singleSession=true
Once the spark shell is ready I enter the following lines to launch thrift server and create a temporary view from a data frame taking its source in a json file
import org.apache.spark.sql.hive.thriftserver._
spark.sqlContext.setConf("hive.server2.thrift.port","10002")
HiveThriftServer2.startWithContext(spark.sqlContext)
val df = spark.read.json("examples/src/main/resources/people.json")
df.createOrReplaceTempView("people")
spark.sql("select * from people").show()
The last statement displays the table, it runs fine.
However when I start beeline and log to my thrift server instance, I can't see any temporary tables:
show tables;
+------------+--------------+--+
| tableName | isTemporary |
+------------+--------------+--+
+------------+--------------+--+
No rows selected (0,658 seconds)
Did I miss something regarding my spark upgrade from 1.5.1 to 2.0, how can I gain access to my temporary tables ?
This worked for me after upgrading to spark 2.0.1
val sparkConf =
new SparkConf()
.setAppName("Spark Thrift Server Demo")
.setMaster(sparkMaster)
.set("hive.metastore.warehouse.dir", hdfsDataUri + "/hive")
val spark = SparkSession
.builder()
.enableHiveSupport()
.config(sparkConf)
.getOrCreate()
val sqlContext = new org.apache.spark.sql.SQLContext(spark.sparkContext)
HiveThriftServer2.startWithContext(sqlContext)

NoClassDefFoundError when using avro in spark-shell

I keep getting
java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:
scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError
The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).
So the original issue is that the avro-mapred package is missing.
Try launching your Spark Shell like this:
spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.
If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.

Resources