Running Hive on spark - apache-spark

Running Hive on spark - apache-spark

Trying to run hive on spark,using the below properties for the same. Tried tweaking some other properties as well like number of executor instances,spark master but throwing the error "FAILED: SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client."
Its running fine when spark.master is set to local.
The job is not getting instantiated. Any inputs?
set hive.execution.engine=spark;
set spark.executor.memory=2g;
set yarn.scheduler.maximum-allocation-mb=8192;
set yarn.nodemanager.resource.memory-mb=40960;
set spark.executor.cores=4;
set spark.executor.memory=4g;
set spark.yarn.executor.memoryOverhead=750;
set hive.spark.client.server.connect.timeout=900000ms;
set yarn.nodemanager.resource.memory-mb=2048;

Hive on spark will not work for Hive version < 2.1

Related

Spark : Can't set table properties start with "spark.sql" to hive external table while creating

Env : linux (spark-submit xxx.py)
Target database : Hive
We used to use beeline to execute hql, but now we try to run the hql through pyspark and faced some issue when tried to set table properties while creating the table.
SQL
CREATE EXTERNAL TABLE example.a(
column_a string)
TBLPROPERTIES (
'discover.partitions'='true',
'spark.sql.sources.schema.numPartCols'='1',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"column_a","type":"string","nullable":true,"metadata":{}}]}',
'spark.sql.sources.schema.partCol.0'='received_utc_date_partition');
Error message
Hive - ERROR - Cannot persist
example.a into Hive metastore as table property
keys may not start with 'spark.sql.': [spark.sql.sources.schema.partCol.0, spark.sql.sources.schema.numParts,
spark.sql.sources.schema.numPartCols, spark.sql.sources.schema.part.0];
In line 130-147 in spark source code it seems that it prevent all table properties that start with "spark.sql"
Not sure if I did it wrong or there's another way to set up the table properties for hive table.
Any kinds of suggestion is appreciated.

Set spark configuration

I am trying to set the configuration of a few spark parameters inside the pyspark shell.
I tried the following
spark.conf.set("spark.executor.memory", "16g")
To check if the executor memory has been set, I did the following
spark.conf.get("spark.executor.memory")
which returned "16g".
I tried to check it through sc using
sc._conf.get("spark.executor.memory")
and that returned "4g".
Why do these two return different values and whats the correct way to set these configurations.
Also, I am fiddling with a bunch of parameters like
"spark.executor.instances"
"spark.executor.cores"
"spark.executor.memory"
"spark.executor.memoryOverhead"
"spark.driver.memory"
"spark.driver.cores"
"spark.driver.memoryOverhead"
"spark.memory.offHeap.size"
"spark.memory.fraction"
"spark.task.cpus"
"spark.memory.offHeap.enabled "
"spark.rpc.io.serverThreads"
"spark.shuffle.file.buffer"
Is there a way that will set the configurations for all the variables.
EDIT
I need to set the configuration programmatically. How do I change it after I have done spark-submit or started the pyspark shell? I am trying to reduce the runtime of my jobs for which I am going through multiple iterations changing the spark configuration and recording the runtimes.

You can set environment variables by using: (e.g. in spark-env.sh, only stand-alone)
SPARK_EXECUTOR_MEMORY=16g
You can also set the spark-defaults.conf:
spark.executor.memory=16g
But these solutions are hardcoded and pretty much static, and you want to have different parameters for different jobs, however, you might want to set up some defaults.
The best approach is to use spark-submit:
spark-submit --executor-memory 16G
The problem of defining variables programmatically is that some of them need to be defined at startup time if not precedence rules will take over and your changes after the initiation of the job will be ignored.
Edit:
The amount of memory per executor is looked up when SparkContext is created.
And
once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
See: SparkConf Documentation
Have you tried changing the variable before the SparkContext is created, then running your iteration, stopping your SparkContext and changing your variable to iterate again?
import org.apache.spark.{SparkContext, SparkConf}
val conf = new SparkConf.set("spark.executor.memory", "16g")
val sc = new SparkContext(conf)
...
sc.stop()
val conf2 = new SparkConf().set("spark.executor.memory", "24g")
val sc2 = new SparkContext(conf2)
You can debug your configuration using: sc.getConf.toDebugString
See: Spark Configuration
Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
You'll need to make sure that your variable is not defined with higher precedence.
Precedence order:
conf/spark-defaults.conf
--conf or -c - the command-line option used by spark-submit
SparkConf
I hope this helps.

In Pyspark,
Suppose I want to increase the driver memory and executor in code. I can do it as below:
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '23g'), ('spark.driver.memory','9.7g')])
To view the updated settings:
spark.sparkContext._conf.getAll()

Get data from subfolders of an unpartitioned hive table into a dataframe in spark

There is an external table in hive pointing to s3 location that is not partitioned. The table points to a folder in s3 but the data is in multiple subfolders inside that folder.
This table can be queried even though the table is not partitioned by setting few properties in hive like below,
set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
set hive.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
However, when the same table is used in spark to load the data into a dataframe using a sql statement like df = sqlContext.sql("select * from table_name"), the action fails saying 'The subfolders in the external s3 location is not a file'.
I tried setting above hive properties in spark using sc.hadoopConfiguration.set("mapred.input.dir.recursive","true") method, but it did not help. Looks like this would help only for sc.textFile kind of loading.

This can be achieved by setting the following property in spark,
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")
Note here that the property is set usign sqlContext instead of sparkContext.
And I tested this in spark 1.6.2

Cannot change hive.exec.max.dynamic.partitions in Spark

I'm trying to insert some data in a table that will have 1500 dynamic partitions and I receive this error:
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
Number of dynamic partitions created is 1500, which is more than 1000.
To solve this try to set hive.exec.max.dynamic.partitions to at least 1500.
So, I try to: SET hive.exec.max.dynamic.partitions=2048 but I still get the same error.
How can I change this value from Spark?
Code:
this.spark.sql("SET hive.exec.dynamic.partition=true")
this.spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
this.spark.sql("SET hive.exec.max.dynamic.partitions=2048")
this.spark.sql(
"""
|INSERT INTO processed_data
|PARTITION(event, date)
|SELECT c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,event,date FROM csv_data DISTRIBUTE BY event, date
""".stripMargin
).show()
Using Spark 2.0.0 standalone mode.
Thank you!

From spark 2.x version, adding hive set properties in Spark CLI may not work. Please add your hive set properties in hive-site.xml of your both spark and hive conf directories.
adding below property in hive-site.xml file should resolve your issue.
<name>hive.exec.max.dynamic.partitions</name>
<value>2048</value>
<description></description>
Note: restart hiveserver2 and spark history server if it didn't work.

Spark queue name in hive

I'm trying to execute my queries in hive using spark engine, but i need to run them in specific queue. I couldn't find any queue name properties except in spark-submit --queue. So far i've used these settings:
set hive.execution.engine=spark;
set spark.job.queue.name=MyQueue;
set spark.executor.instances=50;
or
set spark.queue.name=MyQueue;
but they won't start the jobs.
found another option:
set spark.yarn.queue=MyQueue
doesn't work either

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string