Spark queue name in hive - apache-spark

I'm trying to execute my queries in hive using spark engine, but i need to run them in specific queue. I couldn't find any queue name properties except in spark-submit --queue. So far i've used these settings:
set hive.execution.engine=spark;
set spark.job.queue.name=MyQueue;
set spark.executor.instances=50;
or
set spark.queue.name=MyQueue;
but they won't start the jobs.
found another option:
set spark.yarn.queue=MyQueue
doesn't work either

Related

How to list hive variables available in a notebook session?

In a hive session it is possible to list the variables available with the following:
0: jdbc:hive2://127.0.0.1:10000>set;
How can I list the hive variables from a databricks notebook?
The solution was to just run the SET SQL command:
%sql
SET

how to set the description of a job in sparkly R

Within pyspark you are able to set the description of a job using
SparkContext.setJobDescription('some job description')
Is there an equivalent method available within sparkly R?

Set spark configuration

I am trying to set the configuration of a few spark parameters inside the pyspark shell.
I tried the following
spark.conf.set("spark.executor.memory", "16g")
To check if the executor memory has been set, I did the following
spark.conf.get("spark.executor.memory")
which returned "16g".
I tried to check it through sc using
sc._conf.get("spark.executor.memory")
and that returned "4g".
Why do these two return different values and whats the correct way to set these configurations.
Also, I am fiddling with a bunch of parameters like
"spark.executor.instances"
"spark.executor.cores"
"spark.executor.memory"
"spark.executor.memoryOverhead"
"spark.driver.memory"
"spark.driver.cores"
"spark.driver.memoryOverhead"
"spark.memory.offHeap.size"
"spark.memory.fraction"
"spark.task.cpus"
"spark.memory.offHeap.enabled "
"spark.rpc.io.serverThreads"
"spark.shuffle.file.buffer"
Is there a way that will set the configurations for all the variables.
EDIT
I need to set the configuration programmatically. How do I change it after I have done spark-submit or started the pyspark shell? I am trying to reduce the runtime of my jobs for which I am going through multiple iterations changing the spark configuration and recording the runtimes.
You can set environment variables by using: (e.g. in spark-env.sh, only stand-alone)
SPARK_EXECUTOR_MEMORY=16g
You can also set the spark-defaults.conf:
spark.executor.memory=16g
But these solutions are hardcoded and pretty much static, and you want to have different parameters for different jobs, however, you might want to set up some defaults.
The best approach is to use spark-submit:
spark-submit --executor-memory 16G
The problem of defining variables programmatically is that some of them need to be defined at startup time if not precedence rules will take over and your changes after the initiation of the job will be ignored.
Edit:
The amount of memory per executor is looked up when SparkContext is created.
And
once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
See: SparkConf Documentation
Have you tried changing the variable before the SparkContext is created, then running your iteration, stopping your SparkContext and changing your variable to iterate again?
import org.apache.spark.{SparkContext, SparkConf}
val conf = new SparkConf.set("spark.executor.memory", "16g")
val sc = new SparkContext(conf)
...
sc.stop()
val conf2 = new SparkConf().set("spark.executor.memory", "24g")
val sc2 = new SparkContext(conf2)
You can debug your configuration using: sc.getConf.toDebugString
See: Spark Configuration
Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
You'll need to make sure that your variable is not defined with higher precedence.
Precedence order:
conf/spark-defaults.conf
--conf or -c - the command-line option used by spark-submit
SparkConf
I hope this helps.
In Pyspark,
Suppose I want to increase the driver memory and executor in code. I can do it as below:
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '23g'), ('spark.driver.memory','9.7g')])
To view the updated settings:
spark.sparkContext._conf.getAll()

Running Hive on spark

Trying to run hive on spark,using the below properties for the same. Tried tweaking some other properties as well like number of executor instances,spark master but throwing the error "FAILED: SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client."
Its running fine when spark.master is set to local.
The job is not getting instantiated. Any inputs?
set hive.execution.engine=spark;
set spark.executor.memory=2g;
set yarn.scheduler.maximum-allocation-mb=8192;
set yarn.nodemanager.resource.memory-mb=40960;
set spark.executor.cores=4;
set spark.executor.memory=4g;
set spark.yarn.executor.memoryOverhead=750;
set hive.spark.client.server.connect.timeout=900000ms;
set yarn.nodemanager.resource.memory-mb=2048;
Hive on spark will not work for Hive version < 2.1

Get data from subfolders of an unpartitioned hive table into a dataframe in spark

There is an external table in hive pointing to s3 location that is not partitioned. The table points to a folder in s3 but the data is in multiple subfolders inside that folder.
This table can be queried even though the table is not partitioned by setting few properties in hive like below,
set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
set hive.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
However, when the same table is used in spark to load the data into a dataframe using a sql statement like df = sqlContext.sql("select * from table_name"), the action fails saying 'The subfolders in the external s3 location is not a file'.
I tried setting above hive properties in spark using sc.hadoopConfiguration.set("mapred.input.dir.recursive","true") method, but it did not help. Looks like this would help only for sc.textFile kind of loading.
This can be achieved by setting the following property in spark,
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")
Note here that the property is set usign sqlContext instead of sparkContext.
And I tested this in spark 1.6.2

Resources