How to execute Spark programs with Dynamic Resource Allocation? - apache-spark

I am using spark-summit command for executing Spark jobs with parameters such as:
spark-submit --master yarn-cluster --driver-cores 2 \
--driver-memory 2G --num-executors 10 \
--executor-cores 5 --executor-memory 2G \
--class com.spark.sql.jdbc.SparkDFtoOracle2 \
Spark-hive-sql-Dataframe-0.0.1-SNAPSHOT-jar-with-dependencies.jar
Now i want to execute the same program using Spark's Dynamic Resource allocation. Could you please help with the usage of Dynamic Resource Allocation in executing Spark programs.

In Spark dynamic allocation spark.dynamicAllocation.enabled needs to be set to true because it's false by default.
This requires spark.shuffle.service.enabled to be set to true, as spark application is running on YARN. Check this link to start the shuffle service on each NodeManager in YARN.
The following configurations are also relevant:
spark.dynamicAllocation.minExecutors,
spark.dynamicAllocation.maxExecutors, and
spark.dynamicAllocation.initialExecutors
These options can be configured to Spark application in 3 ways
1. From Spark submit with --conf <prop_name>=<prop_value>
spark-submit --master yarn-cluster \
--driver-cores 2 \
--driver-memory 2G \
--num-executors 10 \
--executor-cores 5 \
--executor-memory 2G \
--conf spark.dynamicAllocation.minExecutors=5 \
--conf spark.dynamicAllocation.maxExecutors=30 \
--conf spark.dynamicAllocation.initialExecutors=10 \ # same as --num-executors 10
--class com.spark.sql.jdbc.SparkDFtoOracle2 \
Spark-hive-sql-Dataframe-0.0.1-SNAPSHOT-jar-with-dependencies.jar
2. Inside Spark program with SparkConf
Set the properties in SparkConf then create SparkSession or SparkContext with it
val conf: SparkConf = new SparkConf()
conf.set("spark.dynamicAllocation.minExecutors", "5");
conf.set("spark.dynamicAllocation.maxExecutors", "30");
conf.set("spark.dynamicAllocation.initialExecutors", "10");
.....
3. spark-defaults.conf usually located in $SPARK_HOME/conf/
Place the same configurations in spark-defaults.conf to apply for all spark applications if no configuration is passed from command-line as well as code.
Spark - Dynamic Allocation Confs

I just did a small demo with Spark's dynamic resource allocation. The code is on my Github. Specifically, the demo is in this release.

Related

Spark fail if not all resources are allocated

Does spark or yarn has any flag to fail fast job if we can't allocate all resoucres?
For example if i run
spark-submit --class org.apache.spark.examples.SparkPi
--master yarn-client
--num-executors 7
--driver-memory 512m
--executor-memory 4g
--executor-cores 1
/usr/hdp/current/spark2-client/examples/jars/spark-examples_*.jar 1000
For now if spark can allocate only 5 executors it just will go with 5. Can we make to run it only with 7 or fail in other case?
You can set a spark.dynamicAllocation.minExecutors config in your job. For it you need to set spark.dynamicAllocation.enabled=true, detailed in this doc

Spark-submit with properties file for python

I am trying to load spark configuration through the properties file using the following command in Spark 2.4.0.
spark-submit --properties-file props.conf sample.py
It gives the following error
org.apache.spark.SparkException: Dynamic allocation of executors requires the external shuffle service. You may enable this through spark.shuffle.service.enabled.
The props.conf file has this
spark.master yarn
spark.submit.deployMode client
spark.authenticate true
spark.sql.crossJoin.enabled true
spark.dynamicAllocation.enabled true
spark.driver.memory 4g
spark.driver.memoryOverhead 2048
spark.executor.memory 2g
spark.executor.memoryOverhead 2048
Now, when I try to run the same by adding all arguments to the command itself, it works fine.
spark2-submit \
--conf spark.master=yarn \
--conf spark.submit.deployMode=client \
--conf spark.authenticate=true \
--conf spark.sql.crossJoin.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.driver.memory=4g \
--conf spark.driver.memoryOverhead=2048 \
--conf spark.executor.memory=2g \
--conf spark.executor.memoryOverhead=2048 \
sample.py
This works as expected.
I don't think spark supports --properties-file, one workaround is making the change on $SPARK_HOME/conf/spark-defaults.conf, spark will auto load it.
You can refer https://spark.apache.org/docs/latest/submitting-applications.html#loading-configuration-from-a-file

How does spark-submit.sh work with different modes and different cluster managers?

In Apache Spark, how does spark-submit.sh work with different modes and different cluster managers? Specifically:
In local deployment mode,
does spark-submit.sh skip calling any cluster manager?
Is it correct that there is no need to install a cluster manager on the local machine?
In client or cluster deployment mode,
Does spark-submit.sh work with different cluster managers (Spark standalone, YARN, Mesos, Kubernetes)? Do different cluster managers have different interfaces, and spark-submit.sh has to invoke them in different ways?
Does spark-submit.sh appear to programmers the same interface except --master? option --master of spark-submit.sh is used to specify a cluster manager.
Thanks.
To make things clear, there is absolutely no need to specify any cluster manager while running spark on any mode (client or cluster or whether you run spark in local mode). The cluster manager is only there to make resource allocation easier and independent, but it is always your choice to use one or not.
The spark-submit command doesn't need a cluster manager present to run.
The different ways in which you can use the command are:
1) local mode:
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar \
100
2) client mode without a resource manager (also known as spark standalone mode):
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
3) cluster mode with spark standalone mode:
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--deploy-mode cluster \
--supervise \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
4) Client/Cluster mode with a resource manager:
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \ # can be client for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000
As you can see above, the spark-submit.sh will behave in the same way whether there is a cluster manager or not. Also, if you want to use a resource manager like yarn, mesos, the behaviour of spark-submit will remain the same.
You can read more about spark-submit here.

Spark: Entire dataset concentrated in one executor

I am running a spark job with 3 files each of 100MB size, for some reason my spark UI shows all dataset concentrated into 2 executors.This is making the job run for 19 hrs and still running.
Below is my spark configuration . spark 2.3 is the version used.
spark2-submit --class org.mySparkDriver \
--master yarn-cluster \
--deploy-mode cluster \
--driver-memory 8g \
--num-executors 100 \
--conf spark.default.parallelism=40 \
--conf spark.yarn.executor.memoryOverhead=6000mb \
--conf spark.dynamicAllocation.executorIdleTimeout=6000s \
--conf spark.executor.cores=3 \
--conf spark.executor.memory=8G \
I tried repartitioning inside the code which works , as this makes the file go into 20 partitions (i used rdd.repartition(20)). But why should I repartition , i believe specifying spark.default.parallelism=40 in the script should let spark divide the input file to 40 executors and process the file in 40 executors.
Can anyone help.
Thanks,
Neethu
I am assuming you're running your jobs in YARN if yes, you can check following properties.
yarn.scheduler.maximum-allocation-mb
yarn.nodemanager.resource.memory-mb
yarn.scheduler.maximum-allocation-vcores
yarn.nodemanager.resource.cpu-vcores
In YARN these properties would affect number of containers that can be instantiated in a NodeManager based on spark.executor.cores, spark.executor.memory property values (along with executor memory overhead)
For example, if a cluster with 10 nodes (RAM : 16 GB, cores : 6) and set with following yarn properties
yarn.scheduler.maximum-allocation-mb=10GB
yarn.nodemanager.resource.memory-mb=10GB
yarn.scheduler.maximum-allocation-vcores=4
yarn.nodemanager.resource.cpu-vcores=4
Then with spark properties spark.executor.cores=2, spark.executor.memory=4GB you can expect 2 Executors/Node so total you'll get 19 executors + 1 container for Driver
If the spark properties are spark.executor.cores=3, spark.executor.memory=8GB then you will get 9 Executor (only 1 Executor/Node) + 1 container for Driver
you can refer to link for more details
Hope this helps

SparkConf not reading spark-submit arguments

SparkConf on pyspark does not read the configuration arguments passed to spark-submit.
My python code is something like
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("foo")
sc = SparkContext(conf=conf)
# processing code...
sc.stop()
and I submit it with
PYSPARK_PYTHON="/opt/anaconda/bin/python" spark-submit foo.py \
--master local[4] --conf="spark.driver.memory=16g" --executor-memory 16g
but none of the configuration arguments are applied. That is, the application is executed with the default values of local[*] for master, 1g for driver memory and 1g for executor memory. This was confirmed by the Spark GUI.
However, the configuration arguments are followed if I use pyspark to submit the application:
PYSPARK_PYTHON="/opt/anaconda/bin/python" pyspark --master local[4] \
--conf="spark.driver.memory=8g"
Notice that --executor-memory 16g was also changed to --conf="spark.executor.memory=16g" because the former doesn't work either.
What am I doing wrong?
I believe you need to remove the = sign from --conf=. Your spark-submit script should be
PYSPARK_PYTHON="/opt/anaconda/bin/python" spark-submit foo.py \
--master local[4] --conf spark.driver.memory=16g --executor-memory 16g
Note that spark-submit also supports setting driver memory with the flag --driver-memory 16G
Apparently, the order of the arguments matter. The last argument should be the name of the python script. So, the call should be
PYSPARK_PYTHON="/opt/anaconda/bin/python" spark-submit \
--master local[4] --conf="spark.driver.memory=16g" --executor-memory 16g foo.py
or, following #glennie-helles-sindholt's advise,
PYSPARK_PYTHON="/opt/anaconda/bin/python" spark-submit \
--master local[4] --driver-memory 16g --executor-memory 16g foo.py

Resources