Spark executor configuration priority - apache-spark

I saw a Spark submit command with following parameters
spark-submit --class ${my_class} \
--master yarn \
--deploy-mode cluster \
--executor-cores 2 \ <--- executor cores
--driver-cores 2\ <--- driver cores
--num-executors 12 \ <--- number of executors
--files hdfs:///blah.xml \
--conf spark.executor.instances=15 \ <--- number of executors again?
--conf spark.executor.cores=4 \ <--- driver cores again?
--conf spark.driver.cores=4 \ <--- executor cores again?
It seems like there can have multiple ways to set up core number and instance number for executor and driver node, just wondering, in above setting which way take priority and overwrite the other one? The -- parameter or conf parameter? Eventually how many cores and instances are given to the spark job?

Configuration are picked up depending upon the order of preference.
Priority wise , config defined in application through set() gets the highest priority.
Second priority is given to spark-submit parameters and then the next priority is given to default config parameters.
--executor-cores 2 \ <--- executor cores
--driver-cores 2\ <--- driver cores
--num-executors 12 \ <--- number of executors
The above configuration will take priority over --conf parameters as these properties are used to override the default conf priorities.

Related

Spark: Entire dataset concentrated in one executor

I am running a spark job with 3 files each of 100MB size, for some reason my spark UI shows all dataset concentrated into 2 executors.This is making the job run for 19 hrs and still running.
Below is my spark configuration . spark 2.3 is the version used.
spark2-submit --class org.mySparkDriver \
--master yarn-cluster \
--deploy-mode cluster \
--driver-memory 8g \
--num-executors 100 \
--conf spark.default.parallelism=40 \
--conf spark.yarn.executor.memoryOverhead=6000mb \
--conf spark.dynamicAllocation.executorIdleTimeout=6000s \
--conf spark.executor.cores=3 \
--conf spark.executor.memory=8G \
I tried repartitioning inside the code which works , as this makes the file go into 20 partitions (i used rdd.repartition(20)). But why should I repartition , i believe specifying spark.default.parallelism=40 in the script should let spark divide the input file to 40 executors and process the file in 40 executors.
Can anyone help.
Thanks,
Neethu
I am assuming you're running your jobs in YARN if yes, you can check following properties.
yarn.scheduler.maximum-allocation-mb
yarn.nodemanager.resource.memory-mb
yarn.scheduler.maximum-allocation-vcores
yarn.nodemanager.resource.cpu-vcores
In YARN these properties would affect number of containers that can be instantiated in a NodeManager based on spark.executor.cores, spark.executor.memory property values (along with executor memory overhead)
For example, if a cluster with 10 nodes (RAM : 16 GB, cores : 6) and set with following yarn properties
yarn.scheduler.maximum-allocation-mb=10GB
yarn.nodemanager.resource.memory-mb=10GB
yarn.scheduler.maximum-allocation-vcores=4
yarn.nodemanager.resource.cpu-vcores=4
Then with spark properties spark.executor.cores=2, spark.executor.memory=4GB you can expect 2 Executors/Node so total you'll get 19 executors + 1 container for Driver
If the spark properties are spark.executor.cores=3, spark.executor.memory=8GB then you will get 9 Executor (only 1 Executor/Node) + 1 container for Driver
you can refer to link for more details
Hope this helps

Dynamic Allocation with spark streaming on yarn not scaling down executors

I'm using spark-streaming (spark version 2.2) on yarn cluster and am trying to enable dynamic core allocation for my application.
The number of executors scale up as required but once executors are assigned, they are not being scaled down even when the traffic is reduced, i.e, and executor once allocated is not being released. I have also enabled external shuffle service on yarn, as mentioned here:
https://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service
The configs that I have set in spark-submit command are:
--conf spark.dynamicAllocation.enabled=false \
--conf spark.streaming.dynamicAllocation.enabled=true \
--conf spark.streaming.dynamicAllocation.scalingInterval=30 \
--conf spark.shuffle.service.enabled=true \
--conf spark.streaming.dynamicAllocation.initialExecutors=15 \
--conf spark.streaming.dynamicAllocation.minExecutors=10 \
--conf spark.streaming.dynamicAllocation.maxExecutors=30 \
--conf spark.streaming.dynamicAllocation.executorIdleTimeout=60s \
--conf spark.streaming.dynamicAllocation.cachedExecutorIdleTimeout=60s \
Can someone help if there is any particular config that I'm missing ?
Thanks
The document added as part of this JIRA helped me out: https://issues.apache.org/jira/browse/SPARK-12133.
The key point to note was that the number of executors are scaled down when the ratio (batch processing time/batch duration) is less than 0.5(default value) which essentially means that the executors are idle for half of the time. The config that can be used to alter this default value is "spark.streaming.dynamicAllocation.scalingDownRatio"

Spark Standalone --total-executor-cores

Im using Spark 2.1.1 Standalone cluster,
Although I have 29 free cores in my cluster (Cores in use: 80 Total, 51 Used), when submitting new spark job with --total-executor-cores 16 this config is not taking affect and the job submitted only with 6 cores..
What am I missing?
(deleting checkpoints doesn't help)
Here is my spark-submit command:
PYSPARK_PYTHON="/usr/bin/python3.4"
PYSPARK_DRIVER_PYTHON="/usr/bin/python3.4" \
/opt/spark/spark-2.1.1-bin-hadoop2.7/bin/spark-submit \
--master spark://XXXX.XXXX:7077 \
--conf "spark.sql.shuffle.partitions=2001" \
--conf "spark.port.maxRetries=200" \
--conf "spark.executorEnv.PYTHONHASHSEED=0" \
--executor-memory 24G \
--total-executor-cores 16 \
--driver-memory 8G \
/home/XXXX/XXXX.py \
--spark_master "spark://XXXX.XXXX:7077" \
--topic "XXXX" \
--broker_list "XXXX" \
--hdfs_prefix "hdfs://XXXX"
My problem was the high number of memory I asked from spark (--executor-memory 24G) - spark tried to find worker nodes with 24G free memory and found only 2 nodes, each node had 3 free cores (that's why I saw only 6 cores).
When decreasing the number of memory to 8G, spark found the number of cores specified.

Spark - Capping the number of CPU cores or memory of slave servers

I am using Spark 2.1. This question is for use cases where some of Spark slave servers run other apps as well. Is there a way to tell the Spark Master server to to use only certain # of CPU cores or memory of a slave server ?
Thanks.
To limit the number of cores used by a spark job, you need to add the --total-executor-cores option into your spark-submit command. To limit the amount of memory used by each executor, use the --executor-memory option. For example:
spark-submit --total-executor-cores 10 \
--executor-memory 8g \
--class com.example.SparkJob \
SparkJob.jar
This also works with spark-shell
spark-shell --total-executor-cores 10 \
--executor-memory 8g

How to execute Spark programs with Dynamic Resource Allocation?

I am using spark-summit command for executing Spark jobs with parameters such as:
spark-submit --master yarn-cluster --driver-cores 2 \
--driver-memory 2G --num-executors 10 \
--executor-cores 5 --executor-memory 2G \
--class com.spark.sql.jdbc.SparkDFtoOracle2 \
Spark-hive-sql-Dataframe-0.0.1-SNAPSHOT-jar-with-dependencies.jar
Now i want to execute the same program using Spark's Dynamic Resource allocation. Could you please help with the usage of Dynamic Resource Allocation in executing Spark programs.
In Spark dynamic allocation spark.dynamicAllocation.enabled needs to be set to true because it's false by default.
This requires spark.shuffle.service.enabled to be set to true, as spark application is running on YARN. Check this link to start the shuffle service on each NodeManager in YARN.
The following configurations are also relevant:
spark.dynamicAllocation.minExecutors,
spark.dynamicAllocation.maxExecutors, and
spark.dynamicAllocation.initialExecutors
These options can be configured to Spark application in 3 ways
1. From Spark submit with --conf <prop_name>=<prop_value>
spark-submit --master yarn-cluster \
--driver-cores 2 \
--driver-memory 2G \
--num-executors 10 \
--executor-cores 5 \
--executor-memory 2G \
--conf spark.dynamicAllocation.minExecutors=5 \
--conf spark.dynamicAllocation.maxExecutors=30 \
--conf spark.dynamicAllocation.initialExecutors=10 \ # same as --num-executors 10
--class com.spark.sql.jdbc.SparkDFtoOracle2 \
Spark-hive-sql-Dataframe-0.0.1-SNAPSHOT-jar-with-dependencies.jar
2. Inside Spark program with SparkConf
Set the properties in SparkConf then create SparkSession or SparkContext with it
val conf: SparkConf = new SparkConf()
conf.set("spark.dynamicAllocation.minExecutors", "5");
conf.set("spark.dynamicAllocation.maxExecutors", "30");
conf.set("spark.dynamicAllocation.initialExecutors", "10");
.....
3. spark-defaults.conf usually located in $SPARK_HOME/conf/
Place the same configurations in spark-defaults.conf to apply for all spark applications if no configuration is passed from command-line as well as code.
Spark - Dynamic Allocation Confs
I just did a small demo with Spark's dynamic resource allocation. The code is on my Github. Specifically, the demo is in this release.

Resources