Spark fail if not all resources are allocated - apache-spark

Does spark or yarn has any flag to fail fast job if we can't allocate all resoucres?
For example if i run
spark-submit --class org.apache.spark.examples.SparkPi
--master yarn-client
--num-executors 7
--driver-memory 512m
--executor-memory 4g
--executor-cores 1
/usr/hdp/current/spark2-client/examples/jars/spark-examples_*.jar 1000
For now if spark can allocate only 5 executors it just will go with 5. Can we make to run it only with 7 or fail in other case?

You can set a spark.dynamicAllocation.minExecutors config in your job. For it you need to set spark.dynamicAllocation.enabled=true, detailed in this doc

Related

Spark Submit Command Configurations

Can Someone Please help me to understand below spark-submit configuration in detail.What is the use of each configuration and how exactly spark utilize these configuration for better performance.
--num-executors 3
--master yarn
--deploy-mode cluster
--driver-cores 3
--driver-memory 1G
--executor-cores 3
--executor-memory 1G

Does spark-submit parameters work in local mode?

When I run spark-submit --master local[10] --num-executors 8 --executor-cores 5 --executor-memory 5g foo.jar,which means I am running an application in local mode,will --num-executors 8 --executor-cores 5 --executor-memory work together with local[10]? If not,which parameters will decide the resources allocation?
In other words,does --num-executors 8 --executor-cores 5 --executor-memory 5g only works on yarn?In local mode,only local[K] works?
No, the spark-submit parameters num-executors, executor-cores, executor-memory won't work in local mode because these parameters are to be used when you deploy your spark job on a cluster and not a single machine, these will only work in case you run your job in client or cluster mode.
Please refer here for more information on different ways to submit a spark application.

spark scala memory management issues

I am trying to submit a spark scala job with below configuration:
spark-submit --class abcd --queue new --master yarn --executor-cores 1 --executor-memory 4g --driver-memory 2g --num-executors 1
The allocated space for the queue is 700GB and it is taking entire 700GB and running.
Is there a way to restrict to 100GB only?
Thanks in advance.

Over utilization of yarn resources with spark

I have EMR cluster with below configuration.
Data Nodes : 6
RAM per Node : 56 GB
Cores per Node: 32
Instance Type: M4*4xLarge
I am running below spark-sql to execute 5 hive scripts in parallel.
spark-sql --master yarn --num-executors 1 --executor-memory 20G --executor-cores 20 --driver-memory 4G -f hive1.hql & spark-sql --master yarn --num-executors 1 --executor-memory 20G --executor-cores 20 --driver-memory 4G -f hive2.hql & spark-sql --master yarn --num-executors 1 --executor-memory 20G --executor-cores 20 --driver-memory 4G -f hive3.hql & spark-sql --master yarn --num-executors 1 --executor-memory 20G --executor-cores 20 --driver-memory 4G -f hive4.hql & spark-sql --master yarn --num-executors 1 --executor-memory 20G --executor-cores 20 --driver-memory 4G -f hive5.hql
But, 270 GB of memory is being utilized by yarn.
As per the parameters in given command,
Each spark job should utilize only 120 GB RAM.
1*20+4 = 24 GB RAM
5 jobs = 5 * 24 = 120 GB
But, why yarn is utilizing 270 GB RAM? (No other Hadoop jobs are running in the cluster)
Do I need to include any extra parameters to limit yarn resource utilization?
Make it as "spark.dynamicAllocation.enabled" false at spark-defaults.conf (../../spark/spark-x.x.x/conf/spark-defaults.conf)
This should help you limiting/avoiding dynamic allocation of resources.
Even though we set executor memory in the command, spark allocates memory dynamically if resources are available in the cluster. To restrict the memory usage to only executor memory, spark dynamic memory allocation parameter should set to false.
You can change it directly in spark config file or pass as config parameter to the command.
spark-sql --master yarn --num-executors 1 --executor-memory 20G --executor-cores 20 --driver-memory 4G --conf spark.dynamicAllocation.enabled=false -f hive1.hql

Show number of executors and executor memory

I am runing a pyspark job using the command
spark-submit ./exp-1.py --num-executors 8 --executor-memory 4G
Is there a way to confirm that these configurations are getting reflected in during execution ?
There is a command verbose for checking configuration when spark job runs.
spark-submit --verbose ./exp-1.py --num-executors 8 --executor-memory 4G

Resources