spark scala memory management issues - apache-spark

I am trying to submit a spark scala job with below configuration:
spark-submit --class abcd --queue new --master yarn --executor-cores 1 --executor-memory 4g --driver-memory 2g --num-executors 1
The allocated space for the queue is 700GB and it is taking entire 700GB and running.
Is there a way to restrict to 100GB only?
Thanks in advance.

Related

Spark Submit Command Configurations

Can Someone Please help me to understand below spark-submit configuration in detail.What is the use of each configuration and how exactly spark utilize these configuration for better performance.
--num-executors 3
--master yarn
--deploy-mode cluster
--driver-cores 3
--driver-memory 1G
--executor-cores 3
--executor-memory 1G

Why does spark-shell/pyspark gets less resources compared to spark-submit

I am running a PySpark code which runs on a single node due to code requirements.
But when I run the code using PySpark
pyspark --master yarn --executor-cores 1 --driver-memory 60g --executor-memory 5g --conf spark.driver.maxResultSize=4g
I get Segment error and the code fails, when I checked in Yarn I saw that my job was not getting resources even when they were available.
But when I use spark-submit
spark-submit --master yarn --executor-cores 1 --driver-memory 60g --executor-memory 5g --conf spark.driver.maxResultSize=4g code.py
the code gets all the resources it needs and the code runs perfectly.
Am I missing some fundamental aspect of spark or why is this happening?

Spark fail if not all resources are allocated

Does spark or yarn has any flag to fail fast job if we can't allocate all resoucres?
For example if i run
spark-submit --class org.apache.spark.examples.SparkPi
--master yarn-client
--num-executors 7
--driver-memory 512m
--executor-memory 4g
--executor-cores 1
/usr/hdp/current/spark2-client/examples/jars/spark-examples_*.jar 1000
For now if spark can allocate only 5 executors it just will go with 5. Can we make to run it only with 7 or fail in other case?
You can set a spark.dynamicAllocation.minExecutors config in your job. For it you need to set spark.dynamicAllocation.enabled=true, detailed in this doc

Does spark-submit parameters work in local mode?

When I run spark-submit --master local[10] --num-executors 8 --executor-cores 5 --executor-memory 5g foo.jar,which means I am running an application in local mode,will --num-executors 8 --executor-cores 5 --executor-memory work together with local[10]? If not,which parameters will decide the resources allocation?
In other words,does --num-executors 8 --executor-cores 5 --executor-memory 5g only works on yarn?In local mode,only local[K] works?
No, the spark-submit parameters num-executors, executor-cores, executor-memory won't work in local mode because these parameters are to be used when you deploy your spark job on a cluster and not a single machine, these will only work in case you run your job in client or cluster mode.
Please refer here for more information on different ways to submit a spark application.

Over utilization of yarn resources with spark

I have EMR cluster with below configuration.
Data Nodes : 6
RAM per Node : 56 GB
Cores per Node: 32
Instance Type: M4*4xLarge
I am running below spark-sql to execute 5 hive scripts in parallel.
spark-sql --master yarn --num-executors 1 --executor-memory 20G --executor-cores 20 --driver-memory 4G -f hive1.hql & spark-sql --master yarn --num-executors 1 --executor-memory 20G --executor-cores 20 --driver-memory 4G -f hive2.hql & spark-sql --master yarn --num-executors 1 --executor-memory 20G --executor-cores 20 --driver-memory 4G -f hive3.hql & spark-sql --master yarn --num-executors 1 --executor-memory 20G --executor-cores 20 --driver-memory 4G -f hive4.hql & spark-sql --master yarn --num-executors 1 --executor-memory 20G --executor-cores 20 --driver-memory 4G -f hive5.hql
But, 270 GB of memory is being utilized by yarn.
As per the parameters in given command,
Each spark job should utilize only 120 GB RAM.
1*20+4 = 24 GB RAM
5 jobs = 5 * 24 = 120 GB
But, why yarn is utilizing 270 GB RAM? (No other Hadoop jobs are running in the cluster)
Do I need to include any extra parameters to limit yarn resource utilization?
Make it as "spark.dynamicAllocation.enabled" false at spark-defaults.conf (../../spark/spark-x.x.x/conf/spark-defaults.conf)
This should help you limiting/avoiding dynamic allocation of resources.
Even though we set executor memory in the command, spark allocates memory dynamically if resources are available in the cluster. To restrict the memory usage to only executor memory, spark dynamic memory allocation parameter should set to false.
You can change it directly in spark config file or pass as config parameter to the command.
spark-sql --master yarn --num-executors 1 --executor-memory 20G --executor-cores 20 --driver-memory 4G --conf spark.dynamicAllocation.enabled=false -f hive1.hql

Resources