Spark set driver memory config in Databricks - apache-spark

I'm working on Azure databricks. My driver node and worker node specs are : 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2.
My pyspark notebook fails with Java heap space error. I checked online and one suggestion was to increase driver memory. I'm trying to use following conf parameter in the notebook
spark.conf.get("spark.driver.memory")
To get driver memory. But my notebook cell fails with error.
java.util.NoSuchElementException: spark.driver.memory
Any idea how to check driver memory and change its value?

You can set the spark config when you setup your cluster on Databricks. When you create a cluster and expand the "Advanced Options"-menu, you can see that there is a "Spark Config" section. In this field you can set the configurations you want.
For more information you can always check the documentation page of Azure Databricks.

This should increase the memory usage of the cluster when hit the limit.
It's the same answer as above, but not as a photo
spark.executor.memory 19g
spark.driver.memory 19g

Related

Spark Insufficient Memory

My Spark job fails with the following error:
java.lang.IllegalArgumentException: Required executor memory (33792 MB), offHeap memory (0) MB, overhead (8192 MB), and PySpark memory (0 MB)
is above the max threshold (24576 MB) of this cluster!
Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
I have defined executor memory to be 33g and executor memory overhead to be 8g. However, the total should be less than or equal to 24g as per the error log. Can someone help me understand what exactly does 24g refer to? Is it the RAM on the master node or something else? Why is it capped to 24g?
Once I figure it out, I can programmatically calculate my other values to not run into this issue again.
Setup: Running make command which houses multiple spark-submit commands on Jenkins which launches it on an AWS EMR cluster running Spark 3.x
This error is happening because you're requesting more resources than is available on the cluster (org.apache.spark.deploy.yarn.Client source). For your case specifically (AWS EMR), I think you should check the value of yarn.nodemanager.resource.memory-mb as message says (in yarn-site.xml or via NodeManager Web UI), and do not try to allocate more than this value per YARN container memory.

Spark driver pod getting killed with 'OOMKilled' status

We are running a Spark Streaming application on a Kubernetes cluster using spark 2.4.5.
The application is receiving massive amounts of data through a Kafka topic (one message each 3ms). 4 executors and 4 kafka partitions are being used.
While running, the memory of the driver pod keeps increasing until it is getting killed by K8s with an 'OOMKilled' status. The memory of executors is not facing any issues.
When checking the driver pod resources using this command :
kubectl top pod podName
We can see that the memory increases until it reaches 1.4GB, and the pod is getting killed.
However, when checking the storage memory of the driver on Spark UI, we can see that the storage memory is not fully used (50.3 KB / 434 MB). Is there any difference between the storage memory of the driver, and the memory of the pod containing the driver ?
Has anyone had experience with a similar issue before?
Any help would be appreciated.
Here are few more details about the app :
Kubernetes version : 1.18
Spark version : 2.4.5
Batch interval of spark streaming context : 5 sec
Rate of input data : 1 kafka message each 3 ms
Scala language
In brief, the Spark memory consists of three parts:
Reversed memory (300MB)
User memory ((all - 300MB)*0.4), used for data processing logic.
Spark memory ((all-300MB)*0.6(spark.memory.fraction)), used for cache and shuffle in Spark.
Besides this, there is also max(executor memory * 0.1, 384MB)(0.1 is spark.kubernetes.memoryOverheadFactor) extra memory used by non-JVM memory in K8s.
Adding executor memory limit by memory overhead in K8S should fix the OOM.
You can also decrease spark.memory.fraction to allocate more RAM to user memory.

Cannot change spark.driver.memory even though is set in spark-defaults

I am using an Amazon EMR cluster and I want to change the driver heap memory.
I gave the following configuration to the cluster:
[{"classification":"spark-defaults",
"properties":
{"spark.executor.memory":"36g",
"spark.driver.memory":"36g",
"spark.driver.cores":"3",
"spark.default.parallelism":"174",
"spark.executor.cores":"3",
"spark.executor.instances":"29",
"spark.yarn.executor.memoryOverhead":"4g",
"spark.dynamicAllocation.enabled":"false"}}]
However when I check the environment tab in spark UI, the driver memory still remains fixed to 1000m while the rest is correctly set. What am I doing wrong?
I am using a 10 r4.4xlarge machines cluster.
Thank you.

emr-5.4.0 (Spark executors memory allocation issue)

I created a spark cluster(learning so did not create high memory-cpu cluster) with 1 master node and 2 Core to run executors using below config
Master:Running1m4.large (2 Core , 8GB)
Core:Running2c4.large (2 core , 3.5 GB)
Hive 2.1.1, Pig 0.16.0, Hue 3.11.0, Spark 2.1.0, Sqoop 1.4.6, HBase 1.3.0
When pyspark is run getting below error
Required executor memory (1024+384 MB) is above the max threshold (896 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
Before trying to increase yarn-site.xml config , curious to understand why EMR is taking just 896MB as limit when master has 8GB and worker node has 3.5GB each.
And Resource manager URL (for master- http://master-public-dns-name:8088/) is showing 1.75 GB where as memory for vm is 8GB. Is hbase or other sws taking up too much memory?
If anyone encountered similar issue , please share your insight why it is EMR is setting low defaults. Thanks!
Before trying to increase yarn-site.xml config , curious to understand
why EMR is taking just 896MB as limit when master has 8GB and worker
node has 3.5GB each.
If you run spark jobs with yarn cluster mode (which you probably were using) , the executors will be run on core's and masters memory will not be used.
Now, all-though your CORE EC2 instance (c4.large) has 3.75 GB to use, EMR configures YARN not to use all this memory for running YARN containers or spark executors. This is because you gotta leave enough memory for other permanent daemons ( like HDFS's datanode , YARN's nodemanager , EMR's own daemons etc.. based on app's you provision)
EMR does publish this default YARN configuration it sets for all instance types on this page : http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html
c4.large
Configuration Option Default Value
mapreduce.map.java.opts -Xmx717m
mapreduce.map.memory.mb 896
yarn.scheduler.maximum-allocation-mb 1792
yarn.nodemanager.resource.memory-mb 1792
So, yarn.nodemanager.resource.memory-mb = 1792, which means 1792 MB is the physical memory that will be allocated to YARN containers on that core node having 3.75 actual memory. Also, check spark-defaults.xml where EMR has some defaults for spark executor memory. These are default's and of course you can change those before starting cluster using EMR's configurations API . But keep in mind that if you over provision memory for YARN containers , you might starve some other processes.
Given that it is important to understand YARN configs and how SPARK interacts with YARN .
https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
http://spark.apache.org/docs/latest/running-on-yarn.html
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
It's not really a property of EMR but rather of YARN, which is the resource manager running on EMR.
My personal take on YARN is that is really build for managing long running clusters that continuously take in a variety of jobs that it has to run simultaneously. In these cases it makes sense for YARN to only assign a small part of the available memory to each job.
Unfortunately, when it comes to specific-purpose clusters (like: "I will just spin up a cluster run my job and terminate the cluster again") these YARN-defaults are simply annoying, and you have to configure a bunch of stuff in order to make YARN utilise your resources optimally. But running on EMR it's what we are stuck with these days, so one has to live with that...

How to update spark configuration after resizing worker nodes in Cloud Dataproc

I have a DataProc Spark cluster. Initially, the master and 2 worker nodes are of type n1-standard-4 (4 vCPU, 15.0 GB memory), then I resized all of them to n1-highmem-8 (8 vCPUs, 52 GB memory) via the web console.
I noticed that the two workers nodes are not being fully used. In particular, there are only 2 executors on the first worker node and 1 executor on the second worker node, with
spark.executor.cores 2
spark.executor.memory 4655m
in the /usr/lib/spark/conf/spark-defaults.conf. I thought with spark.dynamicAllocation.enabled true, the number of executors will be increased automatically.
Also, The information on DataProc page of the web console doesn't get updated automatically, either. It seems that DataProc still think that all nodes are n1-standard-4.
My questions are
why are there more executors on the first worker node than the second?
why are not more executors added to each node?
Ideally, I want the whole cluster to get fully utilized, if the spark configuration needs updated, how?
As you've found a cluster's configuration is set when the cluster is first created and does not adjust to manual resizing.
To answer your questions:
The Spark ApplicationMaster takes a container in YARN on a worker node, usually the first worker if only a single spark application is running.
When a cluster is started, Dataproc attempts to fit two YARN containers per machine.
The YARN NodeManager configuration on each machine determines how much of the machine's resources should be dedicated to YARN. This can be changed on each VM under /etc/hadoop/conf/yarn-site.xml, followed by a sudo service hadoop-yarn-nodemanager restart. Once machines are advertising more resources to the ResourceManager, Spark can start more containers. After adding more resources to YARN, you may want to modify the size of containers requested by Spark by modifying spark.executor.memory and spark.executor.cores.
Instead of resizing cluster nodes and manually editing configuration files afterwards, consider starting a new cluster with new machine sizes and copy any data from your old cluster to the new cluster. In general, the simplest way to move data is to use hadoop's built in distcp utility. An example usage would be something along the lines of:
$ hadoop distcp hdfs:///some_directory hdfs://other-cluster-m:8020/
Or if you can use Cloud Storage:
$ hadoop distcp hdfs:///some_directory gs://<your_bucket>/some_directory
Alternatively, consider always storing data in Cloud Storage and treating each cluster as an ephemeral resource that can be torn down and recreated at any time. In general, any time you would save data to HDFS, you can also save it as:
gs://<your_bucket>/path/to/file
Saving to GCS has the nice benefit of allowing you to delete your cluster (and data in HDFS, on persistent disks) when not in use.

Resources