We are currently running into an issue where Spark is showing that each of our nodes only have 4GB of memory. However, we have allocated 10GB of memory by setting spark-worker.jvmOptions = -Xmx10g. We can not figure out what is causing this unusual limitation/incorrect memory allocation.
When we go to run spark jobs it will run as if there is only 4GB of memory per worker.
Any help would be great! Thanks!
Screenshot of SOLR UI
You should set worker memory using : --executor-memory in your spark-submit
Try setting the following parameters inside the conf/spark-defaults.conf file:
spark.executor.memory 10g
Related
I know that when the spark cluster in the production environment is running a job, it is in the stand-alone mode.
While I was running a job, a few points of worker's memory overflow caused the worker node process to die.
I would like to ask how to analyze the error shown in the image below:
Spark Worker Fatal Error
EDIT: This is a relatively common problem please also view this if the below doesn't help you Spark java.lang.OutOfMemoryError: Java heap space.
Without seeing your code here is the process you should follow:
(1) If the issue is caused primarily from the Java allocation running out of space within the container allocation I would advise messing with your memory overhead settings (below). The current value are a little high and will cause the excess spin-up of vcores. Add the two below settings to your spark-submit and re-run.
--conf "spark.yarn.executor.memoryOverhead=4000m"
--conf "spark.yarn.driver.memoryOverhead=2000m"
(2) Adjust Executor and Driver Memory Levels. Start low and climb. Add these values to the spark-submit statement.
--driver-memory 10g
--executor-memory 5g
(3) Adjust Number of Executor Values in the spark submit.
--num-executors ##
(4) Look at the Yarn stages of the job and figure where inefficiencies in the code is present and where persistence's can be added and replaced. I would advise to heavily look into spark-tuning.
I have a spark application that keeps failing on error:
"Diagnostics: Container [pid=29328,containerID=container_e42_1512395822750_0026_02_000001] is running beyond physical memory limits. Current usage: 1.5 GB of 1.5 GB physical memory used; 2.3 GB of 3.1 GB virtual memory used. Killing container."
I saw lots of different parameters that was suggested to change to increase the physical memory. Can I please have the some explanation for the following parameters?
mapreduce.map.memory.mb (currently set to 0 so suppose to take the default which is 1GB so why we see it as 1.5 GB, changing it also dint effect the number)
mapreduce.reduce.memory.mb (currently set to 0 so suppose to take the default which is 1GB so why we see it as 1.5 GB, changing it also dint effect the number)
mapreduce.map.java.opts/mapreduce.reduce.java.opts set to 80% form the previous number
yarn.scheduler.minimum-allocation-mb=1GB (when changing this then I see the effect on the max physical memory, but for the value 1 GB it still 1.5G)
yarn.app.mapreduce.am.resource.mb/spark.yarn.executor.memoryOverhead can't find at all in configuration.
We are defining YARN (running with yarn-cluster deployment mode) using cloudera CDH 5.12.1.
spark.driver.memory
spark.executor.memory
These control the base amount of memory spark will try to allocate for it's driver and for all the executors. These are probably the ones you want to increase if you are running out of memory.
// options before Spark 2.3.0
spark.yarn.driver.memoryOverhead
spark.yarn.executor.memoryOverhead
// options after Spark 2.3.0
spark.driver.memoryOverhead
spark.executor.memoryOverhead
This value is an additional amount of memory to request when you are running Spark on yarn. It is intended to account extra RAM needed for the yarn container that is hosting your Spark Executors.
yarn.scheduler.minimum-allocation-mb
yarn.scheduler.maximum-allocation-mb
When Spark goes to ask Yarn to reserve a block of RAM for an executor, it will ask a value of the base memory plus the overhead memory. However, Yarn may not give it back one of exactly that size. These parameters control the smallest container size and the largest container size that YARN will grant. If you are only using the cluster for one job, I find it easiest to set these to very small and very large values and then using the spark memory settings mentions above to set the true container size.
mapreduce.map.memory.mb
mapreduce.map.memory.mb
mapreduce.map.java.opts/mapreduce.reduce.java.opts
I don't think these have any bearing on your Spark/Yarn job.
You can set spark.driver.memory and spark.executor.memory that are described as follows:
spark.driver.memory 1g Amount of memory to use for the driver process
spark.executor.memory 1g Amount of memory to use per executor process (e.g. 2g, 8g).
The above configuration says memory. So Is it RAM memory or disk?
(I must admit it's a very intriguing question)
Shortly, it's RAM (and honestly Spark does not support disk as a resource to accept/request from a cluster manager).
From the official documentation Application Properties:
Amount of memory to use for the driver process, i.e. where SparkContext is initialized. (e.g. 1g, 2g).
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option or in your default properties file.
If my server has 50GB memory, Hbase is using 40GB. And when I run Spark I set the memory as --executor-memory 30G. So will Spark grab some memory from Hbase since there only 10GB left.
Another question, if Spark only need 1GB memory, but I gave Spark 10G memory, will Spark occupy 10GB memory.
The behavior will be different depending upon the deployment mode. In case you are using local mode, then --executor-memory will not change anything as you only have 1 Executor and that's your driver, so you need to increase the memory of your driver.
In case you are using Standalone mode and submitting your job in cluster mode then following would be applicable: -
--executor-memory is the memory required by per executor. It is the executors Heap Size. By Default 60% of the configured --executor-memory is used to cache RDDs. The remaining 40% of memory is available for any objects created during task execution. this is equivalent to -Xms and -Xmx. so in case you provide more memory then available then your executors will show errros regarding insufficient memory.
When you give Spark executor 30G memory, OS will not give it actual physical memory. But As and when your executor requires actual memory to either cache or processing this will cause your other processes like hbase to go on to swap. If your system's swap is set to zero then you will face OOM Error.
OS Swaps out idle part of the process which could make your process behave very slow.
I get the following WARN-message:
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
when i try to run the following spark-task:
spark/bin/spark-submit --master $SPARK_MASTER_URL --executor-memory 8g --driver-memory 8g --name "Test-Task" --class path.to.my.Class myJAR.jar
Master and all worker have enough memory for this task (see picture), but it seems like they don't get it allocated.
My setup looks like this:
SparkConf conf = new SparkConf().set("spark.executor.memory", "8g");
When I start my task and then type
ps -fux | more
in my console, it shows me these options:
-Xms512m -Xmx512m
Can anyone tell me what I'm doing wrong?
Edit:
What I am doing:
I have a huge file saved on my master disk, which is about 5gb when I load it into memory (it's a map of maps). So I first load the whole map into memory and then give each node a part of this map to process. As I understand, that's the reason why I need much memory also on my master instance. Maybe not a good solution?
To enlarge the heap size of the master node you can set SPARK_DAEMON_MEMORY environment variable (in spark-env.sh for instance). But I doubt it will solve your memory allocation problem since the master node is not loading data.
I don't understand what your "map of maps" file is. But usually, to process a big file, you make it available to each worker node using a shared folder (NFS) or, better, a distributed file system (HDFS, GlusterFS). Then each worker can read a part of the file and process it. This works as long as the file format is splittable, Spark support JSON file format for instance.