The off-heap memory usage of the 3 Spark executor processes keeps increasing constantly until the boundaries of the physical RAM are hit. This happened two weeks ago, at which point the system comes to a grinding halt, because it's unable to spawn new processes. At such a moment restarting Spark is the obvious solution. In the collectd memory usage graph below we see two moments that we've restarted Spark: last week when we upgraded Spark from 1.4.1 to 1.5.1 and two weeks ago when the physical memory was exhausted.
As can be seen below, the Spark executor process uses approx. 62GB of memory, while the heap size max is set to 20GB. This means the off-heap memory usage is approx. 42GB.
$ ps aux | grep 40724
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
apache-+ 40724 140 47.1 75678780 62181644 ? Sl Nov06 11782:27 /usr/lib/jvm/java-7-oracle/jre/bin/java -cp /opt/spark-1.5.1-bin-hadoop2.4/conf/:/opt/spark-1.5.1-bin-hadoop2.4/lib/spark-assembly-1.5.1-hadoop2.4.0.jar:/opt/spark-1.5.1-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.5.1-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.5.1-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar -Xms20480M -Xmx20480M -Dspark.driver.port=7201 -Dspark.blockManager.port=7206 -Dspark.executor.port=7202 -Dspark.broadcast.port=7204 -Dspark.fileserver.port=7203 -Dspark.replClassServer.port=7205 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver#xxx.xxx.xxx.xxx:7201/user/CoarseGrainedScheduler --executor-id 2 --hostname xxx.xxx.xxx.xxx --cores 10 --app-id app-20151106125547-0000 --worker-url akka.tcp://sparkWorker#xxx.xxx.xxx.xxx:7200/user/Worker
$ sudo -u apache-spark jps
40724 CoarseGrainedExecutorBackend
40517 Worker
30664 Jps
$ sudo -u apache-spark jstat -gc 40724
S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
158720.0 157184.0 110339.8 0.0 6674944.0 1708036.1 13981184.0 2733206.2 59904.0 59551.9 41944 1737.864 39 13.464 1751.328
$ sudo -u apache-spark jps -v
40724 CoarseGrainedExecutorBackend -Xms20480M -Xmx20480M -Dspark.driver.port=7201 -Dspark.blockManager.port=7206 -Dspark.executor.port=7202 -Dspark.broadcast.port=7204 -Dspark.fileserver.port=7203 -Dspark.replClassServer.port=7205 -XX:MaxPermSize=256m
40517 Worker -Xms2048m -Xmx2048m -XX:MaxPermSize=256m
10693 Jps -Dapplication.home=/usr/lib/jvm/java-7-oracle -Xms8m
Some info:
We use Spark Streaming lib.
Our code is written in Java.
We run Oracle Java v1.7.0_76
Data is read from Kafka (Kafka runs on different boxes).
Data is written to Cassandra (Cassandra runs on different boxes).
1 Spark master and 3 Spark executors/workers, running on 4 separate boxes.
We recently upgraded Spark to 1.4.1 and 1.5.1 and the memory usage pattern is identical on all those versions.
What can be the cause of this ever-increasing off-heap memory use?
Related
The error log is as follows :
20/05/10 18:40:47 ERROR yarn.Client: Application diagnostics message: Application application_1588683044535_1067 failed 2 times due to AM Container for appattempt_1588683044535_1067_000002 exited with exitCode: -104
Failing this attempt.Diagnostics: [2020-05-10 18:40:47.661]Container [pid=209264,containerID=container_e142_1588683044535_1067_02_000001] is running 3313664B beyond the 'PHYSICAL' memory limit. Current usage: 1.5 GB of 1.5 GB physical memory used; 3.6 GB of 3.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_e142_1588683044535_1067_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 209264 209262 209264 209264 (bash) 0 0 22626304 372 /bin/bash -c LD_LIBRARY_PATH="/cdhparcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop/../../../CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop/lib/native:" /usr/java/jdk1.8.0_181-cloudera/bin/java -server -Xmx1024m -Djava.io.tmpdir=/hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/tmp -Dspark.yarn.app.container.log.dir=/hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'com.airtel.spark.execution.driver.SparkDriver' --jar hdfs:///user/aiuat/lib/platform/di-platform-main-1.0.jar --arg 'hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irsparkbatchjobconf.json,hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irruleexecution.json' --properties-file /hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/__spark_conf__/__spark_conf__.properties 1> /hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/stdout 2> /hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/stderr
|- 209280 209264 209264 209264 (java) 34135 2437 3845763072 393653 /usr/java/jdk1.8.0_181-cloudera/bin/java -server -Xmx1024m -Djava.io.tmpdir=/hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/tmp -Dspark.yarn.app.container.log.dir=/hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class com.airtel.spark.execution.driver.SparkDriver --jar hdfs:///user/aiuat/lib/platform/di-platform-main-1.0.jar --arg hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irsparkbatchjobconf.json,hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irruleexecution.json --properties-file /hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/__spark_conf__/__spark_conf__.properties
Some of the observations are :
Application master is getting killed. The memory error is in container of application master itself, not of executer containers.
This job is scheduled via oozie and some instances of job get succeeded and some fails randomly without any pattern. The amount of input data is same in every case.
I have tried the most of solutions suggested on internet.
yarn.mapareduce.map.mb and yarn.mapareduce.reduce.mb is set to 8gb already.
I have also tried increasing driver memory , executer memory , overhead memory of both to very high value, low value, tweaking with these configurations but some instances still failed in every case.
yarn.nodemanager.vmem-pmem-ratio is set to 2.1 vnem check is disable and pnem check is enabled. Unfortunately these configurations can't be changed as it's a production cluster.
yarn.app.mapreduce.am.resource.mb is set to 5GB already. yarn.scheduler.maximum-allocation-mb is set to 26GB
Some of my other confusions are :
Why is memory available to Application master container only 1.5GB as shown in logs if yarn.app.mapreduce.am.resource.mb is set to 5GB ?
As this error comes in the container of application master itself and as per my understanding , application master and spark driver runs in the same jvm. I am concluding that that this error is because of either spark driver memory or application master memory not being sufficient. Does my conclusion seem correct ?
I have fixed this error. So, I thought I will answer this here.
In case of cluster mode, driver memory configurations can't be given on runtime after a sparksession is already created as application master was already launched and driver runs inside yarn application master container. What I was trying to do is to pass driver memory conf via "spark.driver.memory" after creating a sparksession. Spark doesn't give any error for this case and even shows the driver memory as exactly what was provided via this conf in the environment tab on spark ui page, which makes identifying the issue even more difficult. Application master memory was taken as default value 1GB instead of the memory I provided and thus, I was getting this error.
My spark application reads 3 files of 7 MB , 40 MB ,100MB and so many transformations and store multiple directories
Spark version CDH1.5
MASTER_URL=yarn-cluster
NUM_EXECUTORS=15
EXECUTOR_MEMORY=4G
EXECUTOR_CORES=6
DRIVER_MEMORY=3G
My spark job was running for some time and then it throws the below error message and restarts from begining
18/03/27 18:59:44 INFO avro.AvroRelation: using snappy for Avro output
18/03/27 18:59:47 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
18/03/27 18:59:47 INFO CuratorFrameworkSingleton: Closing ZooKeeper client.
Once it restarted again it ran for sometime and failed with this error
Application application_1521733534016_7233 failed 2 times due to AM Container for appattempt_1521733534016_7233_000002 exited with exitCode: -104
For more detailed output, check application tracking page:http://entline.com:8088/proxy/application_1521733534016_7233/Then, click on links to logs of each attempt.
Diagnostics: Container [pid=52716,containerID=container_e98_1521733534016_7233_02_000001] is running beyond physical memory limits. Current usage: 3.5 GB of 3.5 GB physical memory used; 4.3 GB of 7.3 GB virtual memory used. Killing container.
Dump of the process-tree for container_e98_1521733534016_7233_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 52720 52716 52716 52716 (java) 89736 8182 4495249408 923677 /usr/java/jdk1.7.0_67-cloudera/bin/java -server -Xmx3072m -Djava.io.tmpdir=/apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001 -XX:MaxPermSize=256m org.apache.spark.deploy.yarn.ApplicationMaster --class com.sky.ids.dovetail.asrun.etl.DovetailAsRunETLMain --jar file:/apps/projects/dovetail_asrun_etl/jars/EntLine-1.0-SNAPSHOT-jar-with-dependencies.jar --arg --app.conf.path --arg application.conf --arg --run_type --arg AUTO --arg --bus_date --arg 2018-03-27 --arg --code_base_id --arg EntLine-1.0-SNAPSHOT --executor-memory 4096m --executor-cores 6 --properties-file /apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/__spark_conf__/__spark_conf__.properties
|- 52716 52714 52716 52716 (bash) 2 0 108998656 389 /bin/bash -c LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/../../../CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/lib/native: /usr/java/jdk1.7.0_67-cloudera/bin/java -server -Xmx3072m -Djava.io.tmpdir=/apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001 -XX:MaxPermSize=256m org.apache.spark.deploy.yarn.ApplicationMaster --class 'com.sky.ids.dovetail.asrun.etl.DovetailAsRunETLMain' --jar file:/apps/projects/dovetail_asrun_etl/jars/EntLine-1.0-SNAPSHOT-jar-with-dependencies.jar --arg '--app.conf.path' --arg 'application.conf' --arg '--run_type' --arg 'AUTO' --arg '--bus_date' --arg '2018-03-27' --arg '--code_base_id' --arg 'EntLine-1.0-SNAPSHOT' --executor-memory 4096m --executor-cores 6 --properties-file /apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/__spark_conf__/__spark_conf__.properties 1> /var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/stdout 2> /var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/stderr
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.
As per my CDH
Container Memory[Amount of physical memory, in MiB, that can be allocated for containers]
yarn.nodemanager.resource.memory-mb 50655 MiB
Please see the containers running in my driver node
Why are there many containers running in one node .
I know that container_e98_1521733534016_7880_02_000001 is for my spark application . I don't know about other containers ? Any idea on that ?
Also I see that physical memory for container_e98_1521733534016_7880_02_000001 is 3584 which is close to 3.5 GB
What does this error mean? Whe it usally occurs?
What is 3.5 GB of 3.5 GB physical memory? Is it driver memory?
Could some one help me on this issue?
container_e98_1521733534016_7233_02_000001 is the first container started and given MASTER_URL=yarn-cluster that's not only the ApplicationMaster, but also the driver of the Spark application.
It appears that the memory setting for the driver, i.e. DRIVER_MEMORY=3G, is too low and you have to bump it up.
Spark on YARN runs two executors by default (see --num-executors) and so you'll end up with 3 YARN containers with 000001 for the ApplicationMaster (perhaps with the driver) and 000002 and 000003 for the two executors.
What is 3.5 GB of 3.5 GB physical memory? Is it driver memory?
Since you use yarn-cluster the driver, the ApplicationMaster and container_e98_1521733534016_7233_02_000001 are all the same and live in the same JVM. That gives that the error is about how much memory you assigned to the driver.
My understanding is that you gave DRIVER_MEMORY=3G which happened to have been too little for your processing and once YARN figured it out killed the driver (and hence the entire Spark application as it's not possible to have a Spark application up and running without the driver).
See the document Running Spark on YARN.
A small addition to what #Jacek already wrote to answer the question
why you get 3.5GB instead of 3GB?
is that apart the DRIVER_MEMORY=3G you need to consider spark.driver.memoryOverhead which can be calculated as MIN(DRIVER_MEMORY * 0.10, 384)MB = 384MB + 3GB ~ 3.5GB
I'm trying to setup a standalone Spark 2.0 server to process an analytics function in parallel. To do this I want to run 8 workers, with a single core per each worker. However, the Spark Master/Worker UI doesn't seem to be reflecting my configuration.
I'm using :
Standalone Spark 2.0
8 Cores 24gig RAM
windows server 2008
pyspark
spark-env.sh file is configured as follows:
SPARK_WORKER_INSTANCES = 8
SPARK_WORKER_CORES = 1
SPARK_WORKER_MEMORY = 2g
spark-defaults.conf is configured as follows:
spark.cores.max = 8
I start the master:
spark-class org.apache.spark.deploy.master.Master
I start the workers by running this command 8 times within a batch file:
spark-class org.apache.spark.deploy.worker.Worker spark://10.0.0.10:7077
The problem is that the UI shows up as follows:
As you can see each worker has 8 cores instead of the 1 core I have assigned it via the SPARK_WORKER_CORES setting. Also the memory is reflective of the entire machine memory not the 2g assigned to each worker. How can I configure Spark to run with 1 core/2g per each worker in standalone mode?
I fixed this to adding the cores and memory arguments to the worker itself.
start spark-class org.apache.spark.deploy.worker.Worker --cores 1 --memory 2g spark://10.0.0.10:7077
We are running Spark drivers and executors in Docker containers, orchestrated by Kubernetes. We'd like to be able to set the Java heap size for them at runtime, via the Kubernetes controller YAML. What Spark config has to be set to do this? If I do nothing and look at the launched process via ps -ef, I see:
root 639 638 0 00:16 ? 00:00:23 /opt/ibm/java/jre/bin/java -cp /opt/ibm/spark/conf/:/opt/ibm/spark/lib/spark-assembly-1.5.2-hadoop2.6.0.jar:/opt/ibm/spark/lib/datanucleus-api-jdo-3.2.6.jar:/opt/ibm/spark/lib/datanucleus-core-3.2.10.jar:/opt/ibm/spark/lib/datanucleus-rdbms-3.2.9.jar:/opt/ibm/hadoop/conf/ -Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=172.17.48.29:2181,172.17.231.2:2181,172.17.47.17:2181 -Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=172.17.48.29:2181,172.17.231.2:2181,172.17.47.17:2181 -Dcom.ibm.apm.spark.logfilename=master.log -Dspark.deploy.defaultCores=2 **-Xms1g -Xmx1g** org.apache.spark.deploy.master.Master --ip sparkmaster-1 --port 7077 --webui-port 18080
Something is setting the -Xms and -Xmx options. I tried setting SPARK_DAEMON_JAVA_OPTS="-XmsIG -Xms2G" in spark-env.sh and got:
root 2919 2917 2 19:16 ? 00:00:15 /opt/ibm/java/jre/bin/java -cp /opt/ibm/spark/conf/:/opt/ibm/spark/lib/spark-assembly-1.5.2-hadoop2.6.0.jar:/opt/ibm/spark/lib/datanucleus-api-jdo-3.2.6.jar:/opt/ibm/spark/lib/datanucleus-core-3.2.10.jar:/opt/ibm/spark/lib/datanucleus-rdbms-3.2.9.jar:/opt/ibm/hadoop/conf/ -Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=172.17.48.29:2181,172.17.231.2:2181,172.17.47.17:2181 **-Xms1G -Xmx2G** -Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=172.17.48.29:2181,172.17.231.2:2181,172.17.47.17:2181 **-Xms1G -Xmx2G** -Dcom.ibm.apm.spark.logfilename=master.log -Dspark.deploy.defaultCores=2 **-Xms1g -Xmx1g** org.apache.spark.deploy.master.Master --ip sparkmaster-1 --port 7077 --webui-port 18080
A friend suggested setting
spark.driver.memory 2g
in spark-defaults.conf, but the results looked like the first example. Maybe the values in the ps -ef command were overridden by this setting, but how would I know? If spark.driver.memory is the right override, can you set the heap min and max this way, or does this just set the max?
Thanks in advance.
Setting SPARK_DAEMON_MEMORY environment variable in conf/spark-env.sh should do the trick:
SPARK_DAEMON_MEMORY Memory to allocate to the Spark master and worker daemons themselves (default: 1g).
I want to run a simple spark program, but i am restricted by some errors.
My Environment is:
CentOS:6.6
Java: 1.7.0_51
Scala: 2.10.4
Spark: spark-1.4.0-bin-hadoop2.6
Mesos: 0.22.1
All are installed and nodes are up.Now i have one Mesos master and Mesos slave node. My spark properties are below:
spark.app.id 20150624-185838-2885789888-5050-1291-0005
spark.app.name Spark shell
spark.driver.host 192.168.1.172
spark.driver.memory 512m
spark.driver.port 46428
spark.executor.id driver
spark.executor.memory 512m
spark.executor.uri http://192.168.1.172:8080/spark-1.4.0-bin-hadoop2.6.tgz
spark.externalBlockStore.folderName spark-91aafe3b-01a8-4c86-ac3b-999e278807c5
spark.fileserver.uri http://192.168.1.172:51240
spark.jars
spark.master mesos://zk://192.168.1.172:2181/mesos
spark.mesos.coarse true
spark.repl.class.uri http://192.168.1.172:51600
spark.scheduler.mode FIFO
Now when I started spark, it comes to scala prompt(scala>).
After that I am getting following error: mesos task 1 is now TASK_FAILED, blacklisting mesos slave value due to too many failures is Spark installed on it
How to resolve this.
With only 900MB and spark.driver.memory = 512m, you will be able to launch the scheduler/REPL, but you won't have enough memory for spark.executor.memory = 512m, so any tasks will fail. Either increasing your VM memory size or reducing the driver/executor memory requirements will help you get around these memory limits.
Could you check the mesos slave logs/ task information for more output on why the task failed. You could have a look at :5050.
Probably unrelated question: Do you actually have zookeeper:
spark.master mesos://zk://192.168.1.172:2181/mesos
running (as you mentioned you only have one master)?