My Spark job showed for a task, about 75% or higher GC time out of total run-time. It uses almost full CPU (about 85% by spark configuration) and memories. While following this reference for spark tuning, I turned on the GC log print options by adding:
spark.executor.extraJavaOptions = " -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark"
As you can see, all options are about logging and diagnosis.
However, these options improved the runtime from 5 hours to 2 hours!
How can we explain this improvement?
[Update]
With -XX:+PrintFlagsFinal option only, 1 hr 23 min.
With -XX:+UseG1GC option only, 7+ hrs.
Related
Context
I am processing some data (5 billion rows, ~7 columns) via pyspark on EMR.
The first steps including some joins work as expected, up to and including a cache() (memory_disk_ser). Then I filter one column for nulls and do a count() of this big dataframe.
Problem
It takes hours to then fail with a 'no connection error' (I do not remember precisely, but I am more interested in the 'why' of it being slow than the final error).
What I noticed
Out of my 256 vcores, 1 is always at 100%, the rest is idle. The one at 100% is used by a data node JVM.
Configuration
I have 4 r5a.16xlarge instances, each with 4 EBS ssds.
EMR is supposed to take care of its own config, and that is what I see in the spark UI:
spark.emr.default.executor.memory 18971M
spark.driver.memory 2048M
spark.executor.cores 4
I am setting myself:
spark.network.timeout: 800s
spark.executor.heartbeatInterval: 60s
spark.dynamicAllocation.enabled: True
spark.dynamicAllocation.shuffleTracking.enabled: True
spark.executor.instances: 0
spark.default.parallelism: 128
spark.shuffle.spill.compress: true
spark.shuffle.compress: true
spark.rdd.compress: true
spark.storage.level: MEMORY_AND_DISK_SER
spark.executor.extraJavaOptions: -X:+seG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' -Duser.timezone=GMT
spark.driver.extraJavaOptions: -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' -Duser.timezone=GMT
Question
What do I do wrong, or what do I not understand properly? Counting a cached dataframe built in 10 minutes, even when filtering out nulls, should not take hours.
Some more details
The data source is on S3, on homogeneous parquet files. But reading those always work fine, because the join succeeds.
During the count(), I see 200 taaks, 195 succeeds within a few seconds, but 5 consistently never complete, all NODE_LOCAL (but some NODE_LOCAL) tasks do complete
Its a bit hard to tell what is going wrong without looking at the resource manager.
But first, make sure that you are not 'measuring' the success on non-action APIs (because cache is not an action, neither joins and so on).
My best shot would be at the Spark configurations, I would be using cluster-mode with these configurations:
spark.default.parallelism=510
--num-executors=13 (or spark.executor.instances)
spark.executor.cores = spark.driver.cores = 5
spark.executor.memory = spark.driver.memory = 36g
spark.driver.memoryOverhead=4g
I think the problem is at this configuration spark.executor.instances which you have set to 0.
Otherwise, these settings (from AWS official guide) are almost optimal for your used AWS instance types.
I have 16 receivers in Spark Streaming 2.2.1 job. After a while, some of the receivers are processing less and less records, eventually processing only one record/second. The behaviour can be observed on the screenshot:
While I understand the root-cause can be difficult to find and not obvious, is there a way I could debug this problem further? Currently I have no idea where to start digging. Could it be related to back-pressure?
Spark streaming properties:
spark.app.id application_1599135282140_1222
spark.cores.max 64
spark.driver.cores 4
spark.driver.extraJavaOptions -XX:+PrintFlagsFinal -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dump/ -Dlog4j.configuration=file:///tmp/4f892127ad794245aef295c97ccbc5c9/driver_log4j.properties
spark.driver.maxResultSize 3840m
spark.driver.memory 4g
spark.driver.port 36201
spark.dynamicAllocation.enabled false
spark.dynamicAllocation.maxExecutors 10000
spark.dynamicAllocation.minExecutors 1
spark.eventLog.enabled false
spark.executor.cores 4
spark.executor.extraJavaOptions -XX:+PrintFlagsFinal -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dump/
spark.executor.id driver
spark.executor.instances 16
spark.executor.memory 4g
spark.jars file:/tmp/4f892127ad794245aef295c97ccbc5c9/main-e41d1cc.jar
spark.master yarn
spark.rpc.message.maxSize 512
spark.scheduler.maxRegisteredResourcesWaitingTime 300s
spark.scheduler.minRegisteredResourcesRatio 1.0
spark.scheduler.mode FAIR
spark.shuffle.service.enabled true
spark.sql.cbo.enabled true
spark.streaming.backpressure.enabled true
spark.streaming.backpressure.initialRate 25
spark.streaming.backpressure.pid.minRate 1
spark.streaming.concurrentJobs 1
spark.streaming.receiver.maxRate 100
spark.submit.deployMode client
Seems that the problem started manifesting after running for 30 mins. I think back-pressure could be a reason. According to this article:
With activated backpressure, the driver monitors the current batch scheduling delays and processing times and dynamically adjusts the maximum rate of the receivers. The communication of new rate limits can be verified in the receiver log:
2016-12-06 08:27:02,572 INFO org.apache.spark.streaming.receiver.ReceiverSupervisorImpl Received a new rate limit: 51.
Here is what I would recommend you to try:
Check the receiver log to see if backpress is triggerred.
Check your stream sink to see if there is any error.
Check YARN resource manager for resource utilization.
Tune Spark parameters to see if that makes a difference.
I have large table saved as a parquet and when I try to load it I get an crazy amount of GC-Time like 80%. I use spark 2.4.3 The parquet is saved with the following schema:
/parentfolder/part_0001/parquet.file
/parentfolder/part_0002/parquet.file
/parentfolder/part_0003/parquet.file
[...]
2432 in total
The table in total is 2.6 TiB and looks like this (both fields are 64bit int's):
+-----------+------------+
| a | b |
+-----------+------------+
|85899366440|515396105374|
|85899374731|463856482626|
|85899353599|661424977446|
[...]
I have a total amount of 7.4 TiB cluster memory, with 480 cores, on 10 workers and I read the parquet like this:
df = spark.read.parquet('/main/parentfolder/*/').cache()
and as I said I get an crazy amount of garbage collection time right now it stands at: Task Time (GC Time) | 116.9 h (104.8 h) with only 110 GiB loaded after 22 min of wall time.
I monitor one of the workers and memory usual hover around 546G/748G
what am I doing wrong here? Do I need a larger cluster? If my Dataset is 2.6 TiB why isn't 7.4 TiB of memory enough? But then again why isn't the memory full on my worker?
just try to remove .cache().
There are only few cases where you need to cache your data, the most obvious one is one dataframe, several actions. But if your dataframe is that big, do not use cache. Use persist.
from pyspark import StorageLevel
df = spark.read.parquet('/main/parentfolder/*/').persist(StorageLevel.DISK_ONLY)
see Databrics article on this...
Tuning Java Garbage Collection for Apache Spark Applications
G1 GC Running Status (after Tuning)
-XX:+UseG1GC -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -Xms88g -Xmx88g -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThread=20
need to garbge collector tuning in this case. try above example conf.
Also make sure that in your submit you are passing right parameters like executor memory driver memory
use
scala.collection.Map<String,scala.Tuple2<Object,Object>> getExecutorMemoryStatus()
Return a map from the slave to the max memory available for caching and the remaining memory available for caching.
call and debug using
getExecutorMemoryStatus API using pyspark's py4j bridge
sc._jsc.sc().getExecutorMemoryStatus()
How can I adjust the in-heap and off-heap memory for application running on spark 1.5.0? By using -XX+PrintGCDetails -XX:+PrintGCTimeStamps, I noticed that in GC reports retrieved from the file $SPARK_HOME/work/application_id/stdout, JVM keeps on GC in about every 1 minute. Though 50g executor memory is allocated via --executor-memory 50g option and various --conf spark.storage.memoryFranction value, PSYoungGen region always occupied 30% of (PSYoungGen+ParOldGen). PSPermGen always stays in the value aroud 54,272KB with 99% usage.
What I have tried:
spark.executor.extraJavaOptions='-XX:Xms50g -XX:Xmx50g -XX:PermSize=8g' doesn't work, though loads of blogs ensures this setting works.
JAVA_OPTS setting in both spark-env.sh and spark-default.conf doesn't work
With no explicit in-heap and off-heap memory setting in spark 1.5.0, what's the solution for my problem?
JVM keeps on GC in about every 1 minute
Since you haven't posted any actual data, only your own analysis of the data, I cannot say for certain, but generally speaking 1 GC event per minute is perfectly normal, quite good even. So there is no tuning required.
I am processing data with spark and it works with a day worth of data (40G) but fails with OOM on a week worth of data:
import pyspark
import datetime
import operator
sc = pyspark.SparkContext()
sqc = pyspark.sql.SQLContext(sc)
sc.union([sqc.parquetFile(hour.strftime('.....'))
.map(lambda row:(row.id, row.foo))
for hour in myrange(beg,end,datetime.timedelta(0,3600))]) \
.reduceByKey(operator.add).saveAsTextFile("myoutput")
The number of different IDs is less than 10k.
Each ID is a smallish int.
The job fails because too many executors fail with OOM.
When the job succeeds (on small inputs), "myoutput" is about 100k.
what am I doing wrong?
I tried replacing saveAsTextFile with collect (because I actually want to do some slicing and dicing in python before saving), there was no difference in behavior, same failure. is this to be expected?
I used to have reduce(lambda x,y: x.union(y), [sqc.parquetFile(...)...]) instead of sc.union - which is better? Does it make any difference?
The cluster has 25 nodes with 825GB RAM and 224 cores among them.
Invocation is spark-submit --master yarn --num-executors 50 --executor-memory 5G.
A single RDD has ~140 columns and covers one hour of data, so a week is a union of 168(=7*24) RDDs.
Spark very often suffers from Out-Of-Memory errors when scaling. In these cases, fine tuning should be done by the programmer. Or recheck your code, to make sure that you don't do anything that is way too much, such as collecting all the bigdata in the driver, which is very likely to exceed the memoryOverhead limit, no matter how big you set it.
To understand what is happening you should realize when yarn decides to kill a container for exceeding memory limits. That will happen when the container goes beyond the memoryOverhead limit.
In the Scheduler you can check the Event Timeline to see what happened with the containers. If Yarn has killed a container, it will be appear red and when you hover/click over it, you will see a message like:
Container killed by YARN for exceeding memory limits. 16.9 GB of 16 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
So in that case, what you want to focus on is these configuration properties (values are examples on my cluster):
# More executor memory overhead
spark.yarn.executor.memoryOverhead 4096
# More driver memory overhead
spark.yarn.driver.memoryOverhead 8192
# Max on my nodes
#spark.executor.cores 8
#spark.executor.memory 12G
# For the executors
spark.executor.cores 6
spark.executor.memory 8G
# For the driver
spark.driver.cores 6
spark.driver.memory 8G
The first thing to do is to increase the memoryOverhead.
In the driver or in the executors?
When you are overviewing your cluster from the UI, you can click on the Attempt ID and check the Diagnostics Info which should mention the ID of the container that was killed. If it is the same as with your AM Container, then it's the driver, else the executor(s).
That didn't resolve the issue, now what?
You have to fine tune the number of cores and the heap memory you are providing. You see pyspark will do most of the work in off-heap memory, so you want not to give too much space for the heap, since that would be wasted. You don't want to give too less, because the Garbage Collector will have issues then. Recall that these are JVMs.
As described here, a worker can host multiple executors, thus the number of cores used affects how much memory every executor has, so decreasing the #cores might help.
I have it written in memoryOverhead issue in Spark and Spark – Container exited with a non-zero exit code 143 in more detail, mostly that I won't forget! Another option, that I haven't tried would be spark.default.parallelism or/and spark.storage.memoryFraction, which based on my experience, didn't help.
You can pass configurations flags as sds mentioned, or like this:
spark-submit --properties-file my_properties
where "my_properties" is something like the attributes I list above.
For non numerical values, you could do this:
spark-submit --conf spark.executor.memory='4G'
It turned out that the problem was not with spark, but with yarn.
The solution is to run spark with
spark-submit --conf spark.yarn.executor.memoryOverhead=1000
(or modify yarn config).