"GC Overhead limit exceeded" on Hadoop .20 datanode - garbage-collection

I've searched and not finding much information related to Hadoop Datanode processes dying due to GC overhead limit exceeded, so I thought I'd post a question.
We are running a test where we need to confirm our Hadoop cluster can handle having ~3million files stored on it (currently a 4 node cluster). We are using a 64bit JVM and we've allocated 8g to the namenode. However, as my test program writes more files to DFS, the datanodes start dying off with this error:
Exception in thread "DataNode: [/var/hadoop/data/hadoop/data]" java.lang.OutOfMemoryError: GC overhead limit exceeded
I saw some posts about some options (parallel GC?) I guess which can be set in hadoop-env.sh but I'm not too sure of the syntax and I'm kind of a newbie, so I didn't quite grok how it's done.
Thanks for any help here!

Try to increase the memory for datanode by using this: (hadoop restart required for this to work)
export HADOOP_DATANODE_OPTS="-Xmx10g"
This will set the heap to 10gb...you can increase as per your need.
You can also paste this at the start in $HADOOP_CONF_DIR/hadoop-env.sh file.

If you are running a map reduce job from command line, you can increase the heap using the parameter -D 'mapreduce.map.java.opts=-Xmx1024m' and/or -D 'mapreduce.reduce.java.opts=-Xmx1024m'. Example:
hadoop --config /etc/hadoop/conf jar /usr/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar --conf /etc/hbase/conf/hbase-site.xml -D 'mapreduce.map.java.opts=-Xmx1024m' --hbase-indexer-file $HOME/morphline-hbase-mapper.xml --zk-host 127.0.0.1/solr --collection hbase-collection1 --go-live --log4j /home/cloudera/morphlines/log4j.properties
Note that in some Cloudera documentation, they still use the old parameters mapred.child.java.opts, mapred.map.child.java.opts and mapred.reduce.child.java.opts. These parameters don't work anymore for Hadoop 2 (see What is the relation between 'mapreduce.map.memory.mb' and 'mapred.map.child.java.opts' in Apache Hadoop YARN?).

This post solved the issue for me.
So, the key is to "Prepend that environment variable" (1st time seen this linux command syntax :) )
HADOOP_CLIENT_OPTS="-Xmx10g" hadoop jar "your.jar" "source.dir" "target.dir"

GC overhead limit indicates that your (tiny) heap is full.
This is what often happens in MapReduce operations when u process a lot of data. Try this:
< property >
< name > mapred.child.java.opts < /name >
< value > -Xmx1024m -XX:-UseGCOverheadLimit < /value >
< /property >
Also, try these following things:
Use combiners, the reducers shouldn't get any lists longer than a small multiple of the number of maps
At the same time, you can generate heap dump from OOME and analyze with YourKit, etc adn analyze it

Related

Pyspark: Container exited with a non-zero exit code 143

I have seen various threads on this issue but the solutions given are not working in my case.
The environment is with pyspark 2.1.0 , Java 7 and has enough memory and Cores.
I am running a spark-submit job which deals with Json files, the job runs perfectly alright with the file size < 200MB but if its more than that it fails for Container exited with a non-zero exit code 143 then I checked yarn logs and the error there is java.lang.OutOfMemoryError: Requested array size exceeds VM limit
Since the json file is not in the format which can directly be read using spark.read.json() the first step in the application is reading the json as textfile to rdd to apply map and flatMap to covert into required format then using spark.read.json(rdd) to create the dataframe for further processing, the code is below
def read_json(self, spark_session, filepath):
raw_rdd = spark_session.sparkContext.textFile(filepath)
raw_rdd_add_sep = raw_rdd.map(lambda x:x.replace('}{','}}{{'))
raw_rdd_split = raw_rdd_add_sep.flatMap(lambda x:x.split('}{'))
required_df = spark_session.read.json(raw_rdd_split)
return required_df
I have tried increasing the Memory overhead for executor and driver which didn't help using options spark.driver.memoryOverhead , spark.executor.memoryOverhead
Also I have enabled the Off-Heap options spark.memory.offHeap.enabled and set the value spark.memory.offHeap.size
I have tried setting the JVM memory option with spark.driver.extraJavaOptions=-Xms10g
So The above options didn't work in this scenario, some of the Json files are nearly 1GB and we ought to process ~200 files a day.
Can someone help resolving this issue please?
Regarding "Container exited with a non-zero exit code 143", it is probably because of the memory problem.
You need to check out on Spark UI if the settings you set is taking effect.
BTW, the proportion for executor.memory:overhead.memory should be about 4:1
I don't know why you change the JVM setting directly spark.driver.extraJavaOptions=-Xms10g, I recommend using --driver-memory 10g instread. e.g.: spark-submit --driver-memory 10G (I remember driver-memory only works with spark-submit sometimes)
from my perspective, you just need to update the four arguments to feed your machine resources:
spark.driver.memoryOverhead ,
spark.executor.memoryOverhead,
spark.driver.memory ,
spark.executor.memory

Databricks Spark: java.lang.OutOfMemoryError: GC overhead limit exceeded i

I am executing a Spark job in Databricks cluster. I am triggering the job via a Azure Data Factory pipeline and it execute at 15 minute interval so after the successful execution of three or four times it is getting failed and throwing with the exception "java.lang.OutOfMemoryError: GC overhead limit exceeded".
Though there are many answer with for the above said question but in most of the cases their jobs are not running but in my cases it is getting failed after successful execution of some previous jobs.
My data size is less than 20 MB only.
My cluster configuration is:
So the my question is what changes I should make in the server configuration. If the issue is coming from my code then why it is getting succeeded most of the time. Please advise and suggest me the solution.
This is most probably related to executor memory being bit low .Not sure what is current setting and if its default what is the default value in this particular databrics distribution. Even though it passes but there would lot of GCs happening because of low memory hence it would keep failing once in a while . Under spark configuration please provide spark.executor.memory and also some other params related to num of executors and cores per executor . In spark-submit the config would be provided as spark-submit --conf spark.executor.memory=1g
You may try increasing memory of driver node.
Sometimes the Garbage Collector is not releasing all the loaded objects in the driver's memory.
What you can try is to force the GC to do that. You can do that by executing the following:
spark.catalog.clearCache()
for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items():
rdd.unpersist()
print("Unpersisted {} rdd".format(id))

GC overhead limit exceeded while reading data from MySQL on Spark

I have a > 5GB table on mysql. I want to load that table on spark as a dataframe and create a parquet file out of it.
This is my python function to do the job:
def import_table(tablename):
spark = SparkSession.builder.appName(tablename).getOrCreate()
df = spark.read.format('jdbc').options(
url="jdbc:mysql://mysql.host.name:3306/dbname?zeroDateTimeBehavior=convertToNull
",
driver="com.mysql.jdbc.Driver",
dbtable=tablename,
user="root",
password="password"
).load()
df.write.parquet("/mnt/s3/parquet-store/%s.parquet" % tablename)
I am running the following script to run my spark app:
./bin/spark-submit ~/mysql2parquet.py --conf "spark.executor.memory=29g" --conf "spark.storage.memoryFraction=0.9" --conf "spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit" --driver-memory 29G --executor-memory 29G
When I run this script on a EC2 instance with 30 GB, it fails with java.lang.OutOfMemoryError: GC overhead limit exceeded
Meanwhile, I am only using 1.42 GB of total memory available.
Here is full console output with stack trace: https://gist.github.com/idlecool/5504c6e225fda146df269c4897790097
Here is part of stack trace:
Here is HTOP output:
I am not sure if I am doing something wrong or spark is not meant for this use-case. I hope spark is.
A bit of a crude explanation about memory management of spark is provided below, you can read more about it from the official documentation, but here is my take:
I believe the option "spark.storage.memoryFraction=0.9" is problematic in your case, roughly speaking an executor has three types of memory which can be allocated, first is the storage memory which you have set to 90% of the executor memory i.e. about ~27GB which is used to keep persistent datasets.
Second is heap memory which is used to perform computations and is typically set high for cases where you are doing machine learning or lot of calculations, this is what is insufficient in your case, your program needs to have a higher heap memory which is what causes this error.
The third type of memory is shuffle memory which is used for communicating between different partitions. It needs to be set to a high value in cases where you are doing a lot of joins between dataframes/rdd's or in general, which requires a high amount of network overhead. This can be configured by the setting "spark.shuffle.memoryFraction"
So basically you can set the memory fractions by using these two settings, the rest of the memory available after shuffle and storage memory goes to the heap.
Since you are having such a high storage fraction the heap memory available to the program is extremely small. You will need to play with these parameters to get an optimal value. Since you are outputting a parquet file, you will usually need a higher amount of heap space since the programs requires computations for compression. I would suggest the following settings for you. The idea is that you are not doing any operations which require a lot of shuffle memory hence it can be kept small. Also, you do not need such a high amount of storage memory.
"spark.storage.memoryFraction=0.4"
"spark.shuffle.memoryFraction=0.2"
More about this can be read here:
https://spark.apache.org/docs/latest/configuration.html#memory-management
thanks to Gaurav Dhama for good explanation
, you may need to set spark.executor.extraJavaOptions to -XX:-UseGCOverheadLimit too.

"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result.
Now I'm trying to process a ~5GB bzip2 CSV file on this cluster but I'm receiving this error:
16/11/23 17:29:53 WARN TaskSetManager: Lost task 49.2 in stage 6.0 (TID xxx, xxx.xxx.xxx.compute.internal): ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
I'm confused as to why I'm getting a ~10.5GB memory limit on a ~75GB cluster (15GB per 3m.xlarge instance)...
Here is my EMR config:
[
{
"classification":"spark-env",
"properties":{
},
"configurations":[
{
"classification":"export",
"properties":{
"PYSPARK_PYTHON":"python34"
},
"configurations":[
]
}
]
},
{
"classification":"spark",
"properties":{
"maximizeResourceAllocation":"true"
},
"configurations":[
]
}
]
From what I've read, setting the maximizeResourceAllocation property should tell EMR to configure Spark to fully utilize all resources available on the cluster. Ie, I should have ~75GB of memory available... So why am I getting a ~10.5GB memory limit error?
Here is the code I'm running:
def sessionize(raw_data, timeout):
# https://www.dataiku.com/learn/guide/code/reshaping_data/sessionization.html
window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
.orderBy("timestamp"))
diff = (pyspark.sql.functions.lag(raw_data.timestamp, 1)
.over(window))
time_diff = (raw_data.withColumn("time_diff", raw_data.timestamp - diff)
.withColumn("new_session", pyspark.sql.functions.when(pyspark.sql.functions.col("time_diff") >= timeout.seconds, 1).otherwise(0)))
window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
.orderBy("timestamp")
.rowsBetween(-1, 0))
sessions = (time_diff.withColumn("session_id", pyspark.sql.functions.concat_ws("_", "user_id", "site_id", pyspark.sql.functions.sum("new_session").over(window))))
return sessions
def aggregate_sessions(sessions):
median = pyspark.sql.functions.udf(lambda x: statistics.median(x))
aggregated = sessions.groupBy(pyspark.sql.functions.col("session_id")).agg(
pyspark.sql.functions.first("site_id").alias("site_id"),
pyspark.sql.functions.first("user_id").alias("user_id"),
pyspark.sql.functions.count("id").alias("hits"),
pyspark.sql.functions.min("timestamp").alias("start"),
pyspark.sql.functions.max("timestamp").alias("finish"),
median(pyspark.sql.functions.collect_list("foo")).alias("foo"),
)
return aggregated
spark_context = pyspark.SparkContext(appName="process-raw-data")
spark_session = pyspark.sql.SparkSession(spark_context)
raw_data = spark_session.read.csv(sys.argv[1],
header=True,
inferSchema=True)
# Windowing doesn't seem to play nicely with TimestampTypes.
#
# Should be able to do this within the ``spark.read.csv`` call, I'd
# think. Need to look into it.
convert_to_unix = pyspark.sql.functions.udf(lambda s: arrow.get(s).timestamp)
raw_data = raw_data.withColumn("timestamp",
convert_to_unix(pyspark.sql.functions.col("timestamp")))
sessions = sessionize(raw_data, SESSION_TIMEOUT)
aggregated = aggregate_sessions(sessions)
aggregated.foreach(save_session)
Basically, nothing more than windowing and a groupBy to aggregate the data.
It starts with a few of those errors, and towards halting increases in the amount of the same error.
I've tried running spark-submit with --conf spark.yarn.executor.memoryOverhead but that doesn't seem to solve the problem either.
I feel your pain..
We had similar issues of running out of memory with Spark on YARN. We have five 64GB, 16 core VMs and regardless of what we set spark.yarn.executor.memoryOverhead to, we just couldn't get enough memory for these tasks -- they would eventually die no matter how much memory we would give them. And this as a relatively straight-forward Spark application that was causing this to happen.
We figured out that the physical memory usage was quite low on the VMs but the virtual memory usage was extremely high (despite the logs complaining about physical memory). We set yarn.nodemanager.vmem-check-enabled in yarn-site.xml to false and our containers were no longer killed, and the application appeared to work as expected.
Doing more research, I found the answer to why this happens here: http://web.archive.org/web/20190806000138/https://mapr.com/blog/best-practices-yarn-resource-management/
Since on Centos/RHEL 6 there are aggressive allocation of virtual memory due to OS behavior, you should disable virtual memory checker or increase yarn.nodemanager.vmem-pmem-ratio to a relatively larger value.
That page had a link to a very useful page from IBM: https://web.archive.org/web/20170703001345/https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en
In summary, glibc > 2.10 changed its memory allocation. And although huge amounts of virtual memory being allocated isn't the end of the world, it doesn't work with the default settings of YARN.
Instead of setting yarn.nodemanager.vmem-check-enabled to false, you could also play with setting the MALLOC_ARENA_MAX environment variable to a low number in hadoop-env.sh. This bug report has helpful information about that: https://issues.apache.org/jira/browse/HADOOP-7154
I recommend reading through both pages -- the information is very handy.
If you're not using spark-submit, and you're looking for another way to specify the yarn.nodemanager.vmem-check-enabled parameter mentioned by Duff, here are 2 other ways:
Method 2
If you're using a JSON Configuration file (that you pass to the AWS CLI or to your boto3 script), you'll have to add the following configuration:
[{
"Classification": "yarn-site",
"Properties": {
"yarn.nodemanager.vmem-check-enabled": "false"
}
}]
Method 3
If you use the EMR console, add the following configuration:
classification=yarn-site,properties=[yarn.nodemanager.vmem-check-enabled=false]
See,
I had the same problem in a huge cluster that I'm working now. The problem will not be solved to adding memory to the worker. Sometimes in process aggregation spark will use more memory than it has and the spark jobs will start to use off-heap memory.
One simple example is:
If you have a dataset that you need to reduceByKey it will, sometimes, agregate more data in one worker than other, and if this data exeeds the memory of one worker you get that error message.
Adding the option spark.yarn.executor.memoryOverhead will help you if you set for 50% of the memory used for the worker (just for test, and see if it works, you can add less with more tests).
But you need to understand how Spark works with the Memory Allocation in the cluster:
The more common way Spark uses 75% of the machine memory. The rest goes to SO.
Spark has two types of memory during the execution. One part is for execution and the other is the storage. Execution is used for Shuffles, Joins, Aggregations and Etc. The storage is used for caching and propagating data accross the cluster.
One good thing about memory allocation, if you are not using cache in your execution you can set the spark to use that sotorage space to work with execution to avoid in part the OOM error. As you can see this in documentation of spark:
This design ensures several desirable properties. First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills. Second, applications that do use caching can reserve a minimum storage space (R) where their data blocks are immune to being evicted. Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.
But how can we use that?
You can change some configurations, Add the MemoryOverhead configuration to your job call but, consider add this too: spark.memory.fraction change for 0.8 or 0.85 and reduce the spark.memory.storageFraction to 0.35 or 0.2.
Other configurations can help, but it need to check in your case. Se all these configuration here.
Now, what helps in My case.
I have a cluster with 2.5K workers and 2.5TB of RAM. And we were facing OOM error like yours. We just increase the spark.yarn.executor.memoryOverhead to 2048. And we enable the dynamic allocation. And when we call the job, we don't set the memory for the workers, we leave that for the Spark to decide. We just set the Overhead.
But for some tests for my small cluster, changing the size of execution and storage memory. That solved the problem.
Try repartition. It works in my case.
The dataframe was not so big at the very beginning when it was loaded with write.csv(). The data file amounted to be 10 MB or so, as may required say totally several 100 MB memory for each processing task in executor.
I checked the number of partitions to be 2 at the time.
Then it grew like a snowball during the following operations joining with other tables, adding new columns. And then I ran into the memory exceeding limits issue at a certain step.
I checked the number of partitions, it was still 2, derived from the original data frame I guess.
So I tried to repartition it at the very beginning, and there was no problem anymore.
I have not read many materials about Spark and YARN yet. What I do know is that there are executors in nodes. An executor could handle many tasks depending on the resources. My guess is one partition would be atomically mapped to one task. And its volume determines the resource usage. Spark could not slice it if one partition grows too big.
A reasonable strategy is to determine the nodes and container memory first, either 10GB or 5GB. Ideally, both could serve any data processing job, just a matter of time. Given the 5GB memory setting, the reasonable row for one partition you find, say is 1000 after testing (it won't fail any steps during the processing), we could do it as the following pseudo code:
RWS_PER_PARTITION = 1000
input_df = spark.write.csv("file_uri", *other_args)
total_rows = input_df.count()
original_num_partitions = input_df.getNumPartitions()
numPartitions = max(total_rows/RWS_PER_PARTITION, original_num_partitions)
input_df = input_df.repartition(numPartitions)
Hope it helps!
I had the same issue on small cluster running relatively small job on spark 2.3.1.
The job reads parquet file, removes duplicates using groupBy/agg/first then sorts and writes new parquet. It processed 51 GB of parquet files on 4 nodes (4 vcores, 32Gb RAM).
The job was constantly failing on aggregation stage. I wrote bash script watch executors memory usage and found out that in the middle of the stage one random executor starts taking double memory for a few seconds. When I correlated time of this moment with GC logs it matched with full GC that empties big amount of memory.
At last I understood that the problem is related somehow to GC. ParallelGC and G1 causes this issue constantly but ConcMarkSweepGC improves the situation. The issue appears only with small amount of partitions. I ran the job on EMR where OpenJDK 64-Bit (build 25.171-b10) was installed. I don't know the root cause of the issue, it could be related to JVM or operating system. But it is definitely not related to heap or off-heap usage in my case.
UPDATE1
Tried Oracle HotSpot, the issue is reproduced.

worker dont have sufficent memory

I get the following WARN-message:
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
when i try to run the following spark-task:
spark/bin/spark-submit --master $SPARK_MASTER_URL --executor-memory 8g --driver-memory 8g --name "Test-Task" --class path.to.my.Class myJAR.jar
Master and all worker have enough memory for this task (see picture), but it seems like they don't get it allocated.
My setup looks like this:
SparkConf conf = new SparkConf().set("spark.executor.memory", "8g");
When I start my task and then type
ps -fux | more
in my console, it shows me these options:
-Xms512m -Xmx512m
Can anyone tell me what I'm doing wrong?
Edit:
What I am doing:
I have a huge file saved on my master disk, which is about 5gb when I load it into memory (it's a map of maps). So I first load the whole map into memory and then give each node a part of this map to process. As I understand, that's the reason why I need much memory also on my master instance. Maybe not a good solution?
To enlarge the heap size of the master node you can set SPARK_DAEMON_MEMORY environment variable (in spark-env.sh for instance). But I doubt it will solve your memory allocation problem since the master node is not loading data.
I don't understand what your "map of maps" file is. But usually, to process a big file, you make it available to each worker node using a shared folder (NFS) or, better, a distributed file system (HDFS, GlusterFS). Then each worker can read a part of the file and process it. This works as long as the file format is splittable, Spark support JSON file format for instance.

Resources