Flink Job on EMR Cluster "GC overhead limit exceeded" - garbage-collection

EMR-Flink job is showing error "GC overhead limit exceeded error". The EMR cluster is created within VPC with default EMR roles. Hadoop and Flink options are selected from advanced option (I tried different versions of Hadoop and Flink)
Step method is used to submit the custom jar with the set of arguments. Job reads the data from Aurora DB
Problem: The job executes successfully when the read request has less number of rows from Aurora DB but as the number of rows goes up to millions, I start getting "GC overhead limit exceeded error". I am using JDBC driver for Aurora DB connection. On my local machine, I don't face any error and everything goes fine regardless of data size in read request.
The exact error:
java.lang.OutOfMemoryError: GC overhead limit exceeded
-XX:OnOutOfMemoryError="kill -9 %p"
Executing /bin/sh -c "kill -9 8344"...
Tried Solution:
1: I tried to solve the problem by using this link:https://aws.amazon.com/premiumsupport/knowledge-center/emr-outofmemory-gc-overhead-limit-error/ .
2: I also tried to provide flink configuration at the time of cluster creation such as
taskmanager.heap.mb:13926
jobmanager.heap.mb:13926
taskmanager.memory.preallocate:true
taskmanager.memory.off-heap:true
3: I also tried other options and added other settings flink configuration but nothing is working for me.

The problem was with the hadoop memory as seen below
============= Java processes for user hadoop =============
8228 com.amazonaws.elasticmapreduce.statepusher.StatePusher -Dlog4j.defaultInitOverride
4522 aws157.instancecontroller.Main -Xmx1024m -XX:+ExitOnOutOfMemoryError -XX:MinHeapFreeRatio=10 -Dlog4j.defaultInitOverride
=========== End java processes for user hadoop ===========
The following configuration worked for me and solved my problem
[
{
"Classification": "hadoop-env",
"Properties": {
},
"Configurations": [
{
"Classification": "export",
"Properties": {
"HADOOP_HEAPSIZE":"10000"
},
"Configurations": [
]
}
]
}
]

Related

Why does it take so long for notebooks to run on Databricks Workflows?

I am migrating an ETL implemented in Pentaho to Spark.
For this, I am using Databricks Workflows to orchestrate the Spark notebooks that are part of my new ETL.
The problem I have is that, when launching the job created in Databricks Workflows, the execution times are higher than in Pentaho and higher than expected.
These graphics show the comparison between Databricks and Pentaho execution times. As you can see, Databricks takes much longer than Pentaho. These tables correspond each one of them to a different notebook and they are executed as parallel tasks inside the job. Each notebook does several select statements from SQL tables, joins (lookups) and transformations, inserting the final results into a table.
I have tried running one of the notebooks in isolated form, outside the job I have created in Databricks Workflows. Concretely, I have tried with "TABLE 10" and it takes 4.04 minutes. I have added the following settings on the notebook and the execution time of "TABLE 10" decreases to 1.99 minutes.
sqlContext.setConf("spark.sql.inMemoryColumnarStorage.compressed", True)
sqlContext.setConf("spark.sql.inMemoryColumnarStorage.batchSize", 10000)
sqlContext.setConf("spark.sql.shuffle.partitions", 32)
However, if I re-execute the job with these settings, "TABLE 10" takes 11.35 minutes.
These are the characteristics of the cluster in which I have launched the job:
Databricks Runtime Version: Runtime: 9.1 LTS (Scala 2.12, Spark 3.1.2)
Worker type: Standard_DS3_v2 (14 GB Memory, 4 Cores, Min worker 2, Max worker 4)
Driver type: Standard_DS3_v2 (14 GB Memory, 4 Cores)
Enable autoscaling: true
It is important to mention that in the "TABLE 10" notebook we read from 8 database tables, do multiple transformations, 3 joins and finally, write 128,497 rows to the database table. The code has not been optimized using cache(), repartitionByKey() or groupByKey().
I have tried changing the characteristics of the cluster to the following:
Databricks Runtime Version: Runtime: 9.1 LTS (Scala 2.12, Spark 3.1.2)
Worker type: Standard_DS4_v2 (28 GB Memory, 8 Cores, Min worker 1, Max worker 2)
Driver type: Standard_DS4_v2 (28 GB Memory, 8 Cores)
Enable autoscaling: true
These are the results of the execution times for "TABLE 10":
Without any configuration and running the notebook in isolated form: 1.65 minutes
With the configuration mentioned before and running the notebook in isolated form: 1.53 minutes
Without any configuration and running the notebook in the job: 11.57 minutes
With the configuration mentioned before and running the notebook in the job: 10.08 minutes
Why is there such a big difference on times when we execute tasks separately and parallelly? Should we configure spark settings on the cluster and not in the notebook? Is there any other way to optimize execution times?
EDITED.
This image shows part of the job in Databricks workflow. Each task corresponds to a different notebook and these tasks are executed in parallel.
I have created the job using a JSON file. First, I create the task "Task 1", on which the rest of the tasks depend. Then, I create the tasks that are going to be executed in parallel (here I show the creation of the task "Table 3", it would be the same for the rest of the tasks). All the tasks are executed in the same cluster.
{
"name": "ETL Spark",
"max_concurrent_runs": 1,
"tasks": [
{
"task_key": "task_1",
"description": "Executing Task 1",
"notebook_task": {
"notebook_path": "task_1",
"base_parameters": {
...
}
},
"existing_cluster_id": "0315-122...",
"timeout_seconds": 3600,
"max_retries": 0,
"retry_on_timeout": true
},
{
"task_key": "table_3",
"description": "TABLE_3",
"notebook_task": {
"notebook_path": "table_3",
"base_parameters": {
...
}
},
"depends_on": [
{
"task_key": "task_1"
}
],
"existing_cluster_id": "0315-122...",
"timeout_seconds": 3600,
"max_retries": 0,
"retry_on_timeout": true
}
.
.
.
],
"git_source": {
"git_url": ...,
"git_branch": ...,
"git_provider": ...
}
}

Experiencing very Low CPU utilization of Spark Jobs on AWS EMR

we have a spark job that reads a csv file and applies a series of transformations, and writes the result in an orc file,
the spark job breaks into close to 20 stages and runs for around an hour
input csv file size: 10 GB
spark-submit job resource configuration:
driver-memory= 5 GB
num-executors= 2
executor-core= 3
executor-memory= 20 GB
EC2 instance type: r5d.xlarge i.e. 32GB Memory and 4 vCPU with attached 128 GB EBS volume
EMR Cluster comprises of 1 Master Node and 2 Core machines
when we run the spark job on the above cluster configurations, the cpu utilization is only close to 10-15%
our requirement is to maximize the cpu utilization of EC2 instances for my spark job.
Appreciate for any suggestion!
AFAIK if you increase the parllelism automatically CPU usage will increase try using these in your spark job configuration
num-executors= 4
executor-core= 5
executor-memory= 25 GB
specially if you increase cpu cores parllelism will increase..
more than 5 cores not recommended for each executor. This is based on a study where any application with more than 5 concurrent threads would start hampering the performance.
spark.dynamicAllocation.enabled could be another option.
spark.default.parallelism = 2 * number of CPUs in total on worker nodes
Make sure that you always use yarn mode
Follow Using maximizeResourceAllocation from aws docs there all these things are discussed in detail. Read it completely
You can configure your executors to utilize the maximum resources possible on each node in a cluster by using the spark configuration classification to set maximizeResourceAllocation option to true. This EMR-specific option calculates the maximum compute and memory resources available for an executor on an instance in the core instance group. It then sets the corresponding spark-defaults settings based on this information.
[
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
]
Further reading
Best practices for successfully managing memory for Apache Spark applications on Amazon EMR
EMR-spark-tuning-demystified

Databricks Spark: java.lang.OutOfMemoryError: GC overhead limit exceeded i

I am executing a Spark job in Databricks cluster. I am triggering the job via a Azure Data Factory pipeline and it execute at 15 minute interval so after the successful execution of three or four times it is getting failed and throwing with the exception "java.lang.OutOfMemoryError: GC overhead limit exceeded".
Though there are many answer with for the above said question but in most of the cases their jobs are not running but in my cases it is getting failed after successful execution of some previous jobs.
My data size is less than 20 MB only.
My cluster configuration is:
So the my question is what changes I should make in the server configuration. If the issue is coming from my code then why it is getting succeeded most of the time. Please advise and suggest me the solution.
This is most probably related to executor memory being bit low .Not sure what is current setting and if its default what is the default value in this particular databrics distribution. Even though it passes but there would lot of GCs happening because of low memory hence it would keep failing once in a while . Under spark configuration please provide spark.executor.memory and also some other params related to num of executors and cores per executor . In spark-submit the config would be provided as spark-submit --conf spark.executor.memory=1g
You may try increasing memory of driver node.
Sometimes the Garbage Collector is not releasing all the loaded objects in the driver's memory.
What you can try is to force the GC to do that. You can do that by executing the following:
spark.catalog.clearCache()
for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items():
rdd.unpersist()
print("Unpersisted {} rdd".format(id))

SPARK: YARN kills containers for exceeding memory limits

We're currently encountering an issue where Spark jobs are seeing a number of containers being killed for exceeding memory limits when running on YARN.
16/11/18 17:58:52 WARN TaskSetManager: Lost task 53.0 in stage 49.0 (TID 32715, XXXXXXXXXX):
ExecutorLostFailure (executor 23 exited caused by one of the running tasks)
Reason: Container killed by YARN for exceeding memory limits. 12.4 GB of 12 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead.
The following arguments are being passed via spark-submit:
--executor-memory=6G
--driver-memory=4G
--conf "spark.yarn.executor.memoryOverhead=6G"`
I am using Spark 2.0.1.
We have increased the memoryOverhead to this value after reading several posts about YARN killing containers (e.g. How to avoid Spark executor from getting lost and yarn container killing it due to memory limit?).
Given my parameters and the log message it does seem that "Yarn kills executors when its memory usage is larger than (executor-memory + executor.memoryOverhead)".
It is not practical to continue increasing this overhead in the hope that eventually we find a value at which these errors do not occur. We are seeing this issue on several different jobs. I would appreciate any suggestions as to parameters I should change, things I should check, where I should start looking to debug this etc. Am able to provide further config options etc.
You can reduce the memory usage with the following configurations in spark-defaults.conf:
spark.default.parallelism
spark.sql.shuffle.partitions
And there is a difference when you use more than 2000 partitions for spark.sql.shuffle.partitions. You can see it in the code of spark on Github:
private[spark] object MapStatus {
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
I recommend to try to use more than 2000 Partitions for a test. It could be faster some times, when you use very huge datasets. And according to this your tasks can be short as 200 ms. The correct configuration is not easy to find, but depending on your workload it can make a difference of hours.

Why does Yarn on EMR not allocate all nodes to running Spark jobs?

I'm running a job on Apache Spark on Amazon Elastic Map Reduce (EMR). Currently I'm running on emr-4.1.0 which includes Amazon Hadoop 2.6.0 and Spark 1.5.0.
When I start the job, YARN correctly has allocated all the worker nodes to the spark job (with one for the driver, of course).
I have the magic "maximizeResourceAllocation" property set to "true", and the spark property "spark.dynamicAllocation.enabled" also set to "true".
However, if I resize the emr cluster by adding nodes to the CORE pool of worker machines, YARN only adds some of the new nodes to the spark job.
For example, this morning I had a job that was using 26 nodes (m3.2xlarge, if that matters) - 1 for the driver, 25 executors. I wanted to speed up the job so I tried adding 8 more nodes. YARN has picked up all of the new nodes, but only allocated 1 of them to the Spark job. Spark did successfully pick up the new node and is using it as an executor, but my question is why is YARN letting the other 7 nodes just sit idle?
It's annoying for obvious reasons - I have to pay for the resources even though they're not being used, and my job hasn't sped up at all!
Anybody know how YARN decides when to add nodes to running spark jobs? What variables come into play? Memory? V-Cores? Anything?
Thanks in advance!
Okay, with the help of #sean_r_owen, I was able to track this down.
The problem was this: when setting spark.dynamicAllocation.enabled to true, spark.executor.instances shouldn't be set - an explicit value for that will override dynamic allocation and turn it off. It turns out that EMR sets it in the background if you do not set it yourself. To get the desired behaviour, you need to explicitly set spark.executor.instances to 0.
For the records, here is the contents of one of the files we pass to the --configurations flag when creating an EMR cluster:
[
{
"Classification": "capacity-scheduler",
"Properties": {
"yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
}
},
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.dynamicAllocation.enabled": "true",
"spark.executor.instances": "0"
}
}
]
This gives us an EMR cluster where Spark uses all the nodes, including added nodes, when running jobs. It also appears to use all/most of the memory and all (?) the cores.
(I'm not entirely sure that it's using all the actual cores; but it is definitely using more than 1 VCore, which it wasn't before, but following Glennie Helles's advice it is now behaving better and using half of the listed VCores, which seems to equal the actual number of cores...)
I observed the same behavior with nearly the same settings using emr-5.20.0. I didn't try to add nodes when the cluster is already running but using TASK nodes (together with just one CORE node). I'm using InstanceFleets to define MASTER, CORE and TASK nodes (with InstanceFleets I don't know which exact InstanceTypes I get and that is why I don't want to define the number of executors, cores and memory per executor myself but want that to be maximized/optimized automatically).
With this, it only uses two TASK nodes (probably the first two nodes which are ready to use?) but never scales up while more TASK nodes get provisioned and finishing the bootstrap phase.
What made it work in my case was to set the spark.default.parallelism parameter (to the number of total number of cores of my TASK nodes), which is the same number used for the TargetOnDemandCapacity or TargetSpotCapacity of the TASK InstanceFleet:
[
{
"Classification": "capacity-scheduler",
"Properties": {
"yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
}
},
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.dynamicAllocation.enabled": "true",
"spark.default.parallelism", <Sum_of_Cores_of_all_TASK_nodes>
}
}
]
For the sake of completeness: I'm using one CORE node and several TASK nodes mainly to make sure the cluster has at least 3 nodes (1 MASTER, 1 CORE and at least one TASK node). Before I tried to used only CORE nodes, but as in my case the number of cores is calculated depending on the actual task it was possible to end up with a cluster consisting of just one MASTER and one CORE node. Using the maximizeResourceAllocation option such a cluster runs for ever doing nothing because the executor running the yarn application master is occupying that single CORE node completely.

Resources