Memory issue with spark structured streaming - apache-spark

I'm facing memory issues running structured stream with aggregation and partitioning in Spark 2.2.0:
session
.readStream()
.schema(inputSchema)
.option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
.option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
.csv("s3://test-bucket/input")
.as(Encoders.bean(TestRecord.class))
.flatMap(mf, Encoders.bean(TestRecord.class))
.dropDuplicates("testId", "testName")
.withColumn("year", functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), "YYYY"))
.writeStream()
.option("path", "s3://test-bucket/output")
.option("checkpointLocation", "s3://test-bucket/checkpoint")
.trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS))
.partitionBy("year")
.format("parquet")
.outputMode(OutputMode.Append())
.queryName("test-stream")
.start();
During testing I noticed that amount of used memory increases each time when new data comes and finally executors exit with code 137:
ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container marked as failed: container_1520214726510_0001_01_000003 on host: ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
Killed by external signal
I've created a heap dump and found that most of the memory used by org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider that is referenced from StateStore
On the first glance it looks normal since that is how Spark keeps aggregation keys in memory. However I did my testing by renaming files in source folder, so that they could be picked up by spark. Since input records are the same all further rows should be rejected as duplicates and memory consumption shouldn't increase but it is.
Moreover, GC time took more than 30% of total processing time
Here is a heap dump taken from the executor running with smaller amount of memory than on screens above since when I was creating a dump from that one the java process just terminated in the middle of the process.

Migrating my comment on SPARK-23682 which asker of this question also filed in issue.
In HDFS state store provider, it excessively caches the multiple versions of states in memory, default 100 versions. The issue is addressed by SPARK-24717, and it will only maintain two versions (current for replay, and new for update) of state in memory. The patch will be available in Spark 2.4.0.

I think the root reason is that you do not use a watermark along with dropDuplicates, thus all the states are kept and never dropped.
Have a look at: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#streaming-deduplication

Related

Databricks Spark: java.lang.OutOfMemoryError: GC overhead limit exceeded i

I am executing a Spark job in Databricks cluster. I am triggering the job via a Azure Data Factory pipeline and it execute at 15 minute interval so after the successful execution of three or four times it is getting failed and throwing with the exception "java.lang.OutOfMemoryError: GC overhead limit exceeded".
Though there are many answer with for the above said question but in most of the cases their jobs are not running but in my cases it is getting failed after successful execution of some previous jobs.
My data size is less than 20 MB only.
My cluster configuration is:
So the my question is what changes I should make in the server configuration. If the issue is coming from my code then why it is getting succeeded most of the time. Please advise and suggest me the solution.
This is most probably related to executor memory being bit low .Not sure what is current setting and if its default what is the default value in this particular databrics distribution. Even though it passes but there would lot of GCs happening because of low memory hence it would keep failing once in a while . Under spark configuration please provide spark.executor.memory and also some other params related to num of executors and cores per executor . In spark-submit the config would be provided as spark-submit --conf spark.executor.memory=1g
You may try increasing memory of driver node.
Sometimes the Garbage Collector is not releasing all the loaded objects in the driver's memory.
What you can try is to force the GC to do that. You can do that by executing the following:
spark.catalog.clearCache()
for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items():
rdd.unpersist()
print("Unpersisted {} rdd".format(id))

The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached

I try to analyze a dataset of 500Mb in Databricks. These data are stored in Excel file. The first thing that I did was to install Spark Excel package com.crealytics.spark.excel from Maven (last version - 0.11.1).
These are the parameters of the cluster:
Then I executed the following code in Scala notebook:
val df_spc = spark.read
.format("com.crealytics.spark.excel")
.option("useHeader", "true")
.load("dbfs:/FileStore/tables/test.xlsx")
But I got error about the Java heap size and then I get another error "java.io.IOException: GC overhead limit exceeded". Then I executed this code again and got another error after 5 minutes running:
The spark driver has stopped unexpectedly and is restarting. Your
notebook will be automatically reattached.
I do not understand why it happens. In fact the data set is quite small for the distributed computing and the cluster size should be ok to process these data. What should I check to solve it?
I also got stuck in same situation where i am unable to process my 35000 record xlsx file.
Below solutions I tried to work around:
With the free azure subscription and 14 day pay as you go mode, you can process xlsx with less number of records.In my case with trial version, I have to change it to 25 records.
Also downgrade the worker type to Standard_F4S 8GB Memory 4core, 0.5DBU, 1 worker configuration.
Added below options:
sqlContext.read.format("com.crealytics.spark.excel").
option("location","filename here...").option("useHeader","true").option("treatEmptyValueAsNulls","true").option("maxRowsInMemory",20).option("inferSchema","true").load("filename here...")
I had this same issue. We reached out to DataBricks, who provided us this answer
"In the past we were able to address this issue by simply restarting a cluster that has been up for a long period of time.
This issue occurs due the fact that JVMs reuse the memory locations too many times and start misbehaving."

Spark Memory Usage Concentrated on Driver / Master

I'm currently developing a Spark (v 2.2.0) Streaming application and am running into issues with the way Spark seems to be allocating work across the cluster. This application is submitted to AWS EMR using client mode, so there is a driver node and a couple of worker nodes. Here is a screenshot of Ganglia that shows memory usage in the last hour:
The left-most node is the "master" or "driver" node, and the other two are worker nodes. There are spikes in the memory usage for all three nodes that correspond to workloads coming in through the stream, but the spikes are not equal (even when scaled to % memory usage). When a large workload comes in, the driver node appears to be overworked, and the job will crash with an error regarding memory:
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x000000053e980000, 674234368, 0) failed; error='Cannot allocate memory' (errno=12)
I've also run into this:
Exception in thread "streaming-job-executor-10" java.lang.OutOfMemoryError: Java heap space when the master runs out of memory, which is equally confusing, as my understanding is that "client" mode would not use the driver / master node as an executor.
Pertinent details:
As mentioned earlier, this application is submitted in client mode: spark-submit --deploy-mode client --master yarn ....
Nowhere in the program am I running collect or coalesce
Any work that I suspect of being run on a single node (jdbc reads mainly) is repartition'd after completion.
There are a couple of very, very small datasets persist into memory.
1 x Driver specs: 4 cores, 16GB RAM (m4.xlarge instance)
2 x Worker specs: 4 cores, 30.5GB RAM (r3.xlarge instance)
I have tried both allowing Spark to choose executor size / cores and specifying them manually. Both cases behave the same. (I manually specified 6 executors, 1 core, 9GB RAM)
I'm certainly at a loss here. I'm not sure what is going on in the code to be triggering the driver to hog the workload like this.
The only suspect I can think of is a code snippet similar to the following:
val scoringAlgorithm = HelperFunctions.scoring(_: Row, batchTime)
val rawScored = dataToScore.map(scoringAlgorithm)
Here, a function is being loaded from a static object, and used to map over the Dataset. It is my understanding that Spark will serialize this function across the cluster (re: http://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#passing-functions-to-spark). However perhaps I am mistaken and it is simply running this transformation on the driver.
If anyone has any insight to this issue, I would love to hear it!
I ended up solving this issue. Here's how I addressed it:
I made an incorrect assertion in stating the problem: there was a collect statement at the beginning of the Spark program.
I had a transaction that required collect() to run as it was designed. My assumption was that calling repartition(n) on the resulting data would split the data back amongst the executors in the cluster. From what I can tell, this strategy does not work. Once I re-wrote this line, Spark started behaving as I expected and farming jobs out to worker nodes.
My advice to any lost soul who stumbles across this issue: don't collect unless it's the end of your Spark program. You can not recover from it. Find another way to perform your task. (I ended up switching a SQL transaction from where col in (,,,) syntax to a join on the database.)

"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result.
Now I'm trying to process a ~5GB bzip2 CSV file on this cluster but I'm receiving this error:
16/11/23 17:29:53 WARN TaskSetManager: Lost task 49.2 in stage 6.0 (TID xxx, xxx.xxx.xxx.compute.internal): ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
I'm confused as to why I'm getting a ~10.5GB memory limit on a ~75GB cluster (15GB per 3m.xlarge instance)...
Here is my EMR config:
[
{
"classification":"spark-env",
"properties":{
},
"configurations":[
{
"classification":"export",
"properties":{
"PYSPARK_PYTHON":"python34"
},
"configurations":[
]
}
]
},
{
"classification":"spark",
"properties":{
"maximizeResourceAllocation":"true"
},
"configurations":[
]
}
]
From what I've read, setting the maximizeResourceAllocation property should tell EMR to configure Spark to fully utilize all resources available on the cluster. Ie, I should have ~75GB of memory available... So why am I getting a ~10.5GB memory limit error?
Here is the code I'm running:
def sessionize(raw_data, timeout):
# https://www.dataiku.com/learn/guide/code/reshaping_data/sessionization.html
window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
.orderBy("timestamp"))
diff = (pyspark.sql.functions.lag(raw_data.timestamp, 1)
.over(window))
time_diff = (raw_data.withColumn("time_diff", raw_data.timestamp - diff)
.withColumn("new_session", pyspark.sql.functions.when(pyspark.sql.functions.col("time_diff") >= timeout.seconds, 1).otherwise(0)))
window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
.orderBy("timestamp")
.rowsBetween(-1, 0))
sessions = (time_diff.withColumn("session_id", pyspark.sql.functions.concat_ws("_", "user_id", "site_id", pyspark.sql.functions.sum("new_session").over(window))))
return sessions
def aggregate_sessions(sessions):
median = pyspark.sql.functions.udf(lambda x: statistics.median(x))
aggregated = sessions.groupBy(pyspark.sql.functions.col("session_id")).agg(
pyspark.sql.functions.first("site_id").alias("site_id"),
pyspark.sql.functions.first("user_id").alias("user_id"),
pyspark.sql.functions.count("id").alias("hits"),
pyspark.sql.functions.min("timestamp").alias("start"),
pyspark.sql.functions.max("timestamp").alias("finish"),
median(pyspark.sql.functions.collect_list("foo")).alias("foo"),
)
return aggregated
spark_context = pyspark.SparkContext(appName="process-raw-data")
spark_session = pyspark.sql.SparkSession(spark_context)
raw_data = spark_session.read.csv(sys.argv[1],
header=True,
inferSchema=True)
# Windowing doesn't seem to play nicely with TimestampTypes.
#
# Should be able to do this within the ``spark.read.csv`` call, I'd
# think. Need to look into it.
convert_to_unix = pyspark.sql.functions.udf(lambda s: arrow.get(s).timestamp)
raw_data = raw_data.withColumn("timestamp",
convert_to_unix(pyspark.sql.functions.col("timestamp")))
sessions = sessionize(raw_data, SESSION_TIMEOUT)
aggregated = aggregate_sessions(sessions)
aggregated.foreach(save_session)
Basically, nothing more than windowing and a groupBy to aggregate the data.
It starts with a few of those errors, and towards halting increases in the amount of the same error.
I've tried running spark-submit with --conf spark.yarn.executor.memoryOverhead but that doesn't seem to solve the problem either.
I feel your pain..
We had similar issues of running out of memory with Spark on YARN. We have five 64GB, 16 core VMs and regardless of what we set spark.yarn.executor.memoryOverhead to, we just couldn't get enough memory for these tasks -- they would eventually die no matter how much memory we would give them. And this as a relatively straight-forward Spark application that was causing this to happen.
We figured out that the physical memory usage was quite low on the VMs but the virtual memory usage was extremely high (despite the logs complaining about physical memory). We set yarn.nodemanager.vmem-check-enabled in yarn-site.xml to false and our containers were no longer killed, and the application appeared to work as expected.
Doing more research, I found the answer to why this happens here: http://web.archive.org/web/20190806000138/https://mapr.com/blog/best-practices-yarn-resource-management/
Since on Centos/RHEL 6 there are aggressive allocation of virtual memory due to OS behavior, you should disable virtual memory checker or increase yarn.nodemanager.vmem-pmem-ratio to a relatively larger value.
That page had a link to a very useful page from IBM: https://web.archive.org/web/20170703001345/https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en
In summary, glibc > 2.10 changed its memory allocation. And although huge amounts of virtual memory being allocated isn't the end of the world, it doesn't work with the default settings of YARN.
Instead of setting yarn.nodemanager.vmem-check-enabled to false, you could also play with setting the MALLOC_ARENA_MAX environment variable to a low number in hadoop-env.sh. This bug report has helpful information about that: https://issues.apache.org/jira/browse/HADOOP-7154
I recommend reading through both pages -- the information is very handy.
If you're not using spark-submit, and you're looking for another way to specify the yarn.nodemanager.vmem-check-enabled parameter mentioned by Duff, here are 2 other ways:
Method 2
If you're using a JSON Configuration file (that you pass to the AWS CLI or to your boto3 script), you'll have to add the following configuration:
[{
"Classification": "yarn-site",
"Properties": {
"yarn.nodemanager.vmem-check-enabled": "false"
}
}]
Method 3
If you use the EMR console, add the following configuration:
classification=yarn-site,properties=[yarn.nodemanager.vmem-check-enabled=false]
See,
I had the same problem in a huge cluster that I'm working now. The problem will not be solved to adding memory to the worker. Sometimes in process aggregation spark will use more memory than it has and the spark jobs will start to use off-heap memory.
One simple example is:
If you have a dataset that you need to reduceByKey it will, sometimes, agregate more data in one worker than other, and if this data exeeds the memory of one worker you get that error message.
Adding the option spark.yarn.executor.memoryOverhead will help you if you set for 50% of the memory used for the worker (just for test, and see if it works, you can add less with more tests).
But you need to understand how Spark works with the Memory Allocation in the cluster:
The more common way Spark uses 75% of the machine memory. The rest goes to SO.
Spark has two types of memory during the execution. One part is for execution and the other is the storage. Execution is used for Shuffles, Joins, Aggregations and Etc. The storage is used for caching and propagating data accross the cluster.
One good thing about memory allocation, if you are not using cache in your execution you can set the spark to use that sotorage space to work with execution to avoid in part the OOM error. As you can see this in documentation of spark:
This design ensures several desirable properties. First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills. Second, applications that do use caching can reserve a minimum storage space (R) where their data blocks are immune to being evicted. Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.
But how can we use that?
You can change some configurations, Add the MemoryOverhead configuration to your job call but, consider add this too: spark.memory.fraction change for 0.8 or 0.85 and reduce the spark.memory.storageFraction to 0.35 or 0.2.
Other configurations can help, but it need to check in your case. Se all these configuration here.
Now, what helps in My case.
I have a cluster with 2.5K workers and 2.5TB of RAM. And we were facing OOM error like yours. We just increase the spark.yarn.executor.memoryOverhead to 2048. And we enable the dynamic allocation. And when we call the job, we don't set the memory for the workers, we leave that for the Spark to decide. We just set the Overhead.
But for some tests for my small cluster, changing the size of execution and storage memory. That solved the problem.
Try repartition. It works in my case.
The dataframe was not so big at the very beginning when it was loaded with write.csv(). The data file amounted to be 10 MB or so, as may required say totally several 100 MB memory for each processing task in executor.
I checked the number of partitions to be 2 at the time.
Then it grew like a snowball during the following operations joining with other tables, adding new columns. And then I ran into the memory exceeding limits issue at a certain step.
I checked the number of partitions, it was still 2, derived from the original data frame I guess.
So I tried to repartition it at the very beginning, and there was no problem anymore.
I have not read many materials about Spark and YARN yet. What I do know is that there are executors in nodes. An executor could handle many tasks depending on the resources. My guess is one partition would be atomically mapped to one task. And its volume determines the resource usage. Spark could not slice it if one partition grows too big.
A reasonable strategy is to determine the nodes and container memory first, either 10GB or 5GB. Ideally, both could serve any data processing job, just a matter of time. Given the 5GB memory setting, the reasonable row for one partition you find, say is 1000 after testing (it won't fail any steps during the processing), we could do it as the following pseudo code:
RWS_PER_PARTITION = 1000
input_df = spark.write.csv("file_uri", *other_args)
total_rows = input_df.count()
original_num_partitions = input_df.getNumPartitions()
numPartitions = max(total_rows/RWS_PER_PARTITION, original_num_partitions)
input_df = input_df.repartition(numPartitions)
Hope it helps!
I had the same issue on small cluster running relatively small job on spark 2.3.1.
The job reads parquet file, removes duplicates using groupBy/agg/first then sorts and writes new parquet. It processed 51 GB of parquet files on 4 nodes (4 vcores, 32Gb RAM).
The job was constantly failing on aggregation stage. I wrote bash script watch executors memory usage and found out that in the middle of the stage one random executor starts taking double memory for a few seconds. When I correlated time of this moment with GC logs it matched with full GC that empties big amount of memory.
At last I understood that the problem is related somehow to GC. ParallelGC and G1 causes this issue constantly but ConcMarkSweepGC improves the situation. The issue appears only with small amount of partitions. I ran the job on EMR where OpenJDK 64-Bit (build 25.171-b10) was installed. I don't know the root cause of the issue, it could be related to JVM or operating system. But it is definitely not related to heap or off-heap usage in my case.
UPDATE1
Tried Oracle HotSpot, the issue is reproduced.

spark on yarn, Container exited with a non-zero exit code 143

I am using HDP 2.5, running spark-submit as yarn cluster mode.
I have tried to generate data using dataframe cross join.
i.e
val generatedData = df1.join(df2).join(df3).join(df4)
generatedData.saveAsTable(...)....
df1 storage level is MEMORY_AND_DISK
df2,df3,df4 storage level is MEMORY_ONLY
df1 has much more records i.e 5 million while df2 to df4 has at most 100 records.
doing so my explain plain would result with better performance using BroadcastNestedLoopJoin explain plan.
for some reason it always fail. I don't know how can I debug it and where the memory explode.
Error log output:
16/12/06 19:44:08 WARN YarnAllocator: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
16/12/06 19:44:08 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
16/12/06 19:44:08 ERROR YarnClusterScheduler: Lost executor 1 on hdp4: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
16/12/06 19:44:08 WARN TaskSetManager: Lost task 1.0 in stage 12.0 (TID 19, hdp4): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
I didn't see any WARN or ERROR logs before this error.
What is the problem? where should I look for the memory consumption?
I cannot see anything on the Storage tab of SparkUI.
the log was taken from yarn resource manager UI on HDP 2.5
EDIT
looking at the container log, it seems like it's a java.lang.OutOfMemoryError: GC overhead limit exceeded
I know how to increase the memory, but I don't have any memory anymore.
How can I do a cartesian / product join with 4 Dataframes without getting this error.
I also meet this problem and try to solve it by refering some blog.
1. Run spark add conf bellow:
--conf 'spark.driver.extraJavaOptions=-XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps' \
--conf 'spark.executor.extraJavaOptions=-XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC ' \
When jvm GC ,you will get follow message:
Heap after GC invocations=157 (full 98):
PSYoungGen total 940544K, used 853456K [0x0000000781800000, 0x00000007c0000000, 0x00000007c0000000)
eden space 860160K, 99% used [0x0000000781800000,0x00000007b5974118,0x00000007b6000000)
from space 80384K, 0% used [0x00000007b6000000,0x00000007b6000000,0x00000007bae80000)
to space 77824K, 0% used [0x00000007bb400000,0x00000007bb400000,0x00000007c0000000)
ParOldGen total 2048000K, used 2047964K [0x0000000704800000, 0x0000000781800000, 0x0000000781800000)
object space 2048000K, 99% used [0x0000000704800000,0x00000007817f7148,0x0000000781800000)
Metaspace used 43044K, capacity 43310K, committed 44288K, reserved 1087488K
class space used 6618K, capacity 6701K, committed 6912K, reserved 1048576K
}
Both PSYoungGen and ParOldGen are 99% ,then you will get java.lang.OutOfMemoryError: GC overhead limit exceeded
if more object was created .
Try to add more memory for your executor or your driver when more memory resources are avaliable:
--executor-memory 10000m \
--driver-memory 10000m \
For my case : memory for PSYoungGen are smaller then ParOldGen which causes many young object enter into ParOldGen memory area and finaly
ParOldGen are not avaliable.So java.lang.OutOfMemoryError: Java heap space error appear.
Adding conf for executor:
'spark.executor.extraJavaOptions=-XX:NewRatio=1 -XX:+UseCompressedOops
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps '
-XX:NewRatio=rate
rate = ParOldGen/PSYoungGen
It dependends.You can try GC strategy like
-XX:+UseSerialGC :Serial Collector
-XX:+UseParallelGC :Parallel Collector
-XX:+UseParallelOldGC :Parallel Old collector
-XX:+UseConcMarkSweepGC :Concurrent Mark Sweep
Java Concurrent and Parallel GC
If both step 4 and step 6 are done but still get error, you should consider change you code. For example, reduce iterator times in ML model.
Log file of all containers and am are available on,
yarn logs -applicationId application_1480922439133_0845_02
If you just want AM logs,
yarn logs -am -applicationId application_1480922439133_0845_02
If you want to find containers ran for this job,
yarn logs -applicationId application_1480922439133_0845_02|grep container_e33_1480922439133_0845_02
If you want just a single container log,
yarn logs -containerId container_e33_1480922439133_0845_02_000002
And for these commands to work, log aggregation must have been set to true, or you will have to get logs from individual server directories.
Update
There is nothing you can do apart from try with swapping, but that will degrade performance alot.
The GC overhead limit means, GC has been running non-stop in quick succession but it was not able to recover much memory. Only reason for that is, either code has been poorly written and have alot of back reference(which is doubtful, as you are doing simple join), or memory capacity has reached.
REASON 1
By default the shuffle count is 200. Having too many shuffle will increase the complexity and chances of getting program crashed. Try controlling the number of shuffles in the spark session. I changed the count to 5 using the below code.
implicit val sparkSession = org.apache.spark.sql.SparkSession.builder().enableHiveSupport().getOrCreate()
sparkSession.sql("set spark.sql.shuffle.partitions=5")
Additionally if you are using dataframes and if you are not re-partitioning the dataframe, then the execution will be done in a single executor. If only 1 executor is running for some time then the yarn will make other executors to shut down. Later if more memory is required, though yarn tries to re-call the other executors sometimes the executors won't come up, hence the process might fail with memory overflow issue. To overcome this situation, try re-partitioning the dataframe before an action is called.
val df = df_temp.repartition(5)
Note that you might need to change the shuffle and partition count and according to your requirement. In my case the above combination worked.
REASON 2
It can occur due to memory is not getting cleared on time. For example, if you are running a spark command using Scala and that you are executing bunch of sql statements and exporting to csv. The data in some hive tables will be very huge and you have to manage the memory in your code.
Example, consider the below code where the lst_Sqls is a list that contains a set of sql commands
lst_Sqls.foreach(sqlCmd => spark.sql(sqlCmd).coalesce(1).write.format("com.databricks.spark.csv").option("delimiter","|").save("s3 path..."))
When you run this command sometimes you will end up seeing the same error. This is because although spark clears the memory, it does this in a lazy way, ie, your loop will be continuing but spark might be clearing the memory at some later point.
In such cases, you need to manage the memory in your code, ie, clear the memory after each command is executed. For this let us change our code little. I have commented what each line do in the below code.
lst_Sqls.foreach(sqlCmd =>
{
val df = spark.sql(sqlCmd)
// Store the result in in-memory. If in-memory is full, then it stored to HDD
df.persist(StorageLevel.MEMORY_AND_DISK)
// Export to csv from Dataframe
df.coalesce(1).write.format("com.databricks.spark.csv").save("s3 path")
// Clear the memory. Only after clearing memory, it will jump to next loop
df.unpersist(blocking = true)
})

Resources