spark spilling independent of executor memory assigned - apache-spark

I've noticed strange behavior when running a pyspark application with spark 2.0. In the first step in my script involving a reduceByKey (and thus shuffle) operation, I observe that the amount the shuffle writes is roughly in line with my expectations, but that much more spills occur than I had expected. I tried to avoid these spills by increasing the amount of memory assigned per executor up to 8x the original amount, but see basically no difference in the amount spilled. Strangely, I also see that while this stage is running, hardly any of the assigned storage memory is used (as reported in the executors tab in the spark web UI).
I saw this earlier question, which led me to believe that increasing executor memory might help avoid the spills: How to optimize shuffle spill in Apache Spark application
. This leads me to believe that some hard limit is leading to the spills, and not the spark.shuffle.memoryFraction parameter. Does such a hard limit exist, possibly among HDFS parameters? Otherwise, what could be done to avoid spills besides increasing executor memory?
Many thanks, R

Spilling behavior in PySpark is controlled using spark.python.worker.memory:
Amount of memory to use per python worker process during aggregation, in the same format as JVM memory strings (e.g. 512m, 2g). If the memory used during aggregation goes above this amount, it will spill the data into disks.
which is by default set to 512MB. Moreover PySpark uses its own reducing mechanism with External(GroupBy|Sorter|Merger) and exhibits slightly different behavior than its native counterpart.

Related

How do you find out exactly what had caused the high GC time for the spark tasks in any given spark stage?

I do have a spark application where in one of the spark stage took most of the time 2.5hrs + . I did a dive deep and found for majority of the tasks the GC time was pretty high 60% of total task execution time.
The question that I have is :
How do i co-relate this piece of spark task with my code ?
enter image description here
How do I identify what part of my spark code written using PySpark had caused the high GC time ? enter image description here
In general what causes high GC time for any given spark task , I want to know ?
High GC means frequent GC or GC taking long time. A few suggestions with the limited info I could gather from the screenshots:
One thing to check is are you caching big rdd/rdd's. Uncaching them as soon as they are no longer required will reduce the memory pressure. Is stage 68 part of first job, uncache unrequired data from previous jobs?
How to figure out which operation is this: Use DAG visualization link on the top of stage,job pages to understand the flow. For SQL use SQL tab on the UI.
Also there are 2000 tasks for ~ 40GB of shuffle data, each task handling 20 MB which is very small. better to have atleast ~128MB per task. tune this parallelism back to default 200 ?
If you can't optimize your code then, use more memory by adding more nodes or nodes with larger memory.
From experience, high GC time is caused by tasks requiring more than the available memory. High GC time is often also accompanied by the tasks spilling to disk (entries in the Memory Spill and Disk Spill columns).
Also, from Learning Spark:
A high GC time signals too many objects on the heap (your executors may be memory-starved).
Damji, Jules S.,Wenig, Brooke,Das, Tathagata,Lee, Denny.
In my experience, a good mitigation is to increase the number of partitions read by the given stage to reduce the memory required by the individual tasks e.g. by decreasing spark.files.maxPartitionBytes when reading files, or increasing spark.sql.shuffle.partitions when joining dataframes.

Why caching small Spark RDDs takes big memory allocation in Yarn?

The RDDs that are cached (in total 8) are not big, only around 30G, however, on Hadoop UI, it shows that the Spark application is taking lots of memory (no active jobs are running), i.e. 1.4T, why so much?
Why it shows around 100 executors (here, i.e. vCores) even when there's no active jobs running?
Also, if cached RDDs are stored across 100 executors, are those executors preserved and no more other Spark apps can use them for running tasks any more? To rephrase the question: will preserving a little memory resource (.cache) in executors prevents other Spark app from leveraging the idle computing resource of them?
Is there any potential Spark config / zeppelin config that can cause this phenomenon?
UPDATE 1
After checking the Spark conf (zeppelin), it seems there's the default (configured by administrator by default) setting for spark.executor.memory=10G, which is probably the reason why.
However, here's a new question: Is it possible to keep only the memory needed for the cached RDDs in each executors and release the rest, instead of holding always the initially set memory spark.executor.memory=10G?
Spark configuration
Perhaps you can try to repartition(n) your RDD to a fewer n < 100 partitions before caching. A ~30GB RDD would probably fit into storage memory of ten 10GB executors. A good overview of Spark memory management can be found here. This way, only those executors that hold cached blocks will be "pinned" to your application, while the rest can be reclaimed by YARN via Spark dynamic allocation after spark.dynamicAllocation.executorIdleTimeout (default 60s).
Q: Is it possible to keep only the memory needed for the cached RDDs in each executors and release the rest, instead of holding always the initially set memory spark.executor.memory=10G?
When Spark uses YARN as its execution engine, YARN allocates the containers of a specified (by application) size -- at least spark.executor.memory+spark.executor.memoryOverhead, but may be even bigger in case of pyspark -- for all the executors. How much memory Spark actually uses inside a container becomes irrelevant, since the resources allocated to a container will be considered off-limits to other YARN applications.
Spark assumes that your data is equally distributed on all the executors and tasks. That's the reason why you set memory per task. So to make Spark to consume less memory, your data has to be evenly distributed:
If you are reading from Parquet files or CSVs, make sure that they have similar sizes. Running repartition() causes shuffling, which if the data is so skewed may cause other problems if executors don't have enough resources
Cache won't help to release memory on the executors because it doesn't redistribute the data
Can you please see under "Event Timeline" on the Stages "how big are the green bars?" Normally that's tied to the data distribution, so that's a way to see how much data is loaded (proportionally) on every task and how much they are doing. As all tasks have same memory assigned, you can see graphically if resources are wasted (in case there are mostly tiny bars and few big bars). A sample of wasted resources can be seen on the image below
There are different ways to create evenly distributed files for your process. I mention some possibilities, but for sure there are more:
Using Hive and DISTRIBUTE BY clause: you need to use a field that is equally balanced in order to create as many files (and with proper size) as expected
If the process creating those files is a Spark process reading from a DB, try to create as many connections as files you need and use a proper field to populate Spark partitions. That is achieved, as explained here and here with partitionColumn, lowerBound, upperBound and numPartitions properties
Repartition may work, but see if coalesce also make sense in your process or in the previous one generating the files you are reading from

setting tuning parameters of a spark job

I'm relatively new to spark and I have a few questions related to the tuning optimizations with respect to the spark submit command.
I have followed : How to tune spark executor number, cores and executor memory?
and I understand how to utilise maximum resources out of my spark cluster.
However, I was recently asked how to define the number of cores, memory and cores when I have a relatively smaller operation to do as if I give maximum resources, it is going to be underutilised .
For instance,
if I have to just do a merge job (read files from hdfs and write one single huge file back to hdfs using coalesce) for about 60-70 GB (assume each file is of 128 mb in size which is the block size of HDFS) of data(in avro format without compression), what would be the ideal memory, no of executor and cores required for this?
Assume I have the configurations of my nodes same as the one mentioned in the link above.
I can't understand the concept of how much memory will be used up by the entire job provided there are no joins, aggregations etc.
The amount of memory you will need depends on what you run before the write operation. If all you're doing is reading data combining it and writing it out, then you will need very little memory per cpu because the dataset is never fully materialized before writing it out. If you're doing joins/group-by/other aggregate operations all of those will require much ore memory. The exception to this rule is that spark isn't really tuned for large files and generally is much more performant when dealing with sets of reasonably sized files. Ultimately the best way to get your answers is to run your job with the default parameters and see what blows up.

"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result.
Now I'm trying to process a ~5GB bzip2 CSV file on this cluster but I'm receiving this error:
16/11/23 17:29:53 WARN TaskSetManager: Lost task 49.2 in stage 6.0 (TID xxx, xxx.xxx.xxx.compute.internal): ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
I'm confused as to why I'm getting a ~10.5GB memory limit on a ~75GB cluster (15GB per 3m.xlarge instance)...
Here is my EMR config:
[
{
"classification":"spark-env",
"properties":{
},
"configurations":[
{
"classification":"export",
"properties":{
"PYSPARK_PYTHON":"python34"
},
"configurations":[
]
}
]
},
{
"classification":"spark",
"properties":{
"maximizeResourceAllocation":"true"
},
"configurations":[
]
}
]
From what I've read, setting the maximizeResourceAllocation property should tell EMR to configure Spark to fully utilize all resources available on the cluster. Ie, I should have ~75GB of memory available... So why am I getting a ~10.5GB memory limit error?
Here is the code I'm running:
def sessionize(raw_data, timeout):
# https://www.dataiku.com/learn/guide/code/reshaping_data/sessionization.html
window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
.orderBy("timestamp"))
diff = (pyspark.sql.functions.lag(raw_data.timestamp, 1)
.over(window))
time_diff = (raw_data.withColumn("time_diff", raw_data.timestamp - diff)
.withColumn("new_session", pyspark.sql.functions.when(pyspark.sql.functions.col("time_diff") >= timeout.seconds, 1).otherwise(0)))
window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
.orderBy("timestamp")
.rowsBetween(-1, 0))
sessions = (time_diff.withColumn("session_id", pyspark.sql.functions.concat_ws("_", "user_id", "site_id", pyspark.sql.functions.sum("new_session").over(window))))
return sessions
def aggregate_sessions(sessions):
median = pyspark.sql.functions.udf(lambda x: statistics.median(x))
aggregated = sessions.groupBy(pyspark.sql.functions.col("session_id")).agg(
pyspark.sql.functions.first("site_id").alias("site_id"),
pyspark.sql.functions.first("user_id").alias("user_id"),
pyspark.sql.functions.count("id").alias("hits"),
pyspark.sql.functions.min("timestamp").alias("start"),
pyspark.sql.functions.max("timestamp").alias("finish"),
median(pyspark.sql.functions.collect_list("foo")).alias("foo"),
)
return aggregated
spark_context = pyspark.SparkContext(appName="process-raw-data")
spark_session = pyspark.sql.SparkSession(spark_context)
raw_data = spark_session.read.csv(sys.argv[1],
header=True,
inferSchema=True)
# Windowing doesn't seem to play nicely with TimestampTypes.
#
# Should be able to do this within the ``spark.read.csv`` call, I'd
# think. Need to look into it.
convert_to_unix = pyspark.sql.functions.udf(lambda s: arrow.get(s).timestamp)
raw_data = raw_data.withColumn("timestamp",
convert_to_unix(pyspark.sql.functions.col("timestamp")))
sessions = sessionize(raw_data, SESSION_TIMEOUT)
aggregated = aggregate_sessions(sessions)
aggregated.foreach(save_session)
Basically, nothing more than windowing and a groupBy to aggregate the data.
It starts with a few of those errors, and towards halting increases in the amount of the same error.
I've tried running spark-submit with --conf spark.yarn.executor.memoryOverhead but that doesn't seem to solve the problem either.
I feel your pain..
We had similar issues of running out of memory with Spark on YARN. We have five 64GB, 16 core VMs and regardless of what we set spark.yarn.executor.memoryOverhead to, we just couldn't get enough memory for these tasks -- they would eventually die no matter how much memory we would give them. And this as a relatively straight-forward Spark application that was causing this to happen.
We figured out that the physical memory usage was quite low on the VMs but the virtual memory usage was extremely high (despite the logs complaining about physical memory). We set yarn.nodemanager.vmem-check-enabled in yarn-site.xml to false and our containers were no longer killed, and the application appeared to work as expected.
Doing more research, I found the answer to why this happens here: http://web.archive.org/web/20190806000138/https://mapr.com/blog/best-practices-yarn-resource-management/
Since on Centos/RHEL 6 there are aggressive allocation of virtual memory due to OS behavior, you should disable virtual memory checker or increase yarn.nodemanager.vmem-pmem-ratio to a relatively larger value.
That page had a link to a very useful page from IBM: https://web.archive.org/web/20170703001345/https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en
In summary, glibc > 2.10 changed its memory allocation. And although huge amounts of virtual memory being allocated isn't the end of the world, it doesn't work with the default settings of YARN.
Instead of setting yarn.nodemanager.vmem-check-enabled to false, you could also play with setting the MALLOC_ARENA_MAX environment variable to a low number in hadoop-env.sh. This bug report has helpful information about that: https://issues.apache.org/jira/browse/HADOOP-7154
I recommend reading through both pages -- the information is very handy.
If you're not using spark-submit, and you're looking for another way to specify the yarn.nodemanager.vmem-check-enabled parameter mentioned by Duff, here are 2 other ways:
Method 2
If you're using a JSON Configuration file (that you pass to the AWS CLI or to your boto3 script), you'll have to add the following configuration:
[{
"Classification": "yarn-site",
"Properties": {
"yarn.nodemanager.vmem-check-enabled": "false"
}
}]
Method 3
If you use the EMR console, add the following configuration:
classification=yarn-site,properties=[yarn.nodemanager.vmem-check-enabled=false]
See,
I had the same problem in a huge cluster that I'm working now. The problem will not be solved to adding memory to the worker. Sometimes in process aggregation spark will use more memory than it has and the spark jobs will start to use off-heap memory.
One simple example is:
If you have a dataset that you need to reduceByKey it will, sometimes, agregate more data in one worker than other, and if this data exeeds the memory of one worker you get that error message.
Adding the option spark.yarn.executor.memoryOverhead will help you if you set for 50% of the memory used for the worker (just for test, and see if it works, you can add less with more tests).
But you need to understand how Spark works with the Memory Allocation in the cluster:
The more common way Spark uses 75% of the machine memory. The rest goes to SO.
Spark has two types of memory during the execution. One part is for execution and the other is the storage. Execution is used for Shuffles, Joins, Aggregations and Etc. The storage is used for caching and propagating data accross the cluster.
One good thing about memory allocation, if you are not using cache in your execution you can set the spark to use that sotorage space to work with execution to avoid in part the OOM error. As you can see this in documentation of spark:
This design ensures several desirable properties. First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills. Second, applications that do use caching can reserve a minimum storage space (R) where their data blocks are immune to being evicted. Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.
But how can we use that?
You can change some configurations, Add the MemoryOverhead configuration to your job call but, consider add this too: spark.memory.fraction change for 0.8 or 0.85 and reduce the spark.memory.storageFraction to 0.35 or 0.2.
Other configurations can help, but it need to check in your case. Se all these configuration here.
Now, what helps in My case.
I have a cluster with 2.5K workers and 2.5TB of RAM. And we were facing OOM error like yours. We just increase the spark.yarn.executor.memoryOverhead to 2048. And we enable the dynamic allocation. And when we call the job, we don't set the memory for the workers, we leave that for the Spark to decide. We just set the Overhead.
But for some tests for my small cluster, changing the size of execution and storage memory. That solved the problem.
Try repartition. It works in my case.
The dataframe was not so big at the very beginning when it was loaded with write.csv(). The data file amounted to be 10 MB or so, as may required say totally several 100 MB memory for each processing task in executor.
I checked the number of partitions to be 2 at the time.
Then it grew like a snowball during the following operations joining with other tables, adding new columns. And then I ran into the memory exceeding limits issue at a certain step.
I checked the number of partitions, it was still 2, derived from the original data frame I guess.
So I tried to repartition it at the very beginning, and there was no problem anymore.
I have not read many materials about Spark and YARN yet. What I do know is that there are executors in nodes. An executor could handle many tasks depending on the resources. My guess is one partition would be atomically mapped to one task. And its volume determines the resource usage. Spark could not slice it if one partition grows too big.
A reasonable strategy is to determine the nodes and container memory first, either 10GB or 5GB. Ideally, both could serve any data processing job, just a matter of time. Given the 5GB memory setting, the reasonable row for one partition you find, say is 1000 after testing (it won't fail any steps during the processing), we could do it as the following pseudo code:
RWS_PER_PARTITION = 1000
input_df = spark.write.csv("file_uri", *other_args)
total_rows = input_df.count()
original_num_partitions = input_df.getNumPartitions()
numPartitions = max(total_rows/RWS_PER_PARTITION, original_num_partitions)
input_df = input_df.repartition(numPartitions)
Hope it helps!
I had the same issue on small cluster running relatively small job on spark 2.3.1.
The job reads parquet file, removes duplicates using groupBy/agg/first then sorts and writes new parquet. It processed 51 GB of parquet files on 4 nodes (4 vcores, 32Gb RAM).
The job was constantly failing on aggregation stage. I wrote bash script watch executors memory usage and found out that in the middle of the stage one random executor starts taking double memory for a few seconds. When I correlated time of this moment with GC logs it matched with full GC that empties big amount of memory.
At last I understood that the problem is related somehow to GC. ParallelGC and G1 causes this issue constantly but ConcMarkSweepGC improves the situation. The issue appears only with small amount of partitions. I ran the job on EMR where OpenJDK 64-Bit (build 25.171-b10) was installed. I don't know the root cause of the issue, it could be related to JVM or operating system. But it is definitely not related to heap or off-heap usage in my case.
UPDATE1
Tried Oracle HotSpot, the issue is reproduced.

Understanding Spark shuffle spill

If I understand correctly, when a reduce task goes about gathering its input shuffle blocks ( from outputs of different map tasks ) it first keeps them in memory ( Q1 ). When the amount of shuffles-reserved memory of an executor ( before the change in memory management ( Q2 ) ) is exhausted, the in-memory data is "spilled" to disk. if spark.shuffle.spill.compress is true then that in-memory data is written to disk in a compressed fashion.
My questions:
Q0: Is my understanding correct?
Q1: Is the gathered data inside the reduce task always uncompressed?
Q2: How can I estimate the amount of executor memory available for gathering shuffle blocks?
Q3: I've seen the claim "shuffle spill happens when your dataset cannot fit in memory", but to my understanding as long as the shuffle-reserved executor memory is big enough to contain all the ( uncompressed ) shuffle input blocks of all its ACTIVE tasks, then no spill should occur, is that correct?
If so, to avoid spills one needs to make sure that the ( uncompressed ) data which ends up in all parallel reduce-side tasks is less than the executor's shuffle-reserved memory part?
There are differences in memory management in before and after 1.6. In both cases, there are notions of execution memory and storage memory. The difference is that before 1.6 it's static. Meaning there is a configuration parameter that specifies how much memory is for execution and for storage. And there is a spill, when either one is not enough.
One of the issues that Apache Spark has to workaround is a concurrent execution of:
different stages that are executed in parallel
different tasks like aggregation or sorting.
​
I'd say that your understanding is correct.
What's in memory is uncompressed or else it cannot be processed. Execution memory is spilled to disk in blocks and as you mentioned can be compressed.
Well, since 1.3.1 you can configure it, then you know the size. As of what's left at any moment in time, you can see that by looking at the executor process with something like jstat -gcutil <pid> <period>. It might give you a clue of how much memory is free there. Knowing how much memory is configured for storage and execution, having as little default.parallelism as possible might give you a clue.
That's true, but it's hard to reason about; there might be skew in the data such as some keys have more values than the others, there are many parallel executions, etc.

Resources