Understanding Spark shuffle spill - apache-spark

If I understand correctly, when a reduce task goes about gathering its input shuffle blocks ( from outputs of different map tasks ) it first keeps them in memory ( Q1 ). When the amount of shuffles-reserved memory of an executor ( before the change in memory management ( Q2 ) ) is exhausted, the in-memory data is "spilled" to disk. if spark.shuffle.spill.compress is true then that in-memory data is written to disk in a compressed fashion.
My questions:
Q0: Is my understanding correct?
Q1: Is the gathered data inside the reduce task always uncompressed?
Q2: How can I estimate the amount of executor memory available for gathering shuffle blocks?
Q3: I've seen the claim "shuffle spill happens when your dataset cannot fit in memory", but to my understanding as long as the shuffle-reserved executor memory is big enough to contain all the ( uncompressed ) shuffle input blocks of all its ACTIVE tasks, then no spill should occur, is that correct?
If so, to avoid spills one needs to make sure that the ( uncompressed ) data which ends up in all parallel reduce-side tasks is less than the executor's shuffle-reserved memory part?

There are differences in memory management in before and after 1.6. In both cases, there are notions of execution memory and storage memory. The difference is that before 1.6 it's static. Meaning there is a configuration parameter that specifies how much memory is for execution and for storage. And there is a spill, when either one is not enough.
One of the issues that Apache Spark has to workaround is a concurrent execution of:
different stages that are executed in parallel
different tasks like aggregation or sorting.
I'd say that your understanding is correct.
What's in memory is uncompressed or else it cannot be processed. Execution memory is spilled to disk in blocks and as you mentioned can be compressed.
Well, since 1.3.1 you can configure it, then you know the size. As of what's left at any moment in time, you can see that by looking at the executor process with something like jstat -gcutil <pid> <period>. It might give you a clue of how much memory is free there. Knowing how much memory is configured for storage and execution, having as little default.parallelism as possible might give you a clue.
That's true, but it's hard to reason about; there might be skew in the data such as some keys have more values than the others, there are many parallel executions, etc.


Repartitioning of large dataset in spark

I have 20TB file and I want to repartition it in spark with each partition = 128MB.
But after calculating n=20TB/128mb= 156250 partitions.
I believe 156250 is a very big number for
how should I approach repartitiong in this?
or should I increase the block size from 128mb to let's say 128gb.
but 128 gb per task will explode executor.
Please help me with this.
Divide and conquer it. You don’t need to load all the dataset in one place cause it would cost you huge amount resources and also network pressure because of shuffle exchanging.
The block size that you are referring to here is an HDFS concept related to storing the data by breaking it into chunks (say 128M default) & replicating thereafter for fault tolerance. In case you are storing your 20TB file on HDFS, it will automatically be broken into 20TB/128mb=156250 chunks for storage.
Coming to the Spark dataframe repartition, firstly it is a tranformation rather than an action (more information on the differences between the two: https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations). Which means merely calling this function on the dataframe does nothing unless the dataframe is eventually used in some action.
Further, the repartition value allows you to define the parallelism level of your operation involving the dataframe & should mostly be though upon in those terms rather than the amount of data being processed per executor. The aim should be to maximize parallelism as per the available resources rather than trying to process certain amount of data per executor. The only exception to this rule should be in cases where the executor either needs to persist all this data in memory or collect some information from this data which is proportional to the data size being processed. And the same applies to any executor task running on 128GB of data.

How do you determine shuffle partitions for Spark application?

I am new to spark so am following this amazing tutorial from sparkbyexamples.com and while reading I found this section:
Shuffle partition size & Performance
Based on your dataset size, a number of cores and memory PySpark
shuffling can benefit or harm your jobs. When you dealing with less
amount of data, you should typically reduce the shuffle partitions
otherwise you will end up with many partitioned files with less number
of records in each partition. which results in running many tasks with
lesser data to process.
On other hand, when you have too much of data and having less number
of partitions results in fewer longer running tasks and some times you
may also get out of memory error.
Getting the right size of the shuffle partition is always tricky and
takes many runs with different values to achieve the optimized number.
This is one of the key properties to look for when you have
performance issues on PySpark jobs.
Can someone help me understand how do you determine how many shuffle partitions you will need for your job?
As you quoted, it’s tricky, but this is my strategy:
If you’re using “static allocation”, means you tell Spark how many executors you want to allocate for the job, then it’s easy, number of partitions could be executors * cores per executor * factor. factor = 1 means each executor will handle 1 job, factor = 2 means each executor will handle 2 jobs, and so on
If you’re using “dynamic allocation”, then it’s trickier. You can read the long description here https://databricks.com/blog/2021/03/17/advertising-fraud-detection-at-scale-at-t-mobile.html. The general idea is you need to answer many questions like what’s the size if your data (how big in terms of gigabytes), how its structure looks like (how many files, how many folders, how many rows etc), how would you read it (from hdfs or from hive or from jdbc), how much resources do you have (cores, executors, memory), … Then you run and benchmark over and over to find the sweet spot that is “just right” for your circumstances.
Update #1:
So what is the general industry practice, will a company simply use first tactic and allocate more hardware or they will use dynamic allocation?
Usually, if you have an on-premise Hadoop environment, you can choose between static (default mode) and dynamic allocation (advanced mode). Also, I often start with dynamic because I have no idea how big the data and its transformation is, so stick with dynamic give me flexibility to expand my work without thinking too much about Spark configuration. But you also can start with static if you want to, nothing preventing you to do so.
Then eventually, when it came to productionize process, you also can choose between static (very stable but consumes more resources) vs dynamic (less stable, i.e fail sometimes due to resources allocation, but save resources.
Finally, most Hadoop cloud solution (like Databricks) come with dynamic allocation by default, which is is less costly.

setting tuning parameters of a spark job

I'm relatively new to spark and I have a few questions related to the tuning optimizations with respect to the spark submit command.
I have followed : How to tune spark executor number, cores and executor memory?
and I understand how to utilise maximum resources out of my spark cluster.
However, I was recently asked how to define the number of cores, memory and cores when I have a relatively smaller operation to do as if I give maximum resources, it is going to be underutilised .
For instance,
if I have to just do a merge job (read files from hdfs and write one single huge file back to hdfs using coalesce) for about 60-70 GB (assume each file is of 128 mb in size which is the block size of HDFS) of data(in avro format without compression), what would be the ideal memory, no of executor and cores required for this?
Assume I have the configurations of my nodes same as the one mentioned in the link above.
I can't understand the concept of how much memory will be used up by the entire job provided there are no joins, aggregations etc.
The amount of memory you will need depends on what you run before the write operation. If all you're doing is reading data combining it and writing it out, then you will need very little memory per cpu because the dataset is never fully materialized before writing it out. If you're doing joins/group-by/other aggregate operations all of those will require much ore memory. The exception to this rule is that spark isn't really tuned for large files and generally is much more performant when dealing with sets of reasonably sized files. Ultimately the best way to get your answers is to run your job with the default parameters and see what blows up.

How does spark behave without enough memory (RAM) to create RDD

When I do sc.textFile("abc.txt")
Spark creates RDD in RAM (memory).
So does the cluster collective memory should be greater than size of the file “abc.txt”?
My worker nodes have disk space so could I use disk space while reading texfile to create RDD? If so how to do it?
How to work on big data which doesn’t fit into memory?
When I do sc.textFile("abc.txt") Spark creates RDD in RAM (memory).
The above point is not certainly true. In Spark, their is something called transformations and something called actions. sc.textFile("abc.txt") is transformation operation and it does not simply load data straight away unless you trigger any action eg count().
To give you a collective answer to your all questions, I would urge you to understand how spark execution works. Their is something called logical and physical plans.As part of physical plan, it does cost calculation(available resource calculation across the cluster(s)) before it starts the jobs. if you understand them, you will get clear idea on all your questions.
You first assumption is incorrect:
Spark creates RDD in RAM (memory).
Spark doesn't create RDDs "in-memory". It uses memory but it is not limited to in-memory data processing. So:
So does the cluster collective memory should be greater than size of the file “abc.txt”?
My worker nodes have disk space so could I use disk space while reading texfile to create RDD? If so how to do it?
No special steps are required.
How to work on big data which doesn’t fit into memory?
See above.

spark spilling independent of executor memory assigned

I've noticed strange behavior when running a pyspark application with spark 2.0. In the first step in my script involving a reduceByKey (and thus shuffle) operation, I observe that the amount the shuffle writes is roughly in line with my expectations, but that much more spills occur than I had expected. I tried to avoid these spills by increasing the amount of memory assigned per executor up to 8x the original amount, but see basically no difference in the amount spilled. Strangely, I also see that while this stage is running, hardly any of the assigned storage memory is used (as reported in the executors tab in the spark web UI).
I saw this earlier question, which led me to believe that increasing executor memory might help avoid the spills: How to optimize shuffle spill in Apache Spark application
. This leads me to believe that some hard limit is leading to the spills, and not the spark.shuffle.memoryFraction parameter. Does such a hard limit exist, possibly among HDFS parameters? Otherwise, what could be done to avoid spills besides increasing executor memory?
Many thanks, R
Spilling behavior in PySpark is controlled using spark.python.worker.memory:
Amount of memory to use per python worker process during aggregation, in the same format as JVM memory strings (e.g. 512m, 2g). If the memory used during aggregation goes above this amount, it will spill the data into disks.
which is by default set to 512MB. Moreover PySpark uses its own reducing mechanism with External(GroupBy|Sorter|Merger) and exhibits slightly different behavior than its native counterpart.
