Spark: Difference between Shuffle Write, Shuffle spill (memory), Shuffle spill (disk)? - apache-spark

I have the following spark job, trying to keep everything in memory:
val myOutRDD = myInRDD.flatMap { fp =>
val tuple2List: ListBuffer[(String, myClass)] = ListBuffer()
:
tuple2List
}.persist(StorageLevel.MEMORY_ONLY).reduceByKey { (p1, p2) =>
myMergeFunction(p1,p2)
}.persist(StorageLevel.MEMORY_ONLY)
However, when I looked in to the job tracker, I still have a lot of Shuffle Write and Shuffle spill to disk ...
Total task time across all tasks: 49.1 h
Input Size / Records: 21.6 GB / 102123058
Shuffle write: 532.9 GB / 182440290
Shuffle spill (memory): 370.7 GB
Shuffle spill (disk): 15.4 GB
Then the job failed because "no space left on device" ... I am wondering for the 532.9 GB Shuffle write here, is it written to disk or memory?
Also, why there are still 15.4 G data spill to the disk while I specifically ask to keep them in the memory?
Thanks!

The persist calls in your code are entirely wasted if you don't access the RDD multiple times. What's the point of storing something if you never access it? Caching has no bearing on shuffle behavior other than you can avoid re-doing shuffles by keeping their output cached.
Shuffle spill is controlled by the spark.shuffle.spill and spark.shuffle.memoryFraction configuration parameters. If spill is enabled (it is by default) then shuffle files will spill to disk if they start using more than given by memoryFraction (20% by default).
The metrics are very confusing. My reading of the code is that "Shuffle spill (memory)" is the amount of memory that was freed up as things were spilled to disk. The code for "Shuffle spill (disk)" looks like it's the amount actually written to disk. By the code for "Shuffle write" I think it's the amount written to disk directly — not as a spill from a sorter.

One more note on how to prevent shuffle spill, since I think that is the most important part of the question from a performance aspect (shuffle write, as mentioned above, is a required part of shuffling).
Spilling occurs when the at shuffle read, any reducer cannot fit all of the records assigned to it in memory in the shuffle space on that executor. If your shuffle is unbalanced (e.g. some output partitions are much larger than some input partitions), you may have shuffle spill even if the partitions "fit in memory" before the shuffle. The best way to control this is by
A) balancing the shuffle... e.g changing your code to reduce before shuffling or by shuffling on different keys
or
B) changing the shuffle memory settings as suggested above
Given the extent of the spill to disk you probably need to do A rather than B.

shuffle data
Shuffle write means those data which have written to your local file system in temporary cache location. In yarn cluster mode, you may set this property with attribute "yarn.nodemanager.local-dirs" in yarn-site.xml. Therefor, the "shuffle write" means the size of data which you've written to the temporary location; "Shuffle spill" is more likely your shuffle stage result. Anyway, those figure are accumulated.

Related

What is spark spill (disk and memory both)?

As per the documentation:
Shuffle spill (memory) is the size of the deserialized form of the shuffled data in memory.
Shuffle spill (disk) is the size of the serialized form of the data on disk.
My understanding of shuffle is this:
Every executor takes all the partitions on it and hashpartitions them into 200 new partitions (this 200 can be changed). Each new partition is associated with an executor that it will later on go to. For example: For each existing partition: new_partition = hash(partitioning_id)%200; target_executor = new_partition%num_executors where % is the modulo operator and the num_executors is the number of executors on the cluster.
These new partitions are dumped onto the disk of each node of their initial executors. Each new partitions will, later on, be read by the target_executor
Target executors pick up their respective new partitions (out of the 200 generated)
Is my understanding of the shuffle operation correct?
Can you help me put the definition of shuffle spill (memory) and shuffle spill (disk) in the context of the shuffle mechanism (the one described above if it is correct)? For example (maybe): "shuffle spill (disk) is the part that is happening in point 2 mentioned above where the 200 partitions are dumped to the disk of their respective nodes" (I do not know if it is correct to say that; just giving an example)
Lets take a look at docu where we can find this:
Shuffle read: Total shuffle bytes and records read, includes both data read locally and data read from remote executors
This is what your executor loads into memory when stage processing is starting, you can think about this as shuffle files prepared in previous stage by other executors
Shuffle write: Bytes and records written to disk in order to be read by a shuffle in a future stage
This is size of output of your stage which may be picked up by next stage for processing, in other words this is a size of shuffle files which this stage created
And now what is shuffle spill
Shuffle spill (memory) is the size of the deserialized form of the shuffled data in memory.
Shuffle spill (disk) is the size of the serialized form of the data on disk.
Shuffle spill hapens when your executor is reading shuffle files but they cannot fit into execution memory of this executor. When this happens, some chunk of data is removed from memory and written to disc (its spilled to disc in other words)
Moving back to your question: what is the difference between spill(memory) and spill(disc)? Its describing excatly the same chunk of data. First metric is describing space occupied by those spilled data in memory before they were moved to disc, second is describing their size when written to disc. Those two metrics may be different because data may be represented differently when written to disc, for example they may be compressed.
If you want to read more:
Cloudera questions
"Shuffle spill (memory) is the size of the deserialized form of the
data in memory at the time when we spill it, whereas shuffle spill
(disk) is the size of the serialized form of the data on disk after we
spill it. This is why the latter tends to be much smaller than the
former. Note that both metrics are aggregated over the entire duration
of the task (i.e. within each task you can spill multiple times)."
Medium 1 Medium 2
Spill is represented by two values: (These two values are always presented together.)
Spill (Memory): is the size of the data as it exists in memory before it is spilled.
Spill (Disk): is size of the data that gets spilled, serialized and, written into disk and gets compressed.

Repartitioning of large dataset in spark

I have 20TB file and I want to repartition it in spark with each partition = 128MB.
But after calculating n=20TB/128mb= 156250 partitions.
I believe 156250 is a very big number for
df.repartition(156250)
how should I approach repartitiong in this?
or should I increase the block size from 128mb to let's say 128gb.
but 128 gb per task will explode executor.
Please help me with this.
Divide and conquer it. You don’t need to load all the dataset in one place cause it would cost you huge amount resources and also network pressure because of shuffle exchanging.
The block size that you are referring to here is an HDFS concept related to storing the data by breaking it into chunks (say 128M default) & replicating thereafter for fault tolerance. In case you are storing your 20TB file on HDFS, it will automatically be broken into 20TB/128mb=156250 chunks for storage.
Coming to the Spark dataframe repartition, firstly it is a tranformation rather than an action (more information on the differences between the two: https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations). Which means merely calling this function on the dataframe does nothing unless the dataframe is eventually used in some action.
Further, the repartition value allows you to define the parallelism level of your operation involving the dataframe & should mostly be though upon in those terms rather than the amount of data being processed per executor. The aim should be to maximize parallelism as per the available resources rather than trying to process certain amount of data per executor. The only exception to this rule should be in cases where the executor either needs to persist all this data in memory or collect some information from this data which is proportional to the data size being processed. And the same applies to any executor task running on 128GB of data.

Spark OOM error explanation and alleviation

Sometimes, you will get an OutOfMemoryError not because your RDDs don’t fit in memory, but because the working set of one of your tasks, such as one of the reduce tasks in groupByKey, was too large. Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large. The simplest fix here is to increase the level of parallelism, so that each task’s input set is smaller.
I think it this way, please correct me if I am wrong.
Suppose there are 2 Data Nodes to process the Dataset and both these nodes collectively has a memory of 32GB(16 GB per Data Node). The data set size is 100 GB and let us suppose this data, when read by spark, is partitioned into 10 partitions of 10GB each. It is obvious that the 100GB file cannot be fit into 32 GB RAM at a time. so the partitions have to be loaded into memory and processed in a iterative manner. so I assume as below.
first iteration, 2 partitions, 10GB each are loaded into memory on each data node.
second iteration, 2 partitions, 10GB each are loaded into memory on each data node.
....
....
Fifth iteration, 2 partitions, 10GB each are loaded into memory on each data node.
If this is how the spark is processing, during every iteration, only 2 partitions are loaded into memory. Does that mean, the other partitions which were unable to be accommodated in memory, were read but spilled to disk and they are waiting for the memory to be freed? or those partitions are not read at all and they will be read only when the resources are available. which is true?
During processing if there is a need to groupby/reduceby/join, then it mandates a shuffle. so if one of the shuffle partition is greater than RAM size then the job will fail with OOM error. Example, 10 partitions were processed and shuffled. Now the shuffle partitions are only 4 partitions with 25GB each.
(Default shuffle partitions are 200, but only 4 partitions have the total data remaining are empty.) since the shuffle partition size is greater than 16MB RAM, will the spark job fail? Is my understanding correct?
I understand that, you do not really need that your data fit in memory. Spark processes the data on partition basis. But My question is what if the partition itself is not fitting in memory. Would it still spill the data to disk and start processing or it will fail with OOM error?
The second question I have is, If another spark job(Job2) is triggered during the above spark job(job1) is under execution, and suppose this is also having 100GB file to process with 10 partitions of 10GB each. so when job1 Iteration1 is under execution, there is only 6 MB free slot available in the memory. The job2's partition of 10GB cannot be loaded into memory for processing job2. so will the Job2 wait till the memory is freed up? or will this job also fail with OOM error?
The explanation (bold) is correct.
On your comments:
Unless you explicitly repartition, your partitions will be HDFS block size related, the 128MB size and as many that make up that file.
Then you have number of executors, say 2, per Worker / Data Node. Then max 4 tasks / partitions will be active at any given time.
What would be the point of loading all partitions to memory if you can service at most 4? You would be clogging up the system to the detriment of other Spark Apps. This is all like a normal OS then.
Of course it is a bit more complicated, e.g. if you have 10 Data Nodes and allocation only 2 Executors, there is traffic to move stuff about. Just keeping it simple.
OOM errors only occur if a partition exceeds max partition size. For the rest disk space is needed for spilling.

Spill to disk and shuffle write spark

I'm getting confused about spill to disk and shuffle write. Using the default Sort shuffle manager, we use an appendOnlyMap for aggregating and combine partition records, right? Then when execution memory fill up, we start sorting map, spilling it to disk and then clean up the map for the next spill(if occur), my questions are :
What is the difference between spill to disk and shuffle write? They consist basically in creating file on local file system and also record.
Admit are different, so Spill records are sorted because the are passed through the map, instead shuffle write records no because they don't pass from the map.
I have the idea that the total size of the spilled file, should be equal to the size of the Shuffle write, maybe I'm missing something, please help to understand that phase.
Thanks.
Giorgio
spill to disk and shuffle write are two different things
spill to disk - Data move from Host RAM to Host Disk - is used when there is no enough RAM on your machine, and it place part of its RAM into disk
http://spark.apache.org/faq.html
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
shuffle write - Data move from Executor(s) to another Executor(s) - is used when data needs to move between executors (e.g. due to JOIN, groupBy, etc)
more data can be found here:
https://0x0fff.com/spark-architecture-shuffle/
http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/
An edge case example which might help clearing this issue:
You have 10 executors
Each executor with 100GB RAM
Data size is 1280MB, and is partitioned into 10 partitions
Each executor holds 128MB of data.
Assuming that the data holds one key, Performing groupByKey, will bring all the data into one partition. Shuffle size will be 9*128MB (9 executors will transfer their data into the last executor), and there won't be any spill to disk as the executor has 100GB of RAM and only 1GB of data
Regarding AppendOnlyMap :
As written in the AppendOnlyMap code (see above) - this function is
a low level implementation of a simple open hash table optimized for
the append-only use case, where keys are never removed, but the value
for each key may be changed.
The fact that two different modules uses the same low-level function doesn't mean that those functions are related in hi-level.

How to optimize shuffle spill in Apache Spark application

I am running a Spark streaming application with 2 workers.
Application has a join and an union operations.
All the batches are completing successfully but noticed that shuffle spill metrics are not consistent with input data size or output data size (spill memory is more than 20 times).
Please find the spark stage details in the below image:
After researching on this, found that
Shuffle spill happens when there is not sufficient memory for shuffle data.
Shuffle spill (memory) - size of the deserialized form of the data in memory at the time of spilling
shuffle spill (disk) - size of the serialized form of the data on disk after spilling
Since deserialized data occupies more space than serialized data. So, Shuffle spill (memory) is more.
Noticed that this spill memory size is incredibly large with big input data.
My queries are:
Does this spilling impacts the performance considerably?
How to optimize this spilling both memory and disk?
Are there any Spark Properties that can reduce/ control this huge spilling?
Learning to performance-tune Spark requires quite a bit of investigation and learning. There are a few good resources including this video. Spark 1.4 has some better diagnostics and visualisation in the interface which can help you.
In summary, you spill when the size of the RDD partitions at the end of the stage exceed the amount of memory available for the shuffle buffer.
You can:
Manually repartition() your prior stage so that you have smaller partitions from input.
Increase the shuffle buffer by increasing the memory in your executor processes (spark.executor.memory)
Increase the shuffle buffer by increasing the fraction of executor memory allocated to it (spark.shuffle.memoryFraction) from the default of 0.2. You need to give back spark.storage.memoryFraction.
Increase the shuffle buffer per thread by reducing the ratio of worker threads (SPARK_WORKER_CORES) to executor memory
If there is an expert listening, I would love to know more about how the memoryFraction settings interact and their reasonable range.
To add to the above answer, you may also consider increasing the default number (spark.sql.shuffle.partitions) of partitions from 200 (when shuffle occurs) to a number that will result in partitions of size close to the hdfs block size (i.e. 128mb to 256mb)
If your data is skewed, try tricks like salting the keys to increase parallelism.
Read this to understand spark memory management:
https://0x0fff.com/spark-memory-management/
https://www.tutorialdocs.com/article/spark-memory-management.html

Resources