Spill to disk and shuffle write spark - apache-spark

I'm getting confused about spill to disk and shuffle write. Using the default Sort shuffle manager, we use an appendOnlyMap for aggregating and combine partition records, right? Then when execution memory fill up, we start sorting map, spilling it to disk and then clean up the map for the next spill(if occur), my questions are :
What is the difference between spill to disk and shuffle write? They consist basically in creating file on local file system and also record.
Admit are different, so Spill records are sorted because the are passed through the map, instead shuffle write records no because they don't pass from the map.
I have the idea that the total size of the spilled file, should be equal to the size of the Shuffle write, maybe I'm missing something, please help to understand that phase.

spill to disk and shuffle write are two different things
spill to disk - Data move from Host RAM to Host Disk - is used when there is no enough RAM on your machine, and it place part of its RAM into disk
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
shuffle write - Data move from Executor(s) to another Executor(s) - is used when data needs to move between executors (e.g. due to JOIN, groupBy, etc)
more data can be found here:
An edge case example which might help clearing this issue:
You have 10 executors
Each executor with 100GB RAM
Data size is 1280MB, and is partitioned into 10 partitions
Each executor holds 128MB of data.
Assuming that the data holds one key, Performing groupByKey, will bring all the data into one partition. Shuffle size will be 9*128MB (9 executors will transfer their data into the last executor), and there won't be any spill to disk as the executor has 100GB of RAM and only 1GB of data
Regarding AppendOnlyMap :
As written in the AppendOnlyMap code (see above) - this function is
a low level implementation of a simple open hash table optimized for
the append-only use case, where keys are never removed, but the value
for each key may be changed.
The fact that two different modules uses the same low-level function doesn't mean that those functions are related in hi-level.


What is spark spill (disk and memory both)?

As per the documentation:
Shuffle spill (memory) is the size of the deserialized form of the shuffled data in memory.
Shuffle spill (disk) is the size of the serialized form of the data on disk.
My understanding of shuffle is this:
Every executor takes all the partitions on it and hashpartitions them into 200 new partitions (this 200 can be changed). Each new partition is associated with an executor that it will later on go to. For example: For each existing partition: new_partition = hash(partitioning_id)%200; target_executor = new_partition%num_executors where % is the modulo operator and the num_executors is the number of executors on the cluster.
These new partitions are dumped onto the disk of each node of their initial executors. Each new partitions will, later on, be read by the target_executor
Target executors pick up their respective new partitions (out of the 200 generated)
Is my understanding of the shuffle operation correct?
Can you help me put the definition of shuffle spill (memory) and shuffle spill (disk) in the context of the shuffle mechanism (the one described above if it is correct)? For example (maybe): "shuffle spill (disk) is the part that is happening in point 2 mentioned above where the 200 partitions are dumped to the disk of their respective nodes" (I do not know if it is correct to say that; just giving an example)
Lets take a look at docu where we can find this:
Shuffle read: Total shuffle bytes and records read, includes both data read locally and data read from remote executors
This is what your executor loads into memory when stage processing is starting, you can think about this as shuffle files prepared in previous stage by other executors
Shuffle write: Bytes and records written to disk in order to be read by a shuffle in a future stage
This is size of output of your stage which may be picked up by next stage for processing, in other words this is a size of shuffle files which this stage created
And now what is shuffle spill
Shuffle spill (memory) is the size of the deserialized form of the shuffled data in memory.
Shuffle spill (disk) is the size of the serialized form of the data on disk.
Shuffle spill hapens when your executor is reading shuffle files but they cannot fit into execution memory of this executor. When this happens, some chunk of data is removed from memory and written to disc (its spilled to disc in other words)
Moving back to your question: what is the difference between spill(memory) and spill(disc)? Its describing excatly the same chunk of data. First metric is describing space occupied by those spilled data in memory before they were moved to disc, second is describing their size when written to disc. Those two metrics may be different because data may be represented differently when written to disc, for example they may be compressed.
"Shuffle spill (memory) is the size of the deserialized form of the
data in memory at the time when we spill it, whereas shuffle spill
(disk) is the size of the serialized form of the data on disk after we
spill it. This is why the latter tends to be much smaller than the
former. Note that both metrics are aggregated over the entire duration
of the task (i.e. within each task you can spill multiple times)."
Spill is represented by two values: (These two values are always presented together.)
Spill (Memory): is the size of the data as it exists in memory before it is spilled.
Spill (Disk): is size of the data that gets spilled, serialized and, written into disk and gets compressed.

Repartitioning of large dataset in spark

I have 20TB file and I want to repartition it in spark with each partition = 128MB.
But after calculating n=20TB/128mb= 156250 partitions.
I believe 156250 is a very big number for
how should I approach repartitiong in this?
or should I increase the block size from 128mb to let's say 128gb.
but 128 gb per task will explode executor.
Please help me with this.
Divide and conquer it. You don’t need to load all the dataset in one place cause it would cost you huge amount resources and also network pressure because of shuffle exchanging.
The block size that you are referring to here is an HDFS concept related to storing the data by breaking it into chunks (say 128M default) & replicating thereafter for fault tolerance. In case you are storing your 20TB file on HDFS, it will automatically be broken into 20TB/128mb=156250 chunks for storage.
Coming to the Spark dataframe repartition, firstly it is a tranformation rather than an action (more information on the differences between the two: https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations). Which means merely calling this function on the dataframe does nothing unless the dataframe is eventually used in some action.
Further, the repartition value allows you to define the parallelism level of your operation involving the dataframe & should mostly be though upon in those terms rather than the amount of data being processed per executor. The aim should be to maximize parallelism as per the available resources rather than trying to process certain amount of data per executor. The only exception to this rule should be in cases where the executor either needs to persist all this data in memory or collect some information from this data which is proportional to the data size being processed. And the same applies to any executor task running on 128GB of data.

How Iterator-to-Iterator transformation in Spark allows spilling of Data to disk in Spark.?

If Iterator-to-Iterator transformation is used in MapPartiton then how will it allow spilling of data to disk. As i understand MapPartition needs whole partition in memory to process, but if i use Iterator-to-Iterator then how data can be spilled to disk (despite of fact that MapPartiton needs whole partition in memmory).
This is a wrong notion that MapPartition needs complete data in memory.
MapPartition is just like Map with a difference it acts on a partition at a time.
It will read one record after another sequentially.
It will return once it has processes all the records.
Conceptually, an iterator-to-iterator transformation means defining a
process for evaluating elements one at a time. Thus, Spark can apply
that procedure to batches of records rather than reading an entire
partition into memory or creating a collection with all of the output
records in-memory and then returning it. Consequently,
iterator-to-iterator transformations allow Spark to manipulate
partitions that are too large to fit in memory on a single executor
without out memory errors.
Furthermore, keeping the partition as an iterator allows Spark to use disk space more selectively. Rather than spilling an entire partition when it doesn’t fit in memory, the iterator-to-iterator transformation allows Spark to spill only those records that do not fit in memory, thereby saving disk I/O and the cost of recomputation.
Excerpts from "High Performance Spark"

Apache Spark running out of memory with smaller amount of partitions

I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.

Spark: Difference between Shuffle Write, Shuffle spill (memory), Shuffle spill (disk)?

I have the following spark job, trying to keep everything in memory:
val myOutRDD = myInRDD.flatMap { fp =>
val tuple2List: ListBuffer[(String, myClass)] = ListBuffer()
}.persist(StorageLevel.MEMORY_ONLY).reduceByKey { (p1, p2) =>
However, when I looked in to the job tracker, I still have a lot of Shuffle Write and Shuffle spill to disk ...
Total task time across all tasks: 49.1 h
Input Size / Records: 21.6 GB / 102123058
Shuffle write: 532.9 GB / 182440290
Shuffle spill (memory): 370.7 GB
Shuffle spill (disk): 15.4 GB
Then the job failed because "no space left on device" ... I am wondering for the 532.9 GB Shuffle write here, is it written to disk or memory?
Also, why there are still 15.4 G data spill to the disk while I specifically ask to keep them in the memory?
The persist calls in your code are entirely wasted if you don't access the RDD multiple times. What's the point of storing something if you never access it? Caching has no bearing on shuffle behavior other than you can avoid re-doing shuffles by keeping their output cached.
Shuffle spill is controlled by the spark.shuffle.spill and spark.shuffle.memoryFraction configuration parameters. If spill is enabled (it is by default) then shuffle files will spill to disk if they start using more than given by memoryFraction (20% by default).
The metrics are very confusing. My reading of the code is that "Shuffle spill (memory)" is the amount of memory that was freed up as things were spilled to disk. The code for "Shuffle spill (disk)" looks like it's the amount actually written to disk. By the code for "Shuffle write" I think it's the amount written to disk directly — not as a spill from a sorter.
One more note on how to prevent shuffle spill, since I think that is the most important part of the question from a performance aspect (shuffle write, as mentioned above, is a required part of shuffling).
Spilling occurs when the at shuffle read, any reducer cannot fit all of the records assigned to it in memory in the shuffle space on that executor. If your shuffle is unbalanced (e.g. some output partitions are much larger than some input partitions), you may have shuffle spill even if the partitions "fit in memory" before the shuffle. The best way to control this is by
A) balancing the shuffle... e.g changing your code to reduce before shuffling or by shuffling on different keys
B) changing the shuffle memory settings as suggested above
Given the extent of the spill to disk you probably need to do A rather than B.
shuffle data
Shuffle write means those data which have written to your local file system in temporary cache location. In yarn cluster mode, you may set this property with attribute "yarn.nodemanager.local-dirs" in yarn-site.xml. Therefor, the "shuffle write" means the size of data which you've written to the temporary location; "Shuffle spill" is more likely your shuffle stage result. Anyway, those figure are accumulated.
