How Iterator-to-Iterator transformation in Spark allows spilling of Data to disk in Spark.? - apache-spark

If Iterator-to-Iterator transformation is used in MapPartiton then how will it allow spilling of data to disk. As i understand MapPartition needs whole partition in memory to process, but if i use Iterator-to-Iterator then how data can be spilled to disk (despite of fact that MapPartiton needs whole partition in memmory).

This is a wrong notion that MapPartition needs complete data in memory.
MapPartition is just like Map with a difference it acts on a partition at a time.
It will read one record after another sequentially.
It will return once it has processes all the records.
http://bytepadding.com/big-data/spark/spark-map-vs-mappartitions/

Conceptually, an iterator-to-iterator transformation means defining a
process for evaluating elements one at a time. Thus, Spark can apply
that procedure to batches of records rather than reading an entire
partition into memory or creating a collection with all of the output
records in-memory and then returning it. Consequently,
iterator-to-iterator transformations allow Spark to manipulate
partitions that are too large to fit in memory on a single executor
without out memory errors.
Furthermore, keeping the partition as an iterator allows Spark to use disk space more selectively. Rather than spilling an entire partition when it doesn’t fit in memory, the iterator-to-iterator transformation allows Spark to spill only those records that do not fit in memory, thereby saving disk I/O and the cost of recomputation.
Excerpts from "High Performance Spark"

Related

Repartitioning of large dataset in spark

I have 20TB file and I want to repartition it in spark with each partition = 128MB.
But after calculating n=20TB/128mb= 156250 partitions.
I believe 156250 is a very big number for
df.repartition(156250)
how should I approach repartitiong in this?
or should I increase the block size from 128mb to let's say 128gb.
but 128 gb per task will explode executor.
Please help me with this.
Divide and conquer it. You don’t need to load all the dataset in one place cause it would cost you huge amount resources and also network pressure because of shuffle exchanging.
The block size that you are referring to here is an HDFS concept related to storing the data by breaking it into chunks (say 128M default) & replicating thereafter for fault tolerance. In case you are storing your 20TB file on HDFS, it will automatically be broken into 20TB/128mb=156250 chunks for storage.
Coming to the Spark dataframe repartition, firstly it is a tranformation rather than an action (more information on the differences between the two: https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations). Which means merely calling this function on the dataframe does nothing unless the dataframe is eventually used in some action.
Further, the repartition value allows you to define the parallelism level of your operation involving the dataframe & should mostly be though upon in those terms rather than the amount of data being processed per executor. The aim should be to maximize parallelism as per the available resources rather than trying to process certain amount of data per executor. The only exception to this rule should be in cases where the executor either needs to persist all this data in memory or collect some information from this data which is proportional to the data size being processed. And the same applies to any executor task running on 128GB of data.

How can Spark process data that is way larger than Spark storage?

Currently taking a course in Spark and came across the definition of an executor:
Each executor will hold a chunk of the data to be processed. This
chunk is called a Spark partition. It is a collection of rows that
sits on one physical machine in the cluster. Executors are responsible
for carrying out the work assigned by the driver. Each executor is
responsible for two things: (1) execute code assigned by the driver,
(2) report the state of the computation back to the driver
I am wondering what will happen if the storage of the spark cluster is less than the data that needs to be processed? How executors will fetch the data to sit on the physical machine in the cluster?
The same question goes for streaming data, which unbound data. Do Spark save all the incoming data on disk?
The Apache Spark FAQ briefly mentions the two strategies Spark may adopt:
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
Although Spark uses all available memory by default, it could be configured to run the jobs only with disk.
In section 2.6.4 Behavior with Insufficient Memory of Matei's PhD dissertation on Spark (An Architecture for Fast and General Data Processing on Large Clusters) benchmarks the performance impact due to the reduced amount of memory available.
In practice, you don't usually persist the source dataframe of 100TB, but only the aggregations or intermediate computations that are reused.

how spark handles out of memory error when cached( MEMORY_ONLY persistence) data does not fit in memory?

I'm new to the spark and i am not able to find clear answer that What happens when a cached data does not fit in memory?
many places i found that If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.
for example:lets say 500 partition is created and say 200 partition didn't cached then again we have to re-compute the remaining 200 partition by re-evaluating the RDD.
If that is the case then OOM error should never occur but it does.What is the reason?
Detailed explanation is highly appreciated.Thanks in advance
There are different ways you can persist in your dataframe in spark.
1)Persist (MEMORY_ONLY)
when you persist data frame with MEMORY_ONLY it will be cached in spark.cached.memory section as deserialized Java objects. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level and can some times cause OOM when the RDD is too big and cannot fit in memory(it can also occur after recalculation effort).
To answer your question
If that is the case then OOM error should never occur but it does.What is the reason?
even after recalculation you need to fit those rdd in memory. if there no space available then GC will try to clean some part and try to allocate it.if not successfully then it will fail with OOM
2)Persist (MEMORY_AND_DISK)
when you persist data frame with MEMORY_AND_DISK it will be cached in spark.cached.memory section as deserialized Java objects if memory is not available in heap then it will be spilled to disk. to tackle memory issues it will spill down some part of data or complete data to disk. (note: make sure to have enough disk space in nodes other no-disk space errors will popup)
3)Persist (MEMORY_ONLY_SER)
when you persist data frame with MEMORY_ONLY_SER it will be cached in spark.cached.memory section as serialized Java objects (one-byte array per partition). this is generally more space-efficient than MEMORY_ONLY but it is a cpu-intensive task because compression is involved (general suggestion here is to use Kyro for serialization) but this still faces OOM issues similar to MEMORY_ONLY.
4)Persist (MEMORY_AND_DISK_SER)
it is similar to MEMORY_ONLY_SER but one difference is when no heap space is available then it will spill RDD array to disk the same as (MEMORY_AND_DISK) ... we can use this option when you have a tight constraint on disk space and you want to reduce IO traffic.
5)Persist (DISK_ONLY)
In this case, heap memory is not used.RDD's are persisted to disk. make sure to have enough disk space and this option will have huge IO overhead. don't use this when you have dataframes that are repeatedly used.
6)Persist (MEMORY_ONLY_2 or MEMORY_AND_DISK_2)
These are similar to above mentioned MEMORY_ONLY and MEMORY_AND_DISK. the only difference is these options replicate each partition on two cluster nodes just to be on the safe side.. use these options when you are using spot instances.
7)Persist (OFF_HEAP)
Off heap memory generally contains thread stacks, spark container application code, network IO buffers, and other OS application buffers. even you can utilize this part of the memory from RAM for caching your RDD with the above option.

How does spark behave without enough memory (RAM) to create RDD

When I do sc.textFile("abc.txt")
Spark creates RDD in RAM (memory).
So does the cluster collective memory should be greater than size of the file “abc.txt”?
My worker nodes have disk space so could I use disk space while reading texfile to create RDD? If so how to do it?
How to work on big data which doesn’t fit into memory?
When I do sc.textFile("abc.txt") Spark creates RDD in RAM (memory).
The above point is not certainly true. In Spark, their is something called transformations and something called actions. sc.textFile("abc.txt") is transformation operation and it does not simply load data straight away unless you trigger any action eg count().
To give you a collective answer to your all questions, I would urge you to understand how spark execution works. Their is something called logical and physical plans.As part of physical plan, it does cost calculation(available resource calculation across the cluster(s)) before it starts the jobs. if you understand them, you will get clear idea on all your questions.
You first assumption is incorrect:
Spark creates RDD in RAM (memory).
Spark doesn't create RDDs "in-memory". It uses memory but it is not limited to in-memory data processing. So:
So does the cluster collective memory should be greater than size of the file “abc.txt”?
No
My worker nodes have disk space so could I use disk space while reading texfile to create RDD? If so how to do it?
No special steps are required.
How to work on big data which doesn’t fit into memory?
See above.

Spill to disk and shuffle write spark

I'm getting confused about spill to disk and shuffle write. Using the default Sort shuffle manager, we use an appendOnlyMap for aggregating and combine partition records, right? Then when execution memory fill up, we start sorting map, spilling it to disk and then clean up the map for the next spill(if occur), my questions are :
What is the difference between spill to disk and shuffle write? They consist basically in creating file on local file system and also record.
Admit are different, so Spill records are sorted because the are passed through the map, instead shuffle write records no because they don't pass from the map.
I have the idea that the total size of the spilled file, should be equal to the size of the Shuffle write, maybe I'm missing something, please help to understand that phase.
Thanks.
Giorgio
spill to disk and shuffle write are two different things
spill to disk - Data move from Host RAM to Host Disk - is used when there is no enough RAM on your machine, and it place part of its RAM into disk
http://spark.apache.org/faq.html
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
shuffle write - Data move from Executor(s) to another Executor(s) - is used when data needs to move between executors (e.g. due to JOIN, groupBy, etc)
more data can be found here:
https://0x0fff.com/spark-architecture-shuffle/
http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/
An edge case example which might help clearing this issue:
You have 10 executors
Each executor with 100GB RAM
Data size is 1280MB, and is partitioned into 10 partitions
Each executor holds 128MB of data.
Assuming that the data holds one key, Performing groupByKey, will bring all the data into one partition. Shuffle size will be 9*128MB (9 executors will transfer their data into the last executor), and there won't be any spill to disk as the executor has 100GB of RAM and only 1GB of data
Regarding AppendOnlyMap :
As written in the AppendOnlyMap code (see above) - this function is
a low level implementation of a simple open hash table optimized for
the append-only use case, where keys are never removed, but the value
for each key may be changed.
The fact that two different modules uses the same low-level function doesn't mean that those functions are related in hi-level.

Resources