In spark jobs, there are multiple reasons to get out of memory error, mostly when the shuffle size is large. However the expected behaviour from spark jobs is to spill on disk whenever the data (either cached data or shuffle data) does not fit into the executor memory, so theoretically we should never see an out of memory issue. But apparently this is not the behaviour in practice. What is the reason?


How can Spark process data that is way larger than Spark storage?

Currently taking a course in Spark and came across the definition of an executor:
Each executor will hold a chunk of the data to be processed. This
chunk is called a Spark partition. It is a collection of rows that
sits on one physical machine in the cluster. Executors are responsible
for carrying out the work assigned by the driver. Each executor is
responsible for two things: (1) execute code assigned by the driver,
(2) report the state of the computation back to the driver
I am wondering what will happen if the storage of the spark cluster is less than the data that needs to be processed? How executors will fetch the data to sit on the physical machine in the cluster?
The same question goes for streaming data, which unbound data. Do Spark save all the incoming data on disk?
The Apache Spark FAQ briefly mentions the two strategies Spark may adopt:
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
Although Spark uses all available memory by default, it could be configured to run the jobs only with disk.
In section 2.6.4 Behavior with Insufficient Memory of Matei's PhD dissertation on Spark (An Architecture for Fast and General Data Processing on Large Clusters) benchmarks the performance impact due to the reduced amount of memory available.
In practice, you don't usually persist the source dataframe of 100TB, but only the aggregations or intermediate computations that are reused.

how spark handles out of memory error when cached( MEMORY_ONLY persistence) data does not fit in memory?

I'm new to the spark and i am not able to find clear answer that What happens when a cached data does not fit in memory?
many places i found that If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.
for example:lets say 500 partition is created and say 200 partition didn't cached then again we have to re-compute the remaining 200 partition by re-evaluating the RDD.
If that is the case then OOM error should never occur but it does.What is the reason?
Detailed explanation is highly appreciated.Thanks in advance
There are different ways you can persist in your dataframe in spark.
1)Persist (MEMORY_ONLY)
when you persist data frame with MEMORY_ONLY it will be cached in spark.cached.memory section as deserialized Java objects. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level and can some times cause OOM when the RDD is too big and cannot fit in memory(it can also occur after recalculation effort).
To answer your question
If that is the case then OOM error should never occur but it does.What is the reason?
even after recalculation you need to fit those rdd in memory. if there no space available then GC will try to clean some part and try to allocate it.if not successfully then it will fail with OOM
when you persist data frame with MEMORY_AND_DISK it will be cached in spark.cached.memory section as deserialized Java objects if memory is not available in heap then it will be spilled to disk. to tackle memory issues it will spill down some part of data or complete data to disk. (note: make sure to have enough disk space in nodes other no-disk space errors will popup)
when you persist data frame with MEMORY_ONLY_SER it will be cached in spark.cached.memory section as serialized Java objects (one-byte array per partition). this is generally more space-efficient than MEMORY_ONLY but it is a cpu-intensive task because compression is involved (general suggestion here is to use Kyro for serialization) but this still faces OOM issues similar to MEMORY_ONLY.
it is similar to MEMORY_ONLY_SER but one difference is when no heap space is available then it will spill RDD array to disk the same as (MEMORY_AND_DISK) ... we can use this option when you have a tight constraint on disk space and you want to reduce IO traffic.
5)Persist (DISK_ONLY)
In this case, heap memory is not used.RDD's are persisted to disk. make sure to have enough disk space and this option will have huge IO overhead. don't use this when you have dataframes that are repeatedly used.
These are similar to above mentioned MEMORY_ONLY and MEMORY_AND_DISK. the only difference is these options replicate each partition on two cluster nodes just to be on the safe side.. use these options when you are using spot instances.
7)Persist (OFF_HEAP)
Off heap memory generally contains thread stacks, spark container application code, network IO buffers, and other OS application buffers. even you can utilize this part of the memory from RAM for caching your RDD with the above option.

Why is Spark's read from file after a stage so fast?

Spark materializes its results on disk after a shuffle. While running an experiment, I saw that a task of Spark read materialized data of 65MB in 1ms (some tasks were even showed to read this in 0ms :)). My question is how can Spark read data from HDD so fast? Is it actually reading this data from a file or from memory?
The answer by #zero323 on this Stackoverflow post states To disk are written shuffle files. It doesn't mean that data after the shuffle is not kept in memory. But I couldn't find any official Spark source that says that Spark keeps shuffle output in memory which is preferred while reading by the next task.
Is the Spark task reading shuffle output from disk or from memory (if from memory, I would be thankful if someone can point to an official source).
Spark shuffle outputs are written to disk. You can find this on Spark Documents on Performance Impact topic.
Shuffle also generates a large number of intermediate files on disk.
As of Spark 1.3, these files are preserved until the
corresponding RDDs are no longer used and are garbage collected.
This is done so the shuffle files don’t need to be re-created if the
lineage is re-computed. Garbage collection may happen only after a
long period time, if the application retains references to these RDDs
or if GC does not kick in frequently.
This means that long-running Spark jobs may consume a large amount of
disk space.

spark spilling independent of executor memory assigned

I've noticed strange behavior when running a pyspark application with spark 2.0. In the first step in my script involving a reduceByKey (and thus shuffle) operation, I observe that the amount the shuffle writes is roughly in line with my expectations, but that much more spills occur than I had expected. I tried to avoid these spills by increasing the amount of memory assigned per executor up to 8x the original amount, but see basically no difference in the amount spilled. Strangely, I also see that while this stage is running, hardly any of the assigned storage memory is used (as reported in the executors tab in the spark web UI).
I saw this earlier question, which led me to believe that increasing executor memory might help avoid the spills: How to optimize shuffle spill in Apache Spark application
. This leads me to believe that some hard limit is leading to the spills, and not the spark.shuffle.memoryFraction parameter. Does such a hard limit exist, possibly among HDFS parameters? Otherwise, what could be done to avoid spills besides increasing executor memory?
Many thanks, R
Spilling behavior in PySpark is controlled using spark.python.worker.memory:
Amount of memory to use per python worker process during aggregation, in the same format as JVM memory strings (e.g. 512m, 2g). If the memory used during aggregation goes above this amount, it will spill the data into disks.
which is by default set to 512MB. Moreover PySpark uses its own reducing mechanism with External(GroupBy|Sorter|Merger) and exhibits slightly different behavior than its native counterpart.

spark streaming failed batches

I see some failed batches in my spark streaming application because of memory related issues like
Could not compute split, block input-0-1464774108087 not found
, and I was wondering if there is a way to re process those batches on the side without messing with the current running application, just in general , does not have to be the same exact exception.
Thanks in advance
This may happen in cases where your data ingestion rate into spark is higher than memory allocated or can be kept. You can try changing StorageLevel to MEMORY_AND_DISK_SER so that when it is low on memory Spark can spill data to disk. This will prevent your error.
Also, I don't think this error means that any data was lost while processing, but that input block which was added by your block manager just timed out before processing started.
Check similar question on Spark User list.
Data is not lost, it was just not present where the task was expecting it to be. As per Spark docs:
You can mark an RDD to be persisted using the persist() or cache()
methods on it. The first time it is computed in an action, it will be
kept in memory on the nodes. Spark’s cache is fault-tolerant – if any
partition of an RDD is lost, it will automatically be recomputed using
the transformations that originally created it.
