When are Spark RDD blocks created and destroyed/removed? - apache-spark

There's a column called RDD blocks in the Spark UI in executors tab. One observation made is that the number of RDD blocks keeps increasing for a particular streaming job where messages are streamed from Kafka.
Certain executors were removed automatically and application slows down after long run with a large number of RDD blocks. DStreams and RDDs are not persisted manually anywhere.
It would be a great help if someone explains when these blocks are created and on what basis are the blocks being removed (are there any parameters that need to be modified?).

Good explanation of Spark UI is this. RDD blocks can represent cached RDD partitions, intermediate shuffle outputs, broadcasts, etc. Check out BlockManager section of this book.

Related

spark partitionBy out of memory failures

I have a Spark 2.2 job written in pyspark that's trying to read in 300BT of Parquet data in a hive table, run it through a python udf, and then write it out.
The input is partitioned on about five keys and results in about 250k partitions.
I then want to write it out using the same partition scheme using the .partitionBy clause for the dataframe.
When I don't use a partitionBy clause the data writes out and the job does finish eventually. However with the partitionBy clause I continuously see out of memory failures on the spark UI.
Upon further investigation the source parquet data is about 800MB on disk (compressed using snappy), and each node has about 50G of memory available to it.
Examining the spark UI I see that the last step before writing out is doing a sort. I believe this sort is the cause of all my issues.
When reading in a dataframe of partitioned data, is there any way to preserve knowledge of this partitioning so spark doesn't run an unnecessary sort before writing it out?
I'm trying to avoid a shuffle step here by repartitioning that could equally result in further delays of this.
Ultimately I can rewrite to read one partition at a time, but I think that's not a good solution and that spark should already be able to handle this use case.
I'm running with about 1500 executors across 150 nodes on ec2 r3.8xlarge.
I've tried smaller executor configs and larger ones and always hit the same out of memory issues.

spark streaming failed batches

I see some failed batches in my spark streaming application because of memory related issues like
Could not compute split, block input-0-1464774108087 not found
, and I was wondering if there is a way to re process those batches on the side without messing with the current running application, just in general , does not have to be the same exact exception.
Thanks in advance
Pradeep
This may happen in cases where your data ingestion rate into spark is higher than memory allocated or can be kept. You can try changing StorageLevel to MEMORY_AND_DISK_SER so that when it is low on memory Spark can spill data to disk. This will prevent your error.
Also, I don't think this error means that any data was lost while processing, but that input block which was added by your block manager just timed out before processing started.
Check similar question on Spark User list.
Edit:
Data is not lost, it was just not present where the task was expecting it to be. As per Spark docs:
You can mark an RDD to be persisted using the persist() or cache()
methods on it. The first time it is computed in an action, it will be
kept in memory on the nodes. Spark’s cache is fault-tolerant – if any
partition of an RDD is lost, it will automatically be recomputed using
the transformations that originally created it.

What is the difference between spark's shuffle read and shuffle write?

I need to run a spark program which has huge amount of data. I am trying to optimize the spark program and working through spark UI and trying to reduce the Shuffle part.
There are couple of components mentioned, shuffle read and shuffle write. I can understand the difference based their terminology, but I would like to understand the exact meaning of them and which one of spark's shuffle read/write reduces the performance?
I have searched over the internet, but could not find solid in depth details about them, so wanted to see if any one can explain them here.
From the UI tooltip
Shuffle Read
Total shuffle bytes and records read (includes both data read locally and data read from remote executors
Shuffle Write
Bytes and records written to disk in order to be read by a shuffle in a future stage
I've recently begun working with Spark. I have been looking for answers to the same sort of questions.
When the data from one stage is shuffled to a next stage through the network, the executor(s) that process the next stage pull the data from the first stage's process through TCP. I noticed the shuffle "write" and "read" metrics for each stage are displayed in the Spark UI for a particular job. A stage also potentially had an "input" size (eg. input from HDFS or hive table scan).
I noticed that the shuffle write size from one stage that fed into another stage did not match that stages shuffle read size. If I remember correctly, there are reducer-type operations that can be performed on the shuffle data prior to it being transferred to the next stage/executor as an optimization. Maybe this contributes to the difference in size and therefore the relevance of reporting both values.

Why does Spark save Map phase output to local disk?

I'm trying to understand spark shuffle process deeply. When i start reading i came across the following point.
Spark writes the Map task(ShuffleMapTask) output directly to disk on completion.
I would like to understand the following w.r.t to Hadoop MapReduce.
If both Map-Reduce and Spark writes the data to the local disk then how spark shuffle process is different from Hadoop MapReduce?
Since data is represented as RDD's in Spark why don't these outputs remain in the node executors memory?
How is the output of the Map tasks from Hadoop MapReduce and Spark different?
If there are lot of small intermediate files as output how spark handles the network and I/O bottleneck?
First of all Spark doesn't work in a strict map-reduce manner and map output is not written to disk unless it is necessary. To disk are written shuffle files.
It doesn't mean that data after the shuffle is not kept in memory. Shuffle files in Spark are written mostly to avoid re-computation in case of multiple downstream actions. Why to write to a file system at all? There at least two interleaved reasons:
memory is a valuable resource and in-memory caching in Spark is ephemeral. Old data can be evicted from cache when needed.
shuffle is an expensive process we want to avoid if not necessary. It makes more sense to store shuffle data in a manner which makes it persistent during a lifetime of a given context.
Shuffle itself, apart from the ongoing low level optimization efforts and implementation details, isn't different at all. It is based on the same basic approach with all its limitations.
How tasks are different form Hadoo maps? As nicely illustrated by Justin Pihony multiple transformations which doesn't require shuffles are squashed together in a single tasks. Since these operate on standard Scala Iterators operations on individual elements can be piped.
Regarding network and I/O bottlenecks there is no silver bullet here. While Spark can reduce amount of data which is written to disk or shuffled by combining transformations, caching in memory and providing transformation aware worker preferences, it is a subject to the same limitations like any other distributed framework.
If both Map-Reduce and Spark writes the data to the local disk then how spark shuffle process is different from Hadoop MapReduce?
When you execute a Spark application, the very first thing is starting the SparkContext first that becomes the home of multiple interconnected services with DAGScheduler, TaskScheduler and SchedulerBackend being among the most important ones.
DAGScheduler is the main orchestrator and is responsible for transforming a RDD lineage graph (i.e. a directed acyclic graph of RDDs) into stages. While doing it, DAGScheduler traverses the parent dependencies of the final RDD and creates a ResultStage with parent ShuffleMapStages.
A ResultStage is (mostly) the last stage with ShuffleMapStages being its parents. I said mostly because I think I may have seen that you can "schedule" a ShuffleMapStage.
This is the very early and first optimization Spark applies to your Spark jobs (that together create a Spark application) - execution pipelining where multiple transformations are wired together to create a single stage (because their inter-dependencies are narrow). That's what makes Spark faster than Hadoop MapReduce since two or more transformations can get executed one by one with no data shuffling possibly all in memory.
A single stage is as wide until it hits ShuffleDependency (aka wide dependency).
There are RDD transformations that will cause shuffling (due to creating a ShuffleDependency). That's the moment where Spark is very much like Hadoop's MapReduce since it will save partial shuffle outputs to...local disks on executors.
When a Spark application starts it requests executors from a cluster manager (there are three supported: Spark Standalone, Apache Mesos and Hadoop YARN). This is what SchedulerBackend is for -- to manage communication between your Spark application and cluster resources.
(Let's assume you are not using External Shuffle Manager)
Executors host their own local BlockManagers that are responsible for managing RDD blocks that are kept on local hard drive (possibly in memory and replicated too). You can control RDD block persistence using cache and persist operators and StorageLevels. You can use Storage and Executors tabs in web UI to track blocks with their location and size.
The difference between Spark storing data locally (on executors) and Hadoop MapReduce is that:
The partial results (after computing ShuffleMapStages) are saved on local hard drives not HDFS which is a distributed file system with a very expensive saves.
Only some files are saved to local hard drive (after operations being pipelined) which does not happen in Hadoop MapReduce that saves all maps to HDFS.
Let me answer the following item:
If there are lot of small intermediate files as output how spark handles the network and I/O bottleneck?
That's the trickest part in the Spark execution plan and heavily depends on how wide the shuffling is. If you work only with local data (multiple executors on a single machine) you will see no data traffic since the data is in place already.
If the data shuffle is required, executors will send data between each other and that will increase the traffic.
Data Exchange Between Nodes in Spark Application
Just to elaborate on the traffic between nodes in a Spark application.
Broadcast variables are the means of sending data from the driver to executors.
Accumulators are the means of sending data from executors to the driver.
Operators like collect will pull all the remote blocks from executors to the driver.

RDD distribution in Spark Streaming

In spark streaming, the received data is replicated among multiple Spark executors in worker nodes in the cluster (default replication factor is 2)(http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html). But how can I get the location of the replication of an specific RDD?
In Spark UI there is a tab called "Storage" that tell you which RDDs are cached and where (memory, disk, serialized, etc).
For Spark Streaming by default it will serialize the RDD in memory and remove old ones as needed. If you don't have computations that depend on previous results it's better if you set spark.streaming.unpersist to True, so once processed get's removed to avoid putting pressure on the garbage collector.

Resources