In spark streaming, the received data is replicated among multiple Spark executors in worker nodes in the cluster (default replication factor is 2)(http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html). But how can I get the location of the replication of an specific RDD?
In Spark UI there is a tab called "Storage" that tell you which RDDs are cached and where (memory, disk, serialized, etc).
For Spark Streaming by default it will serialize the RDD in memory and remove old ones as needed. If you don't have computations that depend on previous results it's better if you set spark.streaming.unpersist to True, so once processed get's removed to avoid putting pressure on the garbage collector.
Related
From where does Spark load data for RDD? Is data already present in Executor nodes or spark shuffles data from Driver node first?
From the name itself - RDD (Resilient Distributed Dataset) - it indicates that the data resides across executors when ever you create it.
Lets say when you run parallelize() for 100 entries, it will distribute that 100 entries across your executors so that each executor has its own chunk of data to do distributed processing.
Shuffling happens - when you do any operations like repartition() or coalesce().
Also if you run functions like collect() spark will try to pull all data from executor and bring it to driver(And you loose the ability of distributed processing)
This reference has more details around internals of spark - Apache Spark architecture
Say my Spark cluster has 100G memory, during the Spark computing process, more data (new dataframes, caches) with a size of 200G are generated. In this case, will Spark store some of this data on Disk or it will just OOM?
Spark only starts reading in the data when an action (like count, collect or write) is called. Once an action is called, Spark loads in data in partitions - the number of concurrently loaded partitions depend on the number of cores you have available. So in Spark you can think of 1 partition = 1 core = 1 task.
If you apply no transformation but only do for instance a count, Spark will still read in the data in partitions, but it will not store any data in your cluster and if you do the count again it will read in all the data once again. To avoid reading in data several times, you might call cache or persist in which case Spark will try to store the data in you cluster. On cache (which is the same as persist(StorageLevel.MEMORY_ONLY) it will store all partitions in memory - if it doesn't fit in memory you will get an OOM. If you call persist(StorageLevel.MEMORY_AND_DISK) it will store as much as it can in memory and the rest will be put on disk. If data doesn't fit on disk either the OS will usually kill your workers.
In Apache Spark if the data does not fits into the memory then Spark simply persists that data to disk. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.
The persist method in Apache Spark provides six persist storage level to persist the data.
MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER
(Java and Scala), MEMORY_AND_DISK_SER
(Java and Scala), DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, OFF_HEAP.
The OFF_HEAP storage is under experimentation.
There's a column called RDD blocks in the Spark UI in executors tab. One observation made is that the number of RDD blocks keeps increasing for a particular streaming job where messages are streamed from Kafka.
Certain executors were removed automatically and application slows down after long run with a large number of RDD blocks. DStreams and RDDs are not persisted manually anywhere.
It would be a great help if someone explains when these blocks are created and on what basis are the blocks being removed (are there any parameters that need to be modified?).
Good explanation of Spark UI is this. RDD blocks can represent cached RDD partitions, intermediate shuffle outputs, broadcasts, etc. Check out BlockManager section of this book.
In case of task failures,does spark clear the persisted RDD (StorageLevel.MEMORY_ONLY_SER) and recompute them again when the task is attempted to start from beginning. Or will the cached RDD be appended.
I am seeing duplicate records in case of any task failures for a persisted RDD. Any help would be appreciated.
Task is the smallest individual unit of execution that is launched to compute a RDD partition.
In case a task fails run method notifies TaskContextImpl that the task has failed.Run requests MemoryStore to release unroll memory for this task (for both ON_HEAP and OFF_HEAP memory modes) and ContextCleaner is a Spark service that is responsible for application-wide cleanup of shuffles, RDDs, broadcasts, accumulators and checkpointed RDDs
As we know RDD is Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.
Caching computes and materializes an RDD in memory while keeping track of its lineage (dependencies).Since caching remembers an RDD’s lineage, Spark can recompute loss partitions in the event of node failures. Lastly, an RDD that is cached lives within the context of the running application, and once the application terminates, cached RDDs are deleted as well.
Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
If RDD persisted in memory then on task fail executor JVM process fails too, so the memory is released.
If RDD persisted on disk then on task fail Spark shutdown hook just wipes temp files
You can call
rdd.unpersist()
to clear the cached rdd.
Where Spark RDD Lineage is stored? As per white paper on RDD, it is persisted in-memory but want to know if it is at driver side or somewhere else on cluster.
Also how fault-tolerance is ensured i.e. how many replications of RDD (metadata) are created by default?
I want to understand core framework behaviour when we are not using persist() method.
The RDD lineage lives on the driver where RDDs live. When jobs are submitted, this information is no longer relevant. It's an internal part of any RDD and that's how it knows the parents.
When the driver fails RDD lineage is gone as is the entire computation. The driver is...well...the driver and without it nothing really happens.