In case of task failures,does spark clear the persisted RDD (StorageLevel.MEMORY_ONLY_SER) and recompute them again when the task is attempted to start from beginning. Or will the cached RDD be appended.
I am seeing duplicate records in case of any task failures for a persisted RDD. Any help would be appreciated.
Task is the smallest individual unit of execution that is launched to compute a RDD partition.
In case a task fails run method notifies TaskContextImpl that the task has failed.Run requests MemoryStore to release unroll memory for this task (for both ON_HEAP and OFF_HEAP memory modes) and ContextCleaner is a Spark service that is responsible for application-wide cleanup of shuffles, RDDs, broadcasts, accumulators and checkpointed RDDs
As we know RDD is Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.
Caching computes and materializes an RDD in memory while keeping track of its lineage (dependencies).Since caching remembers an RDD’s lineage, Spark can recompute loss partitions in the event of node failures. Lastly, an RDD that is cached lives within the context of the running application, and once the application terminates, cached RDDs are deleted as well.
Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
If RDD persisted in memory then on task fail executor JVM process fails too, so the memory is released.
If RDD persisted on disk then on task fail Spark shutdown hook just wipes temp files
You can call
rdd.unpersist()
to clear the cached rdd.
Related
The RDD, which have been cached used the rdd.cache() method from the scala terminal, are being stored in the memory.
That means it will consume some part of the ram being available for the Spark process itself.
Having said that if the ram is being limited, and more and more RDDs have been cached, when will spark clean the memory automatically which has been occupied by the rdd cache?
Spark will clean cached RDDs and Datasets / DataFrames:
When it is explicitly asked to by calling RDD.unpersist (How to uncache RDD?) / Dataset.unpersist methods or Catalog.clearCache.
In regular intervals, by the cache cleaner:
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.
When corresponding distributed data structure is garbage collected.
Spark will automatically un-persist/clean the RDD or Dataframe if the RDD is not used any longer. To check if a RDD is cached, check into the Spark UI and check the Storage tab and look into the Memory details.
From the terminal, we can use ‘rdd.unpersist() ‘or ‘sqlContext.uncacheTable("sparktable") ‘
to remove the RDD or tables from Memory. Spark made for Lazy Evaluation, unless and until you say any action, it does not load or process any data into the RDD or DataFrame.
Many say:
Spark does not replicate data in hdfs.
Spark arranges the operations in DAG graph.Spark builds RDD lineage. If a RDD is lost they can be rebuilt with the help of lineage graph.
So there is no need of data replication as the RDDs can be recalculated from the lineage graph.
And my question is:
If a node fails, spark will only recompute the RDD partitions lost on this node, but where does the data source needed in the recompution process come from ? Do you mean its parent RDD is still there when the node fails?What if the RDD that lost some partitions didn't have parent RDD(like the RDD is from spark streaming receiver) ?
What if we lose something part way through computation?
Rely on the key insight from MR! Determinism provides safe recompute.
Track 'lineage' of each RDD. Can recompute from parents if needed.
Interesting: only need to record tiny state to do recompute.
Need parent pointer, function applied, and a few other bits.
Log 10 KB per transform rather than re-output 1 TB -> 2 TB
Source
The child RDD is metadata that describes how to calculate the RDD from the parent RDD. Read more in What is RDD dependency in Spark?
If a node fails, spark will only recompute the RDD partitions lost on this node, but where does the data source needed in the recompution process come from ? Do you mean its parent RDD is still there when the node fails?
The core idea is that you can use the lineage to recover lost RDDs because RDDs are
built from another RDD or
built from data in stable storage.
(source: RDD paper, beginning of section 2.1)
If some RDD is lost, you can just go back in the lineage until you reach some RDD or the initial data record that is still available.
The data in stable storage is replicated across multiple nodes, therefore unlikely to be lost.
As far from what I've read about Streaming Receivers, the received data seems to be saved in stable storage as well, so it behaves just like any other data source.
Can any one please correct my understanding on persisting by Spark.
If we have performed a cache() on an RDD its value is cached only on those nodes where actually RDD was computed initially.
Meaning, If there is a cluster of 100 Nodes, and RDD is computed in partitions of first and second nodes. If we cached this RDD, then Spark is going to cache its value only in first or second worker nodes.
So when this Spark application is trying to use this RDD in later stages, then Spark driver has to get the value from first/second nodes.
Am I correct?
(OR)
Is it something that the RDD value is persisted in driver memory and not on nodes ?
Change this:
then Spark is going to cache its value only in first or second worker nodes.
to this:
then Spark is going to cache its value only in first and second worker nodes.
and...Yes correct!
Spark tries to minimize the memory usage (and we love it for that!), so it won't make any unnecessary memory loads, since it evaluates every statement lazily, i.e. it won't do any actual work on any transformation, it will wait for an action to happen, which leaves no choice to Spark, than to do the actual work (read the file, communicate the data to the network, do the computation, collect the result back to the driver, for example..).
You see, we don't want to cache everything, unless we really can to (that is that the memory capacity allows for it (yes, we can ask for more memory in the executors or/and the driver, but sometimes our cluster just doesn't have the resources, really common when we handle big data) and it really makes sense, i.e. that the cached RDD is going to be used again and again (so caching it will speedup the execution of our job).
That's why you want to unpersist() your RDD, when you no longer need it...! :)
Check this image, is from one of my jobs, where I had requested 100 executors, however the Executors tab displayed 101, i.e. 100 slaves/workers and one master/driver:
RDD.cache is a lazy operation. it does nothing until unless you call an action like count. Once you call the action the operation will use the cache. It will just take the data from the cache and do the operation.
RDD.cache- Persists the RDD with default storage level (Memory only).
Spark RDD API
2.Is it something that the RDD value is persisted in driver memory and not on nodes ?
RDD can be persisted to disk and Memory as well . Click on the link to Spark document for all the option
Spark Rdd Persist
# no actual caching at the end of this statement
rdd1=sc.read('myfile.json').rdd.map(lambda row: myfunc(row)).cache()
# again, no actual caching yet, because Spark is lazy, and won't evaluate anything unless
# a reduction op
rdd2=rdd2.map(mysecondfunc)
# caching is done on this reduce operation. Result of rdd1 will be cached in the memory of each worker node
n=rdd1.count()
So to answer your question
If we have performed a cache() on an RDD its value is cached only on those nodes where actually RDD was computed initially
The only possibility of caching something is on worker nodes, and not on driver nodes.
cache function can only be applied to an RDD (refer), and since RDD only exists on the worker node's memory (Resilient Distributed Datasets!), it's results are cached in the respective worker node memory. Once you apply an operation like count which brings back the result to the driver, it's not really an RDD anymore, it's merely a result of computation done RDD by the worker nodes in their respective memories
Since cache in the above example was called on rdd2 which is still on multiple worker nodes, the caching only happens on the worker node's memory.
In the above example, when do some map-red op on rdd1 again, it won't read the JSON again, because it was cached
FYI, I am using the word memory based on the assumption that the caching level is set to MEMORY_ONLY. Of course, if that level is changed to others, Spark will cache to either memory or storage based on the setting
Here is an excellent answer on caching
(Why) do we need to call cache or persist on a RDD
Basically caching stores the RDD in the memory / disk (based on persistence level set) of that node, so that the when this RDD is called again it does not need to recompute its lineage (lineage - Set of prior transformations executed to be in the current state).
Where Spark RDD Lineage is stored? As per white paper on RDD, it is persisted in-memory but want to know if it is at driver side or somewhere else on cluster.
Also how fault-tolerance is ensured i.e. how many replications of RDD (metadata) are created by default?
I want to understand core framework behaviour when we are not using persist() method.
The RDD lineage lives on the driver where RDDs live. When jobs are submitted, this information is no longer relevant. It's an internal part of any RDD and that's how it knows the parents.
When the driver fails RDD lineage is gone as is the entire computation. The driver is...well...the driver and without it nothing really happens.
In spark streaming, the received data is replicated among multiple Spark executors in worker nodes in the cluster (default replication factor is 2)(http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html). But how can I get the location of the replication of an specific RDD?
In Spark UI there is a tab called "Storage" that tell you which RDDs are cached and where (memory, disk, serialized, etc).
For Spark Streaming by default it will serialize the RDD in memory and remove old ones as needed. If you don't have computations that depend on previous results it's better if you set spark.streaming.unpersist to True, so once processed get's removed to avoid putting pressure on the garbage collector.