What happens to the previous RDD when it gets transformed into a new RDD? - apache-spark

I am a beginner in Apache Spark. What I have understood so far regarding RDDs is that, once a RDD is generated, it cannot be modified but can be transformed into another RDD. Now I have several doubts here:
If an RDD is transformed to another RDD on applying some transformation on the RDD, then what happens to the previous RDD? Are both the RDDs stored in the memory?
If an RDD is cached, and some transformation is applied on the cached RDD to generate a new RDD, then can there be a scenario that, there is not enough space in RAM to hold the newly generated RDD? In case, such a scenario occurs, how will Spark handle that?
Thanks in advance!

Due to Spark's lazy evaluation, nothing happens when you do transformations on RDDs. Spark only starts computing when you call an action (save, take, collect, ...). So to answer your questions,
The original RDD stays where it is, and the transformed RDD does not exist / has not been computed yet due to lazy evaluation. Only a query plan is generated for it.
Regardless of whether the original RDD is cached, the transformed RDD will not be computed until an action is called. Therefore running out of memory shouldn't happen.
Normally when you run out of memory, either you encounter an OOM error, or the cached RDDs in memory will be spilled onto the disk to free up memory.

In order to understand the answers of the questions you have, you need to know couple of things about spark.
Spark evaluation Model (Lazy Evaluation)
Spark Operations (Transformations and Actions)
Directed Acyclic Graph (DAG)
Answer to your first Question:
You could think of RDD as virtual data structure that does not get filled with values unless there is some action called on it which materializes the rdd/dataframe. When you perform transformations it just creates query plan which shows the lazily evaluation behavior of spark. When action gets called, it perform all the transformation based on the physical plan that gets generated. So, nothing happens to the RDDs. RDD data gets pulled into memory when action gets called.
Answer to your Second Question:
If an RDD is cached and you perform multiple transformations on top of the cache RDD, actually nothing happens to the RDDs as cache is a transformation operation. Also the RDD that you have cached would be in memory when any action would be performed. So, you won't run out of memory.
You could run into memory issues if you are trying to cache each step of the transformation, which should be avoided.(Whether to cache or not Cache a dataframe/RDD is a million dollar question as beginner but you get to understand that as you learn the basics and spark architecture)
Other workflow where you can run out of memory is when you have huge data size and you are caching the rdd after some transformation as you would like to perform multiple actions on it or it is getting used in the workflow multiple times. In this case you need to verify your cluster configuration and need to make sure that it can handle the data that you are intending to cache.

Related

Does spark automatically un-cache and delete unused dataframes?

I have the following strategy to change a dataframe df.
df = T1(df)
df.cache()
df = T2(df)
df.cache()
.
.
.
df = Tn(df)
df.cache()
Here T1, T2, ..., Tn are n transformations that return spark dataframes. Repeated caching is used because df has to pass through a lot of transformations and used mutiple times in between; without caching lazy evaluation of the transformations might make using df in between very slow. What I am worried about is that the n dataframes that are cached one by one will gradually consume the RAM. I read that spark automatically un-caches "least recently used" items. Based on this I have the following queries -
How is "least recently used" parameter determined? I hope that a dataframe, without any reference or evaluation strategy attached to it, qualifies as unused - am I correct?
Does a spark dataframe, having no reference and evaluation strategy attached to it, get selected for garbage collection as well? Or does a spark dataframe never get garbage collected?
Based on the answer to the above two queries, is the above strategy correct?
How is "least recently used" parameter determined? I hope that a dataframe, without any reference or evaluation strategy attached to it, qualifies as unused - am I correct?
Results are cached on spark executors. A single executor runs multiple tasks and could have multiple caches in its memory at a given point in time. A single executor caches are ranked based on when it is asked. Cache just asked in some computation will have rank 1 always, and others are pushed down. Eventually when available space is full, cache with last rank is dropped to make space for new cache.
Does a spark dataframe, having no reference and evaluation strategy attached to it, get selected for garbage collection as well? Or does a spark dataframe never get garbage collected?
Dataframe is an execution expression and unless an action is called, no computation is materialised. Moreover, everything will be cleared once the executor is done with computation for that task. Only when dataframe is cached (before calling action), results are kept aside in executor memory for further use. And these result caches are cleared based on LRU.
Based on the answer to the above two queries, is the above strategy correct?
Your example seems like transformation are done in sequence and reference for previous dataframe is not used further (no idea why you are using cache). If multiple executions are done by same executor, it is possible that some results are dropped and when asked they will be re-computed again.
N.B. - Nothing is executed unless a spark action is called. Transformations are chained and optimised by spark engine when an action is called.
As far as I have worked with spark and also with the communication with the cloudera that I had, we should unpersist/uncache the data, if we do not do that job will start to slow down, the problem becomes more severe in case of streaming job.
I have nothing to support my answer but
read here and here for details

RDD in Spark: where and how are they stored?

I've always heard that Spark is 100x faster than classic Map Reduce frameworks like Hadoop. But recently I'm reading that this is only true if RDDs are cached, which I thought was always done but instead requires the explicit cache () method.
I would like to understand how all produced RDDs are stored throughout the work. Suppose we have this workflow:
I read a file -> I get the RDD_ONE
I use the map on the RDD_ONE -> I get the RDD_TWO
I use any other transformation on the RDD_TWO
QUESTIONS:
if I don't use cache () or persist () is every RDD stored in memory, in cache or on disk (local file system or HDFS)?
if RDD_THREE depends on RDD_TWO and this in turn depends on RDD_ONE (lineage) if I didn't use the cache () method on RDD_THREE Spark should recalculate RDD_ONE (reread it from disk) and then RDD_TWO to get RDD_THREE?
Thanks in advance.
In spark there are two types of operations: transformations and actions. A transformation on a dataframe will return another dataframe and an action on a dataframe will return a value.
Transformations are lazy, so when a transformation is performed spark will add it to the DAG and execute it when an action is called.
Suppose, you read a file into a dataframe, then perform a filter, join, aggregate, and then count. The count operation which is an action will actually kick all the previous transformation.
If we call another action(like show) the whole operation is executed again which can be time consuming. So, if we want not to run the whole set of operation again and again we can cache the dataframe.
Few pointers you can consider while caching:
Cache only when the resulting dataframe is generated from significant transformation. If spark can regenerate the cached dataframe in few seconds then caching is not required.
Cache should be performed when the dataframe is used for multiple actions. If there are only 1-2 actions on the dataframe then it is not worth saving that dataframe in memory.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes
To Answer your question:
Q1:if I don't use cache () or persist () is every RDD stored in memory, in cache or on disk (local file system or HDFS)? Ans: Considering the data which is available in workers node as blocks in HDFS, when creating rdd for the file as
val rdd=sc.textFile("<HDFS Path>")
the underlying blocks of data from each nodes (HDFS) will be loaded to their RAM's(i,e memory) as partitions (in spark, the blocks of hdfs data are called as partitions once loaded into memory)
Q2: if RDD_THREE depends on RDD_TWO and this in turn depends on RDD_ONE (lineage) if I didn't use the cache () method on RDD_THREE Spark should recalculate RDD_ONE (reread it from disk) and then RDD_TWO to get RDD_THREE? Ans: Yes.Since the underlying results are not stored in drivers memory by using cache() in this scenario.

Is RDD kept in memory or is it immediately flushed out of memory after an action is completed?

I am going through a book, which to me has stated a contradictory statement. Quoting the book:
"Spark’s RDDs are by default recomputed each time you run an action on them."
But in the next few lines, it states:
"After computing it the first time, Spark will store the RDD contents in memory and reuse them in future actions."
My question is, if RDDs are stored in memory, why is it recomputed each time an action is called on them?
In the first statement it says RDD is recomputed each time, and in the second statement it says, RDD is stored in memory to reuse them for future actions.
"Spark’s RDDs are by default recomputed each time you run an action on them." For your this statement, yes RDDs are recomputed each time you run an action on them. Now the reason behind it is that if it store all the RDDs content in memory then your memory would get exhausted very soon. So, it cannot keep each and every RDDs in memory. When you perform any action on it, it reads from the source data and performs transformations on it and give you the output for your action.
"After computing it the first time, Spark will store the RDD contents in memory and reuse them in future actions." It does not store it in memory by default but based on your use case you can persist the specific RDD using df.cache() or df.persist() and then it will store that RDD content in memory and when you perform any action second time of the RDD that depends upon the cached RDD then it won't read that from the source but it would use it from memory. You only should cache the RDD if you are performing multiple actions on it or there is a complex transformation logic that you don't want spark to perform every time an action is called.

Transformation vs Action in the context of Laziness

As mentioned in the "Learning Spark: Lightning-Fast Big Data Analysis" book:
Transformations and actions are different because of the way Spark computes RDDs.
After some explanation about laziness, as I found, both transformations and actions are working lazily. Therefore, the question is, what does the quoted sentence mean?
It is not necessarily valid to contrast laziness of RDD actions vs transformations.
The correct statement would be that RDDs are lazily evaluated, from the perspective of an RDD as a collection of data: there's not necessarily "data" in memory when the RDD instance is created.
The question raised by this statement is: when does the RDD's data get loaded in memory? Which can be rephrased as "when does the RDD get evaluated?". It's here that we have the distinction between actions and transformations:
Consider the following sequence of code:
Line #1:
rdd = sc.textFile("text-file-path")
Does the RDD exist? Yes.
Is the data loaded in memory? No.
--> RDD evaluation is lazy
Line #2:
rdd2 = rdd.map(lambda line: list.split())
Does the RDD exist? Yes. In fact, there are 2 RDDs.
Is the data loaded in memory? No.
--> Still, it's lazy, all Spark does is record how to load the data and transform it, remembering the lineage (how to derive RDDs one from another).
Line #3
print(rdd2.collect())
Does the RDD exist? Yes (2 RDDs still).
Is the data loaded in memory? Yes.
What's the difference? collect() forces Spark to return the result of the transformations. Spark now does all that it recorded in steps #1, #2, and #3.
In spark's terminology, #1 and #2 are transformations.
Transformations typically return another RDD instance, and that's a hint for recognizing the lazy part.
#3 has an action, which simply means an operation that causes plans in transformations to be carried out in order to return a result or perform a final action, such as saving results (yes, "such as saving the actual collection of data loaded in memory").
So, in short, I'd say that RDDs are lazily evaluated, but, in my opinion, it's incorrect to label operations (actions or transformations) as lazy or not.
Transformations are lazy, actions are not.
Definitions:
Transformation - A function that mutates the data out on the cluster. These actions will change the data in place when they are executed. Examples of this are map, filter, and aggregate. These are not executed until an action is called.
Action - Any function that results in data being persisted or returned to the driver (also foreach, which doesn't really fall into those two categories).
In order to run an action (like saving the data), all the transformations you have requested up till now have to be run to materialize the data. Spark can implement optimizations if it looks at the total execution plan of the operations you want to run, so it is beneficial not to compute anything until it is required.

Does Spark write intermediate shuffle outputs to disk

I'm reading Learning Spark, and I don't understand what it means that Spark's shuffle outputs are written to disk. See Chapter 8, Tuning and Debugging Spark, pages 148-149:
Spark’s internal scheduler may truncate the lineage of the RDD graph
if an existing RDD has already been persisted in cluster memory or on
disk. A second case in which this truncation can happen is when an RDD
is already materialized as a side effect of an earlier shuffle, even
if it was not explicitly persisted. This is an under-the-hood
optimization that takes advantage of the fact that Spark shuffle
outputs are written to disk, and exploits the fact that many times
portions of the RDD graph are recomputed.
As I understand there are different persistence policies, for example, the default MEMORY_ONLY which means the intermediate result will never be persisted to the disk.
When and why will a shuffle persist something on disk? How can that be reused by further computations?
When
It happens with when operation that requires shuffle is first time evaluated (action) and cannot be disabled
Why
This is an optimization. Shuffling is one of the expensive things that happen in Spark.
How that can be reused by further computations?
It is automatically reused with any subsequent action executed on the same RDD.

Resources