RDD in Spark: where and how are they stored? - apache-spark

I've always heard that Spark is 100x faster than classic Map Reduce frameworks like Hadoop. But recently I'm reading that this is only true if RDDs are cached, which I thought was always done but instead requires the explicit cache () method.
I would like to understand how all produced RDDs are stored throughout the work. Suppose we have this workflow:
I read a file -> I get the RDD_ONE
I use the map on the RDD_ONE -> I get the RDD_TWO
I use any other transformation on the RDD_TWO
QUESTIONS:
if I don't use cache () or persist () is every RDD stored in memory, in cache or on disk (local file system or HDFS)?
if RDD_THREE depends on RDD_TWO and this in turn depends on RDD_ONE (lineage) if I didn't use the cache () method on RDD_THREE Spark should recalculate RDD_ONE (reread it from disk) and then RDD_TWO to get RDD_THREE?
Thanks in advance.

In spark there are two types of operations: transformations and actions. A transformation on a dataframe will return another dataframe and an action on a dataframe will return a value.
Transformations are lazy, so when a transformation is performed spark will add it to the DAG and execute it when an action is called.
Suppose, you read a file into a dataframe, then perform a filter, join, aggregate, and then count. The count operation which is an action will actually kick all the previous transformation.
If we call another action(like show) the whole operation is executed again which can be time consuming. So, if we want not to run the whole set of operation again and again we can cache the dataframe.
Few pointers you can consider while caching:
Cache only when the resulting dataframe is generated from significant transformation. If spark can regenerate the cached dataframe in few seconds then caching is not required.
Cache should be performed when the dataframe is used for multiple actions. If there are only 1-2 actions on the dataframe then it is not worth saving that dataframe in memory.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes

To Answer your question:
Q1:if I don't use cache () or persist () is every RDD stored in memory, in cache or on disk (local file system or HDFS)? Ans: Considering the data which is available in workers node as blocks in HDFS, when creating rdd for the file as
val rdd=sc.textFile("<HDFS Path>")
the underlying blocks of data from each nodes (HDFS) will be loaded to their RAM's(i,e memory) as partitions (in spark, the blocks of hdfs data are called as partitions once loaded into memory)
Q2: if RDD_THREE depends on RDD_TWO and this in turn depends on RDD_ONE (lineage) if I didn't use the cache () method on RDD_THREE Spark should recalculate RDD_ONE (reread it from disk) and then RDD_TWO to get RDD_THREE? Ans: Yes.Since the underlying results are not stored in drivers memory by using cache() in this scenario.

Related

What happens to the previous RDD when it gets transformed into a new RDD?

I am a beginner in Apache Spark. What I have understood so far regarding RDDs is that, once a RDD is generated, it cannot be modified but can be transformed into another RDD. Now I have several doubts here:
If an RDD is transformed to another RDD on applying some transformation on the RDD, then what happens to the previous RDD? Are both the RDDs stored in the memory?
If an RDD is cached, and some transformation is applied on the cached RDD to generate a new RDD, then can there be a scenario that, there is not enough space in RAM to hold the newly generated RDD? In case, such a scenario occurs, how will Spark handle that?
Thanks in advance!
Due to Spark's lazy evaluation, nothing happens when you do transformations on RDDs. Spark only starts computing when you call an action (save, take, collect, ...). So to answer your questions,
The original RDD stays where it is, and the transformed RDD does not exist / has not been computed yet due to lazy evaluation. Only a query plan is generated for it.
Regardless of whether the original RDD is cached, the transformed RDD will not be computed until an action is called. Therefore running out of memory shouldn't happen.
Normally when you run out of memory, either you encounter an OOM error, or the cached RDDs in memory will be spilled onto the disk to free up memory.
In order to understand the answers of the questions you have, you need to know couple of things about spark.
Spark evaluation Model (Lazy Evaluation)
Spark Operations (Transformations and Actions)
Directed Acyclic Graph (DAG)
Answer to your first Question:
You could think of RDD as virtual data structure that does not get filled with values unless there is some action called on it which materializes the rdd/dataframe. When you perform transformations it just creates query plan which shows the lazily evaluation behavior of spark. When action gets called, it perform all the transformation based on the physical plan that gets generated. So, nothing happens to the RDDs. RDD data gets pulled into memory when action gets called.
Answer to your Second Question:
If an RDD is cached and you perform multiple transformations on top of the cache RDD, actually nothing happens to the RDDs as cache is a transformation operation. Also the RDD that you have cached would be in memory when any action would be performed. So, you won't run out of memory.
You could run into memory issues if you are trying to cache each step of the transformation, which should be avoided.(Whether to cache or not Cache a dataframe/RDD is a million dollar question as beginner but you get to understand that as you learn the basics and spark architecture)
Other workflow where you can run out of memory is when you have huge data size and you are caching the rdd after some transformation as you would like to perform multiple actions on it or it is getting used in the workflow multiple times. In this case you need to verify your cluster configuration and need to make sure that it can handle the data that you are intending to cache.

Spark driver running out of memory when reading multiple files

My program works like this:
Read in a lot of files as dataframes. Among those files there is a group of about 60 files with 5k rows each, where I create a separate Dataframe for each of them, do some simple processing and then union them all into one dataframe which is used for further joins.
I perform a number of joins and column calculations on a number of dataframes finally which finally results in a target dataframe.
I save the target dataframe as a Parquet file.
In the same spark application, I load that Parquet file and do some heavy aggregation followed by multiple self-joins on that dataframe.
I save the second dataframe as another Parquet file.
The problem
If I have just one file instead of 60 in the group of files I mentioned above, everything works with driver having 8g memory. With 60 files, the first 3 steps work fine, but driver runs out of memory when preparing the second file. Things improve only when I increase the driver's memory to 20g.
The Question
Why is that? When calculating the second file I do not use Dataframes used to calculate the first file so their number and content should not really matter if the size of the first Parquet file remains constant, should it? Do those 60 dataframes get cached somehow and occupy driver's memory? I don't do any caching myself. I also never collect anything. I don't understand why 8g of memory would not be sufficient for Spark driver.
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
//you have to use serialization configuration if you are using MEMORY_AND_DISK_SER
val rdd1 = sc.textFile("some data")
rdd1.persist(storageLevel.MEMORY_AND_DISK_SER) // marks rdd as persist
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.persist(storageLevel.MEMORY_AND_DISK_SER)
rdd3.persist(storageLevel.MEMORY_AND_DISK_SER)
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")
rdd1.unpersist()
rdd2.unpersist()
rdd3.unpersist()
For tuning your code follow this link
Caching or persistence are optimisation techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storages like disk and/or replicated.
RDDs can be cached using cache operation. They can also be persisted using persist operation.
The difference between cache and persist operations is purely syntactic. cache is a synonym of persist or persist(MEMORY_ONLY), i.e. cache is merely persist with the default storage level MEMORY_ONLY.
refer to use of persist and unpersist

What does Spark recover the data from a failed node?

Suppose we have an RDD, which is being used multiple times. So to save the computations again and again, we persisted this RDD using the rdd.persist() method.
So when we are persisting this RDD, the nodes computing the RDD will be storing their partitions.
So now suppose, the node containing this persisted partition of RDD fails, then what will happen? How will spark recover the lost data? Is there any replication mechanism? Or some other mechanism?
When you do rdd.persist, rdd doesn't materialize the content. It does when you perform an action on the rdd. It follows the same lazy evaluation principle.
Now an RDD knows the partition on which it should operate and the DAG associated with it. With the DAG it is perfectly capable of recreating the materialized partition.
So, when a node fails the driver spawn another executor in some other node and provides it the Data partition on which it was supposed to work and the DAG associated with it in a closure. Now with this information it can recompute the data and materialize it.
In the mean time the cached data in the RDD won't have all the data in memory, the data of the lost nodes it has to fetch from the disk it will take so little more time.
On the replication, yes spark supports in memory replication. You need to set StorageLevel.MEMORY_DISK_2 when you persist.
rdd.persist(StorageLevel.MEMORY_DISK_2)
This ensures the data is replicated twice.
I think the best way I was able to understand how Spark is resilient was when someone told me that I should not think of RDDs as big, distributed arrays of data.
Instead I should picture them as a container that had instructions on what steps to take to convert data from data source and take one step at a time until a result was produced.
Now if you really care about losing data when persisting, then you can specify that you want to replicate your cached data.
For this, you need to select storage level. So instead of normally using this:
MEMORY_ONLY - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
MEMORY_AND_DISK - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
You can specify that you want your persisted data replcated
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. - Same as the levels above, but replicate each partition on two cluster nodes.
So if the node fails, you will not have to recompute the data.
Check storage levels here: http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence

When will Spark clean the cached RDDs automatically?

The RDD, which have been cached used the rdd.cache() method from the scala terminal, are being stored in the memory.
That means it will consume some part of the ram being available for the Spark process itself.
Having said that if the ram is being limited, and more and more RDDs have been cached, when will spark clean the memory automatically which has been occupied by the rdd cache?
Spark will clean cached RDDs and Datasets / DataFrames:
When it is explicitly asked to by calling RDD.unpersist (How to uncache RDD?) / Dataset.unpersist methods or Catalog.clearCache.
In regular intervals, by the cache cleaner:
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.
When corresponding distributed data structure is garbage collected.
Spark will automatically un-persist/clean the RDD or Dataframe if the RDD is not used any longer. To check if a RDD is cached, check into the Spark UI and check the Storage tab and look into the Memory details.
From the terminal, we can use ‘rdd.unpersist() ‘or ‘sqlContext.uncacheTable("sparktable") ‘
to remove the RDD or tables from Memory. Spark made for Lazy Evaluation, unless and until you say any action, it does not load or process any data into the RDD or DataFrame.

Is Spark RDD cached on worker node or driver node (or both)?

Can any one please correct my understanding on persisting by Spark.
If we have performed a cache() on an RDD its value is cached only on those nodes where actually RDD was computed initially.
Meaning, If there is a cluster of 100 Nodes, and RDD is computed in partitions of first and second nodes. If we cached this RDD, then Spark is going to cache its value only in first or second worker nodes.
So when this Spark application is trying to use this RDD in later stages, then Spark driver has to get the value from first/second nodes.
Am I correct?
(OR)
Is it something that the RDD value is persisted in driver memory and not on nodes ?
Change this:
then Spark is going to cache its value only in first or second worker nodes.
to this:
then Spark is going to cache its value only in first and second worker nodes.
and...Yes correct!
Spark tries to minimize the memory usage (and we love it for that!), so it won't make any unnecessary memory loads, since it evaluates every statement lazily, i.e. it won't do any actual work on any transformation, it will wait for an action to happen, which leaves no choice to Spark, than to do the actual work (read the file, communicate the data to the network, do the computation, collect the result back to the driver, for example..).
You see, we don't want to cache everything, unless we really can to (that is that the memory capacity allows for it (yes, we can ask for more memory in the executors or/and the driver, but sometimes our cluster just doesn't have the resources, really common when we handle big data) and it really makes sense, i.e. that the cached RDD is going to be used again and again (so caching it will speedup the execution of our job).
That's why you want to unpersist() your RDD, when you no longer need it...! :)
Check this image, is from one of my jobs, where I had requested 100 executors, however the Executors tab displayed 101, i.e. 100 slaves/workers and one master/driver:
RDD.cache is a lazy operation. it does nothing until unless you call an action like count. Once you call the action the operation will use the cache. It will just take the data from the cache and do the operation.
RDD.cache- Persists the RDD with default storage level (Memory only).
Spark RDD API
2.Is it something that the RDD value is persisted in driver memory and not on nodes ?
RDD can be persisted to disk and Memory as well . Click on the link to Spark document for all the option
Spark Rdd Persist
# no actual caching at the end of this statement
rdd1=sc.read('myfile.json').rdd.map(lambda row: myfunc(row)).cache()
# again, no actual caching yet, because Spark is lazy, and won't evaluate anything unless
# a reduction op
rdd2=rdd2.map(mysecondfunc)
# caching is done on this reduce operation. Result of rdd1 will be cached in the memory of each worker node
n=rdd1.count()
So to answer your question
If we have performed a cache() on an RDD its value is cached only on those nodes where actually RDD was computed initially
The only possibility of caching something is on worker nodes, and not on driver nodes.
cache function can only be applied to an RDD (refer), and since RDD only exists on the worker node's memory (Resilient Distributed Datasets!), it's results are cached in the respective worker node memory. Once you apply an operation like count which brings back the result to the driver, it's not really an RDD anymore, it's merely a result of computation done RDD by the worker nodes in their respective memories
Since cache in the above example was called on rdd2 which is still on multiple worker nodes, the caching only happens on the worker node's memory.
In the above example, when do some map-red op on rdd1 again, it won't read the JSON again, because it was cached
FYI, I am using the word memory based on the assumption that the caching level is set to MEMORY_ONLY. Of course, if that level is changed to others, Spark will cache to either memory or storage based on the setting
Here is an excellent answer on caching
(Why) do we need to call cache or persist on a RDD
Basically caching stores the RDD in the memory / disk (based on persistence level set) of that node, so that the when this RDD is called again it does not need to recompute its lineage (lineage - Set of prior transformations executed to be in the current state).

Resources