How to checkpoint a RDD without saving all of its data? - apache-spark

I am running a series of jobs and intermediate rdd is used in all jobs. So i have cached the intermediate rdds but after some iterations its slowing down. Then i used rdd check pointing after caching to break lineage which is not required. In spark UI i am able to confirm that check pointing is done correctly. But its also taking time because its writing each rdd to local system. What is the effective way to break unnecessary lineage without saving actual rdd data?

The exact point of checkpointing is to save all data.This enables breaking lineage and "forgetting" about the past. Without saving the data breaking the lineage is simply not possible.

Related

What happens to the previous RDD when it gets transformed into a new RDD?

I am a beginner in Apache Spark. What I have understood so far regarding RDDs is that, once a RDD is generated, it cannot be modified but can be transformed into another RDD. Now I have several doubts here:
If an RDD is transformed to another RDD on applying some transformation on the RDD, then what happens to the previous RDD? Are both the RDDs stored in the memory?
If an RDD is cached, and some transformation is applied on the cached RDD to generate a new RDD, then can there be a scenario that, there is not enough space in RAM to hold the newly generated RDD? In case, such a scenario occurs, how will Spark handle that?
Thanks in advance!
Due to Spark's lazy evaluation, nothing happens when you do transformations on RDDs. Spark only starts computing when you call an action (save, take, collect, ...). So to answer your questions,
The original RDD stays where it is, and the transformed RDD does not exist / has not been computed yet due to lazy evaluation. Only a query plan is generated for it.
Regardless of whether the original RDD is cached, the transformed RDD will not be computed until an action is called. Therefore running out of memory shouldn't happen.
Normally when you run out of memory, either you encounter an OOM error, or the cached RDDs in memory will be spilled onto the disk to free up memory.
In order to understand the answers of the questions you have, you need to know couple of things about spark.
Spark evaluation Model (Lazy Evaluation)
Spark Operations (Transformations and Actions)
Directed Acyclic Graph (DAG)
Answer to your first Question:
You could think of RDD as virtual data structure that does not get filled with values unless there is some action called on it which materializes the rdd/dataframe. When you perform transformations it just creates query plan which shows the lazily evaluation behavior of spark. When action gets called, it perform all the transformation based on the physical plan that gets generated. So, nothing happens to the RDDs. RDD data gets pulled into memory when action gets called.
Answer to your Second Question:
If an RDD is cached and you perform multiple transformations on top of the cache RDD, actually nothing happens to the RDDs as cache is a transformation operation. Also the RDD that you have cached would be in memory when any action would be performed. So, you won't run out of memory.
You could run into memory issues if you are trying to cache each step of the transformation, which should be avoided.(Whether to cache or not Cache a dataframe/RDD is a million dollar question as beginner but you get to understand that as you learn the basics and spark architecture)
Other workflow where you can run out of memory is when you have huge data size and you are caching the rdd after some transformation as you would like to perform multiple actions on it or it is getting used in the workflow multiple times. In this case you need to verify your cluster configuration and need to make sure that it can handle the data that you are intending to cache.

Differences between persist(DISK_ONLY) vs manually saving to HDFS and reading back

This answer clearly explains RDD persist() and cache() and the need for it - (Why) do we need to call cache or persist on a RDD
So, I understand that calling someRdd.persist(DISK_ONLY) is lazy, but someRdd.saveAsTextFile("path") is eager.
But other than this (also disregarding the cleanup of text file stored in HDFS manually), is there any other difference (performance or otherwise) between using persist to cache the rdd to disk versus manually writing and reading from disk?
Is there a reason to prefer one over the other?
More Context: I came across code which manually writes to HDFS and reads it back in our production application. I've just started learning Spark and was wondering if this can be replaced with persist(DISK_ONLY). Note that the saved rdd in HDFS is deleted before every new run of the job and this stored data is not used for anything else between the runs.
There are at least these differences:
Writing to HDFS will have the replicas overhead, while caching is written locally on the executor (or to second replica if DISK_ONLY_2 is chosen).
Writing to HDFS is persistent, while cached data might get lost if/when an executor is killed for any reason. And you already mentioned the benefit of writing to HDFS when the entire application goes down.
Caching does not change the partitioning, but reading from HDFS might/will result in different partitioning than the original written DataFrame/RDD. For example, small partitions (files) will be aggregated and large files will be split.
I usually prefer to cache small/medium data sets that are expensive to evaluate, and write larger data sets to HDFS.

Spark createDataFrame(df.rdd, df.schema) vs checkPoint for breaking lineage

I'm currently using
val df=longLineageCalculation(....)
val newDf=sparkSession.createDataFrame(df.rdd, df.schema)
newDf.join......
In order to save time when calculating plans, however docs say that checkpointing is the suggested way to "cut" lineage. BUT I don't want to pay the price of saving the RDD to disk.
My process is a batch process which is not-so-long and can be restarted without issues, so checkpointing is not benefit for me (I think).
What are the problems which can arise using "my" method? (Docs suggests checkpointing, which is more expensive, instead of this one for breaking lineages and I would like to know the reason)
Only think I can guess is that if some node fails after my "lineage breaking" maybe my process will fail while the checkpointed one would have worked correctly? (what If the DF is cached instead of checkpointed?)
Thanks!
EDIT:
From SMaZ answer, my own knowledge and the article which he provided. Using createDataframe (which is a Dev-API, so use at "my"/your own risk) will keep the lineage in memory (not a problem for me since I don't have memory problems and the lineage is not big).
With this, it looks (not tested 100%) that Spark should be able to rebuild whatever is needed if it fails.
As I'm not using the data in the following executions, I'll go with
cache+createDataframe versus checkpointing (which If i'm not wrong, is
actually cache+saveToHDFS+"createDataFrame").
My process is not that critical (if it crashes) since an user will be always expecting the result and they launch it manually, so if it gives problems, they can relaunch (+Spark will relaunch it) or call me, so I can take some risk anyways, but I'm 99% sure there's no risk :)
Let me start with creating dataframe with below line :
val newDf=sparkSession.createDataFrame(df.rdd, df.schema)
If we take close look into SparkSession class then this method is annotated with #DeveloperApi. To understand what this annotation means please take a look into below lines from DeveloperApi class
A lower-level, unstable API intended for developers.
Developer API's might change or be removed in minor versions of Spark.
So it is not advised to use this method for production solutions, called as Use at your own risk implementation in open source world.
However, Let's dig deeper what happens when we call createDataframe from RDD. It is calling the internalCreateDataFrame private method and creating LogicalRDD.
LogicalRDD is created when:
Dataset is requested to checkpoint
SparkSession is requested to create a DataFrame from an RDD of internal binary rows
So it is nothing but the same as checkpoint operation without saving the dataset physically. It is just creating DataFrame From RDD Of Internal Binary Rows and Schema. This might truncate the lineage in memory but not at the Physical level.
So I believe it's just the overhead of creating another RDDs and can not be used as a replacement of checkpoint.
Now, Checkpoint is the process of truncating lineage graph and saving it to a reliable distributed/local file system.
Why checkpoint?
If computation takes a long time or lineage is too long or Depends too many RDDs
Keeping heavy lineage information comes with the cost of memory.
The checkpoint file will not be deleted automatically even after the Spark application terminated so we can use it for some other process
What are the problems which can arise using "my" method? (Docs
suggests checkpointing, which is more expensive, instead of this one
for breaking lineages and I would like to know the reason)
This article will give detail information on cache and checkpoint. IIUC, your question is more on where we should use the checkpoint. let's discuss some practical scenarios where checkpointing is helpful
Let's take a scenario where we have one dataset on which we want to perform 100 iterative operations and each iteration takes the last iteration result as input(Spark MLlib use cases). Now during this iterative process lineage is going to grow over the period. Here checkpointing dataset at a regular interval(let say every 10 iterations) will assure that in case of any failure we can start the process from last failure point.
Let's take some batch example. Imagine we have a batch which is creating one master dataset with heavy lineage or complex computations. Now after some regular intervals, we are getting some data which should use earlier calculated master dataset. Here if we checkpoint our master dataset then it can be reused for all subsequent processes from different sparkSession.
My process is a batch process which is not-so-long and can be
restarted without issues, so checkpointing is not benefit for me (I
think).
That's correct, If your process is not heavy-computation/Big-lineage then there is no point of checkpointing. Thumb rule is if your dataset is not used multiple time and can be re-build faster than the time is taken and resources used for checkpoint/cache then we should avoid it. It will give more resources to your process.
I think the sparkSession.createDataFrame(df.rdd, df.schema) will impact the fault tolerance property of spark.
But the checkpoint() will save the RDD in hdfs or s3 and hence if failure occurs, it will recover from the last checkpoint data.
And in case of createDataFrame(), it just breaks the lineage graph.

Should you always unpersist earlier cached rdds after you cache an rdd that appears later in the same lineage graph?

I have an rdd that I cache after loading data from s3, since I don't want to have to re-pull from s3 if I lose an executor. I then make a bunch of transformations on that rdd, and then cache again.
At this point, is there any reason to leave the first cached rdd in the cache? Will all later stages just pull from the more recently cached transformation if I don't use the earlier rdd again?
I don't want to have to re-pull from s3 if I lose an executor.
Default caching variants don't protect you from executor loss. Spark provides replicated cache options (MEMORY_ONLY_SER_2, MEMORY_AND_DISK_SER_2, DISK_ONLY_2) which add some protection in case of node failure, but there more expensive than non-replicated variants.
is there any reason to leave the first cached rdd in the cache?
If the second one has been materialized then there is no reason to keep the first one, but LRU cleaner should be able to do handle this case without your help, if it is necessary.

What does "Stage Skipped" mean in Apache Spark web UI?

From my Spark UI. What does it mean by skipped?
Typically it means that data has been fetched from cache and there was no need to re-execute given stage. It is consistent with your DAG which shows that the next stage requires shuffling (reduceByKey). Whenever there is shuffling involved Spark automatically caches generated data:
Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed.
Suppose you have a initial data frame with some data. Now you perform couple of transformations on top of it and perform multiple actions on the final data frame. If you had cache a data frame then it would materialize it when you call an action and keep it in memory in materialize form. So when an next action gets called it would go through the whole DAG and in doing that it will see that the data frame was cached so it will skip those stages by utilizing the already ready state that it has in materialized form in the memory.
When it skip the stage then you will see it as skipped in the spark UI and it speeds up your operation as it does not have to calculate the dag from the root and can start its operation after the cache data frame.

Resources