How to use RDD checkpointing to share datasets across Spark applications? - apache-spark

I have a spark application, and checkpoint the rdd in the code, a simple code snippet is as follows(It is very simple, just for illustrating my question.):
#Test
def testCheckpoint1(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data)
//sc is initialized in the setup
sc.setCheckpointDir(Utils.getOutputDir())
rdd.checkpoint()
rdd.collect()
}
When the rdd is checkpointed on the file system.I write another Spark application and would pick up the data checkpointed in the above code,
and make it as an RDD as a starting point in this second application
The ReliableCheckpointRDD is exactly the RDD that does the work, but this RDD is private to Spark.
So,since ReliableCheckpointRDD is private, it looks spark doesn't recommend to use ReliableCheckpointRDD outside spark.
I would ask if there is a way to do it.

Quoting the scaladoc of RDD.checkpoint (highlighting mine):
checkpoint(): Unit Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext#setCheckpointDir and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.
So, RDD.checkpoint will cut the RDD lineage and trigger partial computation so you've got something already pre-computed in case your Spark application may fail and stop.
Note that RDD checkpointing is very similar to RDD caching but caching would make the partial datasets private to some Spark application.
Let's read Spark Streaming's Checkpointing (that in some way extends the concept of RDD checkpointing making it closer to your needs to share the results of computations between Spark applications):
Data checkpointing Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.
So, yes, in a sense you could share the partial results of computations in a form of RDD checkpointing, but why would you even want to do it if you could save the partial results using the "official" interface using JSON, parquet, CSV, etc.
I doubt using this internal persistence interface could give you more features and flexibility than using the aforementioned formats. Yes, it is indeed technically possible to use RDD checkpointing to share datasets between Spark applications, but it's too much effort for not much gain.

Related

RDD in Spark: where and how are they stored?

I've always heard that Spark is 100x faster than classic Map Reduce frameworks like Hadoop. But recently I'm reading that this is only true if RDDs are cached, which I thought was always done but instead requires the explicit cache () method.
I would like to understand how all produced RDDs are stored throughout the work. Suppose we have this workflow:
I read a file -> I get the RDD_ONE
I use the map on the RDD_ONE -> I get the RDD_TWO
I use any other transformation on the RDD_TWO
QUESTIONS:
if I don't use cache () or persist () is every RDD stored in memory, in cache or on disk (local file system or HDFS)?
if RDD_THREE depends on RDD_TWO and this in turn depends on RDD_ONE (lineage) if I didn't use the cache () method on RDD_THREE Spark should recalculate RDD_ONE (reread it from disk) and then RDD_TWO to get RDD_THREE?
Thanks in advance.
In spark there are two types of operations: transformations and actions. A transformation on a dataframe will return another dataframe and an action on a dataframe will return a value.
Transformations are lazy, so when a transformation is performed spark will add it to the DAG and execute it when an action is called.
Suppose, you read a file into a dataframe, then perform a filter, join, aggregate, and then count. The count operation which is an action will actually kick all the previous transformation.
If we call another action(like show) the whole operation is executed again which can be time consuming. So, if we want not to run the whole set of operation again and again we can cache the dataframe.
Few pointers you can consider while caching:
Cache only when the resulting dataframe is generated from significant transformation. If spark can regenerate the cached dataframe in few seconds then caching is not required.
Cache should be performed when the dataframe is used for multiple actions. If there are only 1-2 actions on the dataframe then it is not worth saving that dataframe in memory.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes
To Answer your question:
Q1:if I don't use cache () or persist () is every RDD stored in memory, in cache or on disk (local file system or HDFS)? Ans: Considering the data which is available in workers node as blocks in HDFS, when creating rdd for the file as
val rdd=sc.textFile("<HDFS Path>")
the underlying blocks of data from each nodes (HDFS) will be loaded to their RAM's(i,e memory) as partitions (in spark, the blocks of hdfs data are called as partitions once loaded into memory)
Q2: if RDD_THREE depends on RDD_TWO and this in turn depends on RDD_ONE (lineage) if I didn't use the cache () method on RDD_THREE Spark should recalculate RDD_ONE (reread it from disk) and then RDD_TWO to get RDD_THREE? Ans: Yes.Since the underlying results are not stored in drivers memory by using cache() in this scenario.

What's the overhead of converting an RDD to a DataFrame and back again?

It was my assumption that Spark Data Frames were built from RDDs. However, I recently learned that this is not the case, and Difference between DataFrame, Dataset, and RDD in Spark does a good job explaining that they are not.
So what is the overhead of converting an RDD to a DataFrame, and back again? Is it negligible or significant?
In my application, I create a DataFrame by reading a text file into an RDD and then custom-encoding every line with a map function that returns a Row() object. Should I not be doing this? Is there a more efficient way?
RDDs have a double role in Spark. First of all is the internal data structure for tracking changes between stages in order to manage failures and secondly until Spark 1.3 was the main interface for interaction with users. Therefore after after Spark 1.3 Dataframes constitute the main interface offering much richer functionality than RDDs.
There is no significant overhead when converting one Dataframe to RDD with df.rdd since the dataframes they already keep an instance of their RDDs initialized therefore returning a reference to this RDD should not have any additional cost. On the other side, generating a dataframe from an RDD requires some extra effort. There are two ways to convert an RDD to dataframe 1st by calling rdd.toDF() and 2nd with spark.createDataFrame(rdd, schema). Both methods will evaluate lazily although there will be an extra overhead regarding the schema validation and execution plan (you can check the toDF() code here for more details). Of course that would be identical to the overhead that you have just by initializing your data with spark.read.text(...) but with one less step, the conversion from RDD to dataframe.
This the first reason that I would go directly with Dataframes instead of working with two different Spark interfaces.
The second reason is that when using the RDD interface you are missing some significant performance features that dataframes and datasets offer related to Spark optimizer (catalyst) and memory management (tungsten).
Finally I would use the RDDs interface only if I need some features that are missing in dataframes such as key-value pairs, zipWithIndex function etc. But even then you can access those via df.rdd which is costless as already mentioned. As for your case , I believe that would be faster to use directly a dataframe and use the map function of that dataframe to ensure that Spark leverages the usage of tungsten ensuring efficient memory management.

What does Spark recover the data from a failed node?

Suppose we have an RDD, which is being used multiple times. So to save the computations again and again, we persisted this RDD using the rdd.persist() method.
So when we are persisting this RDD, the nodes computing the RDD will be storing their partitions.
So now suppose, the node containing this persisted partition of RDD fails, then what will happen? How will spark recover the lost data? Is there any replication mechanism? Or some other mechanism?
When you do rdd.persist, rdd doesn't materialize the content. It does when you perform an action on the rdd. It follows the same lazy evaluation principle.
Now an RDD knows the partition on which it should operate and the DAG associated with it. With the DAG it is perfectly capable of recreating the materialized partition.
So, when a node fails the driver spawn another executor in some other node and provides it the Data partition on which it was supposed to work and the DAG associated with it in a closure. Now with this information it can recompute the data and materialize it.
In the mean time the cached data in the RDD won't have all the data in memory, the data of the lost nodes it has to fetch from the disk it will take so little more time.
On the replication, yes spark supports in memory replication. You need to set StorageLevel.MEMORY_DISK_2 when you persist.
rdd.persist(StorageLevel.MEMORY_DISK_2)
This ensures the data is replicated twice.
I think the best way I was able to understand how Spark is resilient was when someone told me that I should not think of RDDs as big, distributed arrays of data.
Instead I should picture them as a container that had instructions on what steps to take to convert data from data source and take one step at a time until a result was produced.
Now if you really care about losing data when persisting, then you can specify that you want to replicate your cached data.
For this, you need to select storage level. So instead of normally using this:
MEMORY_ONLY - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
MEMORY_AND_DISK - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
You can specify that you want your persisted data replcated
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. - Same as the levels above, but replicate each partition on two cluster nodes.
So if the node fails, you will not have to recompute the data.
Check storage levels here: http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence

Misunderstanding of spark RDD fault tolerant

Many say:
Spark does not replicate data in hdfs.
Spark arranges the operations in DAG graph.Spark builds RDD lineage. If a RDD is lost they can be rebuilt with the help of lineage graph.
So there is no need of data replication as the RDDs can be recalculated from the lineage graph.
And my question is:
If a node fails, spark will only recompute the RDD partitions lost on this node, but where does the data source needed in the recompution process come from ? Do you mean its parent RDD is still there when the node fails?What if the RDD that lost some partitions didn't have parent RDD(like the RDD is from spark streaming receiver) ?
What if we lose something part way through computation?
Rely on the key insight from MR! Determinism provides safe recompute.
Track 'lineage' of each RDD. Can recompute from parents if needed.
Interesting: only need to record tiny state to do recompute.
Need parent pointer, function applied, and a few other bits.
Log 10 KB per transform rather than re-output 1 TB -> 2 TB
Source
The child RDD is metadata that describes how to calculate the RDD from the parent RDD. Read more in What is RDD dependency in Spark?
If a node fails, spark will only recompute the RDD partitions lost on this node, but where does the data source needed in the recompution process come from ? Do you mean its parent RDD is still there when the node fails?
The core idea is that you can use the lineage to recover lost RDDs because RDDs are
built from another RDD or
built from data in stable storage.
(source: RDD paper, beginning of section 2.1)
If some RDD is lost, you can just go back in the lineage until you reach some RDD or the initial data record that is still available.
The data in stable storage is replicated across multiple nodes, therefore unlikely to be lost.
As far from what I've read about Streaming Receivers, the received data seems to be saved in stable storage as well, so it behaves just like any other data source.

How to update an RDD?

We are developing Spark framework wherein we are moving historical data into RDD sets.
Basically, RDD is immutable, read only dataset on which we do operations.
Based on that we have moved historical data into RDD and we do computations like filtering/mapping, etc on such RDDs.
Now there is a use case where a subset of the data in the RDD gets updated and we have to recompute the values.
HistoricalData is in the form of RDD.
I create another RDD based on request scope and save the reference of that RDD in a ScopeCollection
So far I have been able to think of below approaches -
Approach1: broadcast the change:
For each change request, my server fetches the scope specific RDD and spawns a job
In a job, apply a map phase on that RDD -
2.a. for each node in the RDD do a lookup on the broadcast and create a new Value which is now updated, thereby creating a new RDD
2.b. now I do all the computations again on this new RDD at step2.a. like multiplication, reduction etc
2.c. I Save this RDDs reference back in my ScopeCollection
Approach2: create an RDD for the updates
For each change request, my server fetches the scope specific RDD and spawns a job
On each RDD, do a join with the new RDD having changes
now I do all the computations again on this new RDD at step2 like multiplication, reduction etc
Approach 3:
I had thought of creating streaming RDD where I keep updating the same RDD and do re-computation. But as far as I understand it can take streams from Flume or Kafka. Whereas in my case the values are generated in the application itself based on user interaction.
Hence I cannot see any integration points of streaming RDD in my context.
Any suggestion on which approach is better or any other approach suitable for this scenario.
TIA!
The usecase presented here is a good match for Spark Streaming. The two other options bear the question: "How do you submit a re-computation of the RDD?"
Spark Streaming offers a framework to continuously submit work to Spark based on some stream of incoming data and preserve that data in RDD form. Kafka and Flume are only two possible Stream sources.
You could use Socket communication with the SocketInputDStream, reading files in a directory using FileInputDStream or even using shared Queue with the QueueInputDStream. If none of those options fit your application, you could write your own InputDStream.
In this usecase, using Spark Streaming, you will read your base RDD and use the incoming dstream to incrementally transform the existing data and maintain an evolving in-memory state. dstream.transform will allow you to combine the base RDD with the data collected during a given batch interval, while the updateStateByKey operation could help you build an in-memory state addressed by keys. See the documentation for further information.
Without more details on the application is hard to go up to the code level on what's possible using Spark Streaming. I'd suggest you to explore this path and make new questions for any specific topics.
I suggest to take a look at IndexedRDD implementation, which provides updatable RDD of key value pairs. That might give you some insights.
The idea is based on the knowledge of the key and that allows you to zip your updated chunk of data with the same keys of already created RDD. During update it's possible to filter out previous version of the data.
Having historical data, I'd say you have to have sort of identity of an event.
Regarding streaming and consumption, it's possible to use TCP port. This way the driver might open a TCP connection spark expects to read from and sends updates there.

Resources