How to access cached data in Spark Streaming application? - apache-spark

I have a Kafka broker with JSON data from my IoT applications. I connect to this server from a Spark Streaming application in order to do some processing.
I'd like to save in memory (RAM) some specific fields of my json data which I believe I could achieve using cache() and persist() operators.
Next time when I receive a new JSON data in the Spark Streaming application, I check in memory (RAM) if there are fields in common that I can retrieve. And if yes, I do some simple computations and I finally update the values of fields I saved in memory (RAM).
Thus, I would like to know if what I previously descibed is possible. If yes, do I have to use cache() or persist() ? And How can I retrieve from memory my fields?

It's possible with cache / persist which uses memory or disk for the data in Spark applications (not necessarily for Spark Streaming applications only -- it's a more general use of caching in Spark).
But...in Spark Streaming you've got special support for such use cases which are called stateful computations. See Spark Streaming Programming Guide to explore what's possible.
I think for your use case mapWithState operator is exactly what you're after.

Spark does not work that way. Please think it through in a distributed way.
For the first part of keeping in RAM. You can use cache() or persist() anyone as by default they keep data in memory, of the worker.
You can verify this from Apache Spark Code.
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): this.type = persist()
As far as I understand your use case, you need the UpdateStateByKey Operation to implement your second use case !
For more on Windowing see here.

Related

Spark 2.3.1 Structured Streaming state store inner working

I have been going through the documentation of spark 2.3.1 on structured streaming, but could not find details of how stateful operation works internally with the the state store. More specifically what i would like to know is, (1) is the state store distributed? (2) if so then how, per worker or core ?
It seems like in previous version of Spark it was per worker but no idea for now. I know that it is backed by HDFS, but nothing explained how the in-memory store actually works.
Indeed is it a distributed in-memory store ? I am particularly interested in de-duplication, if data are stream from let say a large data set, then this need to be planned as the all "Distinct" DataSet will be ultimately held in memory as the end of the processing of that data set. Hence one need to plan the size of the worker or master depending on how that state store work.
There is only one implementation of State Store in Structured Streaming which is backed by In-memory HashMap and HDFS.
While In-Memory HashMap is for data storage, HDFS is for fault rolerance.
The HashMap occupies executor memory on the worker and each HashMap represents a versioned key-value data of aggregated partition (generated after aggregator operator like deduplication, groupByy, etc)
But this does not explain how the HDFSBackedStateStore actually work. i don't see it in the documentation
You are correct that there is no such documentation available.
I had to understand the code (2.3.1) , wrote an article on how State Store works internally in Structured Streaming. You might like to have a look : https://www.linkedin.com/pulse/state-management-spark-structured-streaming-chandan-prakash/

How to use RDD checkpointing to share datasets across Spark applications?

I have a spark application, and checkpoint the rdd in the code, a simple code snippet is as follows(It is very simple, just for illustrating my question.):
#Test
def testCheckpoint1(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data)
//sc is initialized in the setup
sc.setCheckpointDir(Utils.getOutputDir())
rdd.checkpoint()
rdd.collect()
}
When the rdd is checkpointed on the file system.I write another Spark application and would pick up the data checkpointed in the above code,
and make it as an RDD as a starting point in this second application
The ReliableCheckpointRDD is exactly the RDD that does the work, but this RDD is private to Spark.
So,since ReliableCheckpointRDD is private, it looks spark doesn't recommend to use ReliableCheckpointRDD outside spark.
I would ask if there is a way to do it.
Quoting the scaladoc of RDD.checkpoint (highlighting mine):
checkpoint(): Unit Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext#setCheckpointDir and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.
So, RDD.checkpoint will cut the RDD lineage and trigger partial computation so you've got something already pre-computed in case your Spark application may fail and stop.
Note that RDD checkpointing is very similar to RDD caching but caching would make the partial datasets private to some Spark application.
Let's read Spark Streaming's Checkpointing (that in some way extends the concept of RDD checkpointing making it closer to your needs to share the results of computations between Spark applications):
Data checkpointing Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.
So, yes, in a sense you could share the partial results of computations in a form of RDD checkpointing, but why would you even want to do it if you could save the partial results using the "official" interface using JSON, parquet, CSV, etc.
I doubt using this internal persistence interface could give you more features and flexibility than using the aforementioned formats. Yes, it is indeed technically possible to use RDD checkpointing to share datasets between Spark applications, but it's too much effort for not much gain.

Caching a large stream

I am working on a streaming application, which I am caching a large RDD (that is only in memory)..
Dstream.cache()
Dstream.foreachRDD(..)
Dstream.foreachRDD(..)
I wanted to know if the Dstream can not be fit into the memory..Is the RDD recomputed or raise an exception?
I am asking this question since I am developing a stateful application using mapwithState function which internally uses an internal stream which is presisted only in memory.(https://github.com/wliuxad/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/MapWithStateDStream.scala#L109-109)
Depends on which RDD we're talking about. MapWithStateDStream caches the data inside a OpenHashMapBasedStateMap. It doesn't spill to disk. This means that you need to have sufficient memory in order for your application to work properly. When you think about it, how can you evict state? It isn't some RDD that is being persisted, it is part of your application logic.
One thing that is evicted is the cached RDD from your source. From your previous example I see that you're using Kafka, so that means that the cached KafkaRDD will be evicted once Spark sees fit.

Repartition in repartitionAndSortWithinPartitions happens on driver or on worker

I am trying to understand the concept of repartitionAndSortWithinPartitions in Spark Streaming whether the repartition happens on driver or on worker. If suppose it happens on driver then does worker wait for all the data to come before sorting happens.
Like any other transformation it is handled by executors. Data is not passed via the driver. In other words this standard shuffle mechanism and there is nothing streaming specific here.
Destination of each record is defined by:
Its key.
Partitioner used for a given shuffle.
Number of partitions.
and data is passed directly between executor nodes.
From the comments it looks like you're more interested in a Spark Streaming architecture. If that's the case you should take a look at Diving into Apache Spark Streaming’s Execution Model. To give you some overview there can exist two different types of streams:
Receiver-based with a receiver node per stream.
Direct (without receiver) where only metadata is assigned to executors but data is fetched directly.

How to update an RDD?

We are developing Spark framework wherein we are moving historical data into RDD sets.
Basically, RDD is immutable, read only dataset on which we do operations.
Based on that we have moved historical data into RDD and we do computations like filtering/mapping, etc on such RDDs.
Now there is a use case where a subset of the data in the RDD gets updated and we have to recompute the values.
HistoricalData is in the form of RDD.
I create another RDD based on request scope and save the reference of that RDD in a ScopeCollection
So far I have been able to think of below approaches -
Approach1: broadcast the change:
For each change request, my server fetches the scope specific RDD and spawns a job
In a job, apply a map phase on that RDD -
2.a. for each node in the RDD do a lookup on the broadcast and create a new Value which is now updated, thereby creating a new RDD
2.b. now I do all the computations again on this new RDD at step2.a. like multiplication, reduction etc
2.c. I Save this RDDs reference back in my ScopeCollection
Approach2: create an RDD for the updates
For each change request, my server fetches the scope specific RDD and spawns a job
On each RDD, do a join with the new RDD having changes
now I do all the computations again on this new RDD at step2 like multiplication, reduction etc
Approach 3:
I had thought of creating streaming RDD where I keep updating the same RDD and do re-computation. But as far as I understand it can take streams from Flume or Kafka. Whereas in my case the values are generated in the application itself based on user interaction.
Hence I cannot see any integration points of streaming RDD in my context.
Any suggestion on which approach is better or any other approach suitable for this scenario.
TIA!
The usecase presented here is a good match for Spark Streaming. The two other options bear the question: "How do you submit a re-computation of the RDD?"
Spark Streaming offers a framework to continuously submit work to Spark based on some stream of incoming data and preserve that data in RDD form. Kafka and Flume are only two possible Stream sources.
You could use Socket communication with the SocketInputDStream, reading files in a directory using FileInputDStream or even using shared Queue with the QueueInputDStream. If none of those options fit your application, you could write your own InputDStream.
In this usecase, using Spark Streaming, you will read your base RDD and use the incoming dstream to incrementally transform the existing data and maintain an evolving in-memory state. dstream.transform will allow you to combine the base RDD with the data collected during a given batch interval, while the updateStateByKey operation could help you build an in-memory state addressed by keys. See the documentation for further information.
Without more details on the application is hard to go up to the code level on what's possible using Spark Streaming. I'd suggest you to explore this path and make new questions for any specific topics.
I suggest to take a look at IndexedRDD implementation, which provides updatable RDD of key value pairs. That might give you some insights.
The idea is based on the knowledge of the key and that allows you to zip your updated chunk of data with the same keys of already created RDD. During update it's possible to filter out previous version of the data.
Having historical data, I'd say you have to have sort of identity of an event.
Regarding streaming and consumption, it's possible to use TCP port. This way the driver might open a TCP connection spark expects to read from and sends updates there.

Resources