Where is Spark structured streaming state of mapGroupsWithState stored? - apache-spark

I know that the state is persisted at the checkpoint location as the state store.
but I don't know while it's still in memory, where it's stored?
I created a Streaming job that uses mapGroupsWithState, but I see that storage memory used by executors is 0.
Does this mean that the state is stored in the execution memory?
I can't know the amount of memory consumed by the state. not sure how to know if I need to increase the executor memory or not!
Also, is it possible to avoid checkpointing of the state at all and keep the state always in memory?

As mapGroupsWithState is an aggregation it will be stored where all aggregations are kept within the lifetime of a Spark application: In the Execution Memory (as you have already assumed).
Looking at the signature of the method
def mapGroupsWithState[S: Encoder, U: Encoder](
func: (K, Iterator[V], GroupState[S]) => U): Dataset[U]
you will notice that S is the type of the user-defined state. And this is where the state is managed.
As this will be sent to the executors it must be encodable to Spark SQL types. Therefore, you would typically use a case class in Scala or a Bean in Java. The GroupState is a typed wrapper object that provides methods to access and manage the state value.
It is crucial that you, as developer, also take care of how data gets removed from this state. Otherwise, your state will inevitably cause an OOM as it will only grow and never shrink.
If you do not enable checkpointing in your structured stream then nothing is stored but you have the drawback to loose your state during a failure. In case you have enabled checkpointing, e.g. to keep track of the input source, Spark will also store the current state into the checkpoint location.

If you enable checkpointing, the states are stored in State Store. By default its a HDFSBackedStateStore but that can be overriden too. A good read on this would be https://medium.com/#polarpersonal/state-storage-in-spark-structured-streaming-e5c8af7bf509
As the other answer already mentioned, if you dont enable checkpointing you will loose fault-tolerance and at-least once guarantees.

Related

Spark checkpointing behaviour

Does Spark use checkpoints when we start a new job? Let's say we used a checkpoint to write some RDD to a disk. Will the said RDD be recalculated or loaded from the disk during a new job?
In addition to the points given by #maxime G...
Spark Does not offer default checkpointing .. we need to explicitly set it.
Checkpointing is actually a feature of Spark Core (that Spark SQL uses
for distributed computations) that allows a driver to be restarted on
failure with previously computed state of a distributed computation
described as an RDD
Spark offers two varieties of checkpointing.
Reliable checkpointing: Reliable checkpointing uses reliable data storage like Hadoop HDFS OR S3. and you can achieve by simply doing
sparkContext.setCheckpointDir("(hdfs:// or s3://)tmp/checkpoint/")
then dataframe.checkpoint(eager = true)
and Nonreliable checkpointing: which is Local checkpointing uses executor storage (i.e node-local disk storage) to write checkpoint files to and due to the executor lifecycle is considered unreliable and it does not promise data to be available if the job terminates abruptly.
sparkContext.setCheckpointDir("/tmp/checkpoint/").
dataframe.localCheckpoint(eager = true)
(Be careful when you are checkpointing in local mode and cluster autoscaling is enabled..)
Note:
Checkpointing can be eager or lazy per eager flag of the checkpoint operator. Eager checkpointing is the default checkpointing and happens immediately when requested. Lazy checkpointing does not and will only happen when an action is executed.
The eager checkpoint will create an immediate stage barrier and later one wait for any particular action to happen and remember all previous transformations.
at the start of the job, if a RDD is present in your checkpoint location, it will be loaded.
That also mean that if you change code, you should also be careful about checkpointing because a RDD with old code is loaded with new code and that can cause conflict.

Where the State of Stateful Operations saved in Spark Cluster

I was experimenting with 'flatMapGroupsWithState' with Spark Structured Streaming, the idea is interesting but now I am asking myself, due to distributed nature of the Spark, where is this State Information kept....
Let's say I have a Cluster 10, will all 10 share the storage load to keep this state information or there is risk that one node in the cluster can be overloaded?
I read somewhere that State object must Java Serialisable, considering Java Serialisation is extreme inefficient, is there a way to customise this to use Protobuffer or Avro, etc...
Thx for answers..
where is this State Information kept....
On executors.
By default there are 200 state stores as there are partitions. You can change it using spark.sql.shuffle.partitions configuration property. That gives you that the number of partitions is equivalent to the number of state stores. That also says that whatever you use as grouping keys will shuffle your data across partitions and (most likely) some of the available state stores are going to have no state at all (be empty).
Let's say I have a Cluster 10, will all 10 share the storage load to keep this state information or there is risk that one node in the cluster can be overloaded?
Yes, but it's controlled by grouping keys and partitions, which is the code a Spark developer writes.
I read somewhere that State object must Java Serialisable, considering Java Serialisation is extreme inefficient
No need to think about serialization as state stores are local to tasks (on executors).
, is there a way to customise this to use Protobuffer or Avro, etc...
Sure. You should write your own state store implementation. By default there's one and only HDFSBackedStateStoreProvider that is configured using spark.sql.streaming.stateStore.providerClass internal configuration property.

Spark 2.3.1 Structured Streaming state store inner working

I have been going through the documentation of spark 2.3.1 on structured streaming, but could not find details of how stateful operation works internally with the the state store. More specifically what i would like to know is, (1) is the state store distributed? (2) if so then how, per worker or core ?
It seems like in previous version of Spark it was per worker but no idea for now. I know that it is backed by HDFS, but nothing explained how the in-memory store actually works.
Indeed is it a distributed in-memory store ? I am particularly interested in de-duplication, if data are stream from let say a large data set, then this need to be planned as the all "Distinct" DataSet will be ultimately held in memory as the end of the processing of that data set. Hence one need to plan the size of the worker or master depending on how that state store work.
There is only one implementation of State Store in Structured Streaming which is backed by In-memory HashMap and HDFS.
While In-Memory HashMap is for data storage, HDFS is for fault rolerance.
The HashMap occupies executor memory on the worker and each HashMap represents a versioned key-value data of aggregated partition (generated after aggregator operator like deduplication, groupByy, etc)
But this does not explain how the HDFSBackedStateStore actually work. i don't see it in the documentation
You are correct that there is no such documentation available.
I had to understand the code (2.3.1) , wrote an article on how State Store works internally in Structured Streaming. You might like to have a look : https://www.linkedin.com/pulse/state-management-spark-structured-streaming-chandan-prakash/

Caching a large stream

I am working on a streaming application, which I am caching a large RDD (that is only in memory)..
Dstream.cache()
Dstream.foreachRDD(..)
Dstream.foreachRDD(..)
I wanted to know if the Dstream can not be fit into the memory..Is the RDD recomputed or raise an exception?
I am asking this question since I am developing a stateful application using mapwithState function which internally uses an internal stream which is presisted only in memory.(https://github.com/wliuxad/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/MapWithStateDStream.scala#L109-109)
Depends on which RDD we're talking about. MapWithStateDStream caches the data inside a OpenHashMapBasedStateMap. It doesn't spill to disk. This means that you need to have sufficient memory in order for your application to work properly. When you think about it, how can you evict state? It isn't some RDD that is being persisted, it is part of your application logic.
One thing that is evicted is the cached RDD from your source. From your previous example I see that you're using Kafka, so that means that the cached KafkaRDD will be evicted once Spark sees fit.

How can I understand check point recorvery when using Kafka Direct InputDstream and stateful stream transformation?

On yarn-cluster I use kafka directstream as input(ex.batch time is 15s),and want to aggregate the input msg in seperate userIds.
So I use stateful streaming api like updateStateByKey or mapWithState.But from the api source,I see that the mapWithState's default checkpoint duration is batchduration * 10 (in my case 150 s),and in kafka directstream the partition offset is checkpointed at every batch(15 s).Actually,every dstream can set different checkpoint duration.
So, my question is:
When streaming app crashed,I restart it,the kafka offset and state stream rdd are asynchronous in checkpoint,in this case how can I keep no data lose? Or I misunderstand the checkpoint mechanism?
How can I keep no data lose?
Stateful streams such as mapWithState or updateStateByKey require you to provide a checkpoint directory because that's part of how they operate, they store the state every intermediate to be able to recover the state upon a crash.
Other than that, each DStream in the chain is free to request checkpointing as well, question is "do you really need to checkpoint other streams"?
If an application crashes, Spark takes all the state RDDs stored inside the checkpoint and brings then back to memory, so your data there is as good as it was the last time spark checkpointed it there. One thing to keep in my mind is, if you change your application code, you cannot recover state from checkpoint, you'll have to delete it. This means that if for instance you need to do a version upgrade, all data that was previously stored in the state will be gone unless you manually save it yourself in a manner which allows versioning.

Resources