Where the State of Stateful Operations saved in Spark Cluster - apache-spark

I was experimenting with 'flatMapGroupsWithState' with Spark Structured Streaming, the idea is interesting but now I am asking myself, due to distributed nature of the Spark, where is this State Information kept....
Let's say I have a Cluster 10, will all 10 share the storage load to keep this state information or there is risk that one node in the cluster can be overloaded?
I read somewhere that State object must Java Serialisable, considering Java Serialisation is extreme inefficient, is there a way to customise this to use Protobuffer or Avro, etc...
Thx for answers..

where is this State Information kept....
On executors.
By default there are 200 state stores as there are partitions. You can change it using spark.sql.shuffle.partitions configuration property. That gives you that the number of partitions is equivalent to the number of state stores. That also says that whatever you use as grouping keys will shuffle your data across partitions and (most likely) some of the available state stores are going to have no state at all (be empty).
Let's say I have a Cluster 10, will all 10 share the storage load to keep this state information or there is risk that one node in the cluster can be overloaded?
Yes, but it's controlled by grouping keys and partitions, which is the code a Spark developer writes.
I read somewhere that State object must Java Serialisable, considering Java Serialisation is extreme inefficient
No need to think about serialization as state stores are local to tasks (on executors).
, is there a way to customise this to use Protobuffer or Avro, etc...
Sure. You should write your own state store implementation. By default there's one and only HDFSBackedStateStoreProvider that is configured using spark.sql.streaming.stateStore.providerClass internal configuration property.

Related

Where is Spark structured streaming state of mapGroupsWithState stored?

I know that the state is persisted at the checkpoint location as the state store.
but I don't know while it's still in memory, where it's stored?
I created a Streaming job that uses mapGroupsWithState, but I see that storage memory used by executors is 0.
Does this mean that the state is stored in the execution memory?
I can't know the amount of memory consumed by the state. not sure how to know if I need to increase the executor memory or not!
Also, is it possible to avoid checkpointing of the state at all and keep the state always in memory?
As mapGroupsWithState is an aggregation it will be stored where all aggregations are kept within the lifetime of a Spark application: In the Execution Memory (as you have already assumed).
Looking at the signature of the method
def mapGroupsWithState[S: Encoder, U: Encoder](
func: (K, Iterator[V], GroupState[S]) => U): Dataset[U]
you will notice that S is the type of the user-defined state. And this is where the state is managed.
As this will be sent to the executors it must be encodable to Spark SQL types. Therefore, you would typically use a case class in Scala or a Bean in Java. The GroupState is a typed wrapper object that provides methods to access and manage the state value.
It is crucial that you, as developer, also take care of how data gets removed from this state. Otherwise, your state will inevitably cause an OOM as it will only grow and never shrink.
If you do not enable checkpointing in your structured stream then nothing is stored but you have the drawback to loose your state during a failure. In case you have enabled checkpointing, e.g. to keep track of the input source, Spark will also store the current state into the checkpoint location.
If you enable checkpointing, the states are stored in State Store. By default its a HDFSBackedStateStore but that can be overriden too. A good read on this would be https://medium.com/#polarpersonal/state-storage-in-spark-structured-streaming-e5c8af7bf509
As the other answer already mentioned, if you dont enable checkpointing you will loose fault-tolerance and at-least once guarantees.

"total-executor-cores" parameter in Spark in relation to Data Nodes

Another item that I read little about.
Leaving S3 aside, and not in the position just now to try out on a bare metal classic data locality approach to Spark, Hadoop, and not in Dynamic Resource Allocation mode, then:
What if a large dataset in HDFS is distributed over (all) N data nodes in the Cluster, but the total-executor-cores parameter is set lower than N, and we need to read all the data on obviously (all) N relevant Data Nodes?
I assume Spark has to ignore this parameter for reading from HDFS. Or not?
If it is ignored, an Executor Core needs to be allocated on that Data Node and is thus acquired by the overall Job and thus this parameter needs to be interpreted to mean for processing and not for reading blocks?
Is the data from such a Data Node immediately shuffled to where the Executors were allocated?
Thanks in advance.
There seems to be little bit of confusion here.
Optimal Data locality (node local) is something we want to achieve, not guarantee. All Spark can do is request resources (for example with YARN - How YARN knows data locality in Apache spark in cluster mode) and hope that it will get resources, which satisfy data locality constraints.
If it doesn't it will simply fetch data from remote nodes. However it is not shuffle. It just a simple transfer over network.
So to answer your question - Spark will use resource which has been allocated, trying to do its best do satisfy the constraints. It cannot use nodes, which hasn't been acquired, so it won't automatically get additional nodes for reads.

Spark 2.3.1 Structured Streaming state store inner working

I have been going through the documentation of spark 2.3.1 on structured streaming, but could not find details of how stateful operation works internally with the the state store. More specifically what i would like to know is, (1) is the state store distributed? (2) if so then how, per worker or core ?
It seems like in previous version of Spark it was per worker but no idea for now. I know that it is backed by HDFS, but nothing explained how the in-memory store actually works.
Indeed is it a distributed in-memory store ? I am particularly interested in de-duplication, if data are stream from let say a large data set, then this need to be planned as the all "Distinct" DataSet will be ultimately held in memory as the end of the processing of that data set. Hence one need to plan the size of the worker or master depending on how that state store work.
There is only one implementation of State Store in Structured Streaming which is backed by In-memory HashMap and HDFS.
While In-Memory HashMap is for data storage, HDFS is for fault rolerance.
The HashMap occupies executor memory on the worker and each HashMap represents a versioned key-value data of aggregated partition (generated after aggregator operator like deduplication, groupByy, etc)
But this does not explain how the HDFSBackedStateStore actually work. i don't see it in the documentation
You are correct that there is no such documentation available.
I had to understand the code (2.3.1) , wrote an article on how State Store works internally in Structured Streaming. You might like to have a look : https://www.linkedin.com/pulse/state-management-spark-structured-streaming-chandan-prakash/

Spark Ingestion path: "Source to Driver to Worker" or "Source to Workers"

When Spark ingest the Data, is there specific situation where it has to go trough the driver and then from the driver the worker ? Same question apply for a direct read by the worker.
I guess i am simply trying to map out what are the condition or situation that lead to one way or the other, and how does partitioning happen in each case.
If you limit yourself to built-in methods then unless you create distributed data structure from a local one with method like:
SparkSession.createDataset
SparkContext.parallelize
data is always accessed directly by the workers, but the details of the data distribution will vary from source to source.
RDDs typically depend on Hadoop input formats, but Spark SQL and data source API, are at least partially independent, at least when it comes to configuration,
It doesn't mean data is always properly distributed. In some cases (JDBC, streaming receivers) data may still be piped trough a single node.

Apache Spark node asking master for more data?

I'm trying to benchmark a few approaches to putting an image processing algorithm into apache spark. For one step in this algorithm, a computation on a pixel in the image will depend on an unknown amount of surrounding data, so we can't partition the image with guaranteed sufficient overlap a priori.
One solution to that problem I need to benchmark is for a worker node to ask the master node for more data when it encounters a pixel with insufficient surrounding data. I'm not convinced this is the way to do things, but I need to benchmark it anyway because of reasons.
Unfortunately, after a bunch of googling and reading docs I can't find any way for a processingFunc called as part of sc.parallelize(partitions).map(processingFunc) to query the master node for more data from a different partition mid-computation.
Does a way for a worker node to ask the master for more data exist in spark, or will I need to hack something together that kind of goes around spark?
Master Node in Spark is for allocating the resources to a particular job and once the resources are allocated, the Driver ships the complete code with all its dependencies to the various executors.
The first step in every code is to load the data to the Spark cluster. You can read the data from any underlying data repository like Database, filesystem, webservices etc.
Once data is loaded it is wrapped into an RDD which is partitioned across the nodes in the cluster and further stored in the workers/ Executors Memory. Though you can control the number of partitions by leveraging various RDD API's but you should do it only when you have valid reasons to do so.
Now all operations are performed over RDD's using its various methods/ Operations exposed by RDD API. RDD keep tracks of partitions and partitioned data and depending upon the need or request it automatically query the appropriate partition.
In nutshell, you do not have to worry about the way data is partitioned by RDD or which partition stores which data and how they communicate with each other but if you do care, then you can write your own custom partitioner, instructing Spark of how to partition your data.
Secondly if your data cannot be partitioned then I do not think Spark would be an ideal choice because that will result in processing of everything in 1 single machine which itself is contrary to the idea of distributed computing.
Not sure what is exactly your use case but there are people who have been leveraging Spark for Image processing. see here for the comments from Databricks

Resources