Reliable checkpoint (keeping complex state) for spark streaming jobs

Reliable checkpoint (keeping complex state) for spark streaming jobs - apache-spark

We are using Spark 1.6 with JVM 1.6 on Red Hat 4.4.7 for running our spark streaming applications/jobs. Some of our streaming jobs use complex state and we have scala case class to represent them. But while testing the upgrade cycle of the jobs we are facing some issue as below. As streaming jobs will run for ever, need help in designing an easy to upgrade application.
I was checking the exact use case where the job is not able to restart from the checkpoint.
Just restarting the job without changing anything did not create the problem.
Restarting the job after doing random change (unrelated to state) did not create the problem.
Restarting the job after doing change to state handling function (like by adding print) did not create the problem.
Restarting the job after doing change to state (by adding a new Boolean field) did create the problem.
After doing some googling, the general guideline for handling the issue seems to be,
Implement the state as "a format that stores schema along with data" as json or avro.
The client code will have to serialize before putting it to state and de-serialize it after reading it from state. The serialization and de-serialization will happen after every streaming interval, mapWithState might help slightly.
Upgrading the state from version x to y have to be handled explicitly if multiple version of the job can co-exist!!!
Stop the input, finish the processing for the input, re-start as a fresh job with new checkpoint.
Though this is simple to implement it is not possible for couple of our jobs. Also the upgrade cycle will become slightly complex.
Parallely save the data to external store also and on an upgrade load it as initialRDD.
This will introduce an external dependency for keeping the state.
Upgrading the state from version x to y have to be handled explicitly if multiple version of the job can co-exist!!!
As the information is scattered all across web, I felt confusing to come to conclusion. Below are my questions,
If the structure of state class changes the checkpoint becomes invalid, but, is there any other known issue where the checkpoint becomes invalid, if assembly jar or functionality(not structure) of the state class changes?
What is the strategy you are using for enabling easy upgrade of stateful spark streaming job?

Consider a case of environment upgrade like jvm/scala/spark/etc... There is no guarantee the checkpoint can be reliable for ever irrespective of any change.
The checkpoint is designed to help in recovering in the unfortunate event of fault/crash only and not designed to be used as a data store!!!
The best alternative is to periodically flush the data in to a reliable store (HDFS/DB/etc...) and read the same as initial RDD in the event of any form of upgrade.

"The best alternative is to periodically flush the data in to a reliable store (HDFS/DB/etc...) and read the same as initial RDD in the event of any form of upgrade"
How to periodically flush Spark State data into externale store? Is there an API call that can provide access to Spark StateRDD?

Related

Does Watermark in Update output mode clean the stored state in Spark Structured Streaming?

I am working on a spark streaming application and while understanding about the sinks and watermarking logic, I couldn't find a clear answer as to if I use a watermark with say 10 min threshold while outputting the aggregations with update output mode, will the intermittent state maintained by spark be cleared off after the 10 min threshold has expired?

Watermark allows late arriving data to be considered for inclusion against already computed results for a period of time using windows. Its premise is that it tracks back to a point in time (threshold) before which it is assumed no more late events are supposed to arrive, but if they do, they are discarded.
As a consequence one needs to maintain the state of window / aggregate already computed to handle these potential late updates based on event time. However, this costs resources, and if done infinitely, this would blow up a Structured Streaming App.
Will the intermittent state maintained by spark be cleared off after the 10 min threshold has expired? Yes, it will. There is by design as there is no point holding any longer a state that can no longer be updated due to the threshold having been expired.
You need to run through some simple examples as I note it is easy to forget the subtlety of output.
See
Why does streaming query with update output mode print out all rows?
which gives an excellent example of update mode output as well. Also this gives an even better update example: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Even better - this blog with some good graphics: https://towardsdatascience.com/watermarking-in-spark-structured-streaming-9e164f373e9

Can Spark automatically detect nondeterministic results and adjust failure recovery accordingly?

If nondeterministic code runs on Spark, this can cause a problem when recovery from failure of a node is necessary, because the new output may not be exactly the same as the old output. My interpretation is that the entire job might need to be rerun in this case, because otherwise the output data could be inconsistent with itself (as different data was produced at different times). At the very least any nodes that are downstream from the recovered node would probably need to be restarted from scratch, because they have processed data that may now change. That's my understanding of the situation anyway, please correct me if I am wrong.
My question is whether Spark can somehow automatically detect if code is nondeterministic (for example by comparing the old output to the new output) and adjust the failure recovery accordingly. If this were possible it would relieve application developers of the requirement to write nondeterministic code, which might sometimes be challenging and in any case this requirement can easily be forgotten.

No. Spark will not be able to handle non deterministic code in case of failures. The fundamental data structure of Spark, RDD is not only immutable but it
should also be determinstic function of it's input. This is necessary otherwise Spark framework will not be able to recompute the partial RDD (partition) in case of
failure. If the recomputed partition is not deterministic then it had to re-run the transformation again on full RDDs in lineage. I don't think that Spark is a right
framework for non-deterministic code.
If Spark has to be used for such use case, application developer has to take care of keeping the output consistent by writing code carefully. It can be done by using RDD only (no datframe or dataset) and persisting output after every transformation executing non-determinstic code. If performance is the concern, then the intermediate RDDs can be persisted on Alluxio.
A long term approach would be to open a feature request in apache spark jira. But I am not too positive about the acceptance of feature. A little hint in syntax to know wether code is deterministic or not and framework can switch to recover RDD partially or fully.

Non-deterministic results are not detected and accounted for in failure recovery (at least in spark 2.4.1, which I'm using).
I have encountered issues with this a few times on spark. For example, let's say I use a window function:
first_value(field_1) over (partition by field_2 order by field_3)
If field_3 is not unique, the result is non-deterministic and can differ each time that function is run. If a spark executor dies and restarts while calculating this window function, you can actually end up with two different first_value results output for the same field_2 partition.

Spark Structured Streaming Checkpoint Compatibility

Am I safe to use Kafka and Spark Structured Streaming (SSS) (>=v2.2) with checkpointing on HDFS in cases where I have to upgrade the Spark library or when changing the query? I'd like to seamlessly continue with the offset left behind even in those cases.
I've found different answers when searching the net for compatibility issues in SSS's (>=2.2) checkpoint mechanism. Maybe someone out there can lighten up the situation ... in best case backed up with facts/references or first-person experience ?
In Spark's programming guide (current=v2.3) they just claim "..should be a directory in an HDFS-compatible" but don't even leave a single word about constraints in terms of compatibility.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Databricks at least gives some hints that this is an issue at all.
https://docs.databricks.com/spark/latest/structured-streaming/production.html#recover-after-changes-in-a-streaming-query
A Cloudera blog recommends storing the offset rather in Zookeeper, but this actually refers to the "old" Spark Streaming implementation. If this is relates to structured streaming, too, is unclear.
https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/
A guy in this conversation claims that there is no problem on that regard anymore ...but without pointing to facts.
How to get Kafka offsets for structured query for manual and reliable offset management?
Help is highly appreciated.

Checkpoints are great when you don't need to change the code, fire and forget procedure are perfect use cases.
I read the post from Databricks you posted, the truth is that you can't know what kind of changes are called to do until you have to do them. I wonder how they can predict the future.
About the link on Cloudera, yes they are speaking about the old procedure, but with Structured Streaming still code changes void your checkpoints.
So, in my opinion, so much automation is good for Fire and Forget procedure.
If this is not your case, saving the Kafka offset elsewhere is a good way to restart from where you left last time; you know that Kafka can contain a lot of data and restart from zero to avoid data loss or accept the idea to restart from the latest offset sometimes is not always acceptable.
Remember: Any stream logic change will be ignored as long as there are checkpoints, so you can't make change to your job once deployed, unless you accept the idea to throwing away the checkpoints.
By throwing away the checkpoints you must force the job to reprocess the entire Kafka topic (earliest), or start right at the end (latest) skipping unprocessed data.
It's great, is it not?

Lack of Data Node locality when working with S3 for HDFS - eventual consistency

Two points here after reading https://wiki.apache.org/hadoop/AmazonS3
Not sure what to make of this below.
...
eventual consistency: changes made by one application (creation, updates and deletions) will not be visible until some undefined time.
...
Some undefined time? What does that mean for writing SPARK Applications then? If I have n JOBs, that may be something may not yet be visible?
How does the SPARK default partitioning apply then for S3 data?

that Hadoop doc is a bit out of date; I'd google for "spark and object stores" to get some more up to date stuff.
The spark documentation has some spark-specific details.
Some undefined time? What does that mean for writing SPARK Applications then?
Good question. AWS never give the hard data here; the best empirical study is Benchmarking Eventual Consistency: Lessons Learned from Long-Term Experimental Studies
That showed that consistency delays depend on total AWS load, and had some other patterns. Because its so variable, nobody dares give a good value of "undefined time".
My general expectations are
Normally list inconsistency can take a few seconds, but under load it can get worse
if nothing has actually gone wrong with S3 then a few minutes is enough for listing inconsistencies to be resolved.
All the s3 connectors mimic rename by listing all files under a path, copying and deleting them, renaming a directory immediately after processes have written to them may miss data.
Because the means by which Spark jobs commit their output to a filesystem depends on rename(), it is not safe to use them to commit the output of tasks.
If I have n JOBs, that may be something may not yet be visible?
It's worse than that. You can't rely on the rename operations within a single job to get it right.
It's why Amazon offer a consistent emrfs option using dynamoDB for listing consistency, and Hadoop 2.9+ have a feature, S3Guard, which uses dynamoDB for that same operation. Neither deal with update inconsistency though, which is why Hadoop 3.1's "S3A committers" default to generating unique filenames for new files.
If you are using the Apache S3A connector to commit work to S3 using the normal filesystem FileOutputCommitter then, without S3Guard, you are at risk of losing data.
Don't worry about chaining work; worry about that.
BTW: I don't know what Databricks do here. Ask them for the details.
How does the SPARK default partitioning apply then for S3 data?
the partitioning is based on whatever blocksize the object store connector makes up. For example, for the s3a connector, its that of fs.s3a.blocksize

Is it possible to get and use a JavaSparkContext from within a task?

I've come across a situation where I'd like to do a "lookup" within a Spark and/or Spark Streaming pipeline (in Java). The lookup is somewhat complex, but fortunately, I have some existing Spark pipelines (potentially DataFrames) that I could reuse.
For every incoming record, I'd like to potentially launch a spark job from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good idea?
Not considering the performance implications, is this even possible?

Is it possible to get and use a JavaSparkContext from within a task?
No. The spark context is only valid on the driver and Spark will prevent serialization of it. Therefore it's not possible to use the Spark context from within a task.
For every incoming record, I'd like to potentially launch a spark job
from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good
idea?
Without more details, my umbrella answer would be: Probably not a good idea.
Not considering the performance implications, is this even possible?
Yes, probably by bringing the base collection to the driver (collect) and iterating over it. If that collection doesn't fit in memory of the driver, please previous point.
If we need to process every record, consider performing some form of join with the 'decorating' dataset - that will be only 1 large job instead of tons of small ones.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string