Spark Structured Streaming Checkpoint Compatibility

Spark Structured Streaming Checkpoint Compatibility - apache-spark

Am I safe to use Kafka and Spark Structured Streaming (SSS) (>=v2.2) with checkpointing on HDFS in cases where I have to upgrade the Spark library or when changing the query? I'd like to seamlessly continue with the offset left behind even in those cases.
I've found different answers when searching the net for compatibility issues in SSS's (>=2.2) checkpoint mechanism. Maybe someone out there can lighten up the situation ... in best case backed up with facts/references or first-person experience ?
In Spark's programming guide (current=v2.3) they just claim "..should be a directory in an HDFS-compatible" but don't even leave a single word about constraints in terms of compatibility.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Databricks at least gives some hints that this is an issue at all.
https://docs.databricks.com/spark/latest/structured-streaming/production.html#recover-after-changes-in-a-streaming-query
A Cloudera blog recommends storing the offset rather in Zookeeper, but this actually refers to the "old" Spark Streaming implementation. If this is relates to structured streaming, too, is unclear.
https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/
A guy in this conversation claims that there is no problem on that regard anymore ...but without pointing to facts.
How to get Kafka offsets for structured query for manual and reliable offset management?
Help is highly appreciated.

Checkpoints are great when you don't need to change the code, fire and forget procedure are perfect use cases.
I read the post from Databricks you posted, the truth is that you can't know what kind of changes are called to do until you have to do them. I wonder how they can predict the future.
About the link on Cloudera, yes they are speaking about the old procedure, but with Structured Streaming still code changes void your checkpoints.
So, in my opinion, so much automation is good for Fire and Forget procedure.
If this is not your case, saving the Kafka offset elsewhere is a good way to restart from where you left last time; you know that Kafka can contain a lot of data and restart from zero to avoid data loss or accept the idea to restart from the latest offset sometimes is not always acceptable.
Remember: Any stream logic change will be ignored as long as there are checkpoints, so you can't make change to your job once deployed, unless you accept the idea to throwing away the checkpoints.
By throwing away the checkpoints you must force the job to reprocess the entire Kafka topic (earliest), or start right at the end (latest) skipping unprocessed data.
It's great, is it not?

Related

Can Kafka-Spark Streaming pair be used for both batch+real time data?

H All,
I am currently working on developing an architecture which should be able to handle both real time and batch data(coming from disparate sources and point solutions - third party tools). The existing architecture is old school and uses mostly RDBMS(I am not going to to go detail in that).
What I have come up with is two different pipeline - one for batch data(sqoop/spark/hive) and the other for real time data(kafka-spark stream).
But I have been told to use kafka-spark streaming pair for handling all kinds of data.
If anyone has any experience working on kafka-spark streaming pair for handling all kinds of data, could you please give me a brief details if this could be a viable solution and better than having two different pipeline.
Thanks in advance!

What I have come up with is two different pipeline - one for batch data(sqoop/spark/hive) and the other for real time data(kafka-spark stream).
Pipeline 1: Sqoop is a good choice for batch load, but it will slow in performance because underlying architecture is still on map-reduce. Though there are options to run sqoop on spark, but didn't try that. Once the data is in HDFS then you can use hive, which is great solution for batch processing. Having said that you can replace sqoop with Spark, if you are worrying about the RDMS fetch time. You can also do a batch transformations in spark also. I would say this is good solution.
Pipeline 2: Kafka and Spark streaming are the most obvious choice and is a good choice. But, If you are using Confluent dist. of Kafka then you could replace most of the spark transformations with K-SQL, K-Streams which will create a realtime transformations.
I would say, its good to have separate system for batching and one for real-time. This is what is lambda architecture. But if you are looking for a more unified framework, then you can try Apache Beam, which provides an unified framework for both batch and realtime processing. You can choose from multiple runners to execute your query.
Hope this helps :)

Lambda architecture would be the way to go!
Hope this link gives you enough ideas:
https://dzone.com/articles/lambda-architecture-how-to-build-a-big-data-pipeli
Thanks much.

Lack of Data Node locality when working with S3 for HDFS - eventual consistency

Two points here after reading https://wiki.apache.org/hadoop/AmazonS3
Not sure what to make of this below.
...
eventual consistency: changes made by one application (creation, updates and deletions) will not be visible until some undefined time.
...
Some undefined time? What does that mean for writing SPARK Applications then? If I have n JOBs, that may be something may not yet be visible?
How does the SPARK default partitioning apply then for S3 data?

that Hadoop doc is a bit out of date; I'd google for "spark and object stores" to get some more up to date stuff.
The spark documentation has some spark-specific details.
Some undefined time? What does that mean for writing SPARK Applications then?
Good question. AWS never give the hard data here; the best empirical study is Benchmarking Eventual Consistency: Lessons Learned from Long-Term Experimental Studies
That showed that consistency delays depend on total AWS load, and had some other patterns. Because its so variable, nobody dares give a good value of "undefined time".
My general expectations are
Normally list inconsistency can take a few seconds, but under load it can get worse
if nothing has actually gone wrong with S3 then a few minutes is enough for listing inconsistencies to be resolved.
All the s3 connectors mimic rename by listing all files under a path, copying and deleting them, renaming a directory immediately after processes have written to them may miss data.
Because the means by which Spark jobs commit their output to a filesystem depends on rename(), it is not safe to use them to commit the output of tasks.
If I have n JOBs, that may be something may not yet be visible?
It's worse than that. You can't rely on the rename operations within a single job to get it right.
It's why Amazon offer a consistent emrfs option using dynamoDB for listing consistency, and Hadoop 2.9+ have a feature, S3Guard, which uses dynamoDB for that same operation. Neither deal with update inconsistency though, which is why Hadoop 3.1's "S3A committers" default to generating unique filenames for new files.
If you are using the Apache S3A connector to commit work to S3 using the normal filesystem FileOutputCommitter then, without S3Guard, you are at risk of losing data.
Don't worry about chaining work; worry about that.
BTW: I don't know what Databricks do here. Ask them for the details.
How does the SPARK default partitioning apply then for S3 data?
the partitioning is based on whatever blocksize the object store connector makes up. For example, for the s3a connector, its that of fs.s3a.blocksize

Spark Streaming - TIMESTAMP field based processing

I'm pretty new to spark streaming and I need some basic clarification that I couldn't fully understand reading the documentation.
The use case is that I have a set of files containing dumping EVENTS, and each events has already inside a field TIMESTAMP.
At the moment I'm loading this file and extracting all the events in a JavaRDD and I would like to pass them to Spark Streaming in order to collect some stats based on the TIMESTAMP (a sort of replay).
My question is if it is possible to process these event using the EVENT TIMESTAMP as temporal reference instead of the actual time of the machine (sorry for the silly question).
In case it is possible, will I need simply spark streaming or I need to switch to Structured Streaming?
I found a similar question here:
Aggregate data based on timestamp in JavaDStream of spark streaming
Thanks in advance

TL;DR
yes you could use either Spark Streaming or Structured Streaming, but I wouldn't if I were you.
Detailed answer
Sorry, no simple answer to this one. Spark Streaming might be better for the per-event processing if you need to individually examine each event. Structured Streaming will be a nicer way to perform aggregations and any processing where per-event work isn't necessary.
However, there is a whole bunch of complexity in your requirements, how much of the complexity you address depends on the cost of inaccuracy in the Streaming job output.
Spark Streaming makes no guarantee that events will be processed in any kind of order. To impose ordering, you will need to setup a window in which to do your processing that minimises the risk of out-of-order processing to an acceptable level. You will need to use a big enough window of data to accurately capture your temporal ordering.
You'll need to give these points some thought:
If a batch fails and is retried, how will that affect your counters?
If events arrive late, will you ignore them, re-process the whole affected window, or update the output? If the latter how can you guarantee the update is done safely?
Will you minimise risk of corruption by keeping hold of a large window of events, or accept any inaccuracies that may arise from a smaller window?
Will the partitioning of events cause complexity in the order that they are processed?
My opinion is that, unless you have relaxed constraints over accuracy, Spark is not the right tool for the job.
I hope that helps in some way.

It is easy to do aggregations based on event-time with Spark SQL (in either batch or structured streaming). You just need to group by a time window over your timestamp column. For example, the following will bucket you data into 1 minute intervals and give you the count for each bucket.
df.groupBy(window($"timestamp", "1 minute") as 'time)
.count()

Reliable checkpoint (keeping complex state) for spark streaming jobs

We are using Spark 1.6 with JVM 1.6 on Red Hat 4.4.7 for running our spark streaming applications/jobs. Some of our streaming jobs use complex state and we have scala case class to represent them. But while testing the upgrade cycle of the jobs we are facing some issue as below. As streaming jobs will run for ever, need help in designing an easy to upgrade application.
I was checking the exact use case where the job is not able to restart from the checkpoint.
Just restarting the job without changing anything did not create the problem.
Restarting the job after doing random change (unrelated to state) did not create the problem.
Restarting the job after doing change to state handling function (like by adding print) did not create the problem.
Restarting the job after doing change to state (by adding a new Boolean field) did create the problem.
After doing some googling, the general guideline for handling the issue seems to be,
Implement the state as "a format that stores schema along with data" as json or avro.
The client code will have to serialize before putting it to state and de-serialize it after reading it from state. The serialization and de-serialization will happen after every streaming interval, mapWithState might help slightly.
Upgrading the state from version x to y have to be handled explicitly if multiple version of the job can co-exist!!!
Stop the input, finish the processing for the input, re-start as a fresh job with new checkpoint.
Though this is simple to implement it is not possible for couple of our jobs. Also the upgrade cycle will become slightly complex.
Parallely save the data to external store also and on an upgrade load it as initialRDD.
This will introduce an external dependency for keeping the state.
Upgrading the state from version x to y have to be handled explicitly if multiple version of the job can co-exist!!!
As the information is scattered all across web, I felt confusing to come to conclusion. Below are my questions,
If the structure of state class changes the checkpoint becomes invalid, but, is there any other known issue where the checkpoint becomes invalid, if assembly jar or functionality(not structure) of the state class changes?
What is the strategy you are using for enabling easy upgrade of stateful spark streaming job?

Consider a case of environment upgrade like jvm/scala/spark/etc... There is no guarantee the checkpoint can be reliable for ever irrespective of any change.
The checkpoint is designed to help in recovering in the unfortunate event of fault/crash only and not designed to be used as a data store!!!
The best alternative is to periodically flush the data in to a reliable store (HDFS/DB/etc...) and read the same as initial RDD in the event of any form of upgrade.

"The best alternative is to periodically flush the data in to a reliable store (HDFS/DB/etc...) and read the same as initial RDD in the event of any form of upgrade"
How to periodically flush Spark State data into externale store? Is there an API call that can provide access to Spark StateRDD?

Parallelism of Streams in Spark Streaming Context

I have multiple input sources (~200) coming in on Kafka topics - the data for each is similar, but each must be run separately because there are differences in schemas - and we need to perform aggregate health checks on the feeds (so we can't throw them all into 1 topic in a simple way, without creating more work downstream). I've created a spark app with a spark streaming context, and everything seems to be working, except that it is only running the streams sequentially. There are certain bottlenecks in each stream which make this very inefficient, and I would like all streams to run at the same time - is this possible? I haven't been able to find a simple way to do this. I've seen the concurrentJobs parameter, but that doesn't worked as desired. Any design suggestions are also welcome, if there is not an easy technical solution.
Thanks

The answer was here:
https://spark.apache.org/docs/1.3.1/job-scheduling.html
with the fairscheduler.xml file.
By default it is FIFO... it only worked for me once I explicitly wrote the file (couldn't set it programmatically for some reason).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string