Avoiding re-processing of data during Spark Structured Streaming application updates - apache-spark

I am using Structured Streaming with Spark 2.2. We are using Kafka as our source and are using checkpoints for failure recovery and e2e exactly once guarantees. I would like to get some more information on how to handle updates to the application when there is a change in stateful operations and/or output schema.
As some of the sources suggest I can start the updated application parallel to the old application until it catches up with the old application in terms of data, and then kill the old one. But then the new application will have to re-read/re-process all the data in Kafka which could take a long time.
I want to avoid this re-processing of the data in the newly deployed updated application.
One way I can think of is for the application to keep writing the offsets into something in addition to the checkpoint directory, for example in zookeeper/hdfs. And then, on an update of the application, I command Kafka readstream() to start reading from the offsets stored in this new location (zookeeper/hdfs) - since the updated application can't read from the checkpoint directory which is now deemed incompatible.
So a couple of questions:
Is the above-stated solution a valid solution?
If yes, How can I automate the detection of whether the application is being restarted because of a failure/maintenance or because of code changes to stateful operations and/or output schema?
Any guidance, example or information source is appreciated.

Related

How Do I monitor progess and recover in a long-running Spark map job?

We're using Spark to run an ETL process by which data gets loaded in from a massive (500+GB) MySQL database and converted into aggregated JSON files, then gets written out to Amazon S3.
My question is two-fold:
This job could take a long time to run, and it would be nice to know how that mapping is going. I know Spark has a built in log manager. Is it as simple as just putting a log statement inside of each map? I'd like to know when each record gets mapped.
Suppose this massive job fails in the middle (maybe it chokes on a DB record or the MYSQL connection drops). Is there an easy way to recover from this in Spark? I've heard that caching/checkpointing can potentially solve this, but I'm not sure how?
Thanks!
Seems like 2 questions with lost of answers and detail. Anyway, assuming non-SPARK Streaming answer and referencing other based on my own reading / research, a limited response:
The following on logging progress checking of stages, tasks, jobs:
Global Logging via log4j and tailoring of this by using under the template stored under SPARK_HOME/conf folder, the template log4j.properties.template file which serves as a basis for defining logging requirements for ones own purposes but at SPARK level.
Programmtically by using Logger, using import org.apache.log4j.{Level, Logger}.
REST API to get status of SPARK Jobs. See this enlightening blog: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
There is also a Spark Listener that can be used
:http://:8080 to see progress via Web UI.
Depends on type of failure. Graceful vs. non-graceful, fault tolerance aspects or memory usage issues and things like serious database duplicate key errors depending on API used.
See How does Apache Spark handles system failure when deployed in YARN? SPARK handles its own failures by looking at DAG and attempting to reconstruct a partition by re-execution of what is needed. This all encompasses aspects under fault tolerance for which nothing needs to be done.
Things outside of SPARK's domain and control mean it's over. E.g. memory issues that may result from exceeding various parameters on at large scale computations, DF JDBC write against a store with a duplicate error, JDBC connection outages. This means re-execution.
As an aside, some aspects are not logged as failures even though they are, e.g. duplicate key inserts on some Hadoop Storage Managers.

How to upgrade or restart Spark streaming application without state loss?

I'm using updateStateByKey() in my application code, and I want to save state even if I restart this application.
This can be done by saving state into somewhere every batch, but doing that may take a lot of time.
So, I wonder if there is a solution that can store state when the application is stopped.
Or is there another solution to upgrade application code without losing the current state?
Currently, as of Spark 2.1.0, there isn't a solution which makes this work out of the box, you have to store the data yourself if you want to upgrade. One possibility would not be using updateStateByKey or mapWithState and storing the state somewhere external, such as in a key-value store.
Spark 2.2 is going to bring a new stateful store based on HDFS, but I haven't had a chance to look at it to see if it overcomes the weakness of the current checkpointing implementation.
There are many options for saving state during each batch. I've listed the majority of them in this answer. Since you highlight the latency this adds (going over the network, serialization etc), I wanted to point out SnappyData. SnappyData deeply integrates an in-memory database with Spark such that they share the same JVM and block manager. This eliminates the serialization step during each batch which should improve latency as you write out your state. Further, it can persist the state when your application stops, as you requested.
(disclaimer: I am an employee of SnappyData)

Process of deploying changes to Spark-Streaming to production

What is the process followed to make some changes on production in Spark-Streaming without any downtime?
If you are looking for Upgrading Application Code , Please refer to spark-streaming documentation .
Upgrading Application Code If a running Spark Streaming application
needs to be upgraded with new application code, then there are two
possible mechanisms.
The upgraded Spark Streaming application is started and run in
parallel to the existing application. Once the new one (receiving the
same data as the old one) has been warmed up and is ready for prime
time, the old one be can be brought down. Note that this can be done
for data sources that support sending the data to two destinations
(i.e., the earlier and upgraded applications).
The existing application is shutdown gracefully (see
StreamingContext.stop(...) or JavaStreamingContext.stop(...) for
graceful shutdown options) which ensure data that has been received is
completely processed before shutdown. Then the upgraded application
can be started, which will start processing from the same point where
the earlier application left off. Note that this can be done only with
input sources that support source-side buffering (like Kafka, and
Flume) as data needs to be buffered while the previous application was
down and the upgraded application is not yet up. And restarting from
earlier checkpoint information of pre-upgrade code cannot be done. The
checkpoint information essentially contains serialized
Scala/Java/Python objects and trying to deserialize objects with new,
modified classes may lead to errors. In this case, either start the
upgraded app with a different checkpoint directory, or delete the
previous checkpoint directory.
https://spark.apache.org/docs/latest/streaming-programming-guide.html

How to change Spark Streaming application with checkpointing?

Please consider the following scenario:
created initial pipeline via Spark streaming
enable checkpointing
run the application for a while
stop streaming application
made tiny changes to the pipeline, e.g. business logic remained untouched but did some refactoring, renaming, class moving, etc.
restart the streaming application
get an exception as pipeline stored in checkpoint directory differs on a class level from the new one
What are the best practices to deal with such a scenario? How can we seamlessly upgrade streaming application with checkpointing enabled? What are the best practices for versioning of streaming applications?
tl;dr Checkpointing is for recovery situations not for upgrades.
From the official documentation about Checkpointing:
A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic (e.g., system failures, JVM crashes, etc.). For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures.
So to answer your question about using checkpointing (that is meant for fault tolerance) and changing your application code, you should not expect it would work since it is against the design.

Memsql Spark-Kafka Transform Failure

We have a Spark Cluster running under Memsql, We have different Pipelines running, The ETL setup is as below.
Extract:- Spark read Messages from Kafka Cluster (Using Memsql Kafka-Zookeeper)
Transform:- We have a custom jar deployed for this step
Load:- Data from Transform stage is Loaded in Columnstore
I have below doubts:
What Happens to the Message polled from Kafka, if the Job fails in Transform stage
- Does Memsql takes care of loading that Message again
- Or, the data is Lost
If the data gets Lost, how can I solve this Problem, is there any configuration changes which needs to done for this?
As it stands, at least once semantics are not available in MemSQL Ops. It is on the roadmap and will be present in one of the future releases of Ops.
If you haven't yet, you should check out MemSQL 5.5 Pipelines.
http://blog.memsql.com/pipelines/
This one isn't based on spark, (and transforms are done a bit differently so you might have to rewrite your code), but we have native kafka streams now.
The way we get exactly once with the native version is simple; store the offsets in the database same atomic transaction as the actual data. If something fails and the transaction isn't committed, the offsets won't be committed, so we'll naturally and automatically retry that partition-offset-range.

Resources