How to change Spark Streaming application with checkpointing? - apache-spark

Please consider the following scenario:
created initial pipeline via Spark streaming
enable checkpointing
run the application for a while
stop streaming application
made tiny changes to the pipeline, e.g. business logic remained untouched but did some refactoring, renaming, class moving, etc.
restart the streaming application
get an exception as pipeline stored in checkpoint directory differs on a class level from the new one
What are the best practices to deal with such a scenario? How can we seamlessly upgrade streaming application with checkpointing enabled? What are the best practices for versioning of streaming applications?

tl;dr Checkpointing is for recovery situations not for upgrades.
From the official documentation about Checkpointing:
A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic (e.g., system failures, JVM crashes, etc.). For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures.
So to answer your question about using checkpointing (that is meant for fault tolerance) and changing your application code, you should not expect it would work since it is against the design.

Related

Databricks Streaming and Cluster updates?

Does anyone know if there's any documentation about how to handle Structured Streaming in Databricks and cluster maintenance updates (runtime, OS, etc)?
I would like to know more about how it works, and if we (user) need to handle that manually or if there's some sort of mechanism that handles that internally.
See recovery from failures under streaming.
For any infrastructure problem that could occur:
Hence, to make your queries fault tolerant, you must enable query checkpointing and configure Databricks jobs to restart your queries automatically after a failure.

Avoiding re-processing of data during Spark Structured Streaming application updates

I am using Structured Streaming with Spark 2.2. We are using Kafka as our source and are using checkpoints for failure recovery and e2e exactly once guarantees. I would like to get some more information on how to handle updates to the application when there is a change in stateful operations and/or output schema.
As some of the sources suggest I can start the updated application parallel to the old application until it catches up with the old application in terms of data, and then kill the old one. But then the new application will have to re-read/re-process all the data in Kafka which could take a long time.
I want to avoid this re-processing of the data in the newly deployed updated application.
One way I can think of is for the application to keep writing the offsets into something in addition to the checkpoint directory, for example in zookeeper/hdfs. And then, on an update of the application, I command Kafka readstream() to start reading from the offsets stored in this new location (zookeeper/hdfs) - since the updated application can't read from the checkpoint directory which is now deemed incompatible.
So a couple of questions:
Is the above-stated solution a valid solution?
If yes, How can I automate the detection of whether the application is being restarted because of a failure/maintenance or because of code changes to stateful operations and/or output schema?
Any guidance, example or information source is appreciated.

How to ensure that DAG is not recomputed after the driver is restarted?

How can I ensure that an entire DAG of spark is highly available i.e. not recomputed from scratch when the driver is restarted (default HA in yarn cluster mode).
Currently, I use spark to orchestrate multiple smaller jobs i.e.
read table1
hash some columns
write to HDFS
this is performed for multiple tables.
Now when the driver is restarted i.e. when working on the second table the first one is reprocessed - though it already would have been stored successfully.
I believe that the default mechanism of checkpointing (the raw input values) would not make sense.
What would be a good solution here?
Is it possible to checkpoint the (small) configuration information and only reprocess what has not already been computed?
TL;DR Spark is not a task orchestration tool. While it has built-in scheduler and some fault tolerance mechanisms built-in, it as suitable for granular task management, as for example server orchestration (hey, we can call pipe on each machine to execute bash scripts, right).
If you want granular recovery choose a minimal unit of computation that makes sense for a given process (read, hash, write looks like a good choice, based on the description), make it an application and use external orchestration to submit the jobs.
You can build poor man's alternative, by checking if expected output exist and skipping part of the job in that case, but really don't - we have variety of battle tested tools which can do way better job than this.
As a side note Spark doesn't provide HA for the driver, only supervision with automatic restarts. Also independent jobs (read -> transform -> write) create independent DAGs - there is no global DAG and proper checkpoint of the application would require full snapshot of its state (like good old BLCR).
when the driver is restarted (default HA in yarn cluster mode).
When the driver of a Spark application is gone, your Spark application is gone and so are all the cached datasets. That's by default.
You have to use some sort of caching solution like https://www.alluxio.org/ or https://ignite.apache.org/. Both work with Spark and both claim to be offering the feature to outlive a Spark application.
There has been times when people used Spark Job Server to share data across Spark applications (which is similar to restarting Spark drivers).

How to upgrade or restart Spark streaming application without state loss?

I'm using updateStateByKey() in my application code, and I want to save state even if I restart this application.
This can be done by saving state into somewhere every batch, but doing that may take a lot of time.
So, I wonder if there is a solution that can store state when the application is stopped.
Or is there another solution to upgrade application code without losing the current state?
Currently, as of Spark 2.1.0, there isn't a solution which makes this work out of the box, you have to store the data yourself if you want to upgrade. One possibility would not be using updateStateByKey or mapWithState and storing the state somewhere external, such as in a key-value store.
Spark 2.2 is going to bring a new stateful store based on HDFS, but I haven't had a chance to look at it to see if it overcomes the weakness of the current checkpointing implementation.
There are many options for saving state during each batch. I've listed the majority of them in this answer. Since you highlight the latency this adds (going over the network, serialization etc), I wanted to point out SnappyData. SnappyData deeply integrates an in-memory database with Spark such that they share the same JVM and block manager. This eliminates the serialization step during each batch which should improve latency as you write out your state. Further, it can persist the state when your application stops, as you requested.
(disclaimer: I am an employee of SnappyData)

Drawbacks of using embedded Spark in Application

I have a use case where in I launch local spark (embedded) inside an application server rather than going for spark rest job server or kernel. Because former(embedded spark) has very low latency compared to other. I am interested in
Drawbacks of this approach if there are any.
Can same be used in production?
P.S. Low latency is priority here.
EDIT: Size of the data being processed for most of the cases will be less than 100mb.
I don't think it is a drawback at all. If you have a look at the implementation of the Hive Thriftserver within the Spark project itself, they also manage SQLContext etc, in the Hive Server process. This is especially the case, if the amount of data is small and the driver can handle it easily. So I would also see this as a hint, that this okay for production use.
But I totally agree, the documentation or advice in general how to integrate spark into interactive customer-facing application is lacking behind the information for BigData pipelines.

Resources