Process of deploying changes to Spark-Streaming to production - apache-spark

What is the process followed to make some changes on production in Spark-Streaming without any downtime?

If you are looking for Upgrading Application Code , Please refer to spark-streaming documentation .
Upgrading Application Code If a running Spark Streaming application
needs to be upgraded with new application code, then there are two
possible mechanisms.
The upgraded Spark Streaming application is started and run in
parallel to the existing application. Once the new one (receiving the
same data as the old one) has been warmed up and is ready for prime
time, the old one be can be brought down. Note that this can be done
for data sources that support sending the data to two destinations
(i.e., the earlier and upgraded applications).
The existing application is shutdown gracefully (see
StreamingContext.stop(...) or JavaStreamingContext.stop(...) for
graceful shutdown options) which ensure data that has been received is
completely processed before shutdown. Then the upgraded application
can be started, which will start processing from the same point where
the earlier application left off. Note that this can be done only with
input sources that support source-side buffering (like Kafka, and
Flume) as data needs to be buffered while the previous application was
down and the upgraded application is not yet up. And restarting from
earlier checkpoint information of pre-upgrade code cannot be done. The
checkpoint information essentially contains serialized
Scala/Java/Python objects and trying to deserialize objects with new,
modified classes may lead to errors. In this case, either start the
upgraded app with a different checkpoint directory, or delete the
previous checkpoint directory.
https://spark.apache.org/docs/latest/streaming-programming-guide.html

Related

Updating jar job on databricks

I have a shared cluster which is used by more than several jobs on databricks.
the update of the jar corresponding to the job is not used when I launch the execution of the job, on cluster, I see that it uses an old version of the jar.
to clarify, I publish the jar through API 2.0 in databricks.
my question why when i start the execution of my Job, the execution on the cluster always uses an old version.
Thank you for you helping
Old jar will be removed from the cluster only when it's terminated. If you have a shared cluster that never terminates, then it doesn't happen. This a limitation not of the Databricks but Java that can't unload classes that are already in use (or it's very hard to implement reliably).
For most of cases it's really not recommended to use shared cluster, for several reasons:
it costs significantly more (~4x)
tasks are affecting each other from performance point of view
there is a high probability of dependencies conflicts + inability of updating libraries without affecting other tasks
there is a kind of "garbage" collected on the driver nodes
...
If you use shared cluster to get faster execution, I recommend to look onto Instance Pools, especially in combination of preloading of Databricks Runtime onto nodes in instance pool.

Slow hazelcast migration when using index

I'm running a microservice in a openshift environment using hazelcast 4.1.1 and the 2.2.1 kubernetes discovery plugin. I have configured hazelcast in embedded mode and I'm running 4 instances of that service. When I scale down the application from 4 to 3 pods the whole migration does not finish and my application throws exception due to WrongTargetException all the time (after one minute).
I analyzed the diagnostic file and I believe that the error comes from the index calculation. If I disable all my indices on my maps, everthing works like a charm. I think this might be related to https://github.com/hazelcast/hazelcast/issues/18079
It seems that the deserialization of my objects are called for each index separatly. Since we have configured a custom (de-)serializer which also applies some compression (LZ4) the migration takes ages.
Can somebody confirm my assumtions? Or are there any other known issues with index calculation and migration?

Avoiding re-processing of data during Spark Structured Streaming application updates

I am using Structured Streaming with Spark 2.2. We are using Kafka as our source and are using checkpoints for failure recovery and e2e exactly once guarantees. I would like to get some more information on how to handle updates to the application when there is a change in stateful operations and/or output schema.
As some of the sources suggest I can start the updated application parallel to the old application until it catches up with the old application in terms of data, and then kill the old one. But then the new application will have to re-read/re-process all the data in Kafka which could take a long time.
I want to avoid this re-processing of the data in the newly deployed updated application.
One way I can think of is for the application to keep writing the offsets into something in addition to the checkpoint directory, for example in zookeeper/hdfs. And then, on an update of the application, I command Kafka readstream() to start reading from the offsets stored in this new location (zookeeper/hdfs) - since the updated application can't read from the checkpoint directory which is now deemed incompatible.
So a couple of questions:
Is the above-stated solution a valid solution?
If yes, How can I automate the detection of whether the application is being restarted because of a failure/maintenance or because of code changes to stateful operations and/or output schema?
Any guidance, example or information source is appreciated.

How to upgrade or restart Spark streaming application without state loss?

I'm using updateStateByKey() in my application code, and I want to save state even if I restart this application.
This can be done by saving state into somewhere every batch, but doing that may take a lot of time.
So, I wonder if there is a solution that can store state when the application is stopped.
Or is there another solution to upgrade application code without losing the current state?
Currently, as of Spark 2.1.0, there isn't a solution which makes this work out of the box, you have to store the data yourself if you want to upgrade. One possibility would not be using updateStateByKey or mapWithState and storing the state somewhere external, such as in a key-value store.
Spark 2.2 is going to bring a new stateful store based on HDFS, but I haven't had a chance to look at it to see if it overcomes the weakness of the current checkpointing implementation.
There are many options for saving state during each batch. I've listed the majority of them in this answer. Since you highlight the latency this adds (going over the network, serialization etc), I wanted to point out SnappyData. SnappyData deeply integrates an in-memory database with Spark such that they share the same JVM and block manager. This eliminates the serialization step during each batch which should improve latency as you write out your state. Further, it can persist the state when your application stops, as you requested.
(disclaimer: I am an employee of SnappyData)

How to change Spark Streaming application with checkpointing?

Please consider the following scenario:
created initial pipeline via Spark streaming
enable checkpointing
run the application for a while
stop streaming application
made tiny changes to the pipeline, e.g. business logic remained untouched but did some refactoring, renaming, class moving, etc.
restart the streaming application
get an exception as pipeline stored in checkpoint directory differs on a class level from the new one
What are the best practices to deal with such a scenario? How can we seamlessly upgrade streaming application with checkpointing enabled? What are the best practices for versioning of streaming applications?
tl;dr Checkpointing is for recovery situations not for upgrades.
From the official documentation about Checkpointing:
A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic (e.g., system failures, JVM crashes, etc.). For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures.
So to answer your question about using checkpointing (that is meant for fault tolerance) and changing your application code, you should not expect it would work since it is against the design.

Resources