There is a way to enable graceful shutdown of spark streaming by setting property spark.streaming.stopGracefullyOnShutdown to true and then kill the process with kill -SIGTERM command. However I don't see such option available for structured streaming (SQLContext.scala).
Is the shutdown process different in structured streaming? Or is it simply not implemented yet?
This feature is not implemented yet. But the write ahead logs of spark structured steaming claims to recover state and offsets without any issues.
Try this github example code.
https://github.com/kali786516/FraudDetection/blob/master/src/main/scala/com/datamantra/spark/jobs/RealTimeFraudDetection/StructuredStreamingFraudDetection.scala#L76
https://github.com/kali786516/FraudDetection/blob/master/src/main/scala/com/datamantra/spark/GracefulShutdown.scala#L26
This Feature is not implemented yet and will also give you duplicates if you kill the job from the resource Manager while the batch is running.
Corrected: The duplicates will only be in the output directory, Write ahead logs handle everything beautifully; you don't need to worry about anything. Feel free to kill it any time.
Related
I am working on spark Structured Streaming which is pretty easy use case.
I will be reading data from Kafka and persist in hdfs sink after parsing JSON.
I have almost completed the part. Now problem is we should have good way of shutting down the streaming job without having to close abruptly (ctrl+c or yarn -kill).
I have used the below options
sparkConf.set("spark.streaming.stopGracefullyOnShutdown","true") but no use.
My requirement is when streaming job is running, it should get stop when some touch file is created in hdfs or Linux EN path.
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-demo-StreamingQueryManager-awaitAnyTermination-resetTerminated.html
In this above link, they create thread for fixed duration. But I need something like that which comes out of execution when some dummy file is created.
I am a newbie, so please need your help for that.
Thanks in advance.
I am not sure if actually works currently sparkConf.set("spark.streaming.stopGracefullyOnShutdown","true"). Some claim it does work, though some don't.
In any event it is about direct kill or graceful stopping.
You need to kill the JVM though, or if in Databricks they have a whole lot of utilities.
But you will not lose data due to check-pointing and write ahead logs that Spark Structured Streaming provides. That is to say ability to recover state and offsets without any issues, Spark maintains own offset management. So, how you stop it seems less of an issue which may explain the confusion & the "but no use".
Killing Spark job using command Prompt
This is the thread that I hoped would answer my question. But all four answers explain how to kill the entire application.
How can I stop a job? Like a count for example?
I can do it from the Spark Web UI by clicking "kill" on the respective job. I suppose it must be possible to list running jobs and interact with them also directly via CLI.
Practically speaking I am working in a Notebook with PySpark on a Glue endpoint. If I kill the application the entire endpoint dies and I have to spin up a new cluster. I just want to stop a job. Cancelling it within the Notebook will just detach synchronization and the job keeps running, blocking any further commands from being executed.
Spark History Server provides REST API interface. Unfortunately, it only exposes monitoring capabilities for applications, jobs, stages, etc.
There is also a REST Submission interface that provides capabilities to submit, kill and check up on status of the applications. It is undocumented AFAIK, and is only supported on Spark standalone and Mesos clusters, no YARN. (Thats why there is no "kill" link in Jobs UI screen for Spark on YARN, I guess.)
So you can try using that "hidden" API, but if you know your application's Spark UI URL and job id of a job you want to kill, the easier way is something like:
$ curl -G http://<Spark-Application-UI-host:port>/jobs/job/kill/?id=<job_id>
Since I don't work with Glue I'd be interested to find out myself how its going to react, because the kill normally results in org.apache.spark.SparkException: Job <job_id> cancelled.
building on the answer by mazaneicha it appears that, for Spark 2.4.6 in standalone mode for jobs submitted in client mode, the curl request to kill an app with a known applicationID is
curl -d "id=<your_appID>&terminate=true" -X POST <your_spark_master_url>/app/kill/
We had a similar problem with people not disconnecting their notebooks from the cluster and hence hogging resources.
We get the list of running applications by parsing the webUI. I'm pretty sure there's less painful ways to manage a Spark cluster..
list the job in linux and kill it.
I would do
ps -ef |grep spark-submit
if it was started using spark-submit. Get the PID from the output and then
kill -9
Kill running job by:
open Spark application UI.
Go to jobs tab.
Find job among running jobs.
Click kill link and confirm.
I want to know delete job works on Databricks. Does it immediately terminate the code execution on terminate the job cluster? If I am using micro-batching, does it make sure that the last batch is processed and then terminates or it is just abrupt termination which can cause data loss/data corruption? How can I avoid that?
Also what happens when i delete a job on a running cluster?
It will terminate immediately - not gracefully.
Are you using Structured Streaming or true micro batching? If the former then a checkpoint file will suffice in starting in the right place again. (https://docs.databricks.com/spark/latest/structured-streaming/production.html)
If you have your own batch process you will need to manually write a checkpoint file to keep track of where you are up to. Given the lack of transactions I would ensure your pipeline is idempotent so that if you do restart and repeat a batch then there is no impact.
I am using Structured Streaming with Spark 2.2. We are using Kafka as our source and are using checkpoints for failure recovery and e2e exactly once guarantees. I would like to get some more information on how to handle updates to the application when there is a change in stateful operations and/or output schema.
As some of the sources suggest I can start the updated application parallel to the old application until it catches up with the old application in terms of data, and then kill the old one. But then the new application will have to re-read/re-process all the data in Kafka which could take a long time.
I want to avoid this re-processing of the data in the newly deployed updated application.
One way I can think of is for the application to keep writing the offsets into something in addition to the checkpoint directory, for example in zookeeper/hdfs. And then, on an update of the application, I command Kafka readstream() to start reading from the offsets stored in this new location (zookeeper/hdfs) - since the updated application can't read from the checkpoint directory which is now deemed incompatible.
So a couple of questions:
Is the above-stated solution a valid solution?
If yes, How can I automate the detection of whether the application is being restarted because of a failure/maintenance or because of code changes to stateful operations and/or output schema?
Any guidance, example or information source is appreciated.
How can I ensure that an entire DAG of spark is highly available i.e. not recomputed from scratch when the driver is restarted (default HA in yarn cluster mode).
Currently, I use spark to orchestrate multiple smaller jobs i.e.
read table1
hash some columns
write to HDFS
this is performed for multiple tables.
Now when the driver is restarted i.e. when working on the second table the first one is reprocessed - though it already would have been stored successfully.
I believe that the default mechanism of checkpointing (the raw input values) would not make sense.
What would be a good solution here?
Is it possible to checkpoint the (small) configuration information and only reprocess what has not already been computed?
TL;DR Spark is not a task orchestration tool. While it has built-in scheduler and some fault tolerance mechanisms built-in, it as suitable for granular task management, as for example server orchestration (hey, we can call pipe on each machine to execute bash scripts, right).
If you want granular recovery choose a minimal unit of computation that makes sense for a given process (read, hash, write looks like a good choice, based on the description), make it an application and use external orchestration to submit the jobs.
You can build poor man's alternative, by checking if expected output exist and skipping part of the job in that case, but really don't - we have variety of battle tested tools which can do way better job than this.
As a side note Spark doesn't provide HA for the driver, only supervision with automatic restarts. Also independent jobs (read -> transform -> write) create independent DAGs - there is no global DAG and proper checkpoint of the application would require full snapshot of its state (like good old BLCR).
when the driver is restarted (default HA in yarn cluster mode).
When the driver of a Spark application is gone, your Spark application is gone and so are all the cached datasets. That's by default.
You have to use some sort of caching solution like https://www.alluxio.org/ or https://ignite.apache.org/. Both work with Spark and both claim to be offering the feature to outlive a Spark application.
There has been times when people used Spark Job Server to share data across Spark applications (which is similar to restarting Spark drivers).