I want to know delete job works on Databricks. Does it immediately terminate the code execution on terminate the job cluster? If I am using micro-batching, does it make sure that the last batch is processed and then terminates or it is just abrupt termination which can cause data loss/data corruption? How can I avoid that?
Also what happens when i delete a job on a running cluster?
It will terminate immediately - not gracefully.
Are you using Structured Streaming or true micro batching? If the former then a checkpoint file will suffice in starting in the right place again. (https://docs.databricks.com/spark/latest/structured-streaming/production.html)
If you have your own batch process you will need to manually write a checkpoint file to keep track of where you are up to. Given the lack of transactions I would ensure your pipeline is idempotent so that if you do restart and repeat a batch then there is no impact.
Related
I am working on spark Structured Streaming which is pretty easy use case.
I will be reading data from Kafka and persist in hdfs sink after parsing JSON.
I have almost completed the part. Now problem is we should have good way of shutting down the streaming job without having to close abruptly (ctrl+c or yarn -kill).
I have used the below options
sparkConf.set("spark.streaming.stopGracefullyOnShutdown","true") but no use.
My requirement is when streaming job is running, it should get stop when some touch file is created in hdfs or Linux EN path.
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-demo-StreamingQueryManager-awaitAnyTermination-resetTerminated.html
In this above link, they create thread for fixed duration. But I need something like that which comes out of execution when some dummy file is created.
I am a newbie, so please need your help for that.
Thanks in advance.
I am not sure if actually works currently sparkConf.set("spark.streaming.stopGracefullyOnShutdown","true"). Some claim it does work, though some don't.
In any event it is about direct kill or graceful stopping.
You need to kill the JVM though, or if in Databricks they have a whole lot of utilities.
But you will not lose data due to check-pointing and write ahead logs that Spark Structured Streaming provides. That is to say ability to recover state and offsets without any issues, Spark maintains own offset management. So, how you stop it seems less of an issue which may explain the confusion & the "but no use".
Sorry but can't find he configuration point a need. I schedule spark application, sometimes they may not succeed after 1 hour, in this case I want to automatically kill this task (because I am sure it will never succeed, and another scheduling may start).
I found a timeout configuration, but as I understand it, this is used to delay the start of a workflow.
So is there a kind of living' timeout ?
Oozie cannot kill a workflow that it triggered. However you can ensure that a single workflow is running at same time by setting Concurrency = 1 in the Coordinator.
Also you can have a second Oozie workflow monitoring the status of the Spark job.
Anyawy, you should investigate the root cause of Spark job not successful or being blocked.
There is a way to enable graceful shutdown of spark streaming by setting property spark.streaming.stopGracefullyOnShutdown to true and then kill the process with kill -SIGTERM command. However I don't see such option available for structured streaming (SQLContext.scala).
Is the shutdown process different in structured streaming? Or is it simply not implemented yet?
This feature is not implemented yet. But the write ahead logs of spark structured steaming claims to recover state and offsets without any issues.
Try this github example code.
https://github.com/kali786516/FraudDetection/blob/master/src/main/scala/com/datamantra/spark/jobs/RealTimeFraudDetection/StructuredStreamingFraudDetection.scala#L76
https://github.com/kali786516/FraudDetection/blob/master/src/main/scala/com/datamantra/spark/GracefulShutdown.scala#L26
This Feature is not implemented yet and will also give you duplicates if you kill the job from the resource Manager while the batch is running.
Corrected: The duplicates will only be in the output directory, Write ahead logs handle everything beautifully; you don't need to worry about anything. Feel free to kill it any time.
For some reason sometimes the cluster seems to misbehave for I suddenly see surge in number of YARN jobs.We are using HDInsight Linux based Hadoop cluster. We run Azure Data Factory jobs to basically execute some hive script pointing to this cluster. Generally average number of YARN apps at any given time are like 50 running and 40-50 pending. None uses this cluster for ad-hoc query execution. But once in few days we notice something weird. Suddenly number of Yarn apps start increasing, both running as well as pending, but especially pending apps. So this number goes more than 100 for running Yarn apps and as for pending it is more than 400 or sometimes even 500+. We have a script that kills all Yarn apps one by one but it takes long time, and that too is not really a solution. From our experience we found that the only solution, when it happens, is to delete and recreate the cluster. It may be possible that for some time cluster's response time is delayed (Hive component especially) but in that case even if ADF keeps retrying several times if a slice is failing, is it possible that the cluster is storing all the supposedly failed slice execution requests (according to ADF) in a pool and trying to run when it can? That's probably the only explanation why it could be happening. Has anyone faced this issue?
Check if all the running jobs in the default queue are Templeton jobs. If so, then your queue is deadlocked.
Azure Data factory uses WebHCat (Templeton) to submit jobs to HDInsight. WebHCat spins up a parent Templeton job which then submits a child job which is the actual Hive script you are trying to run. The yarn queue can get deadlocked if there are too many parents jobs at one time filling up the cluster capacity that no child job (the actual work) is able to spin up an Application Master, thus no work is actually being done. Note that if you kill the Templeton job this will result in Data Factory marking the time slice as completed even though obviously it was not.
If you are already in a deadlock, you can try adjusting the Maximum AM Resource from the default 33% to something higher and/or scaling up your cluster. The goal is to be able to allow some of the pending child jobs to run and slowly draining the queue.
As a correct long term fix, you need to configure WebHCat so that parent templeton job is submitted to a separate Yarn queue. You can do this by (1) creating a separate yarn queue and (2) set templeton.hadoop.queue.name to the newly created queue.
To create queue you can do this via the Ambari > Yarn Queue Manager.
To update WebHCat config via Ambari go to Hive tab > Advanced > Advanced WebHCat-site, and update the config value there.
More info on WebHCat config:
https://cwiki.apache.org/confluence/display/Hive/WebHCat+Configure
I have been using Google Dataproc for a few weeks now and since I started I had a problem with canceling and stopping jobs.
It seems like there must be some server other than those created on cluster setup, that keeps track of and supervises jobs.
I have never had a process that does its job without error actually stop when I hit stop in the dev console. The spinner just keeps spinning and spinning.
Cluster restart or stop does nothing, even if stopped for hours.
Only when the cluster is entirely deleted will the jobs disappear... (But wait there's more!) If you create a new cluster with the same settings, before the previous cluster's jobs have been deleted, the old jobs will start on the new cluster!!!
I have seen jobs that terminate on their own due to OOM errors restart themselves after cluster restart! (with no coding for this sort of fault tolerance on my side)
How can I forcefully stop Dataproc jobs? (gcloud beta dataproc jobs kill does not work)
Does anyone know what is going on with these seemingly related issues?
Is there a special way to shutdown a Spark job to avoid these issues?
Jobs keep running
In some cases, errors have not been successfully reported to the Cloud Dataproc service. Thus, if a job fails, it appears to run forever even though it (has probably) failed on the back end. This should be fixed by a soon-to-be released version of Dataproc in the next 1-2 weeks.
Job starts after restart
This would be unintended and undesirable. We have tried to replicate this issue and cannot. If anyone can replicate this reliably, we'd like to know so we can fix it! This may (is provably) be related to the issue above where the job has failed but appears to be running, even after a cluster restarts.
Best way to shutdown
Ideally, the best way to shutdown a Cloud Dataproc cluster is to terminate the cluster and start a new one. If that will be problematic, you can try a bulk restart of the Compute Engine VMs; it will be much easier to create a new cluster, however.