Why doesn't my Databricks Spark job finish even after all tasks have finished? - apache-spark

I'm seeing some of my jobs stuck waiting in "Running command..." stage even after all tasks have finished and there are "0" running.
What might be causing this?
Which logs should I be looking at to resolve this?
Thanks
Screenshot:

Related

Jobs stuck in azure pipeline queue

currently experiencing an issue with azure pipelines where a job seems to be stuck running stopping other jobs from being processed. The running job has been cancelled yet the agent says it is running, are there any solutions to this? We've tried deleting the 'azure pipelines', turning the agent off and back on again but no luck, is this likely to be an azure bug? We have not hit any caps or limits
Below you can see there is one running job.
When I click into azure pipelines no processes are running
But the agent thinks it is running Job 938 but as can be seen it is not running
Any help appreciated, thanks

Failed to solve the job scheduling wth snakemake on SLURM scheduler

I'm running a snakemake pipeline on Slurm and am observing a strange error:
Failed to solve the job scheduling problem with pulp
Without SLURM, the pipeline works perfectly fine. However, when I try to run it on SLURM, the job scheduling is strange, the scheduler skips the first Job (Job 0) and directly jumps to Job1. Since Job 0 was missed, there are no input files for Job 1.
Any help/direction would be much appreciated.

Oozie: kill a job after a timeout

Sorry but can't find he configuration point a need. I schedule spark application, sometimes they may not succeed after 1 hour, in this case I want to automatically kill this task (because I am sure it will never succeed, and another scheduling may start).
I found a timeout configuration, but as I understand it, this is used to delay the start of a workflow.
So is there a kind of living' timeout ?
Oozie cannot kill a workflow that it triggered. However you can ensure that a single workflow is running at same time by setting Concurrency = 1 in the Coordinator.
Also you can have a second Oozie workflow monitoring the status of the Spark job.
Anyawy, you should investigate the root cause of Spark job not successful or being blocked.

Job spark blocked and runs indefinitely

We encounter a problem on a Spark job 1.6(on yarn) that never ends, whene several jobs launched simultaneously.
We found that by launching the job spark in yarn-client mode we do not have this problem, unlike launching it in yarn-cluster mode.
it could be a trail to find the cause.
we changed the code to add a sparkContext.stop ()
Indeed, the SparkContext was created (val sparkContext = createSparkContext) but not stopped. this solution has allowed us to decrease the number of jobs that remains blocked but nevertheless we still have some jobs blocked.
by analyzing the logs we have found this log that repeats without stopping:
17/09/29 11:04:37 DEBUG SparkEventPublisher: Enqueue SparkListenerExecutorMetricsUpdate(1,WrappedArray())
17/09/29 11:04:41 DEBUG ApplicationMaster: Sending progress
17/09/29 11:04:41 DEBUG ApplicationMaster: Number of pending allocations is 0. Sleeping for 5000.
it seems that the job block whene we call newAPIHadoopRDD to get data from Hbase. it may be the issue !!
Does someone have any idea about this issue ?
Thank you in advance

Google Dataproc Jobs Never Cancel, Stop, or Terminate

I have been using Google Dataproc for a few weeks now and since I started I had a problem with canceling and stopping jobs.
It seems like there must be some server other than those created on cluster setup, that keeps track of and supervises jobs.
I have never had a process that does its job without error actually stop when I hit stop in the dev console. The spinner just keeps spinning and spinning.
Cluster restart or stop does nothing, even if stopped for hours.
Only when the cluster is entirely deleted will the jobs disappear... (But wait there's more!) If you create a new cluster with the same settings, before the previous cluster's jobs have been deleted, the old jobs will start on the new cluster!!!
I have seen jobs that terminate on their own due to OOM errors restart themselves after cluster restart! (with no coding for this sort of fault tolerance on my side)
How can I forcefully stop Dataproc jobs? (gcloud beta dataproc jobs kill does not work)
Does anyone know what is going on with these seemingly related issues?
Is there a special way to shutdown a Spark job to avoid these issues?
Jobs keep running
In some cases, errors have not been successfully reported to the Cloud Dataproc service. Thus, if a job fails, it appears to run forever even though it (has probably) failed on the back end. This should be fixed by a soon-to-be released version of Dataproc in the next 1-2 weeks.
Job starts after restart
This would be unintended and undesirable. We have tried to replicate this issue and cannot. If anyone can replicate this reliably, we'd like to know so we can fix it! This may (is provably) be related to the issue above where the job has failed but appears to be running, even after a cluster restarts.
Best way to shutdown
Ideally, the best way to shutdown a Cloud Dataproc cluster is to terminate the cluster and start a new one. If that will be problematic, you can try a bulk restart of the Compute Engine VMs; it will be much easier to create a new cluster, however.

Resources