Oozie: kill a job after a timeout - apache-spark

Sorry but can't find he configuration point a need. I schedule spark application, sometimes they may not succeed after 1 hour, in this case I want to automatically kill this task (because I am sure it will never succeed, and another scheduling may start).
I found a timeout configuration, but as I understand it, this is used to delay the start of a workflow.
So is there a kind of living' timeout ?

Oozie cannot kill a workflow that it triggered. However you can ensure that a single workflow is running at same time by setting Concurrency = 1 in the Coordinator.
Also you can have a second Oozie workflow monitoring the status of the Spark job.
Anyawy, you should investigate the root cause of Spark job not successful or being blocked.

Related

Long running tasks is cancelled - Celery5.2.6

I have a project hosted in Digital Ocean in a Basic Droplet with 2 GB Ram. In my local machine, the long-running task runs between 8-10 minutes and is still successful.However in Digital Ocean droplets, often times the celery will not succeed in the long-running task.
Current celery - celery 5.2.6
I have two configurations in supervisor
Running the celery worker celery -A myproject worker -l info
Running the celery beat celery -A myproject beat -l info
This is the message from celeryd.log
CPendingDeprecationWarning:
In Celery 5.1 we introduced an optional breaking change which
on connection, loss cancels all currently executed tasks with late acknowledgment enabled.
These tasks cannot be acknowledged as the connection is gone, and the tasks are automatically redelivered back to the queue.
You can enable this behavior using the worker_cancel_long_running_tasks_on_connection_loss setting.
In Celery 5.1 it is set to False by default. The setting will be set to True by default in Celery 6.0.
warnings.warn(CANCEL_TASKS_BY_DEFAULT, CPendingDeprecationWarning)
[2022-07-07 04:25:36,998: ERROR/MainProcess] consumer: Cannot connect to redis://localhost:6379//: Error 111 connecting to localhost:6379. Connection refused..
Trying again in 2.00 seconds... (1/100)
[2022-07-07 04:25:39,066: ERROR/MainProcess] consumer: Cannot connect to redis://localhost:6379//: Error 111 connecting to localhost:6379. Connection refused..
Trying again in 4.00 seconds... (2/100)
For a temporary solution, what I did is to restart the server, and re-run new tasks, but this will not guarantee that the long-running task will be successful, the problem with this is the previously failed task will not restart.
My goal is,
Prevent long-running tasks from being canceled
If the long-running task is already canceled and cancellation can't be avoided, I need it to rerun and continue instead of starting a new task.
Is this possible? Any ideas on how?
As stated in the warning message, you can control this behavior with worker_cancel_long_running_tasks_on_connection_loss to prevent the task from being cancelled on connection loss. On your celery version it is off by default, so your tasks should not be cancelled. However, even if a late-acknowledging task completes successfully in this scenario, the task is still redelivered to the queue and will be run again -- this happens irrespective of this setting and is unavoidable for tasks with late acknowledgment.
This is why it is vital that you design your tasks to be idempotent.
If your job is not idempotent, an alternative solution is to have your tasks ack early (the default), but this risks the possibility that you may drop a task without it actually being completed.
If you must avoid dropping tasks, you must set acks_late=True to your task and it must be designed to be idempotent. This is necessary irrespective of the specific connection loss issue, as many other things can happen that interrupt your tasks and produce this same scenario.
I need it to rerun and continue instead of starting a new task.
This comes down to how you design your task for idempotency. For example, you might want to have your job keep track of its progress in persistent storage, so when the task fails and is run again, it can determine how best to recover.

How to make slurm make a scheduling decision when jobs are submitted?

I'm using back-fill scheduler with Slurm to manage a small GPU cluster. The backfill scheduler makes a scheduling decision every bf_interval seconds (default value is 30 seconds). This means even when GPU resources are available sometimes I have to wait for a while until the they are allocated. I can obviously reduce bf_interval but given that we don't have a lot of job submissions it'd be good if I could force slurm to run the scheduling routine the moment a job is queued. Is this possible?
By default Slurm does it. From the documentation:
Slurm is designed to perform a quick and simple scheduling attempt at events such as job submission or completion and configuration changes.
Have you change the default configuration for this? And, are you sure that not scheduling on submission is your problem?

Does the successful tasks also gets reprocessed on an executor crash?

I am seeing about 3018 tasks failed for the job as about 4 executors died.
The Executors summary (as below in Spark UI) have a completely different statistics. Out of 3018, about 2994 properly completed. My question is,
Will they be re-tried again?
Is there a config to override/limit this?
After monitoring the job and manually validation the attempt counts event for successful tasks, realised
Will they be re-tried again?
- Yes, even the successful tasks are retried.
Is there a config to override/limit this?
- Did not find any config to override this behaviour.
If an executer (kubernetes pod) dies (like with an OOM or timeout), all the tasks, even if successfully completed are re-executed. One of the main reason is, the shuffle writes from the executers are lost with the executor itself!!!

How to tell if your spark job is waiting for resources

I have a pyspark job that I submit to a standalone spark cluster - this is an auto scaling cluster on ec2 boxes so when jobs are submitted and not enough nodes are available, after a few minutes a few more boxes spin up and become available.
We have a #timeout decorator on the main part of the spark job to timeout and error when it's exceeded a certain time threshold (put in place because of some jobs hanging). The issue is that sometimes a job may not have gotten to actually starting because its waiting on resources yet #timeout function is evaluated and jobs error out as a result.
So I'm wondering if there's anyway to tell from within the application itself, with code, if the job is waiting for resources?
To know the status of the application then you need to access the Spark Job History server from where you can get the current status of the job.
You can solve your problem as follows:
Get Application ID of your job through sc.applicationId.
Then use this application Id with Spark History Server REST APIs to get the status of the submitted job.
You can find the Spark History Server Rest APIs at link.

Spark Streaming active jobs stuck/piled up in Web UI

I'm experiencing a strange behavior while streaming from Kafka using spark 2.1.0/2.0.2 on AWS EMR.
"spark.streaming.concurrentJobs" was set to 1 explicitly to the streaming job but after running for a while, the job tab showed more than 1 active jobs running and such "active" jobs keep increasing.
Inside such jobs, some stages remaining not executed for ever (status is --). However all the tasks are shown as SUCCEED under those jobs.
What could be wrong here? A more strange thing is that, such behavior seems not occurring unless I open the Spark UI page to check the current status frequently.
Jobs tab - http://ibb.co/j6XkXk
Stages - http://ibb.co/budg55
It was only Job 12109 at the beginning. Things got piled up when I switched tabs a couple of times.
Regards,
Alex

Resources