Set slurm to send email after all my jobs are done? - slurm

Is it possible to do that without writing my own daemon? I know slurm can send you and email for each job, but I would like a single email when I have no more pending or running jobs.

One option is to submit an empty job just to send the email, and ask Slurm to run that job the latest.
You can do that using the --dependency singleton option. From the documentation:
singleton This job can begin execution after any previously launched
jobs sharing the same job name and user have terminated.
So you need to name all your jobs the same name (--name=commonname), and you should request the minimum resources possible to make sure that this job is not delayed further when all your other jobs are finished.

Related

Azure batch service add webhook on job completion

I've set up a batch service for media file encoding with ffmpeg. Each job can contain multiple tasks, each task will encode one file. I use the task specific resource- and output-file system, so the batch service automatically fetches and delivers the files from and to the blob storage.
However: how do I know that a job or task has completed or failed?
Since the job can take very long - even more so on low priority nodes - I need some sort of webhook or event. Continuous polling on the job status is not viable.
The options I could think of:
after running the ffmpeg command, connect a curl command. Something like:
"commandLine" : "/bin/bash -c "ffmpeg -i inputFile outputFile &&
curl https://my-webhook-receiver.org ""
Technically it works, but I'm worried about timing. The curl request is probably(?) done before the batch service pushes the result file back to the blob storage. If it's a big file, and it takes maybe half a minute to upload, I will get notified before the file exists on the output container.
Use the blob storage event system.
This has the advantage that the result file obviously must've have arrived. However, what if the job failed? It won't get triggered ever then...
Batch alert system. You apparently can create alerts for certain batch events (e.g. task completion) and you can hook it up to an action group and finally a webhook. Is that the right call? It feels kinda hacky and not the right way to use this system.
Isn't there a way to connect azure batch with e.g. azure event grid directly?
What is the "correct" way to let my server know, the encoded file is ready?
There are a few ways to handle this, although admittedly some of these solutions are not very elegant:
Create a task dependency on each task. The dependent task is the one that invokes the webhook. You can make it such that the dependent task is invoked even if the task it is depending on fails with certain exit codes. You can also create a "merge task" that is dependent on all tasks in the job that can let you know when everything completes.
Use a job manager task instead. Job managers are typically used to monitor progression of a workflow and spawn other tasks, so you would be able to query status of task completion (success or failure) and send your webhook commands via this task or a task spawned by the job manager.
Use job release mechanisms to run actions when a job completes. This does not solve your per-task notification problem, but can be used as a job completion signal.

Detecting the end of an Azure Batch job

In my application I create an Azure batch job. It's a Node app and I use an azure-batch Node client, but I could also be using REST, I don't think it matters. I can't switch to a C# client, however.
I expect the job to be completed in a few seconds and I wish to pause the code until the batch job is over but I am not sure how to detect the end of the job without polling the Job Status API. Neither the Node client nor the REST API exposes such functionality. I thought I could maybe register for an event of some sort but was not able to find anything like that. There are job release tasks but I am not sure if I can achieve this using them.
Any ideas how the end of an Azure batch job can be detected from within my application?
One way to do this is once you add your tasks to the job, set the job's onAllTasksComplete property to 'terminatejob'.
Then you can poll the Job-Get API, and check the state property on the job for when the job is complete (https://learn.microsoft.com/en-us/rest/api/batchservice/job/get#jobstate or https://learn.microsoft.com/en-us/javascript/api/azure-batch/job?view=azure-node-latest#get-string--object-).

prevent users from killing each others job in queue

what are the setting needed or any other framework required to prevent users from killing each others spark job submitted to master and visible on localhost:8080. I want a situation where only admin or the user who submitted the job will be able to kill that job and no other user.

SQS: Know remaining jobs

I'm creating an app that uses a JobQueue using Amazon SQS.
Every time a user logs in, I create a bunch of jobs for that specific user, and I want him to wait until all his jobs have been processed before taking the user to a specific screen.
My problem is that I don't know how to query the queue to see if there are still pending jobs for a specific user, or how is the correct way to implement such solution.
Everything regarding the queue (Job creation and processing is working as expected). But I am missing that final step.
Just for the record:
In my previous implementation I was using Redis + Kue and I had created a key with the user Id and the job count, every time a job was added that job count was incremented, and every time a job finished or failed I decremented that count. But now I want to move away from Redi + Kue and I am not sure how to implement this step.
Amazon SQS is not the ideal tool for the scenario you describe. A queueing system is normally used in a "Send and Forget" situation, where the sending system doesn't remain interested in later processing.
You could investigate Amazon Simple Workflow (SWF), which allows work to be monitored as it goes through several processes. Your existing code could mostly be re-used, just with the SWF framework added. Or even power it from Lambda, since you are already using node.js.

gearman and retrying workers with unreliable external dependencies

I'm using gearman to queue a variety of different jobs, some which can always be serviced immediately, and some which can "fail", because they require an unreliable external service. (For example, sending email might require an SMTP server that's frequently unavailable.)
If an external service goes down, I'd like to keep all jobs which require that service on the queue, and retry one job occasionally (every few minutes, say) until the service becomes available again. (Perhaps optionally sending email if the service has not been available for hours.)
However I'd like jobs that don't require a failed service to be passed on to workers as soon as possible. How can this be achieved? (I'm happy to put some of the logic in the workers if necessary, although it seems to be a bit "late" to throttle on the worker side.)
Gearman should already be handle this. As long as you have some workers which specialise in handling jobs with unreliable dependancies and don't handle other jobs, along with some workers that either do all jobs, or just jobs without unreliable dependencies.
All you would need to do it add some code the unreliable dependancy workers so that they only accept jobs once that have checked that the dependent service is running, if the service is down then just have them wait a bit and retest the service (and continue ad infinitum), once the service is up then have them join the gearmand server, do job, return work, retest service, etc etc.
While the dependent service is down, the workers that don't handle jobs that need the service will keep on trundling through the job queue for the other jobs. Gearmand won't block an entire job queue (or worker) on one job type if there are workers available to handle other job types.
The key is to be sensible about how you define your job types and workers.
EDIT--
Ah-ha, I knew my thinking was a little out, (I wrote my gearman system about a year ago and haven't really touched it since). My solution to this type of issue was to have all the workers that normally handle dependent-job unregister their dependent job handling capability with the gearmand server once a failure was detected with the dependent service. (and any workers that are currently trying to complete that job should return a failure.) Once the service is backup - get those same workers to reregister their ability to handle that job. Do note this does require another channel of communications for the workers to be notified of the status of the dependent services.
Hope this helps

Resources