Slurm requeued jobs are given BeginTime that makes newer jobs skip queue order if higher priority job is quickly cancelled - slurm

I'm having an unexpected problem with Slurm when it comes to jobs that are requeued due to preemption.
Job ID 5 is queued with priority 1 and PreemptMode=requeue
Job ID 66 is queued with priority 1 and PreemptMode=requeue
Job ID 777 is queued with priority 3
Job ID 777 is cancelled within a short period of time from being queued. Job ID 66 starts.
What appears to be happening is that when Job ID 5 is requeued it's given a BeginTime roughly two minutes from current time. If Job ID 7777 is cancelled after that BeginTime, Job ID 66 will start because Job ID 5 is queued for a later time.
How can I set up Slurm to get around this problem? I want it to respect queue order in situations like this and always start Job ID 5 before Job ID 66.

Related

Django-Q scheduling tasks every X seconds

I am using Django-Q to schedule a periodic simple task that has to be repeated every < 1 minute.
Croniter, used under the hood to parse cron expressions for the scheduler, specifies that cron "seconds" support is available:
https://pypi.org/project/croniter/#about-second-repeats
So I created a cron-type schedule that looks like this:
Schedule.objects.update_or_create(name='mondrian_scheduler', defaults= {'func':'mondrianapi.tasks.run_scheduler', 'schedule_type':Schedule.CRON,
'cron': '* * * * * */20'} )
Django-q correctly parses and schedules the job, but the real frequency doesn't seem to go below 30 seconds (31, actually), whatever the 6th argument says:
2021-05-12 10:17:08.528307+00:00---run_bot ID 1
2021-05-12 10:17:39.166822+00:00---run_bot ID 1
2021-05-12 10:18:09.899772+00:00---run_bot ID 1
2021-05-12 10:18:40.648140+00:00---run_bot ID 1
2021-05-12 10:19:11.176563+00:00---run_bot ID 1
2021-05-12 10:19:41.857376+00:00---run_bot ID 1
The guard (or sentinel) process is responsible for querying for any scheduled tasks which are due, and it only does this twice per minute:
Scheduler
Twice a minute the scheduler checks for any scheduled tasks that should be starting.
Creates a task from the schedule
Subtracts 1 from django_q.Schedule.repeats
Sets the next run time if there are repeats left or if it has a negative value.
https://django-q.readthedocs.io/en/latest/architecture.html?highlight=scheduler#scheduler
The guard process is also responsible for checking that all of the other processes are still running, so it is not exactly thirty seconds.
Unfortunately the scheduler interval is not configurable. If you're comfortable modifying django_q, the relevant code is in django_q/cluster.py, in Sentinel.guard().

Add a job to SLURM queue with higher priority as previously submitted jobs

I want to submit and run a job X to a SLURM queue while already having other jobs YZ waiting in that queue.
Basically, I want to avoid doing scontrol hold YZ manually or find an automated way to scontrol hold YZ with the submission of X and scontrol release YZ as soon as the job X finishes.
Cheers
There is the scontrol top <jobID> command, which puts a job on top of other jobs of the same user ID. But it has to be enabled by the system administrator.
To quote the scontrol man-page:
top job_list
Move the specified job IDs to the top of the queue of jobs belonging to the identical user ID, partition name, account, and QOS.
The job_list argument is a comma separated ordered list of job IDs.
Any job not matching all of those fields will not be effected. Only
jobs submitted to a single partition will be effected. This operation
changes the order of jobs by adjusting job nice values. The net effect
on that user's throughput will be negligible to slightly negative.
This operation is disabled by default for non-privileged
(non-operator, admin, SlurmUser, or root) users. This operation may be
enabled for non-privileged users by the system administrator by
including the option "enable_user_top" in the SchedulerParameters
configuration parameter.

Kubernetes CronJob - Skip job if previous is still running AND wait for the next schedule time

I have scheduled the K8s cron to run every 30 mins.
If the current job is still running and the next cron schedule has reached it shouldn't create a new job but rather wait for the next schedule.
And repeat the same process if the previous job is still in Running state.
set the following property to Forbid in CronJob yaml
.spec.concurrencyPolicy
https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#concurrency-policy
spec.concurrencyPolicy: Forbid will hold off starting a second job if there is still an old one running. However that job will be queued to start immediately after the old job finishes.
To skip running a new job entirely and instead wait until the next scheduled time, set .spec.startingDeadlineSeconds to be smaller than the cronjob interval (but larger than the max expected startup time of the job).
If you're running a job every 30 minutes and know the job will never take more than one minute to start, set .spec.startingDeadlineSeconds: 60

Slurm does not allocate the resources and keeps waiting

I'm trying to use our cluster but I have issues. I tried allocating some resources with:
salloc -N 1 --ntasks-per-node=5 bash
but It keeps wainting on:
salloc: Pending job allocation ...
salloc: job ... queued and waiting for resources
or when I try:
srun -N1 -l echo test
it lingers at waiting queue!
Am I making a mistake or there is something wrong with our cluster?
It might help to set a time limit for the Slurm job using the option --time, for instance set a limit of 10 minutes like this:
srun --job-name="myJob" --ntasks=4 --nodes=2 --time=00:10:00 --label echo test
Without time limit, Slurm will use the partition's default time limit. The issue is that sometimes this might be set to infinity or to several days, so this might cause a delay in the start of the job. To check the partition's default time limit use:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
prod* up infinite 198 ....
gpu* up 4-00:00:00 70 ....
From the Slurm docs:
-t, --time=<time>
Set a limit on the total run time of the job allocation. If the requested time limit exceeds the partition's time limit, the job will be left in a PENDING state (possibly indefinitely). The default time limit is the partition's default time limit. When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL.

Hill climbing search algorithm stopping criteria for job assignment

Let's say there are 10 jobs and 15 workers. The objective is to assign jobs to workers which could satisfy the jobs’ requirement and minimizing total job processing time.
For each iteration, a job is selected randomly and assigned to worker with next less processing time than the current assigned worker. For example, current assigned worker, let’s say worker 3: processing time is 10. The next less processing time is 8 at worker 5, so the job is assigned to worker 5.
My question is, how do I determine the stopping criteria for the iterations? For the time being, I just set the number of iterations to the number of jobs or to the number of workers.

Resources