Create low priority slurm jobs that suspend or requeue if another job is submitted - slurm

I am currently running a job on my schools HPC that contains 20 compute nodes. I would like to use all of these in a nice way so that if another student needs a compute node my job will be paused/ suspended and then be requeued when the resources become available again. My thought process is to submit 10 jobs that use two nodes each like so:
#!/bin/bash
#SBATCH --job-name=cpu-detect
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --mem=50G
#SBATCH --cpus-per-task=32
#SBATCH --partition=compute
srun conda run -n fires3.7 python detector.py
From what I have seen most students only request one node, so my thoughts are if a request is made, one of my jobs will be stopped and then requeued which will free up two nodes for that student to use, and then once that student is done, the job that was stopped will start again. Is this possible? I could not find too much information online.

You can use scontrol suspend jobid to suspend your jobs if your architecture/configuration supports it. Your job should also supports this.
Slurm supports premption, the act of "stopping" one or more "low-priority" jobs to let a "high-priority" job run. It all depends on the way, the cluster is configured.
From slurm man page:
suspend <job_list>
Suspend a running job. The job_list argument is a comma separated list of job IDs. Use the resume command to resume its
execution. User processes must stop on receipt of SIGSTOP signal and
resume upon receipt of SIGCONT for this operation to be effective. Not
all architectures and configurations support job suspension. If a
suspended job is requeued, it will be placed in a held state. The time
a job is suspended will not count against a job's time limit. Only an
operator, administrator, SlurmUser, or root can suspend jobs.
You can resume it by scontrol resume jobid
resume <job_list>
Resume a previously suspended job. The job_list argument is a comma separated list of job IDs. Also see suspend.
NOTE: A suspended job releases its CPUs for allocation to other jobs. Resuming a previously suspended job may result in multiple jobs
being allocated the same CPUs, which could trigger gang scheduling
with some configurations or severe degradation in performance with
other configurations. Use of the scancel command to send SIGSTOP and
SIGCONT signals would stop a job without releasing its CPUs for
allocation to other jobs and would be a preferable mechanism in many
cases.
In my opinion, if your cluster supports suspend, you can indeed write a script in login node (not advising it because login node resources will be shared by all users). A script that runs in background and check for all the job information (using squeue). If there are any pending jobs, you can sent scontrol suspend jobid to your job (or cancel your job using scontrol cancel jobid). Then resume it when needed (if the job is suspended).
But as a responsible user, you don't need to worry about this since you should onlyrequest the resources that you need and run the job that is essential. It is indeed the responsibility of administrators to come up with a fair scheduling policy (by creating different queues like test queue for small jobs (fewer nodes) with less duration, micro queue for small jobs with long duration, large queue for large jobs etc) etc. Different policies could be employed by your institute to provide a fair policy. For example, a user cannot submit more than 2 jobs in a queue.

Related

How to represent Activity Diagram with multiple scheduled background jobs?

I would like to know what is the correct way to visualize background jobs which are running on a schedule controlled by scheduling service?
In my opinion the correct way should be to have an action representing the scheduling itself, followed by fork node which splits the flow for each of the respective scheduled jobs.
Example. On a schedule Service X is supposed to collect data from an API every day, on another schedule Service Y is supposed to aggregate the collected data.
I've tried to research old themes and find any diagram representing similar activity.
Your current diagram says that:
first the scheduler does something (e.g. identifying the jobs to launch)
then passes control in parallel to all the jobs it wants to launch
no other jobs are scheduled after the scheduler finished its task
the first job that finshes interrupts all the others.
The way it should work would be:
first the scheduler is setup
then the setup launches the real scheduler, which will run in parallel to the scheduled jobs
scheduled jobs can finish (end flow, i.e. a circle with a X inside) without terminating everything
the activity would stop (flow final) only when the scheduler is finished.
Note that the UML specifications do not specify how parallelism is implemented. And neither does your scheduler: whether it is true parallelism using multithreaded CPUs or multiple CPUs, or whether it is time slicing where some interruptions are used to switch between tasks that are executed in reality in small sequential pieces is not relevant for this modeling.
The remaining challenges are:
the scheduler could launch additional jobs. One way of doing it could be to fork back to itself and to a new job.
the scheduler could launch a variable number of jobs in parallel. A better way to represent it is with a «parallel» expansion region, with the input corresponding to task object, and actions that consume the taks by executing them.
if the scheduler runs in parallel to the expansion region, you could also imagine that the schedule provides at any moment some additional input (new tasks to be processed).

How many celery workers we can spawn on a machine?

Tech stack: celery 5.0.5, flask, python, Windows OS(8 CPUs).
To give a background, my usage requires spawning one worker, one queue per country as per the request payload
I am using celery.control.inspect().active() to see list of active workers and see if worker with {country}_worker exists in that list. If no, spawn a new worker using:
python subprocess.Popen('celery -A main.celery worker --loglevel=info -Q {queue_name} --logfile=logs\\{queue_name}.log --concurrency=1 -n {worker_name}')
This basically starts a new celery worker and a new queue.
My initial understanding was that we can spawn only n number of workers where n is the cpu_count(). So with this understanding, while testing my code I found that when my 9th worker was spawned, I assumed it will wait for any one of the previous 8 workers to finish execution before taking up the task, but as soon as it was spawned it started consuming from the queue while rest 8 workers were also executing and same happened when I spawned more workers(15 workers in total).
This brings me to my question that the --concurrency argument in a celery process is responsible for parallel execution within that worker? If I spawned 15 independent workers does that mean 15 different processes can be executed in parallel?
Any help is appreciated in understanding this concept.
Edit: I also noticed that each new task received in the corresponding worker spawns a new python.exe process(as per the task manager) and the previous python process spawned remains in memory unused. This does not happen when I spawn worker as "solo" rather than "prefork". Problem with using solo? celery.inspect().active() does not return anything if the workers are executing something and respond back when no tasks are in progress.
If your tasks are I/O bound, and it seems they are, then perhaps you should change the concurrency type to Eventlet. Then you can in theory have concurrency set even to 1000. However, it is a different execution model so you need to write your tasks carefully to avoid deadlocks.
If the tasks are CPU-bound, then I suggest you have concurrency set to N-1, where N is number of cores, unless you want to overutilise, in which case you can pick a slightly bigger number.
PS. you CAN spawn many worker-processes, but since they all run concurrently (separate processes in this case) their CPU utilisation would be low so it really makes no sense to go above the number of available cores.

SLURM QOS Preemption

I was trying to setup a preemption in my SLURM 19.05 cluster, but I could not figure out how to make preemption work like what I planned.
Basically, I have two QOS.
$ sacctmgr show qos format=name,priority,preempt
Name Priority Preempt
---------- ---------- ----------
normal 0
premium 5000 normal
These are the relevant setting in my configuration for preemption:
# SCHEDULING
SelectType=select/cons_res
FastSchedule=1
SelectTypeParameters=CR_CPU_Memory
PreemptType=preempt/qos
PreemptMode=SUSPEND,GANG
PriorityType=priority/multifactor
PriorityWeightFairshare=10000
PriorityWeightAge=10000
PriorityWeightJobSize=10000
PriorityFavorSmall=YES
PriorityWeightQOS=10000
PartitionName=Compute OverSubscribe=FORCE:1 State=UP Nodes=compute01,compute02
My plan was to allow premium job to preempt the normal job, suspend the normal job until the premium job finish running in the cluster.
However, the preemption I observed seems to time slice and suspend two jobs in sequence every 30 seconds. Is there anything I had missed in the configuration files or SLURM just couldn't offer the preemption I was planning where I do not want any time slice on the resources?
The problem is that PreemptMode=SUSPEND,GANG with PreemptType=preempt/qos results in timeslicing.
You must either set PreemptType to preempt/partition_prio, resulting in "suspend and automatically resume the low priority jobs", or set PreemptMode to REQUEUE, where jobs will be aborted and put back in the queue.
As far as I know these are the options closest to what I think you want.
https://slurm.schedmd.com/slurm.conf.html#PreemptMode
GANG
enables gang scheduling (time slicing) of jobs in the same partition. NOTE: Gang scheduling is performed independently for each partition, so configuring partitions with overlapping nodes and gang scheduling is generally not recommended.
REQUEUE
preempts jobs by requeuing them (if possible) or canceling them. For jobs to be requeued they must have the --requeue sbatch option set or the cluster wide JobRequeue parameter in slurm.conf must be set to one.
SUSPEND
If PreemptType=preempt/partition_prio is configured then suspend and automatically resume the low priority jobs. If PreemptType=preempt/qos is configured, then the jobs sharing resources will always time slice rather than one job remaining suspended. The SUSPEND may only be used with the GANG option (the gang scheduler module performs the job resume operation).

YARN scheduler: reject application after timeout

I have a cluster on which there's one queue for low priority jobs. These jobs can wait for hours before being executed, it does not matter. The only problem I have is that my applications run under a timeout command to kill any suspiciously long running job. I recently added a new job which takes up the entirety of the queue's capacity and runs for several hours. The behaviour I would like to have is that incoming jobs are rejected after a certain amount of time if no capacity could be allocated to them. This way, they could give up and come back later. I do not want to modify my own timeout thresholds - their semantic is supposed to be how long the job runs for, not how long the whole scheduling + job execution lasted.
I did not see anything like this after some research. Is anybody aware of an existing scheduler allowing that, or a cheap way to do it using an existing scheduler (like the default CapacityScheduler) ?.
PS: justification for the apache-spark tag is that it will give this question broader visibility and will have more chance to reach answerers and future readers looking for questions about YARN-Spark.

Gearman callback with nested jobs

I have a gearman job that runs and itself executes more jobs when in turn may execute more jobs. I would like some kind of callback when all nested jobs have completed. I can easily do this, but my implementations would tie up workers (spin until children are complete) which I do not want to do.
Is there a workaround? There is no concept of "groups" in Gearman AFAIK, so I can't add jobs to a group and have something fire once that group has completed.
As you say, there's nothing built-in to Gearman to handle this. If you don't want to tie up a worker (and letting that worker add tasks and track their completion for you), you'll have to do out-of-band status tracking.
A way to do this is to keep a group identifier in memcached, and increment the number of finished subtasks when a task finishes, and increment the number of total tasks when you add a new one for the same group. You can then poll memcached to see the current state of execution (tasks finished vs tasks total).

Resources