Requesting specific nodes with TORQUE qsub? - pbs

There's a cluster with TORQUE qsub installed. I want to send a job, but I want to make sure that it runs on one of a specific set of nodes.
Is it possible to request a list of possible nodes in qsub, so that the job is sent to one of the nodes in the requested set, never to a node outside the set?

Using just TORQUE, the way to do this is to add a feature (or property) to each of the nodes in the set and add the feature as part of the job request. For example:
#nodes file entry
node01 fast np=32
# line in job script to request 2 'fast' nodes with 16 execution slots on each
#PBS -l nodes=2:fast:ppn=16
Depending on which scheduler you're using there may be easier ways to accomplish this task.

Related

Slurm: can i create e sub-queue using a subset of resources in a single node?

I have a use case with slurm and I wonder if there is a way to handle it.
Constraints:
I would like to run several jobs (say 60 jobs).
Each one takes a few hours, e.g. 3h/job.
In the cluster managed by slurm, I use a queue with 2 nodes with 4 gpus each (so I can restrict my batch script to one node).
Each job takes 1 gpu.
Problem: if I put everything in the queue, I will block 4 gpus even if I specify only 1 node.
Desired solution: avoid blocking a whole machine by taking, say, 2 gpus only.
How can I put them in the queue without them taking all 4 gpus?
Could I create a kind of sub-file that would be limited to a subset of resources of a node for example?
You can use the Slurm consumable trackable resources plug-in (cons_tres enabled in your slurm.conf file- more info here: https://slurm.schedmd.com/cons_res.html#using_cons_tres) to:
Specify the --gpus-per-task=X
-or-
Bind a specific number of gpus to the task with --gpus=X
-or-
Bind the task to a specific gpu by its ID with --gpu-bind=GPUID

Cores assigned to SLURM job

Let's say I want to submit a slurm job just assigning the total amount of tasks (--ntasks=someNumber), without specifying the number of nodes and the tasks per node. Is there a way to know within the launched slurm script how many cores are assigned by slurm for each of the reserved nodes? I need to know this info to properly create a machinefile for the program I'm launching, that must be structured like this:
node02:7
node06:14
node09:3
Once the job is launched, the only way I figured out to see what cores have been allocated on the nodes is using the command:
scontrol show jobid -dd
In its output the abovementioned info is stored (together with plenty of other details).
Is there a better way to get this info?
The way the srun documentation illustrates creating a machine file is by running srun hostname. To get the output you want you could run
srun hostname -s | sort | uniq -c | awk '{print $2":"$1}' > $MACHINEFILE
You should check the documentation of your program to see if it accepts a machine file with repetitions rather than a suffix count. If so you can simplify the command as
srun hostname -s > $MACHINEFILE
And of course the first step is actually to make sure you indeed need a machine file in the first place as many parallel programs/libraries have Slurm support and can gather the needed information from the environment variables setup by Slurm upon job start.

How to submit jobs across multiple partitions at the same time (Slurm)

After I submit a job to node/partition cn430 today, I find that the node is keeping obsessed,
After the previous job finished, my job still didn't get running due to priority. Then I noticed that all of these jobs have the same prefix, namely 4988443, which is ahead of my job id 4988560.
It seems that the user has submitted about 1000 jobs together with the same priority across multiple partitions,
I am wondering how to implement it.
Firstoff, cn430 really looks like a node rather than a partition. The partition to which it belongs seems to be named shared-gp.
What you see is a job array. It is a way to submit a large number of jobs that only differ in a specific parameter. Each job in the array is scheduled independently, so if you do not request a specific node (e.g. with -wor --nodelist), Slurm will broadcast them to the nodes that are available.
Note that the job priorities will decay overtime if faishare is being implemented so the jobs that are currently pending will have their priority decrease because of those currently running.

Getting the load for each job

Where can I find the load (used/claimed CPUs) per job? I know to get it per host using sinfo, but that does not directly give information on which job causes a possible 'incorrect' load of anything unequal to 1.
(I want to get this for all jobs, i.e. logging in to the node and running top is not my objective.)
You can use
sacct --format='jobid,ReqCPUS,elapsed,AveCPU'
and compare Elapsed with AveCPU. The latter will only be available for job steps, not for the whole job.

How does PBS_NODEFILE work in pbs?

For example, I have a file /home/user/nodes, which contains:
node1
node2
node3
node4
...
When I try to submit a job like:
qsub -v PBS_NODEFILE=/home/user/nodes -l nodes=2
Does it mean that pbs will select 2 nodes from /home/user/nodes list? I tried but it was not. It still chose the nodes from $PBS_HOME/server_priv/nodes, which is the default configuration.
I really want pbs could choose nodes from my own nodefile, is there a good way to do this?
Thanks.
With a queue system the idea is you let the system assign the nodes to your job when the job runs so you will never use your 'nodes' file. At run time, the variable PBS_NODEFILE is set to the location of a file that is created for your job that includes a list of the assigned nodes.

Resources