Is there a simple way to get SLURM to print out, for a given user, the number of jobs of each status (e.g., running, pending, completed, failed, etc.)?
One way to get that information is with:
squeue -u $USER -o%T -ST | uniq -c
The -u argument will filter jobs for the specific user, the -o%T argument will only output the job state, and the -S argument will sort them. Then uniq -c will do the counting.
Example output:
$ squeue -u $USER -o%T -ST | uniq -c
147 PENDING
49 RUNNING
Related
Often I need to cancel all jobs created after a certain time or job id. Is there some syntax like scancel -j -gt <myid> or scancel -j $(< some call to sacct to get jobs after time "T">)
Turns out that "> job id" is simple:
squeue -u $USER | awk '{if (NR!=1 && $1 > 8122014) {print $1}}' | xargs -n 1 scancel
Which command is correct to check the current job in running status on a specific agent in Autosys
1 ) Autorep -I machine_name | find RU
OR
2 ) Autorep -j ALL - M machine_name | find RU
Below commands returns the list of running job on a particular machine / agent.
autorep -M <machine/agent_name> -d
example:
autorep -M host123#some.com -d
Output Columns:
JobName | Machine | Status | Load | Priority
If executing from linux server, use grep, awk as per need to reformat the report.
I have a bunch of job arrays that are running right now (SLURM).
For example, 2552376_1, 2552376_10, 2552376_20, 2552376_80, 2552377_1, 2552377_10, 2552377_20, 2552377_80 and so on.
Currently, I am interested in that which end with _1.
Is there any way to hold all others without specifying job ids (because I have several hundreds of them)?
The following command works for holding all the jobs:
squeue -r -t PD -u $USER -o "scontrol hold %i" | tail -n +2 | sh
For releasing the one with needed id I use
squeue -r -u $USER -o "scontrol release %i" | tail -n +2 | grep "_1$" | sh
which picks correct jobs.
Mass update of jobs can be done by abusing the output formatting of squeue:
Hold all your pending jobs:
squeue -r -t PD -u $USER -o "scontrol hold %i" | sh
then release all your jobs ending in _1
squeue -r -t PD -u $USER -o "scontrol release %i" | grep "_1$" | sh
First run the commands without the | sh part to make sure it is working the way intended.
Note the -r option to display one job array element per line.
I am trying to hold all jobs submitted from my account. However, scontrol hold only takes in array and I have many arrays. Is there an alternative command like scancel -u user?
Edit1:
If iterating all job id is the only way, this is my method:
squeue -u user | awk '{print $1;}' | while read jobid; do scontrol hold $jobid; done
While piping formatted text to sh is clever, I would probably do something like this:
squeue -u <user> --format "%i" --noheader | xargs scontrol hold
or
sacct --allocation --user=<user> --noheader --format=jobid | xargs scontrol hold
If you wanted to filter by state, you could do that as well:
squeue -u <user> --format "%i" --noheader --states=PENDING | xargs scontrol hold
or
sacct --allocation --user=<user> --noheader --format=jobid --state=PENDING | xargs scontrol hold
source: Slurm man pages
A often-used method is to (ab)use the formatting possibilities of squeue to build the scontrol line:
squeue -u user --format "scontrol hold job %i"
and then pipe that into a shell:
squeue -u user --format "scontrol hold job %i" | sh
If I know the name of a job I have run, how could I return only its jobID through a script.
For example, running sacct --name run.sh returns following output, where I want to return only 50 (jobID).
$ sacct --name run.sh
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
50 run.sh debug alper 1 COMPLETED 0:0
50.batch batch alper 1 COMPLETED 0:0
As a solution I can run: sacct --name run.sh | head -n3 | tail -n1 | awk '{print $1}' that returns 50, but sometimes order of 50 and 50.batch changes for the other jobs.
Use the following combination of options:
sacct -n -X --format jobid --name run.sh
where
-n will suppress the header
-X will suppress the .batch part
--format jobid will only show the jobid column
This will output only the jobid, but if several jobs correspond to the given job name, you will get several results.