one-liner to cancel all sbatch jobs after a certain time or job id? - slurm

Often I need to cancel all jobs created after a certain time or job id. Is there some syntax like scancel -j -gt <myid> or scancel -j $(< some call to sacct to get jobs after time "T">)

Turns out that "> job id" is simple:
squeue -u $USER | awk '{if (NR!=1 && $1 > 8122014) {print $1}}' | xargs -n 1 scancel

Related

Easy way to hold/release jobs by job array task id in slurm

I have a bunch of job arrays that are running right now (SLURM).
For example, 2552376_1, 2552376_10, 2552376_20, 2552376_80, 2552377_1, 2552377_10, 2552377_20, 2552377_80 and so on.
Currently, I am interested in that which end with _1.
Is there any way to hold all others without specifying job ids (because I have several hundreds of them)?
The following command works for holding all the jobs:
squeue -r -t PD -u $USER -o "scontrol hold %i" | tail -n +2 | sh
For releasing the one with needed id I use
squeue -r -u $USER -o "scontrol release %i" | tail -n +2 | grep "_1$" | sh
which picks correct jobs.
Mass update of jobs can be done by abusing the output formatting of squeue:
Hold all your pending jobs:
squeue -r -t PD -u $USER -o "scontrol hold %i" | sh
then release all your jobs ending in _1
squeue -r -t PD -u $USER -o "scontrol release %i" | grep "_1$" | sh
First run the commands without the | sh part to make sure it is working the way intended.
Note the -r option to display one job array element per line.

scontrol all jobs in user account

I am trying to hold all jobs submitted from my account. However, scontrol hold only takes in array and I have many arrays. Is there an alternative command like scancel -u user?
Edit1:
If iterating all job id is the only way, this is my method:
squeue -u user | awk '{print $1;}' | while read jobid; do scontrol hold $jobid; done
While piping formatted text to sh is clever, I would probably do something like this:
squeue -u <user> --format "%i" --noheader | xargs scontrol hold
or
sacct --allocation --user=<user> --noheader --format=jobid | xargs scontrol hold
If you wanted to filter by state, you could do that as well:
squeue -u <user> --format "%i" --noheader --states=PENDING | xargs scontrol hold
or
sacct --allocation --user=<user> --noheader --format=jobid --state=PENDING | xargs scontrol hold
source: Slurm man pages
A often-used method is to (ab)use the formatting possibilities of squeue to build the scontrol line:
squeue -u user --format "scontrol hold job %i"
and then pipe that into a shell:
squeue -u user --format "scontrol hold job %i" | sh

List number of jobs of each status

Is there a simple way to get SLURM to print out, for a given user, the number of jobs of each status (e.g., running, pending, completed, failed, etc.)?
One way to get that information is with:
squeue -u $USER -o%T -ST | uniq -c
The -u argument will filter jobs for the specific user, the -o%T argument will only output the job state, and the -S argument will sort them. Then uniq -c will do the counting.
Example output:
$ squeue -u $USER -o%T -ST | uniq -c
147 PENDING
49 RUNNING

qsub array job delay

#!/bin/bash
#PBS -S /bin/bash
#PBS -N garunsmodel
#PBS -l mem=2g
#PBS -l walltime=1:00:00
#PBS -t 1-2
#PBS -e error/error.txt
#PBS -o error/output.txt
#PBS -A improveherds_my
#PBS -m ae
set -x
c=$PBS_ARRAYID
nodeDir=`mktemp -d /tmp/phuong.XXXXX`
cp -r /group/dairy/phuongho/garuns $nodeDir
cp /group/dairy/phuongho/jo/parity1/my/simplex.bin $nodeDir/garuns/simplex.bin
cp /group/dairy/phuongho/jo/parity1/nttp.txt $nodeDir/garuns/my.txt
cp /group/dairy/phuongho/jo/parity1/delay_input.txt $nodeDir/garuns/delay_input.txt
cd $nodeDir/garuns
module load gcc vle
XXX=`pwd`
sed -i "s|/group/dairy/phuongho/garuns/out|$XXX/out/|" exp/garuns.vpz
awk -v i="$c" 'NR == 1 || $8==i' my.txt > simplex-observed.txt
awk -v i="$c" 'NR == 1 || $7==i {print $6}' delay_input.txt > afm_param.txt
cp "/group/dairy/phuongho/garuns_param.txt" "$nodeDir/garuns/garuns_param.txt"
while true
do
./simplex.bin &
sleep 5m
done
awk 'NR >1' < simplex-optimum-output.csv>> /group/dairy/phuongho/jo/parity1/my/finalresuls${c}.csv
cp simplex-all-output.csv "/group/dairy/phuongho/jo/parity1/my/simplex-all-output${c}.csv"
#awk '$28==1{print $1, $12,$26,$28,c}' c=$c out/exp_tempfile.csv > /group/dairy/phuongho/jo/parity1/my/simulated_my${c}.csv
cp /out/exp_tempfile.csv /group/dairy/phuongho/jo/parity1/my/exp_tempfile${c}.csv
rm simplex-observed.txt
rm garuns_param.txt
I have above bash script that allows submitting multiple jobs at the same time via PBS_ARRAYID. My issue is that my model (simplex.bin) when it executes it writes something to my home directory. Thus, if one jobs runs at a time or wait until next jobs finished writing stuff to home then it is fine. However, as I want to have >1000 jobs running at a time, 1000 of them try to write the same stuff to home, then leading to crash.
Is there any a smart way to just submit the second job after the first one has already started for a certain amount of time (let's say 5 minutes)?
I already checked and found two options: starts 2nd job when 1st finished, or start at a specific date/time.
Thanks
You can try something like the following:
while [ yes ]
do
./simplex.bin &
sleep 2
done
It endlessly starts ./simplex.bin process in the background, waits for 2 seconds, starts a new ./simplex.bin, etc.
Please note that you may also need nohup and add standard input/output redirection for your ./simplex.bin. Depending on your exact requirements
If you are using Torque, you can set a limit on the number of jobs that can run concurrently:
# Only allow 100 jobs to concurrently execute from this job array
qsub myscript.sh -t 0-10000%100
I know this isn't exactly what you're looking for, but I'm guessing you can find a slot limit that'll make it run without crashing.

How to terminate a job dispatcher in back ground in Linux?

I have a job dispatcher bash shell script containing below codes:
for (( i=0; i<$toBeDoneNum; i=i+1 ))
do
while true
do
processNum=`ps aux | grep Checking | wc -l`
if [ $processNum -lt $maxProcessNum ]; then
break
fi
echo "Too many processes: Max process is $maxProcessNum."
sleep $sleepSec
done
java -classpath ".:./conf:./lib/*" odx.comm.cwv.main.Checking $i
done
I run the script like this to be in the background:
./dispatcher.sh &
I want to terminate this dispatcher process with kill -9. But I didn't record the pid of the dispatcher process at the first time. Instead I used jobs to show all the process but it shows nothing. Even this fg cannot bring the process to foreground.
fg
bash: fg: current: no such job
But I think this dispatcher process is still running because it still continues to assign java program to run. How should I terminate this job dispatcher bash shell script process?
Edit: I used jobs, jobs -l, jobs -r and jobs -s. Nothing showed.
create test.sh with content
sleep 60
then
jobs -l | grep 'test.sh &' | grep -v grep | awk '{print $2}'
this gives me the process id on Ubuntu and OSX
you can assign it to a variable and then kill it
pid=`jobs -l | grep 'test.sh &' | grep -v grep | awk '{print $2}'`
kill -9 $pid

Resources