How can I see in slurm the details all the jobs per user - slurm

I want to see the details of all jobs of a user.
I know that I can do the following:
scontrol show job
and then I can see all the details of all the jobs of all the users.
But I am searching for something like this:
scontrol show job UserId=Jon
Thanks.

One way to do that is to use squeue with the formatting option to build the command line and pipe that into a shell:
squeue --user Jon --format "scontrol show job %j" | sh
You can then use all the filtering options of squeue like per partition, per state, etc.

Related

SLURM get job id of last run jobs

Is there a way to get the last x SLURM job ids of (finished) jobs for a certain user (me)?
Or maybe all job IDs run the x-hours?
(My use case is, that I want to get some metrics via a sacctbut idealy don't want to parse them from outputfiles etc.)
For the next time it's maybe adviseable to plan this in advance, like hereā€¦
To strictly answer the question, you can use sacct like this:
sacct -X --start now-3hours -o jobid
This will list the jobs of the jobs that started within the past 3 hours.
But then, if what you want is to feed those job IDs to sacct to get metrics, you can directly add the metrics to the -o option, or remove that -o option altogether.
Also, the -X is there to have one line per job, but memory-related metrics are stored per job step, so you might want to remove it at some point to get one line per job step instead..

Cancel jobs submitted previous to a date or with JOBID lower than a given integer

I have realized that the jobs submitted with a previous version of my software are useless because of a bug, so I want to cancel them. However, I also have newer jobs that I would like to keep running. All the jobs have the same job name and are running in the same partition.
I have written the following script to cancel the jobs with an ID lower than a given one.
#!\bin\bash
if [ $1 ]
then
MAX_JOBID=$1
else
echo "An integer value is needed"
exit
fi
JOBIDLIST=$(squeue -u $USER -o "%F")
for JOBID in $JOBIDLIST
do
if [ "$JOBID" -lt "$MAX_JOBID" ]
then
echo "Cancelling job "$JOBID
scancel $JOBID
fi
done
I would say that this is a recurrent situation for someone developing a software and I wonder if there is a direct way to do it using slurm commands. Alternatively, do you use some tricks like appending the software commit ID to the job name to overcome this kind of situations?
Unfortunately there is no direct way to cancel the job in such scenarios.
Alternatively, like you pointed out, naming the job by adding software version/commit along with job name is useful. In that case you can use, scancel --name=JOB_NAME_VERSION to cancel all the jobs with that job name.
Also, if newly submitted jobs can be hold using scontrol hold <jobid> and then all the PENDING job can be cancelled using scancel --state=PENDING
In my case, I used a similar approach (like yours) by having squeue piped the output to awk and cancelled the first N number of jobs I wanted to remove. Its a one-liner script.
Something like this:
eg: squeue arguments | awk 'NR>=2 && NR<=N{print $1}' | xargs /usr/bin/scancel
In addition to the suggestions by #j23, you can organise your jobs with
job arrays ; if all your jobs are similar in terms of submission script, make them a job array, and submit one job array per version of your software. Then you can cancel an entire job array with just one scancel command
a workflow management system ; they enable submitting and managing sets of jobs (possibly on different clusters) easily
Fireworks https://materialsproject.github.io/fireworks/
Bosco https://osg-bosco.github.io/docs/
Slurm pipelines https://github.com/acorg/slurm-pipeline
Luigi https://github.com/spotify/luigi

how to find the command used for a slurm job based on job id?

After submitting a slurm job using sbatch file.slurm, you get a job ID. You can use squeue and sacct to check the job's status. But neither returns the original submission command (sbatch file.slurm) for the job. Is there a command to show the submission command, namely sbatch file.slurm? I need to link job IDs with my submission commands.
So far, the only way is by saving the return of sbatch command somewhere.
No, there is no command to show the submission command. One workaround is to put the jobname as the filename.
#SBATCH -J "file_name"
So, when you do squeue or scontrol show job, then you can match your job id with the filename.
There is no other way to achieve the desired objective.

Hold several jobs in Slurm

I know that for a specific job ID, I can use scontrol hold $JOBID.
How can I hold jobs for several IDs or/and hold jobs for a range of jobs ids (e.g. scontrol hold 294724-294749)?
Also, how can I hold jobs based on my $USER?
First off, if all your jobs have the same name, you can use
scontrol hold <jobname>
to hold them all.
But the scontrol command accepts a list of job IDs, which can be either space- or comma-separated. So if your jobs have consecutive job IDs, you can use Bash's {1..n} (Brace expansion) construct to generate the list and feed it to scontrol:
scontrol hold {294724..294749}
Otherwise, one common idiom is to use squeue's output formatting capabilities to generate scontrol commands and feed them to a shell:
squeue --user $USER --format "scontrol hold %i" | sh
(When doing that, it is wise to first run the squeue command without piping to sh to review its output before running it again through sh)

SLURM: When we reboot the node, does jobID assignments start from 0?

For example:
sacct --start=1990-01-01 -A user returns job table with latest jobID as 136, but when I submit a new job as sbatch -A user -N1 run.sh submitted bash job returns 100 which is smaller than 136. And seems like sacct -L -A user returns a list which ends with 100.
So it seems like submitted batch jobs overwrites to previous jobs' informations, which I don't want.
[Q] When we reboot the node, does jobID assignments start from 0? If yes, what should I do it to continue from latest jobID assignment before the reboot?
Thank you for your valuable time and help.
There are two main reasons why job ID's might be recycled:
the maximum job ID was reached (see MaxJobId in slurm.conf)
the Slurm controller was restarted with FirstJobId set to a new value
Other than that, Slurm will always increase the job ID's.
Note that the job information in the database is not overwrite; they have a unique ID which is different from the job ID. sacct has a -D, --duplicates option to view all jobs in the database. By default, it only shows the most recent one among all those which have the same job ID.

Resources