SLURM get job id of last run jobs - slurm

Is there a way to get the last x SLURM job ids of (finished) jobs for a certain user (me)?
Or maybe all job IDs run the x-hours?
(My use case is, that I want to get some metrics via a sacctbut idealy don't want to parse them from outputfiles etc.)
For the next time it's maybe adviseable to plan this in advance, like hereā€¦

To strictly answer the question, you can use sacct like this:
sacct -X --start now-3hours -o jobid
This will list the jobs of the jobs that started within the past 3 hours.
But then, if what you want is to feed those job IDs to sacct to get metrics, you can directly add the metrics to the -o option, or remove that -o option altogether.
Also, the -X is there to have one line per job, but memory-related metrics are stored per job step, so you might want to remove it at some point to get one line per job step instead..

Related

Slurm - job name, job ids, how to know which job is which?

I often run many jobs on slurm. Some finish faster than others. However, it is always hard to keep track which job is which. Can I give custom job names on slurm? If so what is the command on the batch script? Would that show up when I do squeue --me?
The parameter is --job-name (or -J), for instance:
#SBATCH --job-name=exp1_run2
The squeue output will list exp1_run2 for the corresponding job ID under column NAME.

Cancel jobs submitted previous to a date or with JOBID lower than a given integer

I have realized that the jobs submitted with a previous version of my software are useless because of a bug, so I want to cancel them. However, I also have newer jobs that I would like to keep running. All the jobs have the same job name and are running in the same partition.
I have written the following script to cancel the jobs with an ID lower than a given one.
#!\bin\bash
if [ $1 ]
then
MAX_JOBID=$1
else
echo "An integer value is needed"
exit
fi
JOBIDLIST=$(squeue -u $USER -o "%F")
for JOBID in $JOBIDLIST
do
if [ "$JOBID" -lt "$MAX_JOBID" ]
then
echo "Cancelling job "$JOBID
scancel $JOBID
fi
done
I would say that this is a recurrent situation for someone developing a software and I wonder if there is a direct way to do it using slurm commands. Alternatively, do you use some tricks like appending the software commit ID to the job name to overcome this kind of situations?
Unfortunately there is no direct way to cancel the job in such scenarios.
Alternatively, like you pointed out, naming the job by adding software version/commit along with job name is useful. In that case you can use, scancel --name=JOB_NAME_VERSION to cancel all the jobs with that job name.
Also, if newly submitted jobs can be hold using scontrol hold <jobid> and then all the PENDING job can be cancelled using scancel --state=PENDING
In my case, I used a similar approach (like yours) by having squeue piped the output to awk and cancelled the first N number of jobs I wanted to remove. Its a one-liner script.
Something like this:
eg: squeue arguments | awk 'NR>=2 && NR<=N{print $1}' | xargs /usr/bin/scancel
In addition to the suggestions by #j23, you can organise your jobs with
job arrays ; if all your jobs are similar in terms of submission script, make them a job array, and submit one job array per version of your software. Then you can cancel an entire job array with just one scancel command
a workflow management system ; they enable submitting and managing sets of jobs (possibly on different clusters) easily
Fireworks https://materialsproject.github.io/fireworks/
Bosco https://osg-bosco.github.io/docs/
Slurm pipelines https://github.com/acorg/slurm-pipeline
Luigi https://github.com/spotify/luigi

How can I see in slurm the details all the jobs per user

I want to see the details of all jobs of a user.
I know that I can do the following:
scontrol show job
and then I can see all the details of all the jobs of all the users.
But I am searching for something like this:
scontrol show job UserId=Jon
Thanks.
One way to do that is to use squeue with the formatting option to build the command line and pipe that into a shell:
squeue --user Jon --format "scontrol show job %j" | sh
You can then use all the filtering options of squeue like per partition, per state, etc.

slurm job status for an old already finished job

I want to see the status of one of my older jobs submitted using slurm. I have used sacct -j , but it does not give me information on exactly the date when the job was submitted/terminated etc. I want to check the date, time of the job submissio. I tried to use scontrol, but I suppose that only works for current running/pending jobs not for older jobs which are already finished. It will be great if someone could suggest me a slurm command for checking the job status along with job submission date and time etc for an already finished old job. Thanks in advance
As you mentioned that sacct -j is working but not providing the proper information, I'll assume that accounting is properly set and working.
You can select the output of the sacct command with the -o flag, so to get exactly what you want you can use:
sacct -j JOBID -o jobid,submit,start,end,state
You can use sacct --helpformat to get the list of all the available fields for the output.

SLURM: When we reboot the node, does jobID assignments start from 0?

For example:
sacct --start=1990-01-01 -A user returns job table with latest jobID as 136, but when I submit a new job as sbatch -A user -N1 run.sh submitted bash job returns 100 which is smaller than 136. And seems like sacct -L -A user returns a list which ends with 100.
So it seems like submitted batch jobs overwrites to previous jobs' informations, which I don't want.
[Q] When we reboot the node, does jobID assignments start from 0? If yes, what should I do it to continue from latest jobID assignment before the reboot?
Thank you for your valuable time and help.
There are two main reasons why job ID's might be recycled:
the maximum job ID was reached (see MaxJobId in slurm.conf)
the Slurm controller was restarted with FirstJobId set to a new value
Other than that, Slurm will always increase the job ID's.
Note that the job information in the database is not overwrite; they have a unique ID which is different from the job ID. sacct has a -D, --duplicates option to view all jobs in the database. By default, it only shows the most recent one among all those which have the same job ID.

Resources