SLURM: When we reboot the node, does jobID assignments start from 0? - slurm

For example:
sacct --start=1990-01-01 -A user returns job table with latest jobID as 136, but when I submit a new job as sbatch -A user -N1 run.sh submitted bash job returns 100 which is smaller than 136. And seems like sacct -L -A user returns a list which ends with 100.
So it seems like submitted batch jobs overwrites to previous jobs' informations, which I don't want.
[Q] When we reboot the node, does jobID assignments start from 0? If yes, what should I do it to continue from latest jobID assignment before the reboot?
Thank you for your valuable time and help.

There are two main reasons why job ID's might be recycled:
the maximum job ID was reached (see MaxJobId in slurm.conf)
the Slurm controller was restarted with FirstJobId set to a new value
Other than that, Slurm will always increase the job ID's.
Note that the job information in the database is not overwrite; they have a unique ID which is different from the job ID. sacct has a -D, --duplicates option to view all jobs in the database. By default, it only shows the most recent one among all those which have the same job ID.

Related

SLURM get job id of last run jobs

Is there a way to get the last x SLURM job ids of (finished) jobs for a certain user (me)?
Or maybe all job IDs run the x-hours?
(My use case is, that I want to get some metrics via a sacctbut idealy don't want to parse them from outputfiles etc.)
For the next time it's maybe adviseable to plan this in advance, like hereā€¦
To strictly answer the question, you can use sacct like this:
sacct -X --start now-3hours -o jobid
This will list the jobs of the jobs that started within the past 3 hours.
But then, if what you want is to feed those job IDs to sacct to get metrics, you can directly add the metrics to the -o option, or remove that -o option altogether.
Also, the -X is there to have one line per job, but memory-related metrics are stored per job step, so you might want to remove it at some point to get one line per job step instead..

how to find the command used for a slurm job based on job id?

After submitting a slurm job using sbatch file.slurm, you get a job ID. You can use squeue and sacct to check the job's status. But neither returns the original submission command (sbatch file.slurm) for the job. Is there a command to show the submission command, namely sbatch file.slurm? I need to link job IDs with my submission commands.
So far, the only way is by saving the return of sbatch command somewhere.
No, there is no command to show the submission command. One workaround is to put the jobname as the filename.
#SBATCH -J "file_name"
So, when you do squeue or scontrol show job, then you can match your job id with the filename.
There is no other way to achieve the desired objective.

slurm job status for an old already finished job

I want to see the status of one of my older jobs submitted using slurm. I have used sacct -j , but it does not give me information on exactly the date when the job was submitted/terminated etc. I want to check the date, time of the job submissio. I tried to use scontrol, but I suppose that only works for current running/pending jobs not for older jobs which are already finished. It will be great if someone could suggest me a slurm command for checking the job status along with job submission date and time etc for an already finished old job. Thanks in advance
As you mentioned that sacct -j is working but not providing the proper information, I'll assume that accounting is properly set and working.
You can select the output of the sacct command with the -o flag, so to get exactly what you want you can use:
sacct -j JOBID -o jobid,submit,start,end,state
You can use sacct --helpformat to get the list of all the available fields for the output.

Slurm: Is it possible to give or change pid of the submitted job via sbatch

When we submit a job via sbatch, the pid to jobs given by incremental order. This order start from again from 1 based on my observation.
sbatch -N1 run.sh
Submitted batch job 20
//Goal is to change submitted batch job's id, if possible.
[Q1] For example there is a running job under slurm. When we reboot the node, does the job continue running? and does its pid get updated or stay as it was before?
[Q2] Is it possible to give or change pid of the submitted job with a unique id that the cluster owner want to give?
Thank you for your valuable time and help.
If the node fails, the job is requeued - if this is permitted by the JobRequeue parameter in slurm.conf. It will get the same Job ID as the previously started run since this is the only identifier in the database for managing the jobs. (Users can override requeueing with the --no-requeue sbatch parameter.)
It's not possible to change Job ID's, no.

Find jobs which were preempted in SLURM

As a user (not an admin), is there any way that I can look up jobs which were preempted at some point, then requeued? I tried:
sacct --allusers --state=PR --starttime=2016-01-01
And didn't get anything, but I don't think this command should actually work, because a job which got preempted and then requeued would not ultimately end up in the preempted state.
You need to use the --duplicate option of sacct; that will show you all the "intermediate states".
From the manpage:
-D, --duplicates
If Slurm job ids are reset, some job numbers will probably appear more than once in the accounting log file but refer to different jobs. Such
jobs can be distinguished by the "submit" time stamp in the data records.
When data for specific jobs are requested with the --jobs option, sacct returns the most recent job with that number. This behavior can be
overridden by specifying --duplicates, in which case all records that match the selection criteria will be returned.
When jobs are preempted, or requeued, you end up with several records in the database for the job, and this option allows you to see all of them.

Resources