Using sacct I want to obtain information about my completed jobs.
Answer mentions how could we obtain a job's information.
I have submitted a job name jobName.sh which has jobID 176. After 12 hours and new 200 jobs came in, I want to check my job's (jobID=176) information and I obtain slurm_load_jobs error: Invalid job id specified.
scontrol show job 176
slurm_load_jobs error: Invalid job id specified
And following line returns nothing: sacct --name jobName.sh
I assume there is a time-limit to keep previously submitted job's information that somehow previous jobs' information has been removed. Is there a limit for that? How could I make that limit very large value in order to prevent them to be deleted?
Please not that JobRequeue=0 is at slurm.conf.
Assuming that you are using mySQL to store that data, in your database configuration file slurmdbd.conf, you can tune, among others, the purging time. Here you have some examples:
PurgeJobAfter=12hours
PurgeJobAfter=1month
PurgeJobAfter=24months
If not set (default), then job records are never purged.
More info.
On Slurm documentation mentioned that:
MinJobAge The minimum age of a completed job before its record is
purged from Slurm's active database. Set the values of MaxJobCount and
to ensure the slurmctld daemon does not exhaust its memory or other
resources. The default value is 300 seconds. A value of zero prevents
any job record purging. In order to eliminate some possible race
conditions, the minimum non-zero value for MinJobAge recommended is 2.
On my slurm.conf file, MinJobAge was 300 which is 5 minutes. That's why after 5 minutes each completed job's information was removed. I increased MinJobAge's value in order to prevent the delete operation.
Related
Is there a way to find out why a job was canceled by slurm? I would like to distinguish the cases where a resource limit was hit from all other reasons (like a manual cancellation). In case a resource limit was hit, I also would like to know which one.
The slurm log file contains that information explicitly. It is also written to the job's output file with something like:
JOB <jobid> CANCELLED AT <time> DUE TO TIME LIMIT
or
Job <jobid> exceeded <mem> memory limit, being killed:
or
JOB <jobid> CANCELLED AT <time> DUE TO NODE FAILURE
etc.
For example:
sacct --start=1990-01-01 -A user returns job table with latest jobID as 136, but when I submit a new job as sbatch -A user -N1 run.sh submitted bash job returns 100 which is smaller than 136. And seems like sacct -L -A user returns a list which ends with 100.
So it seems like submitted batch jobs overwrites to previous jobs' informations, which I don't want.
[Q] When we reboot the node, does jobID assignments start from 0? If yes, what should I do it to continue from latest jobID assignment before the reboot?
Thank you for your valuable time and help.
There are two main reasons why job ID's might be recycled:
the maximum job ID was reached (see MaxJobId in slurm.conf)
the Slurm controller was restarted with FirstJobId set to a new value
Other than that, Slurm will always increase the job ID's.
Note that the job information in the database is not overwrite; they have a unique ID which is different from the job ID. sacct has a -D, --duplicates option to view all jobs in the database. By default, it only shows the most recent one among all those which have the same job ID.
When we submit a job via sbatch, the pid to jobs given by incremental order. This order start from again from 1 based on my observation.
sbatch -N1 run.sh
Submitted batch job 20
//Goal is to change submitted batch job's id, if possible.
[Q1] For example there is a running job under slurm. When we reboot the node, does the job continue running? and does its pid get updated or stay as it was before?
[Q2] Is it possible to give or change pid of the submitted job with a unique id that the cluster owner want to give?
Thank you for your valuable time and help.
If the node fails, the job is requeued - if this is permitted by the JobRequeue parameter in slurm.conf. It will get the same Job ID as the previously started run since this is the only identifier in the database for managing the jobs. (Users can override requeueing with the --no-requeue sbatch parameter.)
It's not possible to change Job ID's, no.
As a user (not an admin), is there any way that I can look up jobs which were preempted at some point, then requeued? I tried:
sacct --allusers --state=PR --starttime=2016-01-01
And didn't get anything, but I don't think this command should actually work, because a job which got preempted and then requeued would not ultimately end up in the preempted state.
You need to use the --duplicate option of sacct; that will show you all the "intermediate states".
From the manpage:
-D, --duplicates
If Slurm job ids are reset, some job numbers will probably appear more than once in the accounting log file but refer to different jobs. Such
jobs can be distinguished by the "submit" time stamp in the data records.
When data for specific jobs are requested with the --jobs option, sacct returns the most recent job with that number. This behavior can be
overridden by specifying --duplicates, in which case all records that match the selection criteria will be returned.
When jobs are preempted, or requeued, you end up with several records in the database for the job, and this option allows you to see all of them.
I have a job running a linux machine managed by slurm.
Now that the job is running for a few hours I realize that I underestimated the time required for it to finish and thus the value of the --time argument I specified is not enough. Is there a way to add time to an existing running job through slurm?
Use the scontrol command to modify a job
scontrol update jobid=<job_id> TimeLimit=<new_timelimit>
Use the SLURM time format, eg. for 8 days 15 hours: TimeLimit=8-15:00:00
Requires admin privileges, on some machines.
Will be allowed to users only if the job is not running yet, on most machines.
To build on the example provided above, you can also use "+" and "-" to increment / decrement the TimeLimit.
From the [scontrol man page][https://slurm.schedmd.com/scontrol.html]:
either specify a new time limit value or precede the time and equal sign with a "+" or "-" to increment or decrement the current time limit (e.g. "TimeLimit+=30")
We regularly get requests like "I need 3 more hours for job XXXXX to finish!!!", which would translate to:
scontrol update job=XXXXX TimeLimit=+03:00:00