Is there a way to find out why a job was canceled by slurm? I would like to distinguish the cases where a resource limit was hit from all other reasons (like a manual cancellation). In case a resource limit was hit, I also would like to know which one.
The slurm log file contains that information explicitly. It is also written to the job's output file with something like:
JOB <jobid> CANCELLED AT <time> DUE TO TIME LIMIT
or
Job <jobid> exceeded <mem> memory limit, being killed:
or
JOB <jobid> CANCELLED AT <time> DUE TO NODE FAILURE
etc.
Related
Dear UNIX/PBS experts:
I am user of a UNIX HPC system (CentOS Linux 7 (Core),Linux 3.10.0-693.5.2.el7.x86_64) and I do not have any root privileges.
Various jobs have been submitted at an HPC system and almost all resources are being used.
Jobs from other users may run for weeks while my submitted job would finish in less than a day.
My goal is to run my job exactly after the first resources will be freed instead of waiting for
all other users to have their jobs finished.
My submitted job has a number qid 66005.pbs.
However the last job running at this moment has number 55004.pbs.
By checking the status of job: qstat 55005,
I obtain: qstat: Unknown Job Id 55005.pbs
Thus my question is whether it is possible to change the name of job 66005.pbs to 55005.pbs, and if this action will allow my job to run?
If yes, how can this be achieved?
If not, are there any other solutions/alternatives for making sure that my jobs run before those ones of other users in queue?
Thank you very much for your help and any suggestion.
The good thing about the computer system is that it is not human. It will be unfair to run your job (which clearly was submitted after other users) before other users and because of that "No" it is not possible to change your job-id.
You can work with your admin to move the job to a higher priority queue instead.
I'm trying to use our cluster but I have issues. I tried allocating some resources with:
salloc -N 1 --ntasks-per-node=5 bash
but It keeps wainting on:
salloc: Pending job allocation ...
salloc: job ... queued and waiting for resources
or when I try:
srun -N1 -l echo test
it lingers at waiting queue!
Am I making a mistake or there is something wrong with our cluster?
It might help to set a time limit for the Slurm job using the option --time, for instance set a limit of 10 minutes like this:
srun --job-name="myJob" --ntasks=4 --nodes=2 --time=00:10:00 --label echo test
Without time limit, Slurm will use the partition's default time limit. The issue is that sometimes this might be set to infinity or to several days, so this might cause a delay in the start of the job. To check the partition's default time limit use:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
prod* up infinite 198 ....
gpu* up 4-00:00:00 70 ....
From the Slurm docs:
-t, --time=<time>
Set a limit on the total run time of the job allocation. If the requested time limit exceeds the partition's time limit, the job will be left in a PENDING state (possibly indefinitely). The default time limit is the partition's default time limit. When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL.
Using sacct I want to obtain information about my completed jobs.
Answer mentions how could we obtain a job's information.
I have submitted a job name jobName.sh which has jobID 176. After 12 hours and new 200 jobs came in, I want to check my job's (jobID=176) information and I obtain slurm_load_jobs error: Invalid job id specified.
scontrol show job 176
slurm_load_jobs error: Invalid job id specified
And following line returns nothing: sacct --name jobName.sh
I assume there is a time-limit to keep previously submitted job's information that somehow previous jobs' information has been removed. Is there a limit for that? How could I make that limit very large value in order to prevent them to be deleted?
Please not that JobRequeue=0 is at slurm.conf.
Assuming that you are using mySQL to store that data, in your database configuration file slurmdbd.conf, you can tune, among others, the purging time. Here you have some examples:
PurgeJobAfter=12hours
PurgeJobAfter=1month
PurgeJobAfter=24months
If not set (default), then job records are never purged.
More info.
On Slurm documentation mentioned that:
MinJobAge The minimum age of a completed job before its record is
purged from Slurm's active database. Set the values of MaxJobCount and
to ensure the slurmctld daemon does not exhaust its memory or other
resources. The default value is 300 seconds. A value of zero prevents
any job record purging. In order to eliminate some possible race
conditions, the minimum non-zero value for MinJobAge recommended is 2.
On my slurm.conf file, MinJobAge was 300 which is 5 minutes. That's why after 5 minutes each completed job's information was removed. I increased MinJobAge's value in order to prevent the delete operation.
When we submit a job via sbatch, the pid to jobs given by incremental order. This order start from again from 1 based on my observation.
sbatch -N1 run.sh
Submitted batch job 20
//Goal is to change submitted batch job's id, if possible.
[Q1] For example there is a running job under slurm. When we reboot the node, does the job continue running? and does its pid get updated or stay as it was before?
[Q2] Is it possible to give or change pid of the submitted job with a unique id that the cluster owner want to give?
Thank you for your valuable time and help.
If the node fails, the job is requeued - if this is permitted by the JobRequeue parameter in slurm.conf. It will get the same Job ID as the previously started run since this is the only identifier in the database for managing the jobs. (Users can override requeueing with the --no-requeue sbatch parameter.)
It's not possible to change Job ID's, no.
As a user (not an admin), is there any way that I can look up jobs which were preempted at some point, then requeued? I tried:
sacct --allusers --state=PR --starttime=2016-01-01
And didn't get anything, but I don't think this command should actually work, because a job which got preempted and then requeued would not ultimately end up in the preempted state.
You need to use the --duplicate option of sacct; that will show you all the "intermediate states".
From the manpage:
-D, --duplicates
If Slurm job ids are reset, some job numbers will probably appear more than once in the accounting log file but refer to different jobs. Such
jobs can be distinguished by the "submit" time stamp in the data records.
When data for specific jobs are requested with the --jobs option, sacct returns the most recent job with that number. This behavior can be
overridden by specifying --duplicates, in which case all records that match the selection criteria will be returned.
When jobs are preempted, or requeued, you end up with several records in the database for the job, and this option allows you to see all of them.