Find jobs which were preempted in SLURM - slurm

As a user (not an admin), is there any way that I can look up jobs which were preempted at some point, then requeued? I tried:
sacct --allusers --state=PR --starttime=2016-01-01
And didn't get anything, but I don't think this command should actually work, because a job which got preempted and then requeued would not ultimately end up in the preempted state.

You need to use the --duplicate option of sacct; that will show you all the "intermediate states".
From the manpage:
-D, --duplicates
If Slurm job ids are reset, some job numbers will probably appear more than once in the accounting log file but refer to different jobs. Such
jobs can be distinguished by the "submit" time stamp in the data records.
When data for specific jobs are requested with the --jobs option, sacct returns the most recent job with that number. This behavior can be
overridden by specifying --duplicates, in which case all records that match the selection criteria will be returned.
When jobs are preempted, or requeued, you end up with several records in the database for the job, and this option allows you to see all of them.

Related

Add a job to SLURM queue with higher priority as previously submitted jobs

I want to submit and run a job X to a SLURM queue while already having other jobs YZ waiting in that queue.
Basically, I want to avoid doing scontrol hold YZ manually or find an automated way to scontrol hold YZ with the submission of X and scontrol release YZ as soon as the job X finishes.
Cheers
There is the scontrol top <jobID> command, which puts a job on top of other jobs of the same user ID. But it has to be enabled by the system administrator.
To quote the scontrol man-page:
top job_list
Move the specified job IDs to the top of the queue of jobs belonging to the identical user ID, partition name, account, and QOS.
The job_list argument is a comma separated ordered list of job IDs.
Any job not matching all of those fields will not be effected. Only
jobs submitted to a single partition will be effected. This operation
changes the order of jobs by adjusting job nice values. The net effect
on that user's throughput will be negligible to slightly negative.
This operation is disabled by default for non-privileged
(non-operator, admin, SlurmUser, or root) users. This operation may be
enabled for non-privileged users by the system administrator by
including the option "enable_user_top" in the SchedulerParameters
configuration parameter.

Cron Job sometimes "missed | Too late for the schedule" sometimes running

Sometimes Crons is working sometimes getting missed. I have attached all setting and result. Anyone can check and revert.
It's completely normal behaviour. Some jobs are skipped caused the time frame is out of scheduled time for specified cron job. In your case the reindex process is scheduled every 1 minute. If there is more things to index (lot of changes on products, categories etc.) one minute is's not enough to complete. Also there is only one process per cron group, in your case index. Use Separate Process in cron configuration means that indexes process will run as separate process in relation to other cron groups.

Edit the Job ID number of a pbs submitted job to achieve submission before other jobs in queue

Dear UNIX/PBS experts:
I am user of a UNIX HPC system (CentOS Linux 7 (Core),Linux 3.10.0-693.5.2.el7.x86_64) and I do not have any root privileges.
Various jobs have been submitted at an HPC system and almost all resources are being used.
Jobs from other users may run for weeks while my submitted job would finish in less than a day.
My goal is to run my job exactly after the first resources will be freed instead of waiting for
all other users to have their jobs finished.
My submitted job has a number qid 66005.pbs.
However the last job running at this moment has number 55004.pbs.
By checking the status of job: qstat 55005,
I obtain: qstat: Unknown Job Id 55005.pbs
Thus my question is whether it is possible to change the name of job 66005.pbs to 55005.pbs, and if this action will allow my job to run?
If yes, how can this be achieved?
If not, are there any other solutions/alternatives for making sure that my jobs run before those ones of other users in queue?
Thank you very much for your help and any suggestion.
The good thing about the computer system is that it is not human. It will be unfair to run your job (which clearly was submitted after other users) before other users and because of that "No" it is not possible to change your job-id.
You can work with your admin to move the job to a higher priority queue instead.

How to submit jobs across multiple partitions at the same time (Slurm)

After I submit a job to node/partition cn430 today, I find that the node is keeping obsessed,
After the previous job finished, my job still didn't get running due to priority. Then I noticed that all of these jobs have the same prefix, namely 4988443, which is ahead of my job id 4988560.
It seems that the user has submitted about 1000 jobs together with the same priority across multiple partitions,
I am wondering how to implement it.
Firstoff, cn430 really looks like a node rather than a partition. The partition to which it belongs seems to be named shared-gp.
What you see is a job array. It is a way to submit a large number of jobs that only differ in a specific parameter. Each job in the array is scheduled independently, so if you do not request a specific node (e.g. with -wor --nodelist), Slurm will broadcast them to the nodes that are available.
Note that the job priorities will decay overtime if faishare is being implemented so the jobs that are currently pending will have their priority decrease because of those currently running.

SLURM: When we reboot the node, does jobID assignments start from 0?

For example:
sacct --start=1990-01-01 -A user returns job table with latest jobID as 136, but when I submit a new job as sbatch -A user -N1 run.sh submitted bash job returns 100 which is smaller than 136. And seems like sacct -L -A user returns a list which ends with 100.
So it seems like submitted batch jobs overwrites to previous jobs' informations, which I don't want.
[Q] When we reboot the node, does jobID assignments start from 0? If yes, what should I do it to continue from latest jobID assignment before the reboot?
Thank you for your valuable time and help.
There are two main reasons why job ID's might be recycled:
the maximum job ID was reached (see MaxJobId in slurm.conf)
the Slurm controller was restarted with FirstJobId set to a new value
Other than that, Slurm will always increase the job ID's.
Note that the job information in the database is not overwrite; they have a unique ID which is different from the job ID. sacct has a -D, --duplicates option to view all jobs in the database. By default, it only shows the most recent one among all those which have the same job ID.

Resources