How can I find out the "command" (batch script filename) of a finished SLURM job? - slurm

I often have lots of SLURM jobs running from different directories. Therefore, it is useful to query the workdir of the jobs. I can do this for jobs in the queue (e.g. pending, running, etc.) something like this:
squeue -u $USER -o "%i %Z"
and I can do this for finished jobs (e.g. completed, timeout, cancelled, etc.) something like this:
sacct -u $USER -o JobID,WorkDir
The problem is, sometimes I have a directory with two (or more) SLURM batch scripts in it, e.g. submit.sh and restart.sh. Therefore, it is also useful to query the "command" of the jobs, i.e. the filename of the batch script. I can do this for jobs in the queue something like this:
squeue -u $USER -o "%i %o"
However, from checking the documentation of sacct and playing around with sacct, there appears to be no equivalent option for sacct so I cannot currently get the command for finished jobs. I also cannot use the squeue method for finished jobs - it just says slurm_load_jobs error: Invalid job id specified because finished jobs are not included in the squeue list. So, how can I find out the command of a finished SLURM job (using sacct or otherwise)?

Slurm does not indeed store the command in the accounting database. Two workarounds:
For a single user: use the JobName or Comment to store the script name upon submission. These are stored in the database, but this approach is error-prone;
Cluster-wise: enable job completion plugin to ElastiSearch as this stores not only the script name but the whole contents as well.

Related

Check sbatch script of running job

When running an slurm job from an sbatch script, is there a command that lets me see what was in the sbatch script that I used to start this job?
For example sacct tells me I'm on SLURM_JOB_ID.3 and I would like to see how many job steps there will be in total.
I'm looking for a command that takes the job id and prints the sbatch script it is running.
You can use
scontrol write batch_script SLURM_JOB_ID
The above will display the submission script for job identified with jobid 12345
More info: https://slurm.schedmd.com/scontrol.html#OPT_write-batch_script

Slurm - Accessing stdout/stderr location of a completed job

I am trying to get the location of the stdout and stderr file of an already completed job.
Indeed, while the job is running, I could do
scontrol show job $JobId
However, this does not work after a job is finished
I am able to get information about previous completed jobs with sacct,
However, there is no option to display the location of stderr and stdout with this command.
The only information, I found about this issue is this https://groups.google.com/g/slurm-users/c/e4cZMbtrMM0 . However, this suggests changing slurm.conf so that scontrol show job $JobId retains information. This is not possible in my case because I do not have access to slurm.conf
So I was wondering if there was a way with slurm to get the location of the stdout and stderr of a completed job?
Thanks for your help
---- edit ----
The jobs are submitted with a bash file
#SBATCH --output=...
#SBATCH --error=...
By running the command sbatch $submission_file
This means retrieving the command to submitted the file does not help. Indeed this will retrieve only sbatch $jobfile and not give any further information on the output and error directory.
Although Slurm does not seem to save that information in the accounting database explicitly, it does save the information of the working directory, which you can obtain with
sacct -j <JOBID> -o workdir%-100
Most of the time, chances are the output and error files will be relative to that directory.
Slurm also saves the submission command, which you can retrieve with
sacct -j <JOBID> -o SubmitLine%-100
which will reveal output and error files in the case they were provided in the command line.
Finally, note that Slurm will also save the full submission script if configured to do so (which could not be the case on your cluster). If so, you can retrieve it with
sacct -j <JOBID> --batch-script

"qsub script.sh" yielding "Unknown queue" error

Let say I have two bash scripts. (small.sh & super.sh)
small.sh
#!/bin/bash
cd /current_path/
chmod a+x *.sh
bash super.sh
super.sh
#!/bin/bash
qsub test.sh
When I submit my job to PBS system.
qsub small.sh
The super.sh could not be executed.
That means it will not
qsub test.sh
Am I doing something wrong? How can I achieve this?
If your script has no #PBS directives, and you don't submit with something like qsub -q batch ..., then it seems like you either a) have no default queue defined, or b) the queue name being submitted to does not exist (or has a typo). Run this (as an admin) to see the default queue:
qmgr -c 'print server default_queue'
Run this to see the queue settings:
qmgr -c 'print queue <queue_name>'
If you have no default queue, then either set one, or make sure to always submit directly to a queue with qsub -q <queue_name>... (and of course make sure the queue actually exists, which you can still do with print queue as mentioned.
This is what i found out from here :
Queue is Unknown
Be sure to use the correct queue. For Pleiades jobs, use the common queue names normal, long, vlong, and debug. For Endeavour jobs, use the queue names e_normal, e_long, e_vlong, and e_debug. The PBS server pbspl1 recognizes the queue names for both Pleiades and Endeavour, and will route them appropriately. However, the pbspl3 server only recognizes the queue names for Endeavour jobs, as shown below:
pfe20% qsub - q normal#pbspl3 job_script
qsub: unknown queue

Need to know qsub completed job ids

I have a script that is posting bunch of qsub jobs. I can use following command to check job status.
qstat -u <user name>
I want to find out completed job ids. What command I need to use for that?

can i delete a shell script after it has been submitted using qsub without affecting the job?

I want to submit a a bunch of jobs using qsub - the jobs are all very similar. I have a script that has a loop, and in each instance it rewrites over a file tmpjob.sh and then does qsub tmpjob.sh . Before the job has had a chance to run, the tmpjob.sh may have been overwritten by the next instance of the loop. Is another copy of tmpjob.sh stored while the job is waiting to run? Or do I need to be careful not to change tmpjob.sh before the job has begun?
Assuming you're talking about torque, then yes; torque reads in the script at submission time. In fact the submission script need never exist as a file at all; as given as an example in the documentation for torque, you can pipe in commands to qsub (from the docs: cat pbs.cmd | qsub.)
But several other batch systems (SGE/OGE, PBS PRO) use qsub as a queue submission command, so you'll have to tell us what queuing system you're using to be sure.
Yes. You can even create jobs and sub-jobs with HERE Documents. Below is an example of a test I was doing with a script initiated by a cron job:
#!/bin/env bash
printenv
qsub -N testCron -l nodes=1:vortex:compute -l walltime=1:00:00 <<QSUB
cd \$PBS_O_WORKDIR
printenv
qsub -N testsubCron -l nodes=1:vortex:compute -l walltime=1:00:00 <<QSUBNEST
cd \$PBS_O_WORKDIR
pwd
date -Isec
QSUBNEST
QSUB

Resources