SLURM job dependency by job name not job id - slurm

The format for job dependencies in the documentation is as follows:
sbatch --dependency=<type:job_id[:job_id][,type:job_id[:job_id]]> ...
Is it possible to make a job dependency using job name instead of job ID?

Slurm does not seem to handle that, but a workaround, that would work in the command line (not in a #SBATCH directive in a script), would be:
sbatch --dependency=$(squeue --noheader --format %i --name <JOB_NAME>) ...

Related

Check sbatch script of running job

When running an slurm job from an sbatch script, is there a command that lets me see what was in the sbatch script that I used to start this job?
For example sacct tells me I'm on SLURM_JOB_ID.3 and I would like to see how many job steps there will be in total.
I'm looking for a command that takes the job id and prints the sbatch script it is running.
You can use
scontrol write batch_script SLURM_JOB_ID
The above will display the submission script for job identified with jobid 12345
More info: https://slurm.schedmd.com/scontrol.html#OPT_write-batch_script

Running parallel jobs in slurm

I was wondering if I could ask something about running slurm jobs in parallel.(Please note that I am new to slurm and linux and have only started using it 2 days ago...)
As per the insturctions on the picture below (source : https://hpc.nmsu.edu/discovery/slurm/serial-parallel-jobs/),
I have designed the following bash script
#!/bin/bash
#SBATCH --job-name fmriGLM #job name을 다르게 하기 위해서
#SBATCH --nodes=1
#SBATCH -t 16:00:00 # Time for running job
#SBATCH -o /scratch/connectome/dyhan316/fmri_preprocessing/FINAL_loop_over_all/output_fmri_glm.o%j #%j : job id 가 [>
#SBATCH -e /scratch/connectome/dyhan316/fmri_preprocessing/FINAL_loop_over_all/error_fmri_glm.e%j
pwd; hostname; date
#SBATCH --ntasks=30
#SBATCH --mem-per-cpu=3000MB
#SBATCH --cpus-per-task=1
for num in {0..29}
do
srun --ntasks=1 python FINAL_ARGPARSE_RUN.py --n_division 30 --start_num ${num} &
done
wait
The, I ran sbatch as follows: sbatch test_bash
However, when I view the outputs, it is apparent that only one of the sruns in the bash script are being executed... Could anyone tell me where I went wrong and how I can fix it?
**update : when I look at the error file I get the following : srun: Job 43969 step creation temporarily disabled, retrying. I searched the internet and it says that this could be caused by not specifying the memory and hence not having enough memory for the second job.. but I thought that I already specifeid the memory when I did --mem_per_cpu=300MB?
**update : I have tried changing the code as said as in here : Why are my slurm job steps not launching in parallel?, but.. still it didn't work
**potentially pertinent information: our node has about 96cores, which seems odd when compared to tutorials that say one node has like 4cores or something
Thank you!!
Try adding --exclusive to the srun command line:
srun --exclusive --ntasks=1 python FINAL_ARGPARSE_RUN.py --n_division 30 --start_num ${num} &
This will instruct srun to use a sub-allocation and work as you intended.
Note that the --exclusive option has a different meaning in this context than if used with sbatch.
Note also that different versions of Slurm have a distinct canonical way of doing this, but using --exclusive should work across most versions.
Even though you have solved your problem which turned out to be something else, and that you have already specified --mem_per_cpu=300MB in your sbatch script, I would like to add that in my case, my Slurm setup doesn't allow --mem_per_cpu in sbatch, only --mem. So the srun command will still allocate all the memory and block the subsequent steps. The key for me, is to specify --mem_per_cpu (or --mem) in the srun command.

[slurm]How to check the job type on slurm ? batch or interactive

I want to determine whether a job on the slurm is a batch job or an interactive job.
It is possible to check the batch host with the following command, but is there a better way?
squeue -O "Name,BatchHost"
NAME EXEC_HOST
int login001
batch compute009
scontrol show job has a flag BatchFlag. The scontrol manpage says:
BatchFlag
Jobs submitted using the sbatch command have BatchFlag set to 1. Jobs submitted using other commands have BatchFlag set to 0.

How can I find out the "command" (batch script filename) of a finished SLURM job?

I often have lots of SLURM jobs running from different directories. Therefore, it is useful to query the workdir of the jobs. I can do this for jobs in the queue (e.g. pending, running, etc.) something like this:
squeue -u $USER -o "%i %Z"
and I can do this for finished jobs (e.g. completed, timeout, cancelled, etc.) something like this:
sacct -u $USER -o JobID,WorkDir
The problem is, sometimes I have a directory with two (or more) SLURM batch scripts in it, e.g. submit.sh and restart.sh. Therefore, it is also useful to query the "command" of the jobs, i.e. the filename of the batch script. I can do this for jobs in the queue something like this:
squeue -u $USER -o "%i %o"
However, from checking the documentation of sacct and playing around with sacct, there appears to be no equivalent option for sacct so I cannot currently get the command for finished jobs. I also cannot use the squeue method for finished jobs - it just says slurm_load_jobs error: Invalid job id specified because finished jobs are not included in the squeue list. So, how can I find out the command of a finished SLURM job (using sacct or otherwise)?
Slurm does not indeed store the command in the accounting database. Two workarounds:
For a single user: use the JobName or Comment to store the script name upon submission. These are stored in the database, but this approach is error-prone;
Cluster-wise: enable job completion plugin to ElastiSearch as this stores not only the script name but the whole contents as well.

SLURM how to know if a running job is a srun or a sbatch job?

I need to distinguish between batch and interactive job when are in RUNNING state.
I can't find with sact or stat a way to know if a job is a interactive session.
Did anyone already solved a similar problem?
You can use the batchflag formatting keyword in the squeue command to infer if a job has been submitted using the sbatch command.
$ squeue --Format=batchflag -u ${USER} --states=RUNNING
From the BatchFlag description in the scontrol help page:
Jobs submitted using the sbatch command have BatchFlag set to 1. Jobs submitted using other commands have BatchFlag set to 0.

Resources