bash variable expansion in slurm's #SBATCH directive - slurm

is it possible to use variable expansion in #SBATCH lines in slurm? for instance I want to have line below:
#SBATCH --array=0-100%{$1-10}
so that by default it uses 10 concurrent job unless I manually pass an argument when I call sbatch.
Above gives me an Invalid job array specification error.

No, this isn't possible. But you can overwrite the scripts default --array by giving it explicitly on the sbatch command line.

Related

Having issue with slurm. error: "no such file or directory"

I'm trying to run a slurm script using sbatch <script.sh>. However, despite checking my path variable multiple times, i get a file not found error. Moreover I think this has to do with my go environment but I also get a "cannot import absolute path" error. I'm not sure what the issue is. I have attached my slurm configuration file as well as the error output below
#!/bin/bash
#SBATCH --partition production
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=5:00:00
#SBATCH --mem=2GB
#SBATCH --job-name=myTest
#SBATCH --mail-type=END
#SBATCH --mail-user=atd341#nyu.edu
#SBATCH --output=slurm_%j.out
module purge
module load go/1.17
##RUNDIR=${SCRATCH}/run-${SLURM_JOB_ID/.*}
##mkdir -p ${RUNDIR}
DATADIR=${SCRATCH}/inmap_sandbox
cd $SLURM_WORK_DIR
source $DATADIR/setup.sh
go run $DATADIR/
Here is the output:
/var/spool/slurmd/job16296/slurm_script: line 19: /inmap_sandbox/setup.sh: No such file or directory
import "/inmap_sandbox": cannot import absolute path
I have tried checking my path variable and making sure I'm following the correct path. For reference by directory structure is /scratch/inmap_sandbox. I'm trying to run the sbatch file in the /scratch directory
Offhand it appears the ${SCRATCH} variable might not be set inside the environment running the script. Try explicitly setting that to /scratch?
Once you get past that problem, note that if this batch script is running on a compute node that is separate from the frontend node you are using interactively, then they might not both mount the same ${SCRATCH} file system (or possibly mount it in different places).
Consult the system documentation to find out which file systems are shared between the frontend and the compute nodes. You might even need to pass SLURM capability options to request certain shared filesystems. In the absence of documentation, comparing the output of mount on the frontend and from within the batch script might be helpful. More specifically, add the mount command on a line by itself early in your batch script, and compare the output it generates to the output of the same command on the frontend.

How to get SLURM task ID in program

I'm running srun -n 100 python foo.py. Inside the python script how does it find out which task number/id/rank it is? Is there an environment variable set?
Have a look at man srun or man sbatch for a list of environment variables. $SLURM_PROCID might be the one you need.

Where can I find documentation on the syntax of a "batch script"

Is the syntax (and semantics, for that matter) of "batch scripts" for sbatch formally documented anywhere?
(I'm looking for formal documentation, as opposed to examples.)
The DESCRIPTION section of the man page for sbatch begins with this paragraph:
sbatch submits a batch script to Slurm. The batch script may be given to sbatch through a file
name on the command line, or if no file name is specified, sbatch will read in a script from
standard input. The batch script may contain options preceded with "#SBATCH" before any exe-
cutable commands in the script.
That's about all I can find in the sbatch man page as to the syntax of a "batch script".
It says nothing, for example, about the fact that this script is required to begin with a shebang line. (One may infer this requirement, however, from the fact that all the examples in the EXAMPLES section meet it.)
It also says nothing of what interpreter one should put on the command line. Again, from the examples in the EXAMPLES section one may infer that /bin/sh is suitable, but one would have no reason to think that /bin/bash or /bin/zsh is also suitable (let alone, e.g., /bin/perl or /bin/python or /bin/ruby, etc.).
The documentation indeed only specifies 'batch script', which is to be understood as 'non-interactive script'.
Only if you try to submit a compiled program are you told about the shebang:
$ sbatch /usr/bin/time
sbatch: error: This does not look like a batch script. The first
sbatch: error: line must start with #! followed by the path to an interpreter.
sbatch: error: For instance: #!/bin/sh
But you can submit a script written in any language that supports # as comment symbol; the most common in that context are Bash, Python and Perl. You can also use for instance Lua but then you cannot incorporate the resource requirements in the script with #SBATCH directives.

Stop slurm sbatch from copying script to compute node

Is there a way to stop sbatch from copying the script to the compute node. For example when I run:
sbatch --mem=300 /shared_between_all_nodes/test.sh
test.sh is copied to /var/lib/slurm-llnl/slurmd/etc/ on the executing compute node. The trouble with this is there are other scripts in /shared_between_all_nodes/ that test.sh needs to use and I would like to avoid hard coding the path.
In sge I could use qsub -b y to stop it from copying the script to the compute node. Is there a similar option or config in slurm?
Using sbatch --wrap is a nice solution for this
sbatch --wrap /shared_between_all_nodes/test.sh
quotes are required if the script has parameters
sbatch --wrap "/shared_between_all_nodes/test.sh param1 param2"
from sbatch docs http://slurm.schedmd.com/sbatch.html
--wrap=
Sbatch will wrap the specified command string in a simple "sh" shell script, and submit that script to the slurm controller. When --wrap is used, a script name and arguments may not be specified on the command line; instead the sbatch-generated wrapper script is used.
The script might be copied there, but the working directory will be the directory in which the sbatch command is launched. So if the command is launched from /shared_between_all_nodes/ it should work.
To be able to lauch sbatch form anywhere, use this option
-D, --workdir=<directory>
Set the working directory of the batch script to directory before
it is executed.
like
sbatch --mem=300 -D /shared_between_all_nodes /shared_between_all_nodes/test.sh

Use Bash variable within SLURM sbatch script

I'm trying to obtain a value from another file and use this within a SLURM submission script. However, I get an error that the value is non-numerical, in other words, it is not being dereferenced.
Here is the script:
#!/bin/bash
# This reads out the number of procs based on the decomposeParDict
numProcs=`awk '/numberOfSubdomains/ {print $2}' ./meshModel/decomposeParDict`
echo "NumProcs = $numProcs"
#SBATCH --job-name=SnappyHexMesh
#SBATCH --output=./logs/SnappyHexMesh.log
#
#SBATCH --ntasks=`$numProcs`
#SBATCH --time=240:00
#SBATCH --mem-per-cpu=4000
#First run blockMesh
blockMesh
#Now decompose the mesh
decomposePar
#Now run snappy in parallel
mpirun -np $numProcs snappyHexMesh -parallel -overwrite
When I run this as a normal Bash shell script, it prints out the number of procs correctly and makes the correct mpirun call. Thus the awk command parses out the number of procs correctly and the variable is dereferenced as expected.
However, when I submit this to SLURM using:
sbatch myScript.sh
I get the error:
sbatch: error: Invalid numeric value "`$numProcs`" for number of tasks.
Can anyone help with this?
This won't work. What happens when you run
sbatch myscript.sh
is that slurm parses the script for those special #SBATCH lines, generates a job record, stores the batch script somewhere. The batch script is executed only later when the job runs.
So you need to structure you workflow in a slightly different way, and first calculate the number of procs you need before submitting the job. Note that you can use something like
sbatch -n $numProcs myscript.sh
, you don't need to autogenerate the script (also, mpirun should be able to get the number of procs in your allocation automatically, no need to use "-np").
Slurm stops processing #SBATCH directives on the first line of executable code in a script. For users whose #SBATCH directives are not dependent on the code they're trying to run above those directives, just put the #SBATCH lines at the top.
See the other answer for a workaround/solution if, as with OP, your sbatch options are dependent on the commands you've placed above them.
The batch script may contain options preceded with "#SBATCH" before
any executable commands in the script. sbatch will stop processing
further #SBATCH directives once the first non-comment non-whitespace
line has been reached in the script.
From the sbatch docs, my emphasis.

Resources