SLURM how to qsub a task when another task is finished? - qsub

I am currently using HPC based on Linux which use only SLURM to submit jobs, and the HPC only allows a job to be run for 12 hours. However, I may need to run 24 jobs continuously for a week to have good results.
Is there a way to run a job again (automatically) when it is finished?
Kind regards
Add:
When the job is finished a .out file will be created. In other words, the number of .out file will increase by 1.
Is it possible to requeue the job when the number of .out is increased?
#!/bin/bash
#!
#! Example SLURM job script for Darwin (Sandy Bridge, ConnectX3)
#! Last updated: Sat Apr 18 13:05:53 BST 2015
#!
#!#############################################################
#!#### Modify the options in this section as appropriate ######
#!#############################################################
#! sbatch directives begin here ###############################
#! Name of the job:
#SBATCH -J Validation
#! Which project should be charged:
#SBATCH -A SOGA
#! How many whole nodes should be allocated?
#SBATCH --nodes=1
#! How many (MPI) tasks will there be in total? (<= nodes*16)
#SBATCH --ntasks=1
#!SBATCH --mem=200
#! How much wallclock time will be required?
#SBATCH --time=12:00:00
#SBATCH --mail-user=zl352
#SBATCH --mail-type=ALL
#! Uncomment this to prevent the job from being requeued (e.g. if
#! interrupted by node failure or system downtime):
##SBATCH --no-requeue
#! Do not change:
#SBATCH -p sandybridge
#! sbatch directives end here (put any additional directives above this line)
#! Notes:
#! Charging is determined by core number*walltime.
#! The --ntasks value refers to the number of tasks to be launched by SLURM only. This
#! usually equates to the number of MPI tasks launched. Reduce this from nodes*16 if
#! demanded by memory requirements, or if OMP_NUM_THREADS>1.
#! Each task is allocated 1 core by default, and each core is allocated 3994MB. If this
#! is insufficient, also specify --cpus-per-task and/or --mem (the latter specifies
#! MB per node).
#! Number of nodes and tasks per node allocated by SLURM (do not change):
numnodes=$SLURM_JOB_NUM_NODES
numtasks=$SLURM_NTASKS
mpi_tasks_per_node=$(echo "$SLURM_TASKS_PER_NODE" | sed -e 's/^\([0-9][0-9]*\).*$/\1/')
#! ############################################################
#! Modify the settings below to specify the application's environment, location
#! and launch method:
#! Optionally modify the environment seen by the application
#! (note that SLURM reproduces the environment at submission irrespective of ~/.bashrc):
. /etc/profile.d/modules.sh # Leave this line (enables the module command)
module purge # Removes all modules still loaded
module load default-impi # REQUIRED - loads the basic environment
#! Insert additional module load commands after this line if needed:
#! Full path to application executable:
application="~/scratch/code7/viv"
#! Run options for the application:
options=" > test.e"
#! Work directory (i.e. where the job will run):
workdir="$SLURM_SUBMIT_DIR" # The value of SLURM_SUBMIT_DIR sets workdir to the directory
# in which sbatch is run.
#! Are you using OpenMP (NB this is unrelated to OpenMPI)? If so increase this
#! safe value to no more than 16:
export OMP_NUM_THREADS=1
#! Number of MPI tasks to be started by the application per node and in total (do not change):
np=$[${numnodes}*${mpi_tasks_per_node}]
#! The following variables define a sensible pinning strategy for Intel MPI tasks -
#! this should be suitable for both pure MPI and hybrid MPI/OpenMP jobs:
export I_MPI_PIN_DOMAIN=omp:compact # Domains are $OMP_NUM_THREADS cores in size
export I_MPI_PIN_ORDER=scatter # Adjacent domains have minimal sharing of caches/sockets
#! Notes:
#! 1. These variables influence Intel MPI only.
#! 2. Domains are non-overlapping sets of cores which map 1-1 to MPI tasks.
#! 3. I_MPI_PIN_PROCESSOR_LIST is ignored if I_MPI_PIN_DOMAIN is set.
#! 4. If MPI tasks perform better when sharing caches/sockets, try I_MPI_PIN_ORDER=compact.
#! Uncomment one choice for CMD below (add mpirun/mpiexec options if necessary):
#! Choose this for a MPI code (possibly using OpenMP) using Intel MPI.
#!CMD="mpirun -ppn $mpi_tasks_per_node -np $np $application $options"
#! Choose this for a pure shared-memory OpenMP parallel program on a single node:
#! (OMP_NUM_THREADS threads will be created):
CMD="$application $options"
#! Choose this for a MPI code (possibly using OpenMP) using OpenMPI:
#!CMD="mpirun -npernode $mpi_tasks_per_node -np $np $application $options"
###############################################################
### You should not have to change anything below this line ####
###############################################################
cd $workdir
echo -e "Changed directory to `pwd`.\n"
JOBID=$SLURM_JOB_ID
echo -e "JobID: $JOBID\n======"
echo "Time: `date`"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"
if [ "$SLURM_JOB_NODELIST" ]; then
#! Create a machine file:
export NODEFILE=`generate_pbs_nodefile`
cat $NODEFILE | uniq > machine.file.$JOBID
echo -e "\nNodes allocated:\n================"
echo `cat machine.file.$JOBID | sed -e 's/\..*$//g'`
fi
echo -e "\nnumtasks=$numtasks, numnodes=$numnodes, mpi_tasks_per_node=$mpi_tasks_per_node (OMP_NUM_THREADS=$OMP_NUM_THREADS)"
echo -e "\nExecuting command:\n==================\n$CMD\n"
eval $CMD

If your job is intrinsically restartable, all you need to do is to call sbatch at the end of your submission script. Assuming it is called submit.sh
if ! job_is_done;
then
sbatch submit.sh
fi
The job_is_done part should be replaced by a command that returns 0 when the job is done (i.e. computation finished, process converged, etc.) for instance by 'grepping' in the log file for certain clues.
You can also re-queue the job:
job_is_done || scontrol requeue $SLURM_JOB_ID
If your program is not intrinsically restartable, you could use a wrapper such as DMCTP to make it restartable.

Related

Submit batch job to a server with linux but no slurm?

I used to have access to a slurm server where I would submit the following batch job:
#!/bin/bash
#SBATCH --job-name=sdmodel
#SBATCH --output=logs/out/%a
#SBATCH --error=logs/err/%a
#SBATCH --nodes=1
#SBATCH --partition=common,scavenger
#SBATCH -c 10
#SBATCH --mem-per-cpu=12GB
#SBATCH --array=1-236
module load Matlab/R2021a
matlab -nodisplay -r "run('main.m'); exit"
Now the new server, is simply linux (no slurm). So the sbatch command does not work. Is there anyway to do something similar?
If the "new server" is the frontend for a cluster you want to use that runs a different batch scheduling system (there are several alternatives to SLURM out there), then you'll need to consult the documentation or sysadmin to identify the new batch scheduling system and then read its documentation.
If the new server is just a single interactive time-shared Linux server (as opposed to a batch-scheduled cluster), then you can probably execute your same script unmodified directly from the command line. One of the benefits of the #SBATCH directive format is they are just comments to bash and will be ignored when the script is interactively executed by the shell.
If your question is actually asking how to run your script in the background and capture the output into files (in a manner similar to execution of your script under SLURM), you could try a command like the following (assuming your script above is named myscript.sh):
$ mkdir -p logs/{out,err}
$ id=`date +%Y-%m-%d_%H:%M:%S` ; echo "Running $id" ; nohup myscript.sh >logs/out/$id 2>logs/err/$id &

Run on-worker setup programs in SLURM sbatch script

How do I run setup code in a SLURM sbatch script? Can I just use two srun lines?
Are these two srun lines guaranteed to run on the same node, without cleanup inbetween?
#!/bin/bash
# Parameters
#SBATCH ...
# setup
srun cp /nfs/data $TMPDIR
# job
srun a.out $TMPDIR
The srun command will start as many instances of the command as specified with the --ntasks parameter. It is typically used with MPI programs and programs that run embarrassingly parallel workloads.
A command like srun cp ... only makes sense in the case multiple nodes are requested and only one task is running per node, so with for instance --nodes=N or --ntasks=N --ntasks-per-node=1 or a similar combination. It can be used to copy files from a network filesystem to a local filesystem.
If there is only one node and multiple tasks, the srun could cause problems by concurrently trying to write to the same file.
If there is only one task, then the srun are not really needed (except if you want to use sstat to monitor them).
In any case, consecutive srun's are run on the same sets of nodes without cleaning.

Slurm: Why do we need Srun in Sbatch script file?

I am new to Slurm and I also found the related questions about this topic. However, I am still confused about several points of how to use srun. According to the official document, srun will typically first allocate resources and then run the parallel jobs. For example, I want to run 20 tasks and if I submit my job based on the following script, I am not sure how many tasks are created. Because sbatch only takes care of allocating resources instead of executing program.
#!/bin/sh
#SBATCH -n 20
#SBATCH --mpi=pmi2
#SBATCH -o myoutputfile.txt
module load mpi/mpich-x86_64
mpirun mpiprogram < inputfile.txt
If I am trying to run sequential program like the following, I am not whether there will be a difference or not. For example, I can simply remove the srun command in this script. What will happen?
#!/bin/sh
#SBATCH -n 1
#SBATCH -N 1
srun tar zxf julia-0.3.11.tar.gz
echo "prefix=/software/julia-0.3.11" > julia/Make.user
cd julia
srun make
The first example will spawn 20 tasks ; sbatch will request 20 CPUs and also set up the environment so that mpirun knows how many CPUs were requested for the job. mpirun will then spawn as many processes as were allocated (provided that OpenMPI was compiled with Slurm support).
The #SBATCH --mpi=pmi2 part is meant for srun so it will have no effect if srun is not called in the submission script.
In the second example, there will be no difference in the number of processes spawned as only one is needed. But, with srun, the output of sstat will be more reliable, the management of signals will be more precise, and the buffering of the output will be more controlled (via the srun command line options).
If you request multiple tasks, srun will instantiate that many processes. It can be an MPI program, or a sequential program that adapts its behaviour based on the SLURM_PROC_ID environment variable.
Also you can run multiple srun in the same submission script. Each instance of srun (called a "step") is then accounted separately in the accounting (sacct).
Finally, srun can use a subset of the allocation and organise the micro-scheduling of many small tasks in a single job (see the example in the srun manpage).

GPU allocation within a SBATCH

I have access to a large GPU cluster (20+ nodes, 8 GPUs per node) and I want to launch a task several times on n GPUs (1 per GPU, n > 8) within one single batch without booking full nodes with the --exclusive flag.
I managed to pre-allocate the resources (see below), but I struggle very hard with launching the task several times within the job. Specifically, my log shows no value for the CUDA_VISIBLE_DEVICES variable.
I know how to do this operation on fully booked nodes with the --nodes and --gres flags. In this situation, I use --nodes=1 --gres=gpu:1 for each srun. However, this solution does not work for the present question, the job hangs indefinitely.
In the MWE below, I have a job asking for 16 gpus (--ntasks and --gpus-per-task). The jobs is composed of 28 tasks which are launched with the srun command.
#!/usr/bin/env bash
#SBATCH --job-name=somename
#SBATCH --partition=gpu
#SBATCH --nodes=1-10
#SBATCH --ntasks=16
#SBATCH --gpus-per-task=1
for i in {1..28}
do
srun echo $(hostname) $CUDA_VISIBLE_DEVICES &
done
wait
The output of this script should look like this:
nodeA 1
nodeR 2
...
However, this is what I got:
nodeA
nodeR
...
When you write
srun echo $(hostname) $CUDA_VISIBLE_DEVICES &
the expansion of the $CUDA_VISIBLE_DEVICES variable will be performed on the master node of the allocation (where the script is run) rather than on the node targeted by srun. You should escape the $:
srun echo $(hostname) \$CUDA_VISIBLE_DEVICES &
By the way, the --gpus-per-task= appeared in the sbatch manpage in the 19.05 version. When you use it with an earlier option, I am not sure how it goes.

Run a "monitor" task alongside mpi task in SLURM

I've got an mpi job I run in slurm using an sbatch script which looks something like:
# request 384 processors across 16 nodes for exclusive use:
#SBATCH --exclusive
#SBATCH --ntasks-per-node=24
#SBATCH -n 384
#SBATCH -N 16
#SBATCH --time 3-00:00:00
mpirun myprog
I want to monitor the memory/cpu usage and some other behaviour of the "myprog" processes. I've written a simple script (call it "monitor") which can do this, but I'm stumped on how to use sbatch to run ONE copy of it on each allocated node, at the same time as "myprog".
I think I need to modify the above to something like:
...
srun monitor
mpirun myprog
But I'm confused about whether a) that means "monitor" will run in the background and b) how I can control where "monitor" runs.
To have monitor run 'in the background', so actually the srun is non-blocking and the subsequent mpirun command can start, you simply need to add an ampersand (&) at the end.
To make sure that program runs on the 'master node' of the allocation, just remove the srun command.
If you need that program to run on a specific node, use the -n1 --nodelist option (you probably first need to get the list of all allocated nodes first.) You should also consider using the --overcommit option of srun to avoid dedicating a full CPU to your monitoring program which I assume is not CPU-bound.

Resources