How can I set the time of a grouped job in the Snakemake file? - slurm

I have a rule that I want to run many times, e.g.,
rule my_rule:
input:
expand(my_input, x=xs)
output:
expand(my_output, x=xs)
threads: 1
resources:
time = T
shell:
'''
{my_command}
'''
rule run_all:
input:
expand(my_output, x=xs)
where T is an integer specifying the maximum number of minutes to allocate to this rule.
I want to parallelize this on SLURM and so run a command like
snakemake run_all --groups my_rule=my_rule --group-components my_rule=N --jobs 1 --profile SLURM
where N is an integer specifying the number of jobs to run in parallel.
Doing so asks SLURM for T*N minutes. But since the jobs are running in parallel all I want is T minutes.
At the moment I solve this by removing the time = T line and editing my ~./config/snakemake/slurm/cluster_config.yaml file (setting time: T). But this is a pain to do and makes my pipeline less repeatable.
Is there a way to set the resources in the Snakemake file so that when grouping N copies of a rule that takes time T we only ask SLURM for T time? Or am I going about this the wrong way?

Related

How to make subsection of sbatch jobs run at a time

I have a bash script (essentially what it does is align all files in a directory to all the files of another specified directory). The number of jobs gets quite large if there are 100 files being aligned individually to another 100 files (10,000 jobs), which all get submitted to slurm individually.
I have been doing them in batches manually but I think there must be a way to include it in a script so that, for example, only 50 jobs are running at a time.
I tried
$ sbatch --array [1-1000]%50
but it didn't work

Scheduling more jobs than MaxArraySize

Let's say I have 6233 simulations to run. The commands are generated and stored in a file, one in each line. I would like to use Slurm to schedule and run these commands. However, the MaxArraySize limit is 2000. So I can't use one job array to schedule all of them.
One solution is given here, where we create four separate jobs and use arithmetic indexing into the file, with the last job having a smaller number of tasks to run (233).
Is it possible to do this using one sbatch script with one job ID?
I set ntasks=1 when using job arrays. Do larger ntasks help in such situations?
Update:
Following Damien's solution and examples given here, I ended up with the following line in my bash script:
curID=$(( ${SLURM_ARRAY_TASK_ID} * ${SLURM_NTASKS} + ${SLURM_PROCID} ))
The same can be done using Python (shown in the referenced page). The only difference is that the environment variables should be imported into the script.
Is it possible to do this using one sbatch script with one job ID?
No. That solution will give you multiple job IDs
I set ntasks=1 when using job arrays. Do larger ntasks help in such situations?
Yes, that is a factor that you can leverage.
Each job in the array can spawn multiple tasks (--ntasks=...). In that case, the line number in the command file must be computed from $SLURM_ARRAY_TASK_ID and $SLURM_PROCID, and the program must be started with srun. Each task in a job member of the array will run in parallel. How large the job can be will depend on the MaxJobsize limit defined on the cluster/partition/qos you have access to.
Another option is to chain the tasks inside each job of the array, with a Bash loop (for i in $seq(...) ; do ...; done). In that case, the line number in the command file must be computed from $SLURM_ARRAY_TASK_ID and $i. Each task in a job member of the array will run serially. How large the job can be will depend on the MaxWall limit defined on the cluster/partition/qos you have access to.

one input file to yield many output files

This is a bit of a backwards approach to snakemake whose main paradigm is "one job -> one output", but i need many reruns in parallel of my script on the same input matrix on the slurm batch job submission cluster. How do I achieve that?
I tried specifying multiple threads, multiple nodes, each time indicating one cpu per task, but it never submits an array of many jobs, just an array of one job.
I don't think there is a nice way to submit an array job like that. In snakemake, you need to specify a unique output for each job. But you can have the same input. If you want 1000 runs of a job:
ids = range(1000)
rule all:
input: expand('output_{sample}_{id}', sample=samples, id=ids)
rule simulation:
input: 'input_{sample}'
output: 'output_{sample}_{id}'
shell: echo {input} > {output}
If that doesn't help, provide more information about the rule/job you are trying to run.

Using option --array as an argument in slurm

Is it possible to use the --array option as an argument? I mean, I have a R code where I use arrays. The number of arrays depends of the file on which I execute my R code. I would like to pass as argument the number of arrays into the sbatch my_code.R command line , in order to never modify my slurm code : for example, for a file with 550.000 columns, I will need 10 arrays, a file with 1.000.000 columns will needed 19 arrays etc. I must get something like this "sbatch --array 1-nb_of_arrays_needed my_code.R" . The goal is to make my code usable by everyone, without the user needs to go into the slurm code in order to change the line #SBATCH --array=x-y
My R code (I don't show it in full) :
data<-read.table(opt$file, h=T, row.names=1, sep="\t")+1
ncol=ncol(data)
nb_arrays=ncol/55000
nb_arrays=round(nb_arrays)
opt$number=nb_arrays
...
Bests
Your R script will start only when the job is scheduled. To be scheduled, it must be submitted, and to be submitted, it must know the argument to --array.
So you have two options:
Either split your R script into to, one part that will run before the job is submitted, and the other that will run when the job starts. The first part will compute the necessary number of jobs in the array (and possibly submit the job array automatically) and the other part will do the actual calculations.
If you prefer having only one R script, you can differentiate the behaviour based on the presence or absence of SLURM_JOB_ID variable in the environment. If it is not present, compute the number of jobs and submit, if it is present, do the actual calculations.
The other option is to set --array in the submission job to a large value, and when the first job in the array starts, it computes the number of jobs that are necessary, and cancels the superfluous jobs.

How to submit a list of jobs to PBS system in a proper way?

I can only submit 10 jobs to the PBS system at the same time.
If I have independent 99 scripts, I would like to have a script to finish all these 99 scripts by one click.
jobs1.sh
jobs2.sh
jobs3.sh
.
.
jobs99.sh
The purpose is to submit another 10 jobs after finishing previous 10 jobs.
Currently, I'm using sleep to separate every 10 jobs by estimating how much time they need. I know it's not a nice way.....
You need to check out the PBS section about dependencies. There are good ways of queuing jobs to run after the previous ones have finished. For example-
qsub -W depend=afterok:Job_ID job11.sh
Job_ID can be the JOB_ID for job1.sh.
So job11 runs only after job1 has finished. You can elaborate on this idea and set up a loop.

Resources