how to generate batch files in a loop in Python (not a loop in a batch file) with slightly changed parameters per iteration - linux

I am a researcher that needs to run files for a set of years on a SLURM system (high performance computing center). The available nodes for long compute times have a long queue. I have 42 years to run, and the only way to run them that gets my files processed quickly (due to the wait times, and that this is many GB of data, it takes time), is to submit them individually, one batch file per year, as jobs. I cannot include multiple years in a single batch file, or I have to wait a week in the queue to run my data due to the time I have to reserve per batch file. This is the fastest way my university's system lets me run my data.
To do this, I have 2 lines in my batch script that I have to change every time: the name of the job, and the last line which is the python script name plus a parameter being passed to it (the year)
like so: pythonscript.py 2020.
I would like to generate batch files with a python or other script I can run, where it loops over a list of years and just changes the job name to jobNameYEAR and changes the last line to pythonscript.py YEAR, writes that to a file jobNameYEAR.sl, then continues in a loop to output the next batch file. ...Even better if it can write the batch file and submit the job (sjob jobNameYEAR) before continuing in the loop, but I realize maybe this is asking too much. But separately...
Is there a way to submit jobs in a loop once these files are created? E.g. loop through the year list and submit sjob jobName2000.sl, sjob jobName2001.sl, sjob jobName2002.sl
I do not want a loop in the batch file changing the variable, this would mean reserving too many hours on the SLURM system for a single job. I want a loop outside of the batch file that generates multiple batch files I can submit as jobs.
Thank you for your help!
This is what one of my .sl files looks like, it works fine, I just want to generate these files in a loop so I can stop editing them by hand:
#!/bin/bash -l
# The -l above is required to get the full environment with modules
# Set the allocation to be charged for this job
# not required if you have set a default allocation
#SBATCH -A MYFOLDER
# The name of the job
#SBATCH -J jobNameYEAR
# 24 hour wall-clock time will be given to this job
#SBATCH -t 3:00:00
# Job partition
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=6
#SBATCH --mem=30GB
#SBATCH -p main
# load the anaconda module
ml PDC/21.11
ml Anaconda3/2021.05
conda activate myEnv
python pythonfilename.py YEAR

Create a script with the following content (let's call it chainsubmit.sh):
#!/bin/bash
SCRIPT=${1?Usage: $0 script.slum arg1 arg2 ...}
shift
ARG=$1
ID=$(sbatch --job-name=jobName$ARG --parsable $SCRIPT $ARG)
shift
for ARG in "$#"; do
ID=$(sbatch --job-name=jobName$ARG --parsable --dependency=afterok:${ID%%;*} $SCRIPT $ARG)
done
Then, adapt your script so that the last line
python pythonfilename.py YEAR
is replaced with
python pythonfilename.py $1
Finally submit all the jobs with
./chainsubmit.sh jobName.sl {2000..2004}
for instance for YEAR ranging from 2000 to 2004

… script I can run, where it loops over a list of years and just changes the job name to jobNameYEAR and changes the last line to pythonscript.py YEAR, writes that to a file jobNameYEAR.sl… submit the job (sjob jobNameYEAR) before continuing in the loop…
It can easily be done with a few shell commands and sed. Assume you have a template file jobNameYEAR.sl as shown, which literally contains jobNameYEAR and YEAR as the parameters. Then we can substitute YEAR with each given year in the loop, e. g.
seq 2000 2002|while read year
do <jobNameYEAR.sl sed s/YEAR$/$year/ >jobName$year.sl
sjob jobName$year.sl
done
If your years aren't in sequence, we can use e. g. echo 1962 1965 1970 instead of seq ….
Other variants are on Linux also possible, like for year in {2000..2002} instead of seq 2000 2002|while read year, and using envsubst instead of sed.

Related

How to make subsection of sbatch jobs run at a time

I have a bash script (essentially what it does is align all files in a directory to all the files of another specified directory). The number of jobs gets quite large if there are 100 files being aligned individually to another 100 files (10,000 jobs), which all get submitted to slurm individually.
I have been doing them in batches manually but I think there must be a way to include it in a script so that, for example, only 50 jobs are running at a time.
I tried
$ sbatch --array [1-1000]%50
but it didn't work

How to determine job array size for lots of jobs?

What is the best way to process lots of files in parallel via Slurm?
I have a lot of files (let's say 10000) in a folder (Each files get 10 secs or so to process). I want to determine sbatch job array size as 1000, naturally. (#SBATCH --array=1-10000%100) But it seems I can't determine more than some numbers(probably 1k). How you handle job array numbers? It seems to me because of my process don't take too much time, I think i should determine one job NOT for one file but for multiple files, right?
Thank you
If the process time is 10 second you should consider packing the tasks in a single job, both because such short jobs take longer to schedule than to run and because there is a limit on the number of jobs in an array.
Your submission script could look like this:
#!/bin/bash
#SBATCH --ntasks=16 # or any other number depending on the size of the cluster and the maximum allowed wall time
#SBATCH --mem-per-cpu=...
#SBATCH --time=... # based on the number of files and number of tasks
find . -name file_pattern -print0 | xargs -I{} -0 -P $SLURM_NTASKS srun -n1 -c1 --exclusive name_of_the_program {}
Make sure to replace all the ... and file_pattern and name_of_the_program with appropriate values.
The script will look for all files matching file_pattern in the submission directory and run the name_of_the_program program on it, limiting the number of concurrent instantes to the number of CPUs (more precisely number of tasks) requested. Note the use of --exclusive here which is specific for this use case and is deprecated with --exact in recent Slurm versions.

Dealing with job submission limits

I am running slurm job arrays with --array, and I would like to run about 2000 tasks/array items. However this is beyond the cluster's job submission limit of ~500 at a time.
Are there any tips/best practices for splitting this up? I'd like to submit it all at once and still be able to pass the array id arguments 1-2000 to my programs if possible. I think something like waiting to submit pieces of the array might be helpful but I'm not sure how to do this at the moment.
If the limit is on the size of an array:
You will have to split the array into several job arrays. The --array parameter accepts values of the form <START>-<END> so you can submit four jobs:
sbatch --array=1-500 ...
sbatch --array=501-1000 ...
sbatch --array=1001-1500 ...
sbatch --array=1501-200 ...
This way you will bypass the 500-limit and still keep the SLURM_ARRAY_TASK_ID ranging from 1 to 2000.
To ease things a bit, you can write this all in one line like this:
paste -d- <(seq 1 500 2000) <(seq 500 500 2000) | xargs -I {} sbatch --array={} ...
If the limit is on the number of submitted jobs:
Then one option is to have the last job of the array submit the following chunk.
#!/bin/bash
#SBATCH ...
...
...
if [[ $((SLURM_ARRAY_TASK_ID % 500)) == 0 ]] ; then
sbatch --array=$((SLURM_ARRAY_TASK_ID+1))-$((SLURM_ARRAY_TASK_ID+500)) $0
fi
Note that ideally, the last running job of the array should submit the job, and it may or may not be the one with the highest TASK ID, but this has worked for all practical purposes in many situations.
Another options is to setup a cron job to monitor the queue and submit each chunk when possible, or to use a workflow manager that will that for you.
you can run a script to submit your jobs and try to make the program sleep a few seconds after every 500 submissions. see https://www.osc.edu/resources/getting_started/howto/howto_submit_multiple_jobs_using_parameters

How to time a SLURM job array?

I am submitting a SLURM job array and want to have the total runtime (i.e. not the runtime of each task) printed to the log.
This is what I tried:
#!/bin/bash
#SBATCH --job-name=step1
#SBATCH --output=logs/step1.log
#SBATCH --error=logs/step1.log
#SBATCH --array=0-263%75
start=$SECONDS
python worker.py ${SLURM_ARRAY_TASK_ID}
echo "Completed step1 in $SECONDS seconds"
What I get in step1.log is something like this:
Completed step1 in 42 seconds
Completed step1 in 94 seconds
Completed step1 in 88 seconds
...
which appear to be giving the runtimes for the last group of tasks in the array. I want a single timer for the whole array, from submission to the end of the last task. Is that possible?
With job arrays, each task is an identical submission of your script, so the way you're measuring time will necessarily only be per-task, as you're seeing. To get the overall elapsed time of the entire jobarray, you'll need to get the submit time of the first task and subtract it from the end time of the last task.
e.g.
# get submit time for first task in array
sacct -j <job_id>_0 --format=submit
# get end time for last task in array
sacct -j <job_id>_263 --format=end
Then use date -d <timestamp from sacct> +%s to convert the timestamps to seconds since the epoch, to make them easier to subtract.
Also note that each of your 264 tasks will overwrite step1.log with its own output. I would typically use #SBATCH --output=step1-%A_%a.out to distinguish outputs from different tasks.

creating cron job that sends output to file every day and overwrites this file every month

I need help with cron job that sends output to file every day and overwrites this file every month my only problem is how to make it overwrite each month and I need this in one job so creating 2 jobs one that outputs to a file and other removing it every month is out of picture
You could run it every day but use date +%w to print the day number and act differently (call with > to clobber the file instead of >> to append) based on that.
Note that some cron daemons require % to be escaped, hence \%.
# Run every day at 00:30 but overwrite file on Mondays; append every other day.
# Note that this requires bash as your shell.
# May need to override with SHELL=/bin/bash
30 00 * * * if [ "$(date +\%w)" = "1" ]; then /your/command > /your/logfile; else /your/command >> /your/logfile; fi
Edit:
You mention in comments above that your actual goal is log rotation.
The norm for Linux systems is to use something like logrotate to manage logs like this. That also has the advantage that you can keep multiple previous log files and compress them if you like.
I would recommend making use of a logrotate config snippet to accomplish your goal instead of doing it in the cron job itself. To put this in the cron job is counter-intuitive if it's merely for log rotation.
Here's an example logrotate snippet, which may go in a location like /etc/logrotate.d/yourapp depending on which Linux distribution you're using.
/var/log/yourlog {
daily
missingok
# keep one year of logs
rotate 365
compress
# keep the first one uncompressed for ease of viewing
delaycompress
}
This will result in your log file being rotated daily, with the first iteration being like /var/log/yourlog.1 and then compressed iterations like /var/log/yourlog.2.gz, /var/log/yourlog.3.gz and so on.
In my opinion therefore, your question is not actually a cron question. The kind of cron trickery used above would only be appropriate in situations such as when you want a job to fire on the last Sunday of the month, or the last day of the month, or other criteria that can't be expressed in cron syntax.

Resources