Combining xargs parallel and mpirun - multithreading

I have an embarassingly parallel (bash) script that is running in a computing cluster.
The script is a shell script and is not linked to any MPI library: this means that the only way I can send the MPI rank to it, is with a command line parameter.
So Far, I only executed it within a single node, and the solution was simple:
#!/bin/bash
#SBATCH --nodes=1
N=16
seq $N | xargs -P $N -I% my_script.bash % $N
How can I scale it with two nodes? If I just use '--nodes=2' and N=32 then xargs will try to spawn all threads on the same node. On the other hand I cannot use mpiexec alone: because the script is not linked to MPI library and I do not know how tell the script which threads it is.

You can use srun within your submission script to do that:
seq $N | xargs -P $N -I% srun --exclusive -N1 my_script.bash % $N
This will use srun to launch your bash script and distribute it to the allocated CPUs.

Related

Running multiple executable parallel with limitation to 5

I am running some larger amount (100) of calculations. Each calculation needs 5 hours to finish and can run only on one core. So in order to make whole process more efficient i need to run 5 (or more, depends on number of cpus) of them at same time.
I have 100 folders and in each is one executable. How can I run all 100 executable in the way that only 5 are running at the same time?
I have tried xargs as seen: Parallelize Bash script with maximum number of processes
cpus=5
find . -name ./*/*.exe | xargs --max-args=1 --max-procs=$cpus echo
I can't find the way to run them. Echo only prints paths on screen.
You can execute the input as the command by specifying a substitution pattern and using that as the command:
$ printf '%s\n' uname pwd | xargs -n 1 -I{} {}
Linux
/data/data/com.termux/files/home
or more easily but less directly by using env as the command, since that will execute its first argument:
printf '%s\n' uname pwd | xargs -n 1 env
In your case, your full command might be:
find . -name '*.exe' | xargs -n 1 -P 5 env

SLURM sbatch multiple parallel calls to executable

I have an executable that takes multiple options and multiple file inputs in order to run. The executable can be called with a variable number of cores to run.
E.g. executable -a -b -c -file fileA --file fileB ... --file fileZ --cores X
I'm trying to create an sbatch file that will enable me to have multiple calls of this executable with different inputs. Each call should be allocated in a different node (in parallel with the rest), using X cores. The parallelization at core level is taken care of the executable, while at the node level by SLURM.
I tried with ntasks and multiple sruns but the first srun was called multiple times.
Another take was to rename the files and use a SLURM process or node number as filename before the extension but it's not really practical.
Any insight on this?
i do these kind of jobs always with the help of bash script that i run by a sbatch command. The easiest approach would be to have a loop in a sbatch script where you spawn the different job and job steps under your executable with srun specifying i.e. the corresponding node name in your partion with -w . You may also read up the documentation of slurm array jobs (if that befits you better). Alternatively you could also store all parameter combinations in a file and than loop over them with the script of have a look at "array job" manual page.
Maybe the following script (i just wrapped it up) helps you to get a feeling for what i have in mind (i hope its what you need). Its not tested so dont just copy and paste it!
#!/bin/bash
parameter=(10 5 2)
node_names=(node1 node2 node3)
# lets run one job per node each time taking one parameter
for parameter in ${parameter[*]}
# asign parameter to node
#script some if else condition here to specify parameters
# -w specifies the name of the node to use
# -N specifies the amount of nodes
JOBNAME="jmyjob$node-$parameter"
# asign the first job to the node
$node=${node_names[0]}
#delete first node from list
unset node_names[0];
#reinstantiate list
node_names=("${Unix[#]}")
srun -N1 -w$node -psomepartition -JJOBNAME executable.sh model_parameter &
done;
You will have the problem that you need to force your sbatch script to wait for the last job step. In this case the follwoing additional while loop might help you.
# Wait for the last job step to complete
while true;
do
# wait for last job to finish use the state of sacct for that
echo "waiting for last job to finish"
sleep 10
# sacct shows your jobs, -R only running steps
sacct -s R,gPD|grep "myjob*" #your job name indicator
# check the status code of grep (1 if nothing found)
if [ "$?" == "1" ];
then
echo "found no running jobs anymore"
sacct -s R |grep "myjob*"
echo "stopping loop"
break;
fi
done;
I managed to find one possible solution, so I'm posting it for reference:
I declared as many tasks as calls to the executable, as well as nodes and the desired number of cpus per call.
And then a separate srun for each call, declaring the number of nodes and tasks at each call. All the sruns are bound with ampersands (&):
srun -n 1 -N 1 --exclusive executable -a1 -b1 -c1 -file fileA1 --file fileB1 ... --file fileZ1 --cores X1 &
srun -n 1 -N 1 --exclusive executable -a2 -b2 -c2 -file fileA2 --file fileB2 ... --file fileZ2 --cores X2 &
....
srun -n 1 -N 1 --exclusive executable -aN -bN -cN -file fileAN --file fileBN ... --file fileZN --cores XN
--Edit: After some tests (as I mentioned in a comment below), if the process of the last srun ends before the rest, it seems to end the whole job, leaving the rest unfinished.
--edited based on the comment by Carles Fenoy
Write a bash script to populate multiple xyz.slurm files and submit each of them using sbatch. Following script does a a nested for loop to create 8 files. Then iterate over them to replace a string in those files, and then batch them. You might need to modify the script to suit your need.
#!/usr/bin/env bash
#Path Where you want to create slurm files
slurmpath=~/Desktop/slurms
rm -rf $slurmpath
mkdir -p $slurmpath/sbatchop
mkdir -p /exports/home/schatterjee/reports
echo "Folder /slurms and /reports created"
declare -a threads=("1" "2" "4" "8")
declare -a chunks=("1000" "32000")
declare -a modes=("server" "client")
## now loop through the above array
for i in "${threads[#]}"
{
for j in "${chunks[#]}"
{
#following are the content of each slurm file
cat <<EOF >$slurmpath/net-$i-$j.slurm
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=$slurmpath/sbatchop/net-$i-$j.out
#SBATCH --wait-all-nodes=1
echo \$SLURM_JOB_NODELIST
cd /exports/home/schatterjee/cs553-pa1
srun ./MyNETBench-TCP placeholder1 $i $j
EOF
#Now schedule them
for m in "${modes[#]}"
{
for value in {1..5}
do
#Following command replaces placeholder1 with the value of m
sed -i -e 's/placeholder1/'"$m"'/g' $slurmpath/net-$i-$j.slurm
sbatch $slurmpath/net-$i-$j.slurm
done
}
}
}
You can also try this python wrapper which can execute your command on the files you provide

Executing several bash scripts simultaneously from one script?

I want to make a bash script that will execute around 30 or so other scripts simultaneously, these 30 scripts all have wget commands iterating through some lists.
I thought of doing something with screen (send ctrl + shift + a + d) or send the scripts to background but really I dont know what to do.
To summarize: 1 master script execution will trigger all other 30 scripts to execute all at the same time.
PS: I've seen the other questions but I don't quite understand how the work or the are a bit more than what I need(expecting a return value, etc)
EDIT:
Small snippet of the script(this part is the one that executes with the config params I specified)
if [ $WP_RANGE_STOP -gt 0 ]; then
#WP RANGE
for (( count= "$WP_RANGE_START"; count< "$WP_RANGE_STOP"+1; count=count+1 ));
do
if cat downloaded.txt | grep "$count" >/dev/null
then
echo "File already downloaded!"
else
echo $count >> downloaded.txt
wget --keep-session-cookies --load-cookies=cookies.txt --referer=server.com http://server.com/wallpaper/$count
cat $count | egrep -o "http://wallpapers.*(png|jpg|gif)" | wget --keep-session-cookies --load-cookies=cookies.txt --referer=http://server.com/wallpaper/$number -i -
rm $count
fi
Probably the most straightforward approach would be to use xargs -P or GNU parallel. Generate the different arguments for each child in the master script. For simplicity's sake, let's say you just want to download a bunch of different content at once. Either of
xargs -P 30 wget < urls_file
parallel -j 30 wget '{}' < urls_file
will spawn up to 30 simultaneous wget processes with different args from the given input. If you give more information about the scripts you want to run, I might be able to provide more specific examples.
Parallel has some more sophisticated tuning options compared to xargs, such as the ability to automatically split jobs across cores or cpus.
If you're just trying to run a bunch of heterogeneous different bash scripts in parallel, define each individual script in its own file, then make each file executable and pass it to parallel:
$ cat list_of_scripts
/path/to/script1 arg1 arg2
/path/to/script2 -o=5 --beer arg3
…
/path/to/scriptN
then
parallel -j 30 < list_of_scripts

How can I use a pipe or redirect in a qsub command?

There are some commands I'd like to run on a grid using qsub (SGE 8.1.3, CentOS 5.9) that need to use a pipe (|) or a redirect (>). For example, let's say I have to parallelize the command
echo 'hello world' > hello.txt
(Obviously a simplified example: in reality I might need to redirect the output of a program like bowtie directly to samtools). If I did:
qsub echo 'hello world' > hello.txt
the resulting content of hello.txt would look like
Your job 123454321 ("echo") has been submitted
Similarly if I used a pipe (echo "hello world" | myprogram), that message is all that would be passed to myprogram, not the actual stdout.
I'm aware I could write a small bash script that each contain the command with the pipe/redirect, and then do qsub ./myscript.sh. However, I'm trying to run many parallelized jobs at the same time using a script, so I'd have to write many such bash scripts each with a slightly different command. When scripting this solution can start to feel very hackish. An example of such a script in Python:
for i, (infile1, infile2, outfile) in enumerate(files):
command = ("bowtie -S %s %s | " +
"samtools view -bS - > %s\n") % (infile1, infile2, outfile)
script = "job" + str(counter) + ".sh"
open(script, "w").write(command)
os.system("chmod 755 %s" % script)
os.system("qsub -cwd ./%s" % script)
This is frustrating for a few reasons, among them that my program can't even delete the many jobXX.sh scripts afterwards to clean up after itself, since I don't know how long the job will be waiting in the queue, and the script has to be there when the job starts.
Is there a way to provide my full echo 'hello world' > hello.txt command to qsub without having to create another file containing the command?
You can do this by turning it into a bash -c command, which lets you put the | in a quoted statement:
qsub bash -c "cmd <options> | cmd2 <options>"
As #spuder has noted in the comments, it seems that in other versions of qsub (not SGE 8.1.3, which I'm using), one can solve the problem with:
echo "cmd <options> | cmd2 <options>" | qsub
as well.
Although my answer is a bit late I am adding it for any incoming viewers. To use a pipe/direct and submit that as a qsub job you need to do a couple of things. But first, using qsub at the end of a pipe like you're doing will only result in one job being sent to the queue (i.e. Your code will run serially rather than get parallelized).
Run qsub with enabling binary mode since the default qsub behavior rather expects compiled code. For that you use the "-b y" flag to qsub and you'll avoid any errors of the sort "command required for a binary mode" or "script length does not match declared length".
echo each call to qsub and then pipe that to shell.
Suppose you have a file params-query.txt which hold several bowtie commands and piped calls to samtools of the following form:
bowtie -q query -1 param1 -2 param2 ... | samtools ...
To send each query as a separate job first prepare your command line units from STDIN through xargs STDIN. Notice the quotes around the braces are important if you are submitting a command of piped parts. That way your entire query is treated a single unit.
cat params-query.txt | xargs -i echo qsub -b y -o output_log -e error_log -N job_name \"{}\" | sh
If that didn't work as expected then you're probably better off generating an intermediate output between bowtie and samtools before calling samtools to accept that intermediate output. You won't need to change the qsub call through xargs but the code in params-query.txt should look like:
bowtie -q query -o intermediate_query_out -1 param1 -2 param2 && samtools read_from_intermediate_query_out
This page has interesting qsub tricks you might like
grep http *.job | awk -F: '{print $1}' | sort -u | xargs -I {} qsub {}

KSH: constraint the number of thread that can run at one time

I have a script that loop and each iteration invoke a thread that run in a background like below
xn_run_process.sh
...
for each in `ls ${INPUT_DIR}/MDX*.txt`
do
java -Xms256m -Xmx1024m -cp ${CLASSPATH} com.wf.xn.etcc.Main -config=${CONFIG_FILE}
...
for SCALE_PDF in `ls ${PROCESS_DIR}/*.pdf`
do
OUTPUT_AFP=${OUTPUT_DIR}/`basename ${SCALE_PDF}`
OUTPUT_AFP=`print ${OUTPUT_AFP} | sed s/pdf/afp/g`
${PROJ_DIR}/myscript.sh -i ${SCALE_PDF} -o ${OUTPUT_AFP} &
sleep 30
done
done
When I did this, I only think that it will be only 5 threads of myscript.sh be concurrently executed at one time, however things change, and this list execute 30 threads, each does quite heavy process. How do I constraint the number of concurrent processes to 5?
While this is possible in pure shell scripting, the easiest approach would be using a parallelization tool like GNU parallel or GNU make. Makefile example:
SOURCES = ${SOME_LIST}
STAMPS = $(SOME_LIST:=.did-run-stamp)
all : $(STAMPS)
%.did-run-stamp : %
/full/path/myscript.sh -f $<
and then calling make as make -j 5.
Use GNU Parallel (adjust -j as you see fit. Remove it if you want # of CPUs):
for each in `ls ${INPUT_DIR}/MDX*.txt`
do
java -Xms256m -Xmx1024m -cp ${CLASSPATH} com.wf.xn.etcc.Main -config=${CONFIG_FILE}
...
for SCALE_PDF in `ls ${PROCESS_DIR}/*.pdf`
do
OUTPUT_AFP=${OUTPUT_DIR}/`basename ${SCALE_PDF}`
OUTPUT_AFP=`print ${OUTPUT_AFP} | sed s/pdf/afp/g`
sem --id myid -j 5 ${PROJ_DIR}/myscript.sh -i ${SCALE_PDF} -o ${OUTPUT_AFP}
done
done
sem --wait --id myid
sem is part of GNU Parallel.
This will keep 5 jobs running until there is only 5 jobs left. Then it will allow your java to run while finishing the last 5. The sem --wait will wait until the last 5 are finished, too.
Alternatively:
for each ...
java ...
...
ls ${PROCESS_DIR}/*.pdf |
parallel -j 5 ${PROJ_DIR}/myscript.sh -i {} -o ${OUTPUT_DIR}/{/.}.afp
done
This will run 5 jobs in parallel and only let java run when all the jobs are finished.
Alternatively you can use the queue trick described in GNU Parallel's man page: https://www.gnu.org/software/parallel/man.html#example__gnu_parallel_as_queue_system_batch_manager
echo >jobqueue; tail -f jobqueue | parallel -j5 &
for each ...
...
ls ${PROCESS_DIR}/*.pdf |
parallel echo ${PROJ_DIR}/myscript.sh -i {} -o ${OUTPUT_DIR}/{/.}.afp >> jobqueue
done
echo killall -TERM parallel >> jobqueue
wait
This will run java, then add jobs to be run to a queue. After adding jobs java will be run immediately. At all time 5 jobs will be run from the queue until the queue is empty.
You can install GNU Parallel simply by:
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
Watch the intro videos to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1 and walk through the tutorial (man parallel_tutorial). You command line with love you for it.
If you have ksh93 check if JOBMAX is available:
JOBMAX
This variable defines the maximum number running background
jobs that can run at a time. When this limit is reached, the
shell will wait for a job to complete before staring a new job.

Resources