How to run 5 Shell job Simultaneous and keep rest in Queue? - multithreading

While both "Task Spooler" and "at" handle multiple queues and allow the execution of commands at a later point, the at project handles output from commands by emailing the results to the user who queued the command, while Task Spooler allows you to get at the results from the command line instead.
But what I am looking for is a way that would allow me to run 5 jobs simultaneously and keep rest of the jobs in a queue, so when any one of the 5 is over it would start the next one.
So, if 5 jobs running and 4 more in the queue as soon as any of them is finished, the next one would start executing and again 5 jobs would running simultaneously.
is there a way to handle such task?

It depends of course how you want to start your tasks. But lets assume they are loop based. The following would launch all N commands in the background.
#!/usr/bin/env bash
for i in {1..N}; do
# do awesome command based on $i
command $i &
done
wait
So if you want to launch only 5 jobs, you need to keep track of what is running :
#!/usr/bin/env bash
Njobs=5
for i in {1..N}; do
# Checks how many jobs are currently running
while [[ $(jobs -p | wc -l) > $Njobs ]]; do
sleep 0.1
done
# do awesome command based on $i
command $i &
done
wait

If you're using task spooler you can do what you're asking. Use the -S <number> flag to specify the number of "slots" (jobs that run concurrently). You can even use -D <job id> to make different jobs depend on another specific job's completion.
So in your example, if you set tsp -S 5, task spooler would run the first 5 jobs and queue up the next 4. Once one of the original 5 jobs completed, the next queued up job (based on lowest job id) would then begin. This would continue to happen as running jobs finish and more slots open up.
Also note for anyone else reading this, on Ubuntu (and maybe other Debian-based systems) task spooler is called tsp so as not to conflict with the openssl-ts tool. On most other systems it should called just ts. Which is why even on Ubuntu, task spooler will refer to itself as ts.
From the manual, regarding slots:
MULTI-SLOT
ts by default offers a queue where each job runs only after the previous finished. Nevertheless, you can change the
maximum number of jobs running at once with the -S [num] parameter. We call that number the amount of slots. You can
also set the initial number of jobs with the environment variable TS_SLOTS . When increasing this setting, queued
waiting jobs will be run at once until reaching the maximum set. When decreasing this setting, no other job will be
run until it can meet the amount of running jobs set. When using an amount of slots greater than 1, the action of
some commands may change a bit. For example, -t without jobid will tail the first job running, and -d will try to set
the dependency with the last job added.
-S [num]
Set the maximum amount of running jobs at once. If you don't specify num it will return the maximum amount of
running jobs set.

You have already a tool that does this: GNU Parallel
parallel --jobs 4 bash ::: script1.sh script2.sh script3.sh script4.sh
See Parallel tutorial for examples.
For the case where fewer jobs than tasks run:
for f in $(seq $TASKS); do
echo ${RANDOM}e-04
done | parallel --jobs $JOBS "echo {#} {}; sleep {}"
Example results for TASKS=9:
JOBS=1 JOBS=5
1 17994e-04 4 2844e-04
2 25155e-04 2 5752e-04
3 7859e-04 3 13084e-04
4 11812e-04 1 13749e-04
5 19851e-04 8 2546e-04
6 1568e-04 7 12086e-04
7 24074e-04 6 16087e-04
8 8435e-04 9 9826e-04
9 1407e-04 5 27257e-04

Related

Slurm Question: Array Job VS srun in a sbatch

What's the difference between the two following parallelization schemes on Slurm?
Scheme 1
Run sbatch script.sh
#!/bin/bash
#SBATCH --ntasks=8
## more options
srun echo hello
This summons 8 jobs that run echo hello
Scheme 2
I've accomplished something similar using array jobs.
#!/bin/bash
#SBATCH --job-name=arrayJob
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --error=arrayJob_%A_%a.err
#SBATCH --array=1-8
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
# Print this sub-job's task ID
echo hello
Is there any difference between the two schemes? They both seem to accomplish the same thing.
Scheme 1 is one single job (with 8 tasks) while Scheme 2 is 8 distinct jobs (each with one task). In the first case, all the tasks will be scheduled at the same time, while in the second case, the 8 tasks will be scheduled independently one of another.
With the job array (Scheme 2), if 8 CPUs become available at once, they will all start at the same time, but if only 4 CPUs become available at first, 4 tasks will run, the other 4 remaining pending. When the initial 4 are done, the other 4 are started. It is typically used in the case of embarrassingly parallel jobs, where the processes do not need to communicate or synchronise, like for applying the same program to a list of files.
By contrast, with a single job (Scheme 1), Slurm will start the 8 tasks at the same time, so it will need 8 CPUS to become available at the same time. This is typically only used with parallel jobs where processes need to communicate with each others, for instance using an Message Passing Interface library.

Synchronizing four shell scripts to run one after another in unix

I have 4 shell script to generate a file(let's say param.txt) which is used by another tool(informatica) and as the tool is done with processing, it deletes param.txt.
The intent here is all four scripts can get invoked at different time lets say 12:10 am, 12:13 am, 12:16 am, 12:17 am. First script runs at 12:10am and creates param.txt and trigger informatica process which uses param.txt. Informatica process takes another 5-10 minutes to complete and deletes the param.txt. The 2nd script invokes at 12:13 am and waits for unavailability of param.txt and as informatica process deletes it, script 2 creates new param.txt and triggers same informatica again. The same happen for another 2 scripts.
I am using Until and sleep commands in all 4 shell script to check the unavailability of param.txt like below:
until [ ! -f "$paramfile" ]
do
Sleep 10
done
<create param.txt file>
Issue here is, sometimes when all 4 scripts begin, the first one succeeds and generates param.txt(as there was no param.txt before) and other waits but when informatica process completes and deletes param.txt, remaining 3 scripts or 2 of them checks the unavailability at same time and one of them creates it but all succeed. I have checked different combinations of sleep interval between four scripts but this situation is occurring almost every time.
You are experiencing a classical race condition. To solve this issue, you need a shared "lock" (or similar) between your 4 scripts.
There are several ways to implement this. One way to do this in bash is by using the flock command, and an agreed-upon filename to use as a lock. The flock man page has some usage examples which resemble this:
(
flock -x 200 # try to acquire an exclusive lock on the file
# do whatever check you want. You are guaranteed to be the only one
# holding the lock
if [ -f "$paramfile" ]; then
# do something
fi
) 200>/tmp/lock-life-for-all-scripts
# The lock is automatically released when the above block is exited
You can also ask flock to fail right away if the lock can't be acquired, or to fail after a timeout (e.g. to print "still trying to acquire the lock" and restart).
Depending on your use case, you could also put the lock on the 'informatica' binary (be sure to use 200< in that case, to open the file for reading instead of (over)writing)
You can use GNU Parallel as a counting semaphore or a mutex, by invoking it as sem instead of as parallel. Scroll down to Mutex on this page.
So, you could use:
sem --id myGlobalId 'create input file; run informatica'
sem --id myGlobalId 'create input file; run informatica'
sem --id myGlobalId 'create input file; run informatica'
sem --id myGlobalId 'create input file; run informatica'
Note I have specified a global id in case you run the jobs from different terminals or cron. This is not necessary if you are starting all jobs from one terminal.
Thanks for your valuable suggestions. It did help me to think from other dimension. However I missed to mention that I am using Solaris UNIX where I couldn't find equivalent of flock or similar function. I could have asked team to install one utility but in mean time I found a workaround for this issue.
I read about the mkdir function being atomic in nature where as 'touch' command to create a file is not(still don't have complete explanation on how it works). That means at a time only 1 script can create/delete directory 'lockdir' out of 4 and other 3 has to wait.
while true;
do
if mkdir "$lockdir"; then
< create param file >
break;
fi
Sleep 30
done

Automatic qsub job completion status notification

I have a shell script that calls five other scripts from it. The first script creates 50 qsub jobs in the cluster. Individual job execution time varies from a couple of minutes to an hour. I need to know when all the 50 jobs get finished because after completing all the jobs I need to run the second script. How to find whether all the qsub jobs are completed or not? One possible solution can be using an infinite loop and check job status by using qstate command with job ID. In this case, I need to check the job status continuously. It is not an excellent solution. Is it possible that after execution, qsub job will notify me by itself. Hence, I don't need to monitor frequently job status.
qsub is capable of handling job dependencies, using -W depend=afterok:jobid.
e.g.
#!/bin/bash
# commands to run on the cluster
COMMANDS="script1.sh script2.sh script3.sh"
# intiliaze JOBID variable
JOBIDS=""
# queue all commands
for CMD in $COMMANDS; do
# queue command and store the job id
JOBIDS="$JOBIDS:`qsub $CMD`"
done
# queue post processing, depended on the submitted jobs
qsub -W depend=afterok:$JOBIDS postprocessing.sh
exit 0
More examples can be found here http://beige.ucs.indiana.edu/I590/node45.html
I never heard about how to do that, and I would be really interested if someone came with a good answer.
In the meanwhile, I suggest that you use file tricks. Either your script outputs a file at the end, or you check for the existence of the log files (assuming they are created only at the end).
while [ ! -e ~/logs/myscript.log-1 ]; do
sleep 30;
done

Bash: Process monitoring and manipulation

I have a C program that processes some input file. I'm using a Bash script to feed the input files one-by-one to this program, along with some other parameters. Each input file is processed by the program 4 times , each time by varying some parameters. You can think of it as an experiment to test the C program with different parameters.
This C program can consume memory very quickly (and can even take up more than 95% of the OS memory , resulting in slowing down the system). So, in my script, I'm monitoring 2 things for every test run of the program - The total running time, and the memory percentage consumed (obtained from top command) . When either of them first crosses a threshold, I kill the C program using killall -q 0 processname , and begin the next test run.
This is how my script is structured:
# run in background
./program file_input1 file_input2 param1 param2 &
# now monitor the process
# monitor time
sleep 1
((seconds++))
if [ $seconds -ge $timeout ]; then
timedout=1
break
fi
# monitor memory percentage used
memused=`top -bn1 | grep \`pidof genpbddh2\` | awk '{print $10}' | cut -d'.' -f1`
if [ $memused -ge $memorylimit ]; then
overmemory=1
break
fi
This entire thing is run inside a loop which keeps generating new values for the paramaters to the C program.
When a program breaks out of the loop due to timeout or over memory limit usage, this command is executed:
killall -q 0 program
The problem:
My intention was , once the program is started in the background (1st line above), I can monitor it. Then go to the next run of the program. A sequential execution of test cases.
But, it seems all the future runs of the program have been schedule by the OS (Linux) for some reason. That is, if Test Run 1 is running, Test Runs 2,3,4..and so on are also scheduled somehow (without Run 1 having finished). At least, it seems that way from the below observation:
When I pressed Ctrl-C to end the script, it exited cleanly. But, new instances of the "program" are keeping on being created continuously. The script has ended, but the instances of the program are still being continuously started. I checked and made sure that the script has ended. Now , I wrote a script to infinitely check for instances of this program being created and kill it. And eventually, all the pre-scheduled instances of the program were killed and no more new ones were created. But it was all a lot of pain.
Is this the correct way to externally monitor a program?
Any clues on why this problem is occuring, and how to fix it?
I would say that a more correct way to monitor a program like this would be:
ulimit -v $memorylimit
With such a limit set any process will get killed if it uses too much virtual memory. It is also possible to set other limits like maximum cpu time used or maximum number of open files.
To see your current limits you can use
ulimit -a
Ulimit is for bash users, if you use tcsh the command to use is instead limit.

Run several jobs parallelly and Efficiently

OS: Cent-OS
I have some 30,000 jobs(or Scripts) to run. Each job takes 3-5 Min. I have 48 CPUs(nproc = 48). I can use 40 CPUs to run 40 Jobs parallelly. please suggest some script or tools can handle 30,000 Jobs by running each 40 Jobs parallely.
What I had done:
I created 40 Different folders and executed the jobs parallely by creating a shell script for each directory.
I want to know better ways to handle this kind of jobs next time.
As Mark Setchell says: GNU Parallel.
find scripts/ -type f | parallel
If you insists on keeping 8 CPUs free:
find scripts/ -type f | parallel -j-8
But usually it is more efficient simply to use nice as that will give you all 48 cores when no one else needs them:
find scripts/ -type f | nice -n 15 parallel
To learn more:
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial (man parallel_tutorial). You command line
with love you for it.
I have used REDIS to do this sort of thing - it is very simple to install and the CLI is easy to use.
I mainly used LPUSH to push all the jobs onto a "queue" in REDIS and BLPOP to do a blocking remove of a job from the queue. So you would LPUSH 30,000 jobs (or script names or parameters) at the start, then start 40 processes in the background (1 per CPU) and each process would sit in a loop doing BLPOP to get a job, run it and do the next.
You can add layers of sophistication to log completed jobs in another "queue".
Here is a little demonstration of what to do...
First, start a Redis server on any machine in your network:
./redis-server & # start REDIS server in background
Or, you could put this in your system startup if you use it always.
Now push 3 jobs onto queue called jobs:
./redis-cli # start REDIS command line interface
redis 127.0.0.1:6379> lpush jobs "job1"
(integer) 1
redis 127.0.0.1:6379> lpush jobs "job2"
(integer) 2
redis 127.0.0.1:6379> lpush jobs "job3"
(integer) 3
See how many jobs there are in queue:
redis 127.0.0.1:6379> llen jobs
(integer) 3
Wait with infinite timeout for job
redis 127.0.0.1:6379> brpop jobs 0
1) "jobs"
2) "job1"
redis 127.0.0.1:6379> brpop jobs 0
1) "jobs"
2) "job2"
redis 127.0.0.1:6379> brpop jobs 0
1) "jobs"
2) "job3"
This last one will wait a LONG time as there are no jobs in queue:
redis 127.0.0.1:6379> brpop jobs 0
Of course, this is readily scriptable:
Put 30,000 jobs in queue:
for ((i=0;i<30000;i++)) ; do
echo "lpush jobs job$i" | redis-cli
done
If your Redis server is on a remote host, just use:
redis-cli -h <HOSTNAME>
Here's how to check progress:
echo "llen jobs" | redis-cli
(integer) 30000
Or, more simply maybe:
redis-cli llen jobs
(integer) 30000
And you could start 40 jobs like this:
#!/bin/bash
for ((i=0;i<40;i++)) ; do
./Keep1ProcessorBusy $i &
done
And then Keep1ProcessorBusy would be something like this:
#!/bin/bash
# Endless loop picking up jobs and processing them
while :
do
job=$(echo brpop jobs 0 | redis_cli)
# Set processor affinity here too if you want to force it, use $1 parameter we were called with
do $job
done
Of course, the actual script or job you want run could also be stored in Redis.
As a totally different option, you could look at GNU Parallel, which is here. And also remember that you can run the output of find through xargs with the -P option to parallelise stuff.
Just execute those scripts, Linux will internally distribute those tasks properly amongst available CPUs. This is upon the Linux task scheduler. But, if you want you can also execute a task on a particular CPU by using taskset (see man taskset). You can do it from a script to execute your 30K tasks. Remember in this manual way, be sure about what you are doing.

Resources