Run several jobs parallelly and Efficiently - linux

OS: Cent-OS
I have some 30,000 jobs(or Scripts) to run. Each job takes 3-5 Min. I have 48 CPUs(nproc = 48). I can use 40 CPUs to run 40 Jobs parallelly. please suggest some script or tools can handle 30,000 Jobs by running each 40 Jobs parallely.
What I had done:
I created 40 Different folders and executed the jobs parallely by creating a shell script for each directory.
I want to know better ways to handle this kind of jobs next time.

As Mark Setchell says: GNU Parallel.
find scripts/ -type f | parallel
If you insists on keeping 8 CPUs free:
find scripts/ -type f | parallel -j-8
But usually it is more efficient simply to use nice as that will give you all 48 cores when no one else needs them:
find scripts/ -type f | nice -n 15 parallel
To learn more:
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial (man parallel_tutorial). You command line
with love you for it.

I have used REDIS to do this sort of thing - it is very simple to install and the CLI is easy to use.
I mainly used LPUSH to push all the jobs onto a "queue" in REDIS and BLPOP to do a blocking remove of a job from the queue. So you would LPUSH 30,000 jobs (or script names or parameters) at the start, then start 40 processes in the background (1 per CPU) and each process would sit in a loop doing BLPOP to get a job, run it and do the next.
You can add layers of sophistication to log completed jobs in another "queue".
Here is a little demonstration of what to do...
First, start a Redis server on any machine in your network:
./redis-server & # start REDIS server in background
Or, you could put this in your system startup if you use it always.
Now push 3 jobs onto queue called jobs:
./redis-cli # start REDIS command line interface
redis 127.0.0.1:6379> lpush jobs "job1"
(integer) 1
redis 127.0.0.1:6379> lpush jobs "job2"
(integer) 2
redis 127.0.0.1:6379> lpush jobs "job3"
(integer) 3
See how many jobs there are in queue:
redis 127.0.0.1:6379> llen jobs
(integer) 3
Wait with infinite timeout for job
redis 127.0.0.1:6379> brpop jobs 0
1) "jobs"
2) "job1"
redis 127.0.0.1:6379> brpop jobs 0
1) "jobs"
2) "job2"
redis 127.0.0.1:6379> brpop jobs 0
1) "jobs"
2) "job3"
This last one will wait a LONG time as there are no jobs in queue:
redis 127.0.0.1:6379> brpop jobs 0
Of course, this is readily scriptable:
Put 30,000 jobs in queue:
for ((i=0;i<30000;i++)) ; do
echo "lpush jobs job$i" | redis-cli
done
If your Redis server is on a remote host, just use:
redis-cli -h <HOSTNAME>
Here's how to check progress:
echo "llen jobs" | redis-cli
(integer) 30000
Or, more simply maybe:
redis-cli llen jobs
(integer) 30000
And you could start 40 jobs like this:
#!/bin/bash
for ((i=0;i<40;i++)) ; do
./Keep1ProcessorBusy $i &
done
And then Keep1ProcessorBusy would be something like this:
#!/bin/bash
# Endless loop picking up jobs and processing them
while :
do
job=$(echo brpop jobs 0 | redis_cli)
# Set processor affinity here too if you want to force it, use $1 parameter we were called with
do $job
done
Of course, the actual script or job you want run could also be stored in Redis.
As a totally different option, you could look at GNU Parallel, which is here. And also remember that you can run the output of find through xargs with the -P option to parallelise stuff.

Just execute those scripts, Linux will internally distribute those tasks properly amongst available CPUs. This is upon the Linux task scheduler. But, if you want you can also execute a task on a particular CPU by using taskset (see man taskset). You can do it from a script to execute your 30K tasks. Remember in this manual way, be sure about what you are doing.

Related

Celery executes the task but not reflects them in the celery logs. Why?

For eg it runs perfect the first time but if we try to run the same thing again so if it ran 3 tasks at first it will show only 1 of them (first or last of the three). But will execute all of them.
The command use to run celery here is
celery -A project worker -B -l info --scheduler django_celery_beat.schedulers:DatabaseScheduler
The logs in the celery looks like
for the first time
task received 1
task received 2
task received 3
For the second time it may be
task received 3 or task recevied 1 #either of these 2
But will still execute all three of them.
So what may be the thing preventing them

qsub slow on cluster with a single computer

Torque is installed in a single computer, which is used for both head node
and computation node. I didnt installed Maui for job schedule, but use the built-in function of torque.
I find qsub is slow when submitting many jobs, for example:
for i in `ls *tt.sh`
do
echo $i
qsub $i
done
it take a while to submit jobs in the end of the scripts list.
This happens even if the computer is in low load. The submission
is slow if there are merely 70 scripts in the list.
Are there some options I could tweak with torque, or I have to install Maui
for job scheduling?
Thanks!

How to run 5 Shell job Simultaneous and keep rest in Queue?

While both "Task Spooler" and "at" handle multiple queues and allow the execution of commands at a later point, the at project handles output from commands by emailing the results to the user who queued the command, while Task Spooler allows you to get at the results from the command line instead.
But what I am looking for is a way that would allow me to run 5 jobs simultaneously and keep rest of the jobs in a queue, so when any one of the 5 is over it would start the next one.
So, if 5 jobs running and 4 more in the queue as soon as any of them is finished, the next one would start executing and again 5 jobs would running simultaneously.
is there a way to handle such task?
It depends of course how you want to start your tasks. But lets assume they are loop based. The following would launch all N commands in the background.
#!/usr/bin/env bash
for i in {1..N}; do
# do awesome command based on $i
command $i &
done
wait
So if you want to launch only 5 jobs, you need to keep track of what is running :
#!/usr/bin/env bash
Njobs=5
for i in {1..N}; do
# Checks how many jobs are currently running
while [[ $(jobs -p | wc -l) > $Njobs ]]; do
sleep 0.1
done
# do awesome command based on $i
command $i &
done
wait
If you're using task spooler you can do what you're asking. Use the -S <number> flag to specify the number of "slots" (jobs that run concurrently). You can even use -D <job id> to make different jobs depend on another specific job's completion.
So in your example, if you set tsp -S 5, task spooler would run the first 5 jobs and queue up the next 4. Once one of the original 5 jobs completed, the next queued up job (based on lowest job id) would then begin. This would continue to happen as running jobs finish and more slots open up.
Also note for anyone else reading this, on Ubuntu (and maybe other Debian-based systems) task spooler is called tsp so as not to conflict with the openssl-ts tool. On most other systems it should called just ts. Which is why even on Ubuntu, task spooler will refer to itself as ts.
From the manual, regarding slots:
MULTI-SLOT
ts by default offers a queue where each job runs only after the previous finished. Nevertheless, you can change the
maximum number of jobs running at once with the -S [num] parameter. We call that number the amount of slots. You can
also set the initial number of jobs with the environment variable TS_SLOTS . When increasing this setting, queued
waiting jobs will be run at once until reaching the maximum set. When decreasing this setting, no other job will be
run until it can meet the amount of running jobs set. When using an amount of slots greater than 1, the action of
some commands may change a bit. For example, -t without jobid will tail the first job running, and -d will try to set
the dependency with the last job added.
-S [num]
Set the maximum amount of running jobs at once. If you don't specify num it will return the maximum amount of
running jobs set.
You have already a tool that does this: GNU Parallel
parallel --jobs 4 bash ::: script1.sh script2.sh script3.sh script4.sh
See Parallel tutorial for examples.
For the case where fewer jobs than tasks run:
for f in $(seq $TASKS); do
echo ${RANDOM}e-04
done | parallel --jobs $JOBS "echo {#} {}; sleep {}"
Example results for TASKS=9:
JOBS=1 JOBS=5
1 17994e-04 4 2844e-04
2 25155e-04 2 5752e-04
3 7859e-04 3 13084e-04
4 11812e-04 1 13749e-04
5 19851e-04 8 2546e-04
6 1568e-04 7 12086e-04
7 24074e-04 6 16087e-04
8 8435e-04 9 9826e-04
9 1407e-04 5 27257e-04

Automatic qsub job completion status notification

I have a shell script that calls five other scripts from it. The first script creates 50 qsub jobs in the cluster. Individual job execution time varies from a couple of minutes to an hour. I need to know when all the 50 jobs get finished because after completing all the jobs I need to run the second script. How to find whether all the qsub jobs are completed or not? One possible solution can be using an infinite loop and check job status by using qstate command with job ID. In this case, I need to check the job status continuously. It is not an excellent solution. Is it possible that after execution, qsub job will notify me by itself. Hence, I don't need to monitor frequently job status.
qsub is capable of handling job dependencies, using -W depend=afterok:jobid.
e.g.
#!/bin/bash
# commands to run on the cluster
COMMANDS="script1.sh script2.sh script3.sh"
# intiliaze JOBID variable
JOBIDS=""
# queue all commands
for CMD in $COMMANDS; do
# queue command and store the job id
JOBIDS="$JOBIDS:`qsub $CMD`"
done
# queue post processing, depended on the submitted jobs
qsub -W depend=afterok:$JOBIDS postprocessing.sh
exit 0
More examples can be found here http://beige.ucs.indiana.edu/I590/node45.html
I never heard about how to do that, and I would be really interested if someone came with a good answer.
In the meanwhile, I suggest that you use file tricks. Either your script outputs a file at the end, or you check for the existence of the log files (assuming they are created only at the end).
while [ ! -e ~/logs/myscript.log-1 ]; do
sleep 30;
done

Better way of running a series of commands simultaneously in UNIX/Linux Shell?

I want to know the good practice of performing a series of commands simultaneously in UNIX/Linux. Suppose that I have a program, program_a, which requires one parameter. I have stored parameters line by line in a file. So I wrote:
while read line
do
./program_a line > $line.log 2>&1
done < parameter_file
The problem is that execution of program_a takes long time. Because each executions of program_a for each parameter is independent, So I think these executions can be run simultaneously. I don't know if it regards to multithreading or other technique. The following is my thought. Use & to run each executions on the background.
while read line
do
./program_a line $line.log 2>&1 &
done < parameter_file
Is there any better way of launching multiple tasks?
Did you know that xargs can launch tasks in parallel? Check out -P -n parameters!
An example:
xargs -P 4 -n 1 ./program_a < parameter_file
That will start up to 4 (P=4) program_a instances for processing each line (n=1). You'll probably have to wrap program_a within a shell script or something so that child processes stdout & stderr can be redirected appropriately.
How this is better than putting processes to backgroud: Suppose you have 1000 lines in the input file, obviously you wouldn't want 1000 processes to be launched. Xargs allows you to look at it as a queue, with P workers each consuming and processing n items from it.
With GNU Parallel you can get a logfile for each parameter and run one job per CPU core:
parallel --results logdir ./program_a :::: parameter_file
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial (man parallel_tutorial). You command line
with love you for it.

Resources