Bash: Running the same program over multiple cores - linux

I have access to a machine where I have access to 10 of the cores -- and I would like to actually use them. What I am used to doing on my own machine would be something like this:
for f in *.fa; do
myProgram (options) "./$f" "./$f.tmp"
done
I have 10 files I'd like to do this on -- let's call them blah00.fa, blah01.fa, ... blah09.fa.
The problem with this approach is that myProgram only uses 1 core at a time, and doing it like this on the multi-core machine I'd be using 1 core at a time 10 times, so I wouldn't be using my mahcine to its max capability.
How could I change my script so that it runs all 10 of my .fa files at the same time? I looked at Run a looped process in bash across multiple cores but I couldn't get the command from that to do what I wanted exactly.

You could use
for f in *.fa; do
myProgram (options) "./$f" "./$f.tmp" &
done
wait
which would start all of you jobs in parallel, then wait until they all complete before moving on. In the case where you have more jobs than cores, you would start all of them and let your OS scheduler worry about swapping processes in an out.
One modification is to start 10 jobs at a time
count=0
for f in *.fa; do
myProgram (options) "./$f" "./$f.tmp" &
(( count ++ ))
if (( count = 10 )); then
wait
count=0
fi
done
but this is inferior to using parallel because you can't start new jobs as old ones finish, and you also can't detect if an older job finished before you manage to start 10 jobs. wait allows you to wait on a single particular process or all background processes, but doesn't let you know when any one of an arbitrary set of background processes complete.

With GNU Parallel you can do:
parallel myProgram (options) {} {.}.tmp ::: *.fa
From: http://git.savannah.gnu.org/cgit/parallel.git/tree/README
= Full installation =
Full installation of GNU Parallel is as simple as:
./configure && make && make install
If you are not root you can add ~/bin to your path and install in
~/bin and ~/share:
./configure --prefix=$HOME && make && make install
Or if your system lacks 'make' you can simply copy src/parallel
src/sem src/niceload src/sql to a dir in your path.
= Minimal installation =
If you just need parallel and do not have 'make' installed (maybe the
system is old or Microsoft Windows):
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
mv parallel sem dir-in-your-$PATH/bin/
Watch the intro videos to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

# Wait while instance count less than $3, run additional instance and exit
function runParallel () {
cmd=$1
args=$2
number=$3
currNumber="1024"
while true ; do
currNumber=`ps -e | grep -v "grep" | grep " $1$" | wc -l`
if [ $currNumber -lt $number ] ; then
break
fi
sleep 1
done
echo "run: $cmd $args"
$cmd $args &
}
loop=0
# We will run 12 sleep commands for 10 seconds each
# and only five of them will work simultaneously
while [ $loop -ne 12 ] ; do
runParallel "sleep" 10 5
loop=`expr $loop + 1`
done

Related

Parallel run and wait for pocesses from subshell

Hi all/ I'm trying to make something like parallel tool for shell simply because the functionality of parallel is not enough for my task. The reason is that I need to run different versions of compiler.
Imagine that I need to compile 12 programs with different compilers, but I can run only 4 of them simultaneously (otherwise PC runs out of memory and crashes :). I also want to be able to observe what's going on with each compile, therefore I execute every compile in new window.
Just to make it easier here I'll replace compiler that I run with small script that waits and returns it's process id sleep.sh:
#!/bin/bash
sleep 30
echo $$
So the main script should look like parallel_run.sh :
#!/bin/bash
for i in {0..11}; do
xfce4-terminal -H -e "./sleep.sh" &
pids[$i]=$!
pstree -p $pids
if (( $i % 4 == 0 ))
then
for pid in ${pids[*]}; do
wait $pid
done
fi
done
The problem is that with $! I get pid of xfce4-terminal and not the program it executes. So if I look at ptree of 1st iteration I can see output from main script:
xfce4-terminal(31666)----{xfce4-terminal}(31668)
|--{xfce4-terminal}(31669)
and sleep.sh says that it had pid = 30876 at that time. Thus wait doesn't work at all in this case.
Q: How to get right PID of compiler that runs in subshell?
Maybe there is the other way to solve task like this?
It seems like there is no way to trace PID from parent to child if you invoke process in new xfce4-terminal as terminal process dies right after it executed given command. So I came to the solution which is not perfect, but acceptable in my situation. I run and put compiler's processes in background and redirect output to .log file. Then I run tail on these logfiles and I kill all tails which belongs to current $USER when compilers from current batch are done, then I run the other batch.
#!/bin/bash
for i in {1..8}; do
./sleep.sh > ./process_$i.log &
prcid=$!
xfce4-terminal -e "tail -f ./process_$i.log" &
pids[$i]=$prcid
if (( $i % 4 == 0 ))
then
for pid in ${pids[*]}; do
wait $pid
done
killall -u $USER tail
fi
done
Hopefully there will be no other tails running at that time :)

Linux command execution rate limiting

I have a Linux command that can be called by another application multiple times (in quick succession) with different parameters. The problem is, if the command gets executed in too quick of succession, the function that it performs will not work properly.
What I’m looking for is some simple way to ensure that each call to the command will be properly delayed/spaced (by a couple milliseconds) from each other.
Order of execution does not matter in this case and I have no control over how the application makes the calls.
Edit: The command being called is used to transmit an RF signal on a Raspberry Pi. As such, the command execution must be exclusive (no concurrency) with an additional delay between executions to prevent the receivers from misreading the signals.
For anyone with the same problem, this worked for me: https://unix.stackexchange.com/questions/408934/how-to-serialize-command-execution-on-linux
CMD="<some command> && sleep <some delay in seconds>"
flock /tmp/some_lockfile $CMD
For a simple concurrency control, which will limit concurrent execution to instances, consider the following while loop (modify as needed).
Note that the script must be invoked as /path/to/script.sh so that it will find other instances. Starting with 'bash /path/to/script.sh' will require changes!
#! /bin/bash
# Process identifier.
echo "START $$"
ME=${0##*/}
# Max number of instances
N=5
# Sleep while there are more than N instances.
while [[ "$(pgrep -c -x $ME)" -gt "$N" ]] ; do echo Waiting ... ; sleep 1 ; done
# Execute the job
sleep "$#"
echo "Done $$"

Fill up four slots of parallel processes constantly even when some finish

I have a script that runs batches of 4 processes at a time, I don't care about getting the return codes of each proc. I don't ever want to run more than 4 procs concurrently. The issue with below approach is that it does not fill up to 4 procs at a time. For example, if proc2 and proc3 finished early, i would like proc 5 and 6 to start, rather than only starting once 1-4 are complete. How can I achieve this in bash?
run_func_1 &
run_func_2 &
run_func_3 &
run_func_4 &
wait
run_func_5 &
run_func_6 &
run_func_7 &
run_func_8 &
wait
I tried to do a custom implementation with pool of workers and queue of jobs.
New worker will take job from the queue as soon as it finishes with previous one.
You can probably adapt this script to whatever you need, but I hope you will see my intentions.
Here's the script:
#!/bin/bash
f1() { echo Started f1; sleep 10; echo Finished f1; }
f2() { echo Started f2; sleep 8; echo Finished f2; }
f3() { echo Started f3; sleep 12; echo Finished f3; }
f4() { echo Started f4; sleep 14; echo Finished f4; }
f5() { echo Started f5; sleep 7; echo Finished f5; }
declare -r MAX_WORKERS=2
declare -a worker_pids
declare -a jobs=('f1' 'f2' 'f3' 'f4' 'f5')
available_worker_index() {
# If number of workers is less than MAX_WORKERS
# We still have workers that are idle
declare worker_count="${#worker_pids[#]}"
if [[ $worker_count -lt $MAX_WORKERS ]]; then
echo "$worker_count"
return 0
fi
# If we reached this code it means
# All workers are already created and executing a job
# We should check which of them finished and return it's index as available
declare -i index=0
for pid in "${worker_pids[#]}"; do
is_running=$(ps -p "$pid" > /dev/null; echo "$?")
if [[ $is_running != 0 ]]; then
echo "$index"
return 0
fi
index+=1
done
echo "None"
}
for job in "${jobs[#]}"; do
declare worker_index
worker_index=$(available_worker_index)
while [[ $worker_index == "None" ]]; do
# Wait for available worker
sleep 3
worker_index=$(available_worker_index)
done
# Run the job in background
"$job" &
# Save it's pid for later
pid="$!"
worker_pids["$worker_index"]="$pid"
done
# Wait all workers to finish
wait
You can easily change size of the worker pool only by changing MAX_WORKERS variable.
With GNU Parallel it is as simple as:
parallel -j4 ::: run_func_{1..8}
Just remember to export -f the functions.
If GNU Parallel is not installed, use
parallel --embed > new_script
to generate a shell script which embeds GNU Parallel. You then simple change the end of new_script.
By default it will run one job per cpu-core. This can be adjusted with --jobs.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
For security reasons you should install GNU Parallel with your package manager, but if GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
$ (wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
fetch -o - http://pi.dk/3 ) > install.sh
$ sha1sum install.sh | grep 883c667e01eed62f975ad28b6d50e22a
12345678 883c667e 01eed62f 975ad28b 6d50e22a
$ md5sum install.sh | grep cc21b4c943fd03e93ae1ae49e28573c0
cc21b4c9 43fd03e9 3ae1ae49 e28573c0
$ sha512sum install.sh | grep da012ec113b49a54e705f86d51e784ebced224fdf
79945d9d 250b42a4 2067bb00 99da012e c113b49a 54e705f8 6d51e784 ebced224
fdff3f52 ca588d64 e75f6033 61bd543f d631f592 2f87ceb2 ab034149 6df84a35
$ bash install.sh
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Linux bash multithread/process small jobs

I have a script that runs some data processing command 10K times.
foreach f (folderName/input*.txt)
mycmd $f
end
I have timed the runtime for each "mycmd $f" to be 0.25 secs.
With 10K runs, it adds up to be more than 1 hr.
I'm running it on a 16 cores nehalem.
It's a huge waste to not run on the remaining 15 cores.
I have tried & with sleep, somehow the script just dies with a warning or error around 3900 iterations, see below. The shorter the sleep, that faster it dies.
foreach f (folderName/input*.txt)
mycmd $f & ; sleep 0.1
end
There has got to be a better way.
Note: I would prefer shell script solutions, let's not wander into C/C++ land.
Thanks
Regards
Pipe the list of files to
xargs -n 1 -P 16 mycmd
For example:
echo folderName/input*.txt | xargs -n 1 -P 16 mycmd
There are a few other solutions possible using one of the following applications:
xjobs
Parallel
PPSS - Parallel Processing Shell Script
runpar.sh
Submit the jobs with batch; that should fix load balancing and resource starvation issues.
for f in folderName/input.*; do
batch <<____HERE
mycmd "$f"
____HERE
done
(Not 100% sure whether the quotes are correct and/or useful.)
With GNU Parallel you can do:
parallel mycmd ::: folderName/input*.txt
From: http://git.savannah.gnu.org/cgit/parallel.git/tree/README
= Full installation =
Full installation of GNU Parallel is as simple as:
./configure && make && make install
If you are not root you can add ~/bin to your path and install in
~/bin and ~/share:
./configure --prefix=$HOME && make && make install
Or if your system lacks 'make' you can simply copy src/parallel
src/sem src/niceload src/sql to a dir in your path.
= Minimal installation =
If you just need parallel and do not have 'make' installed (maybe the
system is old or Microsoft Windows):
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
mv parallel sem dir-in-your-$PATH/bin/
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Scons command with time limit

I want to limit the execution time of a program I am running under Linux. I put in my scons script a line like:
Command("com​","",​"ulimit -t 1; myprogram")
and tested it with an infinite loop program: it did not work and the program ran forever.
Am I missing something?
-- tsf
ulimit -t 1 means that the limit is set to 1 second of CPU time. If your infinite loop program uses any sort of sleep in its inner loop then it will use practically no CPU time. This means it will not get killed in 1 second of real, on the clock time. In fact it may take minutes or hours to use up its 1 second allocation.
What happens if you run the command outside of SCons? Perhaps you don't have permission to change the limit at all...
ulimit -t 1; ./myprogram
For example, it may say the following if the limit is already set to 0:
bash: ulimit: cpu time: cannot modify limit: Operation not permitted
Edit: it seems that the -t option is broken on Ubuntu 9.04. A fix has been committed 05 June 2009, but it may take a while to trickle into the updates - it may not be fixed until 9.10.
As an historical note, this problem no longer exists in Ubuntu 10.04.
You can also use this script:
(taken from http://newsgroups.derkeiler.com/Archive/Comp/comp.sys.mac.system/2005-12/msg00247.html)
#!/bin/sh
# timeout script
#
usage()
{
echo "usage: timeout seconds command args ..."
exit 1
}
[[ $# -lt 2 ]] && usage
seconds=$1; shift
timeout()
{
sleep $seconds
kill -9 $pid >/dev/null 2>/dev/null
}
eval "$#" &
pid=$!
timeout &
wait $pid
.

Resources