How to (trivially) parallelize with the Linux shell by starting one task per Linux core?

How to (trivially) parallelize with the Linux shell by starting one task per Linux core? - linux

Today's CPUs typically comprise several physical cores. These might even be multi-threaded so that the Linux kernel sees quite a large number of cores and accordingly starts several times the Linux scheduler (one for each core). When running multiple tasks on a Linux system the scheduler achieves normally a good distribution of the total workload to all Linux cores (might be the same physical core).
Now, say, I have a large number of files to process with the same executable. I usually do this with the "find" command:
find <path> <option> <exec>
However, this starts just one task at any time and waits until its completion before starting the next task. Thus, just one core at any time is in use for this. This leaves the majority of the cores idle (if this find-command is the only task running on the system). It would be much better to launch N tasks at the same time. Where N is the number of cores seen by the Linux kernel.
Is there a command that would do that ?

Use find with the -print0 option. Pipe it to xargs with the -0 option. xargs also accepts the -P option to specify a number of processes. -P should be used in combination with -n or -L.
Read man xargs for more information.
An example command:
find . -print0 | xargs -0 -P4 -n4 grep searchstring

If you have GNU Parallel http://www.gnu.org/software/parallel/ installed you can do this:
find | parallel do stuff {} --option_a\; do more stuff {}
You can install GNU Parallel simply by:
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
Watch the intro videos for GNU Parallel to learn more:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Gnu parallel or xargs -P is probably a better way to handle this, but you can also write a sort-of multi-tasking framework in bash. It's a little messy and unreliable, however, due to the lack of certain facilities.
#!/bin/sh
MAXJOBS=3
CJ=0
SJ=""
gj() {
echo ${1//[][-]/}
}
endj() {
trap "" sigchld
ej=$(gj $(jobs | grep Done))
jobs %$ej
wait %$ej
CJ=$(( $CJ - 1 ))
if [ -n "$SJ" ]; then
kill $SJ
SJ=""
fi
}
startj() {
j=$*
while [ $CJ -ge $MAXJOBS ]; do
sleep 1000 &
SJ=$!
echo too many jobs running: $CJ
echo waiting for sleeper job [$SJ]
trap endj sigchld
wait $SJ 2>/dev/null
done
CJ=$(( $CJ + 1 ))
echo $CJ jobs running. starting: $j
eval "$j &"
}
set -m
# test
startj sleep 2
startj sleep 10
startj sleep 1
startj sleep 1
startj sleep 1
startj sleep 1
startj sleep 1
startj sleep 1
startj sleep 2
startj sleep 10
wait

Related

bash: how to keep some delay between multiple instances of a script

I am trying to download 100 files using a script
I dont want at any point of time not more than 4 downloads are happening.
So i have create a folder /home/user/file_limit. In the script it creates a file here before the download and after the download is complete it will delete it.
The script will check the number of files in the folder is less than 4 then only it will allow to create a file in the folder /home/user/file_limit
I am running a script like this
today=`date +%Y-%m-%d-%H_%M_%S_%N`;
while true
do
sleep 1
# The below command will find number of files in the folder /home/user/file_limit
lines=$(find /home/user/file_limit -iname 'download_*' -type f| wc -l)
if [ $lines -lt 5 ]; then
echo "Create file"
touch "/home/user/file_limit/download_${today}"
break;
else
echo "Number of files equals 4"
fi
done
#After this some downloading happens and once the downloading is complete
rm "/home/user/file_limit/download_${today}"
The problem i am facing is when 100 such scripts are running. Eg when the number of files in the folder are less than 4, then many touch "/home/user/file_limit/download_${today}" gets executed simultaneously and all of them creates files. So the total number of files become more than 4 which i dont want because more downloads cause my system get slower.
How to ensure there is a delay between each script for checking the lines=$(find /home/user/file_limit -iname 'download_*' -type f| wc -l) so that only one touch command get executed.
Or HOw to ensure the lines=$(find /home/user/file_limit -iname 'download_*' -type f| wc -l) command is checked by each script in a queue. No two scripts can check it at the same time.

How to ensure there is a delay between each script for checking the lines=$(find ... | wc -l) so that only one touch command get executed
Adding a delay won't solve the problem. You need a lock, mutex, or semaphore to ensure that the check and creation of files is executed atomically.
Locks limit the number of parallel processes to 1. Locks can be created with flock (usually pre-installed).
Semaphores are generalized locks limiting the number concurrent processes to any number N. Semaphores can be created with sem (part of GNU parallel, has to be installed).
The following script allows 4 downloads in parallel. If 4 downloads are running and you start the script a 5th time then that 5th download will pause until one of the 4 running downloads finish.
#! /usr/bin/env bash
main() {
# put your code for downloading here
}
export -f main
sem --id downloadlimit -j4 main

My solution starts maximum MAXPARALELLJOBS number of process and waits until all of those processes are done...
Hope it helps your problem.
MAXPARALELLJOBS=4
count=0
while <not done the job>
do
((count++))
( <download job> ) &
[ ${count} -ge ${MAXPARALELLJOBS} ] && count=0 && wait
done

How to wait on a backgrounded sub-process with `wait` command [duplicate]

Is there any builtin feature in Bash to wait for a process to finish?
The wait command only allows one to wait for child processes to finish.
I would like to know if there is any way to wait for any process to finish before proceeding in any script.
A mechanical way to do this is as follows but I would like to know if there is any builtin feature in Bash.
while ps -p `cat $PID_FILE` > /dev/null; do sleep 1; done

To wait for any process to finish
Linux (doesn't work on Alpine, where ash doesn't support tail --pid):
tail --pid=$pid -f /dev/null
Darwin (requires that $pid has open files):
lsof -p $pid +r 1 &>/dev/null
With timeout (seconds)
Linux:
timeout $timeout tail --pid=$pid -f /dev/null
Darwin (requires that $pid has open files):
lsof -p $pid +r 1m%s -t | grep -qm1 $(date -v+${timeout}S +%s 2>/dev/null || echo INF)

There's no builtin. Use kill -0 in a loop for a workable solution:
anywait(){
for pid in "$#"; do
while kill -0 "$pid"; do
sleep 0.5
done
done
}
Or as a simpler oneliner for easy one time usage:
while kill -0 PIDS 2> /dev/null; do sleep 1; done;
As noted by several commentators, if you want to wait for processes that you do not have the privilege to send signals to, you have find some other way to detect if the process is running to replace the kill -0 $pid call. On Linux, test -d "/proc/$pid" works, on other systems you might have to use pgrep (if available) or something like ps | grep "^$pid ".

I found "kill -0" does not work if the process is owned by root (or other), so I used pgrep and came up with:
while pgrep -u root process_name > /dev/null; do sleep 1; done
This would have the disadvantage of probably matching zombie processes.

This bash script loop ends if the process does not exist, or it's a zombie.
PID=<pid to watch>
while s=`ps -p $PID -o s=` && [[ "$s" && "$s" != 'Z' ]]; do
sleep 1
done
EDIT: The above script was given below by Rockallite. Thanks!
My orignal answer below works for Linux, relying on procfs i.e. /proc/. I don't know its portability:
while [[ ( -d /proc/$PID ) && ( -z `grep zombie /proc/$PID/status` ) ]]; do
sleep 1
done
It's not limited to shell, but OS's themselves do not have system calls to watch non-child process termination.

FreeBSD and Solaris have this handy pwait(1) utility, which does exactly, what you want.
I believe, other modern OSes also have the necessary system calls too (MacOS, for example, implements BSD's kqueue), but not all make it available from command-line.

From the bash manpage
wait [n ...]
Wait for each specified process and return its termination status
Each n may be a process ID or a job specification; if a
job spec is given, all processes in that job's pipeline are
waited for. If n is not given, all currently active child processes
are waited for, and the return status is zero. If n
specifies a non-existent process or job, the return status is
127. Otherwise, the return status is the exit status of the
last process or job waited for.

Okay, so it seems the answer is -- no, there is no built in tool.
After setting /proc/sys/kernel/yama/ptrace_scope to 0, it is possible to use the strace program. Further switches can be used to make it silent, so that it really waits passively:
strace -qqe '' -p <PID>

All these solutions are tested in Ubuntu 14.04:
Solution 1 (by using ps command):
Just to add up to Pierz answer, I would suggest:
while ps axg | grep -vw grep | grep -w process_name > /dev/null; do sleep 1; done
In this case, grep -vw grep ensures that grep matches only process_name and not grep itself. It has the advantage of supporting the cases where the process_name is not at the end of a line at ps axg.
Solution 2 (by using top command and process name):
while [[ $(awk '$12=="process_name" {print $0}' <(top -n 1 -b)) ]]; do sleep 1; done
Replace process_name with the process name that appears in top -n 1 -b. Please keep the quotation marks.
To see the list of processes that you wait for them to be finished, you can run:
while : ; do p=$(awk '$12=="process_name" {print $0}' <(top -n 1 -b)); [[ $b ]] || break; echo $p; sleep 1; done
Solution 3 (by using top command and process ID):
while [[ $(awk '$1=="process_id" {print $0}' <(top -n 1 -b)) ]]; do sleep 1; done
Replace process_id with the process ID of your program.

Blocking solution
Use the wait in a loop, for waiting for terminate all processes:
function anywait()
{
for pid in "$#"
do
wait $pid
echo "Process $pid terminated"
done
echo 'All processes terminated'
}
This function will exits immediately, when all processes was terminated. This is the most efficient solution.
Non-blocking solution
Use the kill -0 in a loop, for waiting for terminate all processes + do anything between checks:
function anywait_w_status()
{
for pid in "$#"
do
while kill -0 "$pid"
do
echo "Process $pid still running..."
sleep 1
done
done
echo 'All processes terminated'
}
The reaction time decreased to sleep time, because have to prevent high CPU usage.
A realistic usage:
Waiting for terminate all processes + inform user about all running PIDs.
function anywait_w_status2()
{
while true
do
alive_pids=()
for pid in "$#"
do
kill -0 "$pid" 2>/dev/null \
&& alive_pids+="$pid "
done
if [ ${#alive_pids[#]} -eq 0 ]
then
break
fi
echo "Process(es) still running... ${alive_pids[#]}"
sleep 1
done
echo 'All processes terminated'
}
Notes
These functions getting PIDs via arguments by $# as BASH array.

Had the same issue, I solved the issue killing the process and then waiting for each process to finish using the PROC filesystem:
while [ -e /proc/${pid} ]; do sleep 0.1; done

There is no builtin feature to wait for any process to finish.
You could send kill -0 to any PID found, so you don't get puzzled by zombies and stuff that will still be visible in ps (while still retrieving the PID list using ps).

If you need to both kill a process and wait for it finish, this can be achieved with killall(1) (based on process names), and start-stop-daemon(8) (based on a pidfile).
To kill all processes matching someproc and wait for them to die:
killall someproc --wait # wait forever until matching processes die
timeout 10s killall someproc --wait # timeout after 10 seconds
(Unfortunately, there's no direct equivalent of --wait with kill for a specific pid).
To kill a process based on a pidfile /var/run/someproc.pid using signal SIGINT, while waiting for it to finish, with SIGKILL being sent after 20 seconds of timeout, use:
start-stop-daemon --stop --signal INT --retry 20 --pidfile /var/run/someproc.pid

Use inotifywait to monitor some file that gets closed, when your process terminates. Example (on Linux):
yourproc >logfile.log & disown
inotifywait -q -e close logfile.log
-e specifies the event to wait for, -q means minimal output only on termination. In this case it will be:
logfile.log CLOSE_WRITE,CLOSE
A single wait command can be used to wait for multiple processes:
yourproc1 >logfile1.log & disown
yourproc2 >logfile2.log & disown
yourproc3 >logfile3.log & disown
inotifywait -q -e close logfile1.log logfile2.log logfile3.log
The output string of inotifywait will tell you, which process terminated. This only works with 'real' files, not with something in /proc/

Rauno Palosaari's solution for Timeout in Seconds Darwin, is an excellent workaround for a UNIX-like OS that does not have GNU tail (it is not specific to Darwin). But, depending on the age of the UNIX-like operating system, the command-line offered is more complex than necessary, and can fail:
lsof -p $pid +r 1m%s -t | grep -qm1 $(date -v+${timeout}S +%s 2>/dev/null || echo INF)
On at least one old UNIX, the lsof argument +r 1m%s fails (even for a superuser):
lsof: can't read kernel name list.
The m%s is an output format specification. A simpler post-processor does not require it. For example, the following command waits on PID 5959 for up to five seconds:
lsof -p 5959 +r 1 | awk '/^=/ { if (T++ >= 5) { exit 1 } }'
In this example, if PID 5959 exits of its own accord before the five seconds elapses, ${?} is 0. If not ${?} returns 1 after five seconds.
It may be worth expressly noting that in +r 1, the 1 is the poll interval (in seconds), so it may be changed to suit the situation.

On a system like OSX you might not have pgrep so you can try this appraoch, when looking for processes by name:
while ps axg | grep process_name$ > /dev/null; do sleep 1; done
The $ symbol at the end of the process name ensures that grep matches only process_name to the end of line in the ps output and not itself.

Why the "while" concurrent is running slower and slower in shell script?

I want zip lots of files using more cpus at onetime by using shell's concurrent function like this:
#!/bin/bash
#set -x
function zip_data()
{
while true
do
{
echo "do zip something"
}&
done
}
zip_data
wait
At the begin time ,the speed of loop onetime is quickly .
But with the increase of loop times when running,
the speed is more and more slowly.
Why ???
I think the reason may be there are too many child processes are runing .
So I try that make the while function running one loop at onetime like this:
#!/bin/bash
#set -x
function trap_exit
{
exec 1000>&-;exec 1000<&-
kill -9 0
}
trap 'trap_exit; exit 2' 1 2 3 15 9
mkfifo testfifo ; exec 1000<>testfifo ; rm -rf testfifo
function zip_data()
{
echo >&1000
while true
read -u 1000
do
{
echo "do something"
echo >&1000
}&
done
}
zip_data
wait
However the phenomenon is same like before .
So I don't understand that reason why the speed is more and more slowly when runing.
#
Today I try like this but it don't work
#!/bin/bash
#set -x
c=0
while true
do
c=$(jobs -p | wc -l)
while [ $c -ge 20 ]; do
c=$(jobs -p | wc -l)
sleep 0.01
done
{
echo "$c"
sleep 0.8
}&
done
So i try other way to finish this function like this ,Thank you!
#!/bin/bash
#set -x
function EXPECT_FUNC()
{
para=$1
while true
do
{
do something $1
}
done
}
EXPECT_FUNC 1 &
EXPECT_FUNC 2 &
EXPECT_FUNC 3 &
EXPECT_FUNC 4 &
wait

Any single-threaded util can run in well managed concurrent threads with parallel. man parallel offers dozens of examples, e.g.:
Create a directory for each zip-file and unzip it in that dir:
parallel 'mkdir {.}; cd {.}; unzip ../{}' ::: *.zip
Recompress all .gz files in current directory using bzip2 running 1 job
per CPU core in parallel:
parallel "zcat {} | bzip2 >{.}.bz2 && rm {}" ::: *.gz
A particularly interesting example that only works with gzip shows how to use several CPUs to simultaneously work one archive, with a single-threaded archiver, which sounds impossible:
To process a big file or some output you can use --pipe to split up
the data into blocks and pipe the blocks into the processing program.
If the program is gzip -9 you can do:
cat bigfile | parallel --pipe --recend '' -k gzip -9 >bigfile.gz
This will split bigfile into blocks of 1 MB and pass that to
gzip -9 in parallel. One gzip will be run per CPU core. The output
of gzip -9 will be kept in order and saved to bigfile.gz
If parallel is too complex, here are some compression utils with built-in parallel archiving:
XZ: pixz
LZMA: plzip, pxz
GZIP: pigz
BZIP2: pbzip2

How to dispatch tasks in Linux when the system is not busy

I'm using a 12 core 24 thread Linux machine to performing the following task. Each task is independent.
while read parameter
do
./program_a $parameter $parameter.log 2>&1 &
done < parameter_file
However this code will dispatch all the tasks at one time, which could read to serious context switch lack of idle cpu power and/or lack of memory.
I want to exploit system information tools such as free, top, and ps to determine if the task should be dispatched.
Using free.
while read parameter
do
#for instance using free
free_mem=`free -g | grep Mem | awk '{print $4}'`
if [ $free_mem -gt 10]; then
./program_a $parameter $parameter.log 2>&1 &
fi
done < parameter_file
But this won't work because this won't wait until the condition meet the criteria. How should I do this?
Besides how should I use top and ps to determine if the system is busy or not. I want to dispatch new task when the system is too busy.
Maybe I can use
ps aux | grep "program_a " | grep -v "grep" | wc -l
to limit the number of dispatched tasks. But it is a implicit way to determine if the system is busy or not. Any other thought?

while read parameter
do
#for instance using free
while 1; do
free_mem=`free -g | awk '/Mem/{print $4}'`
if (( $free_mem > 10 )); then
break
fi
sleep 1 # wait so that some tasks might finish
done
./program_a $parameter $parameter.log 2>&1 &
done < parameter_file

KSH: constraint the number of thread that can run at one time

I have a script that loop and each iteration invoke a thread that run in a background like below
xn_run_process.sh
...
for each in `ls ${INPUT_DIR}/MDX*.txt`
do
java -Xms256m -Xmx1024m -cp ${CLASSPATH} com.wf.xn.etcc.Main -config=${CONFIG_FILE}
...
for SCALE_PDF in `ls ${PROCESS_DIR}/*.pdf`
do
OUTPUT_AFP=${OUTPUT_DIR}/`basename ${SCALE_PDF}`
OUTPUT_AFP=`print ${OUTPUT_AFP} | sed s/pdf/afp/g`
${PROJ_DIR}/myscript.sh -i ${SCALE_PDF} -o ${OUTPUT_AFP} &
sleep 30
done
done
When I did this, I only think that it will be only 5 threads of myscript.sh be concurrently executed at one time, however things change, and this list execute 30 threads, each does quite heavy process. How do I constraint the number of concurrent processes to 5?

While this is possible in pure shell scripting, the easiest approach would be using a parallelization tool like GNU parallel or GNU make. Makefile example:
SOURCES = ${SOME_LIST}
STAMPS = $(SOME_LIST:=.did-run-stamp)
all : $(STAMPS)
%.did-run-stamp : %
/full/path/myscript.sh -f $<
and then calling make as make -j 5.

Use GNU Parallel (adjust -j as you see fit. Remove it if you want # of CPUs):
for each in `ls ${INPUT_DIR}/MDX*.txt`
do
java -Xms256m -Xmx1024m -cp ${CLASSPATH} com.wf.xn.etcc.Main -config=${CONFIG_FILE}
...
for SCALE_PDF in `ls ${PROCESS_DIR}/*.pdf`
do
OUTPUT_AFP=${OUTPUT_DIR}/`basename ${SCALE_PDF}`
OUTPUT_AFP=`print ${OUTPUT_AFP} | sed s/pdf/afp/g`
sem --id myid -j 5 ${PROJ_DIR}/myscript.sh -i ${SCALE_PDF} -o ${OUTPUT_AFP}
done
done
sem --wait --id myid
sem is part of GNU Parallel.
This will keep 5 jobs running until there is only 5 jobs left. Then it will allow your java to run while finishing the last 5. The sem --wait will wait until the last 5 are finished, too.
Alternatively:
for each ...
java ...
...
ls ${PROCESS_DIR}/*.pdf |
parallel -j 5 ${PROJ_DIR}/myscript.sh -i {} -o ${OUTPUT_DIR}/{/.}.afp
done
This will run 5 jobs in parallel and only let java run when all the jobs are finished.
Alternatively you can use the queue trick described in GNU Parallel's man page: https://www.gnu.org/software/parallel/man.html#example__gnu_parallel_as_queue_system_batch_manager
echo >jobqueue; tail -f jobqueue | parallel -j5 &
for each ...
...
ls ${PROCESS_DIR}/*.pdf |
parallel echo ${PROJ_DIR}/myscript.sh -i {} -o ${OUTPUT_DIR}/{/.}.afp >> jobqueue
done
echo killall -TERM parallel >> jobqueue
wait
This will run java, then add jobs to be run to a queue. After adding jobs java will be run immediately. At all time 5 jobs will be run from the queue until the queue is empty.
You can install GNU Parallel simply by:
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
Watch the intro videos to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1 and walk through the tutorial (man parallel_tutorial). You command line with love you for it.

If you have ksh93 check if JOBMAX is available:
JOBMAX
This variable defines the maximum number running background
jobs that can run at a time. When this limit is reached, the
shell will wait for a job to complete before staring a new job.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string