Why the "while" concurrent is running slower and slower in shell script? - linux

I want zip lots of files using more cpus at onetime by using shell's concurrent function like this:
#!/bin/bash
#set -x
function zip_data()
{
while true
do
{
echo "do zip something"
}&
done
}
zip_data
wait
At the begin time ,the speed of loop onetime is quickly .
But with the increase of loop times when running,
the speed is more and more slowly.
Why ???
I think the reason may be there are too many child processes are runing .
So I try that make the while function running one loop at onetime like this:
#!/bin/bash
#set -x
function trap_exit
{
exec 1000>&-;exec 1000<&-
kill -9 0
}
trap 'trap_exit; exit 2' 1 2 3 15 9
mkfifo testfifo ; exec 1000<>testfifo ; rm -rf testfifo
function zip_data()
{
echo >&1000
while true
read -u 1000
do
{
echo "do something"
echo >&1000
}&
done
}
zip_data
wait
However the phenomenon is same like before .
So I don't understand that reason why the speed is more and more slowly when runing.
#
Today I try like this but it don't work
#!/bin/bash
#set -x
c=0
while true
do
c=$(jobs -p | wc -l)
while [ $c -ge 20 ]; do
c=$(jobs -p | wc -l)
sleep 0.01
done
{
echo "$c"
sleep 0.8
}&
done
So i try other way to finish this function like this ,Thank you!
#!/bin/bash
#set -x
function EXPECT_FUNC()
{
para=$1
while true
do
{
do something $1
}
done
}
EXPECT_FUNC 1 &
EXPECT_FUNC 2 &
EXPECT_FUNC 3 &
EXPECT_FUNC 4 &
wait

Any single-threaded util can run in well managed concurrent threads with parallel. man parallel offers dozens of examples, e.g.:
Create a directory for each zip-file and unzip it in that dir:
parallel 'mkdir {.}; cd {.}; unzip ../{}' ::: *.zip
Recompress all .gz files in current directory using bzip2 running 1 job
per CPU core in parallel:
parallel "zcat {} | bzip2 >{.}.bz2 && rm {}" ::: *.gz
A particularly interesting example that only works with gzip shows how to use several CPUs to simultaneously work one archive, with a single-threaded archiver, which sounds impossible:
To process a big file or some output you can use --pipe to split up
the data into blocks and pipe the blocks into the processing program.
If the program is gzip -9 you can do:
cat bigfile | parallel --pipe --recend '' -k gzip -9 >bigfile.gz
This will split bigfile into blocks of 1 MB and pass that to
gzip -9 in parallel. One gzip will be run per CPU core. The output
of gzip -9 will be kept in order and saved to bigfile.gz
If parallel is too complex, here are some compression utils with built-in parallel archiving:
XZ: pixz
LZMA: plzip, pxz
GZIP: pigz
BZIP2: pbzip2

Related

bash: how to keep some delay between multiple instances of a script

I am trying to download 100 files using a script
I dont want at any point of time not more than 4 downloads are happening.
So i have create a folder /home/user/file_limit. In the script it creates a file here before the download and after the download is complete it will delete it.
The script will check the number of files in the folder is less than 4 then only it will allow to create a file in the folder /home/user/file_limit
I am running a script like this
today=`date +%Y-%m-%d-%H_%M_%S_%N`;
while true
do
sleep 1
# The below command will find number of files in the folder /home/user/file_limit
lines=$(find /home/user/file_limit -iname 'download_*' -type f| wc -l)
if [ $lines -lt 5 ]; then
echo "Create file"
touch "/home/user/file_limit/download_${today}"
break;
else
echo "Number of files equals 4"
fi
done
#After this some downloading happens and once the downloading is complete
rm "/home/user/file_limit/download_${today}"
The problem i am facing is when 100 such scripts are running. Eg when the number of files in the folder are less than 4, then many touch "/home/user/file_limit/download_${today}" gets executed simultaneously and all of them creates files. So the total number of files become more than 4 which i dont want because more downloads cause my system get slower.
How to ensure there is a delay between each script for checking the lines=$(find /home/user/file_limit -iname 'download_*' -type f| wc -l) so that only one touch command get executed.
Or HOw to ensure the lines=$(find /home/user/file_limit -iname 'download_*' -type f| wc -l) command is checked by each script in a queue. No two scripts can check it at the same time.
How to ensure there is a delay between each script for checking the lines=$(find ... | wc -l) so that only one touch command get executed
Adding a delay won't solve the problem. You need a lock, mutex, or semaphore to ensure that the check and creation of files is executed atomically.
Locks limit the number of parallel processes to 1. Locks can be created with flock (usually pre-installed).
Semaphores are generalized locks limiting the number concurrent processes to any number N. Semaphores can be created with sem (part of GNU parallel, has to be installed).
The following script allows 4 downloads in parallel. If 4 downloads are running and you start the script a 5th time then that 5th download will pause until one of the 4 running downloads finish.
#! /usr/bin/env bash
main() {
# put your code for downloading here
}
export -f main
sem --id downloadlimit -j4 main
My solution starts maximum MAXPARALELLJOBS number of process and waits until all of those processes are done...
Hope it helps your problem.
MAXPARALELLJOBS=4
count=0
while <not done the job>
do
((count++))
( <download job> ) &
[ ${count} -ge ${MAXPARALELLJOBS} ] && count=0 && wait
done

How to add threading to the bash script?

#!/bin/bash
cat input.txt | while read ips
do
cmd="$(snmpwalk -v2c -c abc#123 $ips sysUpTimeInstance)"
echo "$ips ---> $cmd"
echo "$ips $cmd" >> out_uptime.txt
done
How can i add threading to this bash script, i have around 80000 input and it takes lot of time?
Simple method. Assuming the order of the output is unimportant, and that snmpwalk's output is of no interest if it should fail, put a && at the end of each of the commands to background, except the last command which should have a & at the end:
#!/bin/bash
while read ips
do
cmd="$(nice snmpwalk -v2c -c abc#123 $ips sysUpTimeInstance)" &&
echo "$ips ---> $cmd" &&
echo "$ips $cmd" >> out_uptime.txt &
done < input.txt
Less simple. If snmpwalk can fail, and that output is also needed, lose the && and surround the code with curly braces,{}, followed by &. To redirect the appended output to include standard error use &>>:
#!/bin/bash
while read ips
do {
cmd="$(nice snmpwalk -v2c -c abc#123 $ips sysUpTimeInstance)"
echo "$ips ---> $cmd"
echo "$ips $cmd" &>> out_uptime.txt
} &
done < input.txt
The braces can contain more complex if ... then ... else ... fi statements, all of which would be backgrounded.
For those who don't have a complex snmpwalk command to test, here's a similar loop, which prints one through five but sleeps for random durations between echo commands:
for f in {1..5}; do
RANDOM=$f &&
sleep $((RANDOM/6000)) &&
echo $f &
done 2> /dev/null | cat
Output will be the same every time, (remove the RANDOM=$f && for varying output), and requires three seconds to run:
2
4
1
3
5
Compare that to code without the &&s and &:
for f in {1..5}; do
RANDOM=$f
sleep $((RANDOM/6000))
echo $f
done 2> /dev/null | cat
When run, the code requires seven seconds to run, with this output:
1
2
3
4
5
You can send tasks to the background by &. If you intend to wait for all of them to finish you can use the wait command:
process_to_background &
echo Processing ...
wait
echo Done
You can get the pid of the given task started in the background if you want to wait for one (or few) specific tasks.
important_process_to_background &
important_pid=$!
while i in {1..10}; do
less_important_process_to_background $i &
done
wait $important_pid
echo Important task finished
wait
echo All tasks finished
On note though: the background processes can mess up the output as they will run asynchronously. You might want to use a named pipe to collect the output from them.

Bash sizeout script

I like very much the style, how bash handle the shells.
I am looking for the native solution to cover a bash command for testing the size of a result file and exit in the case of that became too big in size.
I am thinking about a command like
sizeout $fileName $maxSize otherBashCommand
It would be usefull to use it in a backup script like:
sizeout $fileName $maxSize timeout 600s ionice nice sudo rear mkbackup
To make it one step more complicated, i would call it over ssh:
ssh $remoteuser#$remoteServer sizeout $fileName $maxSize timeout 600s ionice nice sudo rear mkbackup
What kind of design pattern should i use for this ?
Solution
I have modified Socowi's code a little
#! /bin/bash
# shell script to stop encapsulated script in the case of
# checked file reaching file size limit
# usage
# sizeout.sh filename filesize[Bytes] encapsulated_command arguments
fileName=$1 # file we are checking
maxSize=$2 # max. file size (in bytes) to stop the pid
shift 2
echo "fileName: $fileName"
echo "maxSize: $maxSize"
function limitReached() {
if [[ ! -f $fileName ]]; then
return 1 # file doesn't exist, return with false
fi
actSize=$(stat --format %s $fileName)
if [[ $actSize -lt $maxSize ]]; then
return 1 # filesize under maxsize, return with false
fi
return 0
}
# run command as a background job
$# &
pid=$!
# monitor file size while job is running
while kill -0 $pid; do
limitReached && kill $pid
sleep 1
done 2> /dev/null
wait $pid # return with the exit code of the $pid
I added wait $pid to the end, that returns with the exit code of the background process instead of it's on exit code.
Monitor the File Size Every n Time Units
I don't know whether there is a design pattern for your problem, but you could write the sizeout script as follows:
#! /bin/bash
filename="$1"
maxsize="$2" # max. file size (in bytes)
shift 2
limitReached() {
[[ -e "$filename" ]] &&
(( "$(stat --printf="%s" "$filename")" >= maxsize ))
}
limitReached && exit 0
# run command as a background job
"$#" &
pid="$!"
# monitor file size while job is running
while kill -0 "$pid"; do
limitReached && kill "$pid"
sleep 0.2
done 2> /dev/null
This script checks the file size every 200ms and kills your command if the file size exceeds the maximum. Since we only check every 200ms, the file may end up with (yourWriteSpeed Bytes/s × 0.2s) more than the specified maximum size.
The following points can be improved:
Validate parameters.
Set a trap to kill the background job in every case, for instance when pressing Ctrl+C.
Monitor File Changes
The script from above is not very efficient, since we check the file size every 200ms, even if the file does not change at all. inotifywait allows you to wait until the file changes. See this answer for more information.
A Word on SSH
You just need to copy the sizeout script over to your remote server, then you can use it like on your local machine:
ssh $remoteuser#$remoteServer path/to/sizeout filename maxSize ... mkbackup

Executing several bash scripts simultaneously from one script?

I want to make a bash script that will execute around 30 or so other scripts simultaneously, these 30 scripts all have wget commands iterating through some lists.
I thought of doing something with screen (send ctrl + shift + a + d) or send the scripts to background but really I dont know what to do.
To summarize: 1 master script execution will trigger all other 30 scripts to execute all at the same time.
PS: I've seen the other questions but I don't quite understand how the work or the are a bit more than what I need(expecting a return value, etc)
EDIT:
Small snippet of the script(this part is the one that executes with the config params I specified)
if [ $WP_RANGE_STOP -gt 0 ]; then
#WP RANGE
for (( count= "$WP_RANGE_START"; count< "$WP_RANGE_STOP"+1; count=count+1 ));
do
if cat downloaded.txt | grep "$count" >/dev/null
then
echo "File already downloaded!"
else
echo $count >> downloaded.txt
wget --keep-session-cookies --load-cookies=cookies.txt --referer=server.com http://server.com/wallpaper/$count
cat $count | egrep -o "http://wallpapers.*(png|jpg|gif)" | wget --keep-session-cookies --load-cookies=cookies.txt --referer=http://server.com/wallpaper/$number -i -
rm $count
fi
Probably the most straightforward approach would be to use xargs -P or GNU parallel. Generate the different arguments for each child in the master script. For simplicity's sake, let's say you just want to download a bunch of different content at once. Either of
xargs -P 30 wget < urls_file
parallel -j 30 wget '{}' < urls_file
will spawn up to 30 simultaneous wget processes with different args from the given input. If you give more information about the scripts you want to run, I might be able to provide more specific examples.
Parallel has some more sophisticated tuning options compared to xargs, such as the ability to automatically split jobs across cores or cpus.
If you're just trying to run a bunch of heterogeneous different bash scripts in parallel, define each individual script in its own file, then make each file executable and pass it to parallel:
$ cat list_of_scripts
/path/to/script1 arg1 arg2
/path/to/script2 -o=5 --beer arg3
…
/path/to/scriptN
then
parallel -j 30 < list_of_scripts

How to (trivially) parallelize with the Linux shell by starting one task per Linux core?

Today's CPUs typically comprise several physical cores. These might even be multi-threaded so that the Linux kernel sees quite a large number of cores and accordingly starts several times the Linux scheduler (one for each core). When running multiple tasks on a Linux system the scheduler achieves normally a good distribution of the total workload to all Linux cores (might be the same physical core).
Now, say, I have a large number of files to process with the same executable. I usually do this with the "find" command:
find <path> <option> <exec>
However, this starts just one task at any time and waits until its completion before starting the next task. Thus, just one core at any time is in use for this. This leaves the majority of the cores idle (if this find-command is the only task running on the system). It would be much better to launch N tasks at the same time. Where N is the number of cores seen by the Linux kernel.
Is there a command that would do that ?
Use find with the -print0 option. Pipe it to xargs with the -0 option. xargs also accepts the -P option to specify a number of processes. -P should be used in combination with -n or -L.
Read man xargs for more information.
An example command:
find . -print0 | xargs -0 -P4 -n4 grep searchstring
If you have GNU Parallel http://www.gnu.org/software/parallel/ installed you can do this:
find | parallel do stuff {} --option_a\; do more stuff {}
You can install GNU Parallel simply by:
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
Watch the intro videos for GNU Parallel to learn more:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Gnu parallel or xargs -P is probably a better way to handle this, but you can also write a sort-of multi-tasking framework in bash. It's a little messy and unreliable, however, due to the lack of certain facilities.
#!/bin/sh
MAXJOBS=3
CJ=0
SJ=""
gj() {
echo ${1//[][-]/}
}
endj() {
trap "" sigchld
ej=$(gj $(jobs | grep Done))
jobs %$ej
wait %$ej
CJ=$(( $CJ - 1 ))
if [ -n "$SJ" ]; then
kill $SJ
SJ=""
fi
}
startj() {
j=$*
while [ $CJ -ge $MAXJOBS ]; do
sleep 1000 &
SJ=$!
echo too many jobs running: $CJ
echo waiting for sleeper job [$SJ]
trap endj sigchld
wait $SJ 2>/dev/null
done
CJ=$(( $CJ + 1 ))
echo $CJ jobs running. starting: $j
eval "$j &"
}
set -m
# test
startj sleep 2
startj sleep 10
startj sleep 1
startj sleep 1
startj sleep 1
startj sleep 1
startj sleep 1
startj sleep 1
startj sleep 2
startj sleep 10
wait

Resources