bash: how to keep some delay between multiple instances of a script - linux

I am trying to download 100 files using a script
I dont want at any point of time not more than 4 downloads are happening.
So i have create a folder /home/user/file_limit. In the script it creates a file here before the download and after the download is complete it will delete it.
The script will check the number of files in the folder is less than 4 then only it will allow to create a file in the folder /home/user/file_limit
I am running a script like this
today=`date +%Y-%m-%d-%H_%M_%S_%N`;
while true
do
sleep 1
# The below command will find number of files in the folder /home/user/file_limit
lines=$(find /home/user/file_limit -iname 'download_*' -type f| wc -l)
if [ $lines -lt 5 ]; then
echo "Create file"
touch "/home/user/file_limit/download_${today}"
break;
else
echo "Number of files equals 4"
fi
done
#After this some downloading happens and once the downloading is complete
rm "/home/user/file_limit/download_${today}"
The problem i am facing is when 100 such scripts are running. Eg when the number of files in the folder are less than 4, then many touch "/home/user/file_limit/download_${today}" gets executed simultaneously and all of them creates files. So the total number of files become more than 4 which i dont want because more downloads cause my system get slower.
How to ensure there is a delay between each script for checking the lines=$(find /home/user/file_limit -iname 'download_*' -type f| wc -l) so that only one touch command get executed.
Or HOw to ensure the lines=$(find /home/user/file_limit -iname 'download_*' -type f| wc -l) command is checked by each script in a queue. No two scripts can check it at the same time.

How to ensure there is a delay between each script for checking the lines=$(find ... | wc -l) so that only one touch command get executed
Adding a delay won't solve the problem. You need a lock, mutex, or semaphore to ensure that the check and creation of files is executed atomically.
Locks limit the number of parallel processes to 1. Locks can be created with flock (usually pre-installed).
Semaphores are generalized locks limiting the number concurrent processes to any number N. Semaphores can be created with sem (part of GNU parallel, has to be installed).
The following script allows 4 downloads in parallel. If 4 downloads are running and you start the script a 5th time then that 5th download will pause until one of the 4 running downloads finish.
#! /usr/bin/env bash
main() {
# put your code for downloading here
}
export -f main
sem --id downloadlimit -j4 main

My solution starts maximum MAXPARALELLJOBS number of process and waits until all of those processes are done...
Hope it helps your problem.
MAXPARALELLJOBS=4
count=0
while <not done the job>
do
((count++))
( <download job> ) &
[ ${count} -ge ${MAXPARALELLJOBS} ] && count=0 && wait
done

Related

Shell script file watcher concurrency

I have the following script (shell script running on OEL5.6) that is currently scheduled via cron to pick files up from given directories (specified in a database table) and to call a processing script on them on a directory & filemask basis. The script works fine at the moment but with this implementation if one folder has a large amount of files to process even if all other folders complete the script won't exit until that one has, which means files landing in the other folders won't be picked up until the next run. I'd like to use a similar approach to this but to have it constantly checking folders for new files instead of sequentially running through all folders once and then exiting so it would run as more of a daemon constantly in the background. Any ideas rather than wrapping this in a while true loop? I've filtered out a bit of code from this example to keep it short.
readonly HOME_DIR="$(cd $(dirname $0)/;echo $PWD)"
export LOCK_DIR="/tmp/lock_folder"
check_lock() {
# Try and create the $LOCK_DIR lock directory. Exit script if failure.
# Do some checks to make sure the script is actually running and hasn't just failed and left a lock dir.
}
main(){
# Check to see if there's already an instance of the watcher running.
check_lock
# when the watcher script exits remove the lock directory for the next run
trap 'rm -r $LOCK_DIR;' EXIT
# Pull folder and file details into a csv file from the database -> $FEEDS_FILE
# Loop through all the files in given folders
while IFS="," read feed_name feed_directory file_mask
do
# Count the number of files to process using the directory and file mask
num_files=$(find $feed_directory/$file_mask -mmin +5 -type f 2> /dev/null | wc -l 2> /dev/null)
if [[ $num_files < 1 ]]; then
# There's no files older than 5 mins to pickup here. Move on to next folder.
continue
fi
# Files found! Try and create a new feed_name lock dir. This should always pass first loop.
if mkdir $LOCK_DIR/$feed_name 2>/dev/null; then
$HOME_DIR/another_script.sh "$feed_name" "$feed_directory" "$file_mask" & # Call some script to do processing. This script removes it's child lock dir when done.
else
log.sh "Watcher still running" f
continue
fi
# If the amount of processes running as indicated by child lock dirs present in $LOCK_DIR is greater than or equal to the max allowed then wait before re-trying another.
while [ $(find $LOCK_DIR -maxdepth 1 -type d -not -path $LOCK_DIR | wc -l) -ge 5 ]; do
sleep 10
done
done < $FEEDS_FILE
# Now all folders have been processed make sure that this script doesn't exit until all child scripts have completed (and removed their lock dirs).
while [ $(find $LOCK_DIR -type d | wc -l) -gt 1 ]; do
sleep 10
done
exit 0
}
main "$#"
One idea is to use inotifywait from inotify-tools to monitor the directories for changes. This is more efficient than constantly scanning the directories for changes. Something like that
inotifywait -m -r -e create,modify,move,delete /dir1 /dir2 |
while IFS= read -r event; do
# parse $event, act accordingly
done

Why the "while" concurrent is running slower and slower in shell script?

I want zip lots of files using more cpus at onetime by using shell's concurrent function like this:
#!/bin/bash
#set -x
function zip_data()
{
while true
do
{
echo "do zip something"
}&
done
}
zip_data
wait
At the begin time ,the speed of loop onetime is quickly .
But with the increase of loop times when running,
the speed is more and more slowly.
Why ???
I think the reason may be there are too many child processes are runing .
So I try that make the while function running one loop at onetime like this:
#!/bin/bash
#set -x
function trap_exit
{
exec 1000>&-;exec 1000<&-
kill -9 0
}
trap 'trap_exit; exit 2' 1 2 3 15 9
mkfifo testfifo ; exec 1000<>testfifo ; rm -rf testfifo
function zip_data()
{
echo >&1000
while true
read -u 1000
do
{
echo "do something"
echo >&1000
}&
done
}
zip_data
wait
However the phenomenon is same like before .
So I don't understand that reason why the speed is more and more slowly when runing.
#
Today I try like this but it don't work
#!/bin/bash
#set -x
c=0
while true
do
c=$(jobs -p | wc -l)
while [ $c -ge 20 ]; do
c=$(jobs -p | wc -l)
sleep 0.01
done
{
echo "$c"
sleep 0.8
}&
done
So i try other way to finish this function like this ,Thank you!
#!/bin/bash
#set -x
function EXPECT_FUNC()
{
para=$1
while true
do
{
do something $1
}
done
}
EXPECT_FUNC 1 &
EXPECT_FUNC 2 &
EXPECT_FUNC 3 &
EXPECT_FUNC 4 &
wait
Any single-threaded util can run in well managed concurrent threads with parallel. man parallel offers dozens of examples, e.g.:
Create a directory for each zip-file and unzip it in that dir:
parallel 'mkdir {.}; cd {.}; unzip ../{}' ::: *.zip
Recompress all .gz files in current directory using bzip2 running 1 job
per CPU core in parallel:
parallel "zcat {} | bzip2 >{.}.bz2 && rm {}" ::: *.gz
A particularly interesting example that only works with gzip shows how to use several CPUs to simultaneously work one archive, with a single-threaded archiver, which sounds impossible:
To process a big file or some output you can use --pipe to split up
the data into blocks and pipe the blocks into the processing program.
If the program is gzip -9 you can do:
cat bigfile | parallel --pipe --recend '' -k gzip -9 >bigfile.gz
This will split bigfile into blocks of 1 MB and pass that to
gzip -9 in parallel. One gzip will be run per CPU core. The output
of gzip -9 will be kept in order and saved to bigfile.gz
If parallel is too complex, here are some compression utils with built-in parallel archiving:
XZ: pixz
LZMA: plzip, pxz
GZIP: pigz
BZIP2: pbzip2

Executing several bash scripts simultaneously from one script?

I want to make a bash script that will execute around 30 or so other scripts simultaneously, these 30 scripts all have wget commands iterating through some lists.
I thought of doing something with screen (send ctrl + shift + a + d) or send the scripts to background but really I dont know what to do.
To summarize: 1 master script execution will trigger all other 30 scripts to execute all at the same time.
PS: I've seen the other questions but I don't quite understand how the work or the are a bit more than what I need(expecting a return value, etc)
EDIT:
Small snippet of the script(this part is the one that executes with the config params I specified)
if [ $WP_RANGE_STOP -gt 0 ]; then
#WP RANGE
for (( count= "$WP_RANGE_START"; count< "$WP_RANGE_STOP"+1; count=count+1 ));
do
if cat downloaded.txt | grep "$count" >/dev/null
then
echo "File already downloaded!"
else
echo $count >> downloaded.txt
wget --keep-session-cookies --load-cookies=cookies.txt --referer=server.com http://server.com/wallpaper/$count
cat $count | egrep -o "http://wallpapers.*(png|jpg|gif)" | wget --keep-session-cookies --load-cookies=cookies.txt --referer=http://server.com/wallpaper/$number -i -
rm $count
fi
Probably the most straightforward approach would be to use xargs -P or GNU parallel. Generate the different arguments for each child in the master script. For simplicity's sake, let's say you just want to download a bunch of different content at once. Either of
xargs -P 30 wget < urls_file
parallel -j 30 wget '{}' < urls_file
will spawn up to 30 simultaneous wget processes with different args from the given input. If you give more information about the scripts you want to run, I might be able to provide more specific examples.
Parallel has some more sophisticated tuning options compared to xargs, such as the ability to automatically split jobs across cores or cpus.
If you're just trying to run a bunch of heterogeneous different bash scripts in parallel, define each individual script in its own file, then make each file executable and pass it to parallel:
$ cat list_of_scripts
/path/to/script1 arg1 arg2
/path/to/script2 -o=5 --beer arg3
…
/path/to/scriptN
then
parallel -j 30 < list_of_scripts

Wait for all files with a certain extension to stop existing

I have a shell script that unzips a bunch of files, then processes the files and then zips them back up again. I want to wait with the processing until all the files are done unzipping.
I know how to do it for one file:
while [ -s /homes/ndeklein/mzml/JG-C2-1.mzML.gz ]
do
echo "test"
sleep 10
done
However, when I do
while [ -s /homes/ndeklein/mzml/*.gz ]
I get the following error:
./test.sh: line 2: [: too many arguments
I assume because there are more than 1 results. So how can I do this for multiple files?
You can execute a subcommand in the shell and check that there is output:
while [ -n "$(ls /homes/ndeklein/mzml/*.gz 2> /dev/null)" ]; do
# your code goes here
sleep 1; # generally a good idea to sleep at end of while loops in bash
done
If the directory could potentially have thousands of files, you may want to consider using find instead of ls with the wildcard, ie; find -maxdepth 1 -name "*\.gz"
xargs is your friend if while is not coerced.
ls /homes/ndeklein/mzml/*.gz | xargs -I {} gunzip {}

How to (trivially) parallelize with the Linux shell by starting one task per Linux core?

Today's CPUs typically comprise several physical cores. These might even be multi-threaded so that the Linux kernel sees quite a large number of cores and accordingly starts several times the Linux scheduler (one for each core). When running multiple tasks on a Linux system the scheduler achieves normally a good distribution of the total workload to all Linux cores (might be the same physical core).
Now, say, I have a large number of files to process with the same executable. I usually do this with the "find" command:
find <path> <option> <exec>
However, this starts just one task at any time and waits until its completion before starting the next task. Thus, just one core at any time is in use for this. This leaves the majority of the cores idle (if this find-command is the only task running on the system). It would be much better to launch N tasks at the same time. Where N is the number of cores seen by the Linux kernel.
Is there a command that would do that ?
Use find with the -print0 option. Pipe it to xargs with the -0 option. xargs also accepts the -P option to specify a number of processes. -P should be used in combination with -n or -L.
Read man xargs for more information.
An example command:
find . -print0 | xargs -0 -P4 -n4 grep searchstring
If you have GNU Parallel http://www.gnu.org/software/parallel/ installed you can do this:
find | parallel do stuff {} --option_a\; do more stuff {}
You can install GNU Parallel simply by:
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
Watch the intro videos for GNU Parallel to learn more:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Gnu parallel or xargs -P is probably a better way to handle this, but you can also write a sort-of multi-tasking framework in bash. It's a little messy and unreliable, however, due to the lack of certain facilities.
#!/bin/sh
MAXJOBS=3
CJ=0
SJ=""
gj() {
echo ${1//[][-]/}
}
endj() {
trap "" sigchld
ej=$(gj $(jobs | grep Done))
jobs %$ej
wait %$ej
CJ=$(( $CJ - 1 ))
if [ -n "$SJ" ]; then
kill $SJ
SJ=""
fi
}
startj() {
j=$*
while [ $CJ -ge $MAXJOBS ]; do
sleep 1000 &
SJ=$!
echo too many jobs running: $CJ
echo waiting for sleeper job [$SJ]
trap endj sigchld
wait $SJ 2>/dev/null
done
CJ=$(( $CJ + 1 ))
echo $CJ jobs running. starting: $j
eval "$j &"
}
set -m
# test
startj sleep 2
startj sleep 10
startj sleep 1
startj sleep 1
startj sleep 1
startj sleep 1
startj sleep 1
startj sleep 1
startj sleep 2
startj sleep 10
wait

Resources