Shell script file watcher concurrency - linux

I have the following script (shell script running on OEL5.6) that is currently scheduled via cron to pick files up from given directories (specified in a database table) and to call a processing script on them on a directory & filemask basis. The script works fine at the moment but with this implementation if one folder has a large amount of files to process even if all other folders complete the script won't exit until that one has, which means files landing in the other folders won't be picked up until the next run. I'd like to use a similar approach to this but to have it constantly checking folders for new files instead of sequentially running through all folders once and then exiting so it would run as more of a daemon constantly in the background. Any ideas rather than wrapping this in a while true loop? I've filtered out a bit of code from this example to keep it short.
readonly HOME_DIR="$(cd $(dirname $0)/;echo $PWD)"
export LOCK_DIR="/tmp/lock_folder"
check_lock() {
# Try and create the $LOCK_DIR lock directory. Exit script if failure.
# Do some checks to make sure the script is actually running and hasn't just failed and left a lock dir.
}
main(){
# Check to see if there's already an instance of the watcher running.
check_lock
# when the watcher script exits remove the lock directory for the next run
trap 'rm -r $LOCK_DIR;' EXIT
# Pull folder and file details into a csv file from the database -> $FEEDS_FILE
# Loop through all the files in given folders
while IFS="," read feed_name feed_directory file_mask
do
# Count the number of files to process using the directory and file mask
num_files=$(find $feed_directory/$file_mask -mmin +5 -type f 2> /dev/null | wc -l 2> /dev/null)
if [[ $num_files < 1 ]]; then
# There's no files older than 5 mins to pickup here. Move on to next folder.
continue
fi
# Files found! Try and create a new feed_name lock dir. This should always pass first loop.
if mkdir $LOCK_DIR/$feed_name 2>/dev/null; then
$HOME_DIR/another_script.sh "$feed_name" "$feed_directory" "$file_mask" & # Call some script to do processing. This script removes it's child lock dir when done.
else
log.sh "Watcher still running" f
continue
fi
# If the amount of processes running as indicated by child lock dirs present in $LOCK_DIR is greater than or equal to the max allowed then wait before re-trying another.
while [ $(find $LOCK_DIR -maxdepth 1 -type d -not -path $LOCK_DIR | wc -l) -ge 5 ]; do
sleep 10
done
done < $FEEDS_FILE
# Now all folders have been processed make sure that this script doesn't exit until all child scripts have completed (and removed their lock dirs).
while [ $(find $LOCK_DIR -type d | wc -l) -gt 1 ]; do
sleep 10
done
exit 0
}
main "$#"

One idea is to use inotifywait from inotify-tools to monitor the directories for changes. This is more efficient than constantly scanning the directories for changes. Something like that
inotifywait -m -r -e create,modify,move,delete /dir1 /dir2 |
while IFS= read -r event; do
# parse $event, act accordingly
done

Related

bash: how to keep some delay between multiple instances of a script

I am trying to download 100 files using a script
I dont want at any point of time not more than 4 downloads are happening.
So i have create a folder /home/user/file_limit. In the script it creates a file here before the download and after the download is complete it will delete it.
The script will check the number of files in the folder is less than 4 then only it will allow to create a file in the folder /home/user/file_limit
I am running a script like this
today=`date +%Y-%m-%d-%H_%M_%S_%N`;
while true
do
sleep 1
# The below command will find number of files in the folder /home/user/file_limit
lines=$(find /home/user/file_limit -iname 'download_*' -type f| wc -l)
if [ $lines -lt 5 ]; then
echo "Create file"
touch "/home/user/file_limit/download_${today}"
break;
else
echo "Number of files equals 4"
fi
done
#After this some downloading happens and once the downloading is complete
rm "/home/user/file_limit/download_${today}"
The problem i am facing is when 100 such scripts are running. Eg when the number of files in the folder are less than 4, then many touch "/home/user/file_limit/download_${today}" gets executed simultaneously and all of them creates files. So the total number of files become more than 4 which i dont want because more downloads cause my system get slower.
How to ensure there is a delay between each script for checking the lines=$(find /home/user/file_limit -iname 'download_*' -type f| wc -l) so that only one touch command get executed.
Or HOw to ensure the lines=$(find /home/user/file_limit -iname 'download_*' -type f| wc -l) command is checked by each script in a queue. No two scripts can check it at the same time.
How to ensure there is a delay between each script for checking the lines=$(find ... | wc -l) so that only one touch command get executed
Adding a delay won't solve the problem. You need a lock, mutex, or semaphore to ensure that the check and creation of files is executed atomically.
Locks limit the number of parallel processes to 1. Locks can be created with flock (usually pre-installed).
Semaphores are generalized locks limiting the number concurrent processes to any number N. Semaphores can be created with sem (part of GNU parallel, has to be installed).
The following script allows 4 downloads in parallel. If 4 downloads are running and you start the script a 5th time then that 5th download will pause until one of the 4 running downloads finish.
#! /usr/bin/env bash
main() {
# put your code for downloading here
}
export -f main
sem --id downloadlimit -j4 main
My solution starts maximum MAXPARALELLJOBS number of process and waits until all of those processes are done...
Hope it helps your problem.
MAXPARALELLJOBS=4
count=0
while <not done the job>
do
((count++))
( <download job> ) &
[ ${count} -ge ${MAXPARALELLJOBS} ] && count=0 && wait
done

Shell script to copy one file at a time in a cron job

I have some csv files in location A like this
abc1.csv,
abc2.csv,
abc3.csv
I have a cron job which runs every 30 mins and in each execution I want to copy only 1 file(which shouldn't be repeated) and placed it in location B
I had though of 2 ways of doing this
1)I will pick the first file in the list of files and copy it to location B and will delete it once copied.Problem with this is I am not sure when the file will get copied completely and if i delete before its completed copied it can be an issue
2)I will have a temp folder.So i will copy the file from location A to location B and also keep it in temp location.In next iteration, when I pick the file from list of files I will compare its existence in the temp file location.If it exists I will move to next file .I think this will be more time consuming etc.
Please suggest if there is any other better way
you can use this bash script for your use case:
source="/path/to/.csv/directory"
dest="/path/to/destination/directory"
cd $source
for file in *.csv
do
if [ ! -f $dest/"$file" ]
then
cp -v $file $dest
break
fi
done
You can ensure you move the already copied file with:
cp abc1.csv destination/ && mv abc1.csv.done
(here you can make your logic to find only *.csv files, and not take into account *.done files.. that have been already processed by your script... or use any suffix you want..
if the cp does not succeed, nothing after that will get executed, so the file will not be moved.
You can also replace mv with rm to delete it:
cp abc1.csv destination/ && rm -f abc1.csv
Further more, you can add to the above commands error messages in case you want to be informed if the cp failed:
cp abc1.csv destination/ && mv abc1.csv.done || echo "copy of file abc1.csv failed"
And get informed via CRON/email output
Finally I took some idea from both the opted solution.Here is the final script
source="/path/to/.csv/directory"
dest="/path/to/destination/directory"
cd $source
for file in *.csv
do
if [ ! -f $dest/"$file" ]
then
cp -v $file $dest || echo "copy of file $file failed"
rm -f $file
break
fi
done

Bash script to iterate contents of directory moving only the files not currently open by other process

I have people uploading files to a directory on my Ubuntu Server.
I need to move those files to the final location (another directory) only when I know these files are fully uploaded.
Here's my script so far:
#!/bin/bash
cd /var/uploaded_by_users
for filename in *; do
lsof $filename
if [ -z $? ]; then
# file has been closed, move it
else
echo "*** File is open. Skipping..."
fi
done
cd -
However it's not working as it says some files are open when that's not true. I supposed $? would have 0 if the file was closed and 1 if it wasn't but I think that's wrong.
I'm not linux expert so I'm looking to know how to implement this simple script that will run on a cron job every 1 minute.
[ -z $? ] checks if $? is of zero length or not. Since $? will never be a null string, your check will always fail and result in else part being executed.
You need to test for numeric zero, as below:
lsof "$filename" >/dev/null; lsof_status=$?
if [ "$lsof_status" -eq 0 ]; then
# file is open, skipping
else
# move it
fi
Or more simply (as Benjamin pointed out):
if lsof "$filename" >/dev/null; then
# file is open, skip
else
# move it
fi
Using negation, we can shorten the if statement (as dimo414 pointed out):
if ! lsof "$filename" >/dev/null; then
# move it
fi
You can shorten it even further, using &&:
for filename in *; do
lsof "$filename" >/dev/null && continue # skip if the file is open
# move the file
done
You may not need to worry about when the write is complete, if you are moving the file to a different location in the same file system. As long as the client is using the same file descriptor to write to the file, you can simply create a new hard link for the upload file, then remove the original link. The client's file descriptor won't be affected by one of the links being removed.
cd /var/uploaded_by_users
for f in *; do
ln "$f" /somewhere/else/"$f"
rm "$f"
done

Shell script: Count files, delete 'X' oldest file

I am new to scripting. Currently I have a script that backs up a directory every day to a file server. It deletes the oldest file outside of 14 days. My issue is I need it to count the actual files and delete the 14th oldest one. When going by days, if the file server or host is down for a few days or longer, when back up it will delete a couple days worth of backups or even all of them. Pending down time. I want it to always have 14 days worth of backups.
I tried searching around and could only find solutions related to deleting by dates. Like what I have now.
Thank you for the help/advice!
My code I have, sorry its my first attempt at scripting:
#! /bin/sh
#Check for file. If not found, the connection to the file server is down!
if
[ -f /backup/connection ];
then
echo "File Server is connected!"
#Directory to be backed up.
backup_source="/var/www/html/moin-1.9.7"
#Backup directory.
backup_destination="/backup"
#Current date to name files.
date=`date '+%m%d%y'`
#naming the file.
filename="$date.tgz"
echo "Backing up directory"
#Creating the back up of the backup_source directory and placing it into the backup_destination directory.
tar -cvpzf $backup_destination/$filename $backup_source
echo "Backup Finished!"
#Search for folders older than '+X' days and delete them.
find /backup -type f -ctime +13 -exec rm -rf {} \;
else
echo "File Server is NOT connected! Date:`date '+%m-%d-%y'` Time:`date '+%H:%M:%S'`" > /user/Desktop/error/`date '+%m-%d-%y'`
fi
Something along the lines like this might work:
ls -1t /path/to/directory/ | head -n 14 | tail -n 1
in the ls command, -1 is to list just the filenames (nothing else), -t is to list them in chronological order (newest first). Piping through the head command takes just the first 14 from the output of the ls command, then tail -n 1 takes just the last from that list. This should give the the file that is 14th newest.
Here is another suggestion. The following script simply enumerates the backups. This eases the task of keeping track of the last n backups. If you need to know the actual creation date you can simply check the file metadata, e.g. using stat.
#!/bin/sh
set -e
backup_source='somedir'
backup_destination='backup'
retain=14
filename="backup-$retain.tgz"
check_fileserver() {
nc -z -w 5 file.server.net 80 2>/dev/null || exit 1
}
backup_advance() {
if [ -f "$backup_destination/$filename" ]; then
echo "removing $filename"
rm "$backup_destination/$filename"
fi
for i in $(seq $(($retain)) -1 2); do
file_to="backup-$i.tgz"
file_from="backup-$(($i - 1)).tgz"
if [ -f "$backup_destination/$file_from" ]; then
echo "moving $backup_destination/$file_from to $backup_destination/$file_to"
mv "$backup_destination/$file_from" "$backup_destination/$file_to"
fi
done
}
do_backup() {
tar czf "$backup_destination/backup-1.tgz" "$backup_source"
}
check_fileserver
backup_advance
do_backup
exit 0

Wait for all files with a certain extension to stop existing

I have a shell script that unzips a bunch of files, then processes the files and then zips them back up again. I want to wait with the processing until all the files are done unzipping.
I know how to do it for one file:
while [ -s /homes/ndeklein/mzml/JG-C2-1.mzML.gz ]
do
echo "test"
sleep 10
done
However, when I do
while [ -s /homes/ndeklein/mzml/*.gz ]
I get the following error:
./test.sh: line 2: [: too many arguments
I assume because there are more than 1 results. So how can I do this for multiple files?
You can execute a subcommand in the shell and check that there is output:
while [ -n "$(ls /homes/ndeklein/mzml/*.gz 2> /dev/null)" ]; do
# your code goes here
sleep 1; # generally a good idea to sleep at end of while loops in bash
done
If the directory could potentially have thousands of files, you may want to consider using find instead of ls with the wildcard, ie; find -maxdepth 1 -name "*\.gz"
xargs is your friend if while is not coerced.
ls /homes/ndeklein/mzml/*.gz | xargs -I {} gunzip {}

Resources