bash script to watch a folder

bash script to watch a folder - linux

I have the following situation:
There is a windows folder that has been mounted on a Linux machine. There could be multiple folders (setup before hand)
in this windows mount. I have to do something (preferably a script to start with) to watch these folders.
These are the steps:
Watch for any incoming file(s). Make sure they are transferred completely.
Move it to another folder.
I do not have any control over the file transfer program on the windows machine. It is a secure FTP I believe.
So I cannot ask that process to send me a trailer file to ensure the completion of file transfer.
I have written a bash script. I would like to know about any potential pitfalls with this approach. Reason is,
there is a possibility of mulitple copies of this script running for multiple directories like this.
At the moment, there could be upto 100 directories that may have to be monitored.
Following is the script. I'm sorry for pasting a very long one here. Please take your time to review it and
comment / criticize it. :-)
It takes 3 parameters, the folder that has to be watched, the folder where the file has to be moved,
and a time interval, which has been explained below.
I'm sorry there seems to be a problem with the alignment. Markdown doesn't seem to like it. I tried to organize it properly, but not able to do so.
Linux servername 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:27:17 EDT 2006 i686 i686 i386
GNU/Linux
#!/bin/bash
log_this()
{
message="$1"
now=`date "+%D-%T"`
echo $$": "$now ": " $message
}
usage()
{
cat << EOF
Usage: $0 <Directory to be watched> <Directory to transfer> <time interval>
Time interval is the amount of time after which the modification time of a
file will be monitored.
EOF
`exit 1`
}
if [ $# -lt 2 ]
then
usage
fi
WATCH_DIR=$1
APP_DIR=$2
if [ ! -d "$WATCH_DIR" ]
then
log_this "FATAL: WATCH_DIR, $WATCH_DIR does not exist. Exiting"
exit 1
fi
if [ ! -d "$APP_DIR" ]
then
log_this "APP_DIR: $APP_DIR does not exist. Exiting"
exit 1
fi
# This needs to be set after considering the rate of file transfer.
# Represents the seconds elapsed after the last modification to the file.
# If not supplied as parameter, defaults to 3.
seconds_between_mods=$3
if ! [[ "$seconds_between_mods" =~ ^[0-9]+$ ]]; then
if [ ${#seconds_between_mods} -eq 0 ]; then
log_this "No value supplied for elapse time. Defaulting to 3."
seconds_between_mods=3
else
log_this "Invalid value provided for elapse time"
exit 1
fi
fi
log_this "Start Monitor."
while true
do
ls -1 $WATCH_DIR | while read file_name
do
log_this "Start Monitoring for $file_name"
# Refer only the modification with reference to the mount folder.
# If there is a diff in time between servers, we are in trouble.
token_file=$WATCH_DIR/foo.$$
current_time=`touch $token_file && stat -c "%Y" $token_file`
rm -f $token_file 2>/dev/null
log_this "Current Time: $current_time"
last_mod_time=`stat -c "%Y" $WATCH_DIR/$file_name`
elapsed_time=`expr $current_time - $last_mod_time`
log_this "Elapsed time ==> $elapsed_time"
if [ $elapsed_time -ge $seconds_between_mods ]
then
log_this "Moving $file_name to $APP_DIR"
# In case if there is no space left on the target mount, hide the file
# in the mount itself and remove the incomplete file from APP_DIR.
mv $WATCH_DIR/$file_name $APP_DIR
if [ $? -ne 0 ]
then
log_this "FATAL: mv failed!! Hiding $file_name"
rm $APP_DIR/$file_name
mv $WATCH_DIR/$file_name $WATCH_DIR/.$file_name
log_this "Removed $APP_DIR/$file_name. Look for $WATCH_DIR/.$file_name and submit later."
fi
log_this "End Monitoring for $file_name"
else
log_this "$file_name: Transfer seems to be in progress"
fi
done
log_this "Nothing more to monitor."
echo
sleep 5
done

This isn't going to work for any length of time. In production, you will have network problems and other errors which can leave a partial file in the upload directory. I also don't like the idea of a "trailer" file. The usual approach is to upload the file under a temporary name and then rename it after the upload completes.
This way, you just have to list the directory, filter the temporary names out and and if there is anything left, use it.
If you can't make this change, then ask your boss for a written permission to implement something which can lead to arbitrary data corruption. This is for two purposes: 1) To make them understand that this is a real problem and not something which you make up and 2) to protect yourself when it breaks ... because it will and guess who'll get all the blame?

I believe a much saner approach would be the use of a kernel-level filesystem notify item. Such as inotify. Get also the tools here.

incron is an "inotify cron" system. It consists of a daemon and a table manipulator. You can use it a similar way as the regular cron. The difference is that the inotify cron handles filesystem events rather than time periods.

First make sure inotify-tools in installed.
Then use them like this:
logOfChanges="/tmp/changes.log.csv" # Set your file name here.
# Lock and load
inotifywait -mrcq $DIR > "$logOfChanges" & # monitor, recursively, output CSV, be quiet.
IN_PID=$$
# Do your stuff here
...
# Kill and analyze
kill $IN_PID
cat "$logOfChanges" | while read entry; do
# Split your CSV, but beware that file names may contain spaces too.
# Just look up how to parse CSV with bash. :)
path=...
event=...
... # Other stuff like time stamps
# Depending on the event…
case "$event" in
SOME_EVENT) myHandlingCode path ;;
...
*) myDefaultHandlingCode path ;;
done
Alternatively, using --format instead of -c on inotifywait would be an idea.
Just man inotifywait and man inotifywatch for more infos.

To be honest a python app set up to run at start-up will do this quickly and efficiently. Python has amazing OS support and its rather complete.
Running the script will likely work, but it will be troublesome to take care and manage. I take it you will run these as frequent cron jobs?

To get you off your feet here is a small app I wrote which takes a path and looks at the binary output of jpeg files. I never quite finished it, but it will get you started and to see the structure of python as well as some use of os..
I wouldnt spend to much time worrying about my code.
import time, os, sys
#analyze() takes in a path and moves into the output_files folder, to then analyze files
def analyze(path):
list_outputfiles = os.listdir(path + "/output_files")
print list_outputfiles
for i in range(len(list_outputfiles)):
#print list_outputfiles[i]
f = open(list_outputfiles[i], 'r')
f.readlines()
#txtmaker reads the media file and writes its binary contents to a text file.
def txtmaker(c_file):
print c_file
os.system("cat" + " " + c_file + ">" + " " + c_file +".txt")
os.system("mv *.txt output_files")
#parser() takes in the inputed path, reads and lists all files, creates a directory, then calls txtmaker.
def parser(path):
os.chdir(path)
os.mkdir(path + "/output_files", 0777)
list_files = os.listdir(path)
for i in range(len(list_files)):
if os.path.isdir(list_files[i]) == True:
print (list_files[i], "is a directory")
else:
txtmaker(list_files[i])
analyze(path)
def main():
path = raw_input("Enter the full path to the media: ")
parser(path)
if __name__ == '__main__':
main()

Related

Bash script deletes files older than N days using lftp - but does not remove recursive directories and files

I have finally got this script working and it logs on to my remote FTP and removes files in a folder that are older than N days. I cannot however get it to remove recursive directories also. What can be changed or added to make this script remove files in subfolders as well as subfolders that are also older than N days? I have tried adding the -r function at a few places but it did not work. I think it needs to be added to where the script also builds the list of files to be removed. Any help would be greatly appreciated. Thank you in advance!
#!/bin/bash
# Simple script to delete files older than specific number of days from FTP.
# This script use 'lftp'. And 'date' with '-d' option which is not POSIX compatible.
# FTP credentials and path
FTP_HOST="xxxxxxxxxxxx"
FTP_USER="xxxxxx"
FTP_PASS="xxxxxxxxxxxxxxxxx"
FTP_PATH="/directadmin"
# Full path to lftp executable
LFTP=`which lftp`
# Enquery days to store from 1-st passed argument or strictly hardcode it, uncomment one to use
STORE_DAYS=${1:? "Usage ${0##*/} X, where X - count of daily archives to store"}
# STORE_DAYS=7
function removeOlderThanDays() {
# Make some temp files to store intermediate data
LIST=`mktemp`
DELLIST=`mktemp`
# Connect to ftp get file list and store it into temp file
${LFTP} << EOF
open ${FTP_USER}:${FTP_PASS}#${FTP_HOST}
cd ${FTP_PATH}
cache flush
cls -q -1 --date --time-style="+%Y%m%d" > ${LIST}
quit
EOF
# Print obtained list, uncomment for debug
# echo "File list"
# cat ${LIST}
# Delete list header, uncomment for debug
# echo "Delete list"
# Let's find date to compare
STORE_DATE=$(date -d "now - ${STORE_DAYS} days" '+%Y%m%d')
while read LINE; do
if [[ ${STORE_DATE} -ge ${LINE:0:8} && "${LINE}" != *\/ ]]; then
echo "rm -f \"${LINE:9}\"" >> ${DELLIST}
# Print files which are subject to deletion, uncomment for debug
#echo "${LINE:9}"
fi
done < ${LIST}
# More debug strings
# echo "Delete list complete"
# Print notify if list is empty and exit.
if [ ! -f ${DELLIST} ] || [ -z "$(cat ${DELLIST})" ]; then
echo "Delete list doesn't exist or empty, nothing to delete. Exiting"
exit 0;
fi
# Connect to ftp and delete files by previously formed list
${LFTP} << EOF
open ${FTP_USER}:${FTP_PASS}#${FTP_HOST}
cd ${FTP_PATH}
$(cat ${DELLIST})
quit

I have addressed this sort of thing a few times.
How to connect to a ftp server via bash script?
Provide commands automatically to ftp in bash script
Bash FTP upload - events to log
Better to use scp and/or ssh when you can, especially if you can set up passwordless access with public keys. Otherwise, I recommend a more robust language like Python or Perl that lets you check the return codes of these steps individually and respond accordingly.

Node.js delete first N bytes from a file

How to delete (remove | trim) N bytes from the beginning of a binary file without loading it in the memory?
We have fs.ftruncate(fd, len, callback), which cuts out bytes from the end of the file (if it is bigger).
How to cut bytes from the beginning, or trim from the beginning in Node.js without reading a file in the memory?
I need something like truncateFromBeggining(fd, len, callback) or removeBytes(fd, 0, N, callback).
If it is not possible, what is the fastest way to do it with file streams?
On most filesystems you can't "cut" a part out from the beginning or from the middle of a file, you can only truncate it at the end.
Having the above in mind I imagine, we have to probably open the input file stream, to seek to after the Nth byte, and to pipe the rest of the bytes to an output file stream.

You're asking for an OS file system operation: the ability to remove some bytes from the beginning of a file in place, without rewriting the file.
You're asking for a file system operation that does not exist, at least in Linux / FreeBSD / MacOS / Windows.
If your program is the only user of the file and it fits in RAM, your best bet is to read the whole thing into RAM, then reopen the file for writing, then write out the part you want to keep.
Or you can create a new file. Let's say your input file is called q. Then you'd create a file called, maybe new_q with a stream attached. You'd pipe the contents you wanted to the new file. Then you'd unlink (delete) the input file q and rename the output file new_q to q.
Careful: this unlink / rename operation will create a short time when no file named q is available. So if some other program tries to open it and doesn't find it, it should try again a few times.
If you're creating a queueing scheme, you might consider using some other scheme to hold your queue data. This file read / rewrite / unlink / rename sequence has lots of ways it can go wrong on you under heavy load. (Ask me how I know that when you have a couple of hours to spare ;-) redis is worth a look.

I decided to solve the problem in bash.
The script truncates the files in a temp folder first, then moves them back to the original folder.
The truncate is done with tail:
tail --bytes="$max_size" "$from_file" > "$to_file"
The full script:
#!/bin/bash
declare -r store="/my/data/store"
declare -r temp="/my/data/temp"
declare -r max_size=$(( 200000 * 24 ))
or_exit() {
local exit_status=$?
local message=$*
if [ $exit_status -gt 0 ]
then
echo "$(date '+%F %T') [$(basename "$0" .sh)] [ERROR] $message" >&2
exit $exit_status
fi
}
# Checks if there are any files in 'temp'. It should be empty.
! ls "$temp/"* &> '/dev/null'
or_exit 'Temp folder is not empty'
# Loops over all the files in 'store'
for file_path in "$store/"*
do
# Trim bigger then 'max_size' files from 'store' to 'temp'
if [ "$( stat --format=%s "$file_path" )" -gt "$max_size" ]
then
# Truncates the file to the temp folder
tail --bytes="$max_size" "$file_path" > "$temp/$(basename "$file_path")"
or_exit "Cannot tail: $file_path"
fi
done
unset -v file_path
# If there are files in 'temp', move all of them back to 'store'
if ls "$temp/"* &> '/dev/null'
then
# Moves all the truncated files back to the store
mv "$temp/"* "$store/"
or_exit 'Cannot move files from temp to store'
fi

One liner to append a file into another file but only if it hasn't already been added

I have an automated process that has a number of lines like the following pattern:
sudo cat /some/path/to/a/file >> /some/other/file
I'd like to transform that into a one liner that will only append to /some/other/file if /some/path/to/a/file has not already been added.
Edit
It's clear I need some examples here.
example 1: Updating a .bashrc script for a specific login
example 2: Creating a .screenrc for different logins
example 3: Appending to the end of a /etc/ config file
Some other caveats. The text is going to be added in a block (>>). Consequently, it should be relatively straight forward to see if the entire code block is added or not near the end of a file. I am trying to come up with a simple method for determining whether or not the file has already been appended to the original.
Thanks!
Example python script...
def check_for_appended(new_file, original_file):
""" Checks original_file to see if it has the contents of new_file """
new_lines = reversed(new_file.split("\n"))
original_lines = reversed(original_file.split("\n"))
appended = None
for new_line, orig_line in zip(new_lines, original_lines):
if new_line != orig_line:
appended = False
break
else:
appended = True
return appended

Maybe this will get you started - this GNU awk script:
gawk -v RS='^$' 'NR==FNR{f1=$0;next} {print (index($0,f1) ? "present" : "absent")}' file1 file2
will tell you if the contents of "file1" are present in "file2". It cannot tell you why, e.g. because you previously concatenated file1 onto the end of file2.
Is that all you need? If not update your question to clarify/explain.

Here's a technique to see if a file contains another file
contains_file_in_file() {
local small=$1
local big=$2
awk -v RS="" '{small=$0; getline; exit !index($0, small)}' "$small" "$big"
}
if ! contains_file_in_file /some/path/to/a/file /some/other/file; then
sudo cat /some/path/to/a/file >> /some/other/file
fi

EDIT: Op just told me in the comments that the files he wants to concatenate are bash scripts -- this brings us back to the good ole C preprocessor include guard tactics:
prepend every file with
if [ -z "$__<filename>__" ]; then __<filename>__=1; else
(of course replacing <filename> with the name of the file) and at the end
fi
this way, you surround the script in each file with a test for something that's only true once.

Does this work for you?
sudo (set -o noclobber; date > /tmp/testfile)
noclobber prevents overwriting an existing file.
I think it doesn't, since you wrote you want to append something but this technique might help.
When the appending all occurs in one script, then use a flag:
if [ -z "${appended_the_file}" ]; then
cat /some/path/to/a/file >> /some/other/file
appended_the_file="Yes I have done it except for permission/right issues"
fi
I would continue into writing a function appendOnce { .. }, with the content above. If you really want an ugly oneliner (ugly: pain for the eye and colleague):
test -z "${ugly}" && cat /some/path/to/a/file >> /some/other/file && ugly="dirt"
Combining this with sudo:
test -z "${ugly}" && sudo "cat /some/path/to/a/file >> /some/other/file" && ugly="dirt"

It appears that what you want is a collection of script segments which can be run as a unit. Your approach -- making them into a single file -- is hard to maintain and subject to a variety of race conditions, making its implementation tricky.
A far simpler approach, similar to that used by most modern Linux distributions, is to create a directory of scripts, say ~/.bashrc.d and keep each chunk as an individual file in that directory.
The driver (which replaces the concatenation of all those files) just runs the scripts in the directory one at a time:
if [[ -d ~/.bashrc.d ]]; then
for f in ~/.bashrc.d/*; do
if [[ -f "$f" ]]; then
source "$f"
fi
done
fi
To add a file from a skeleton directory, just make a new symlink.
add_fragment() {
if [[ -f "$FRAGMENT_SKELETON/$1" ]]; then
# The following will silently fail if the symlink already
# exists. If you wanted to report that, you could add || echo...
ln -s "$FRAGMENT_SKELETON/$1" "~/.bashrc.d/$1" 2>>/dev/null
else
echo "Not a valid fragment name: '$1'"
exit 1
fi
}
Of course, it is possible to effectively index the files by contents rather than by name. But in most cases, indexing by name will work better, because it is robust against editing the script fragment. If you used content checks (md5sum, for example), you would run the risk of having an old and a new version of the same fragment, both active, and without an obvious way to remove the old one.
But it should be straight-forward to adapt the above structure to whatever requirements and constraints you might have.
For example, if symlinks are not possible (because the skeleton and the instance do not share a filesystem, for example), then you can copy the files instead. You might want to avoid the copy if the file is already present and has the same content, but that's just for efficiency and it might not be very important if the script fragments are small. Alternatively, you could use rsync to keep the skeleton and the instance(s) in sync with each other; that would be a very reliable and low-maintenance solution.

Rollover shell script

Assuming a shell script(commands.sh) with few commands.
I need to write a script which sends the output of commands executed by commands.sh to a file f1.csv
if file size exceeds 1MB then the output flowing should go to file f2.csv
if the file size exceeds 1 mb again here,the output flowing should go to file f3.csv
if f3.csv exceeds the size 1mb,then the older f1 should be deleted and again new file f1 should be created,
output flowing should be to written to f1. This process should go on .
I can write the crontab file, just the shell script is a bit tricky
I have been experimenting:
#!/usr/bin/env bash
PREFIX="f"
# Maximum size after which you want a new file in bytes
MAX_SIZE=1048576
LAST_FILE=`ls "$prefix"*.csv | tail -1`
# Check if file exists and if it does not, create it.
if [[ -z "$LAST_FILE" ]]
then
LAST_FILE=$PREFIX"1.csv"
touch $LAST_FILE
fi
LAST_FILE_NO=`echo $LAST_FILE | sed s/$PREFIX/''/ | sed s/.csv/''/`
LAST_FILE_SIZE=`stat -c %s $LAST_FILE`
if [ `stat -c %s $LAST_FILE` -lt 200 ]
then
`/bin/sh ./sam.sh >> $LAST_FILE`
else
UPCOMING_FILE_NO=$((LAST_FILE_NO+1))
`/bin/sh ./sam.sh >> $PREFIX$UPCOMING_FILE_NO.csv`
fi
help is appreciated guys.
EDIT: Have got the secondary shell script to work too...
Now if anyone could help me with resetting after 3 files are done and starting from f1.
thanks

It sounds like you'd be better off using logrotate, depending on how your script is running. If you are running 'commands.sh' on a cron, you can have logrotate rotate out the logs. There is a good guide on logrotate here:
http://linuxers.org/howto/howto-use-logrotate-manage-log-files
If your commands.sh isn't going to be on a cron, meaning it's not a regular time interval that triggers it, you could manually set up a log rotation at the beginning of your script. I once had to do something similar. I found this guide really useful:
http://wazem.blogspot.com/2013/11/simple-bash-log-rotate-function.html

Multi-threaded BASH programming - generalized method?

Ok, I was running POV-Ray on all the demos, but POV's still single-threaded and wouldn't utilize more than one core. So, I started thinking about a solution in BASH.
I wrote a general function that takes a list of commands and runs them in the designated number of sub-shells. This actually works but I don't like the way it handles accessing the next command in a thread-safe multi-process way:
It takes, as an argument, a file with commands (1 per line),
To get the "next" command, each process ("thread") will:
Waits until it can create a lock file, with: ln $CMDFILE $LOCKFILE
Read the command from the file,
Modifies $CMDFILE by removing the first line,
Removes the $LOCKFILE.
Is there a cleaner way to do this? I couldn't get the sub-shells to read a single line from a FIFO correctly.
Incidentally, the point of this is to enhance what I can do on a BASH command line, and not to find non-bash solutions. I tend to perform a lot of complicated tasks from the command line and want another tool in the toolbox.
Meanwhile, here's the function that handles getting the next line from the file. As you can see, it modifies an on-disk file each time it reads/removes a line. That's what seems hackish, but I'm not coming up with anything better, since FIFO's didn't work w/o setvbuf() in bash.
#
# Get/remove the first line from FILE, using LOCK as a semaphore (with
# short sleep for collisions). Returns the text on standard output,
# returns zero on success, non-zero when file is empty.
#
parallel__nextLine()
{
local line rest file=$1 lock=$2
# Wait for lock...
until ln "${file}" "${lock}" 2>/dev/null
do sleep 1
[ -s "${file}" ] || return $?
done
# Open, read one "line" save "rest" back to the file:
exec 3<"$file"
read line <&3 ; rest=$(cat<&3)
exec 3<&-
# After last line, make sure file is empty:
( [ -z "$rest" ] || echo "$rest" ) > "${file}"
# Remove lock and 'return' the line read:
rm -f "${lock}"
[ -n "$line" ] && echo "$line"
}

#adjust these as required
args_per_proc=1 #1 is fine for long running tasks
procs_in_parallel=4
xargs -n$args_per_proc -P$procs_in_parallel povray < list
Note the nproc command coming soon to coreutils will auto determine
the number of available processing units which can then be passed to -P

If you need real thread safety, I would recommend to migrate to a better scripting system.
With python, for example, you can create real threads with safe synchronization using semaphores/queues.

sorry to bump this after so long, but I pieced together a fairly good solution for this IMO
It doesnt work perfectly, but it will limit the script to a certain number of child tasks running, and then wait for all the rest at the end.
#!/bin/bash
pids=()
thread() {
local this
while [ ${#} -gt 6 ]; do
this=${1}
wait "$this"
shift
done
pids=($1 $2 $3 $4 $5 $6)
}
for i in 1 2 3 4 5 6 7 8 9 10
do
sleep 5 &
pids=( ${pids[#]-} $(echo $!) )
thread ${pids[#]}
done
for pid in ${pids[#]}
do
wait "$pid"
done
it seems to work great for what I'm doing (handling parallel uploading of a bunch of files at once) and keeps it from breaking my server, while still making sure all the files get uploaded before it finishes the script

I believe you're actually forking processes here, and not threading. I would recommend looking for threading support in a different scripting language like perl, python, or ruby.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

bash script to watch a folder - linux

I believe a much saner approach would be the use of a kernel-level filesystem notify item. Such as inotify. Get also the tools here.

incron is an "inotify cron" system. It consists of a daemon and a table manipulator. You can use it a similar way as the regular cron. The difference is that the inotify cron handles filesystem events rather than time periods.

To be honest a python app set up to run at start-up will do this quickly and efficiently. Python has amazing OS support and its rather complete. Running the script will likely work, but it will be troublesome to take care and manage. I take it you will run these as frequent cron jobs?

Related

Bash script deletes files older than N days using lftp - but does not remove recursive directories and files

Node.js delete first N bytes from a file

One liner to append a file into another file but only if it hasn't already been added

Rollover shell script

Multi-threaded BASH programming - generalized method?

Categories

Resources