Read updates from continuously updated directory

Read updates from continuously updated directory - linux

I am writing a bash script that looks at each file in a directory and does some sort of action to it. It's supposed to look something like this (maybe?).
for file in "$dir"* ; do
something
done
Cool, right? The problem is, this directory is being updated frequently (with new files). There is no guarantee that, at some point, I will technically be done with all the files in the dir (therefore exiting the for-loop), but not actually done feeding the directory with extra files. There is no guarantee that I will never be done feeding the directory (well... take that with a grain of salt).
I do NOT want to process the same file more than once.
I was thinking of making a while loop that runs forever and keeps updating some file-list A, while making another file-list B that keeps track of all the files I already processed, and the first file in file-list A that is not in file-list B gets processed.
Is there a better method? Does this method even work? Thanks
Edit: Mandatory "I am bash newb"

#Barmar has a good suggestion. One way to handle this is using inotify to watch for new files. After installing the inotify-tools on your system, you can use the inotifywait command to feed new-file events into a loop.
You may start with something like:
inotifywait -m -e MOVED_TO,CLOSED_WRITE myfolder |
while read dir events file, do
echo "Processing file $file"
...do something with $dir/$file...
mv $dir/$file /some/place/for/processed/files
done
This inotifywait command will generate events for (a) files that are moved into the directory and (b) files that are closed after being opened for writing. This will generally get you what you want, but there are always corner cases that depend on your particular application.
The output of inotifywait looks something like:
tmp/work/ CLOSE_WRITE,CLOSE file1
tmp/work/ MOVED_TO file2

Related

Listing files while working with them - Shell Linux

I have a database server that it basic work is to import some specific files, do some calculations and provide data in a web interface.
It's planned for next weeks a hardware replacement, it needs to migrate the database. But there's one problem in it: the actual database is corrupted and show some errors in web interface. This is due to server freezing while importing/calculating, that's why the replacement.
So I'm not willing to just dump the db and restore in the new server. Doesn't make sense to still use the corrupted database and while dumping the old server goes really slow. I have a backup from all files to be imported (the current number is 551) and I'm working on a script to "re-import" all of them and have a nice database again.
The actual server takes ~20 minutes to import each new file. Let's say that new server takes 10 for each file due to its power... It's a long time! And here comes the problem: it receives new file hourly, so there will be more files when it finishes the job.
Restore script start like this:
for a in $(ls $BACKUP_DIR | grep part_of_filename); do
Question is: does this "ls" will have new file names when they come? File names are timestamp based, so they will be in the end of the list.
Or does this "ls" is execute once and results goes to a temp var?
Thanks.

ls will execute once, at the beginning, and any new files won't show up.
You can rewrite that statement to list the files again at the start of each loop (and, as Trey mentioned, better to use find, not ls):
while all=$(find $BACKUP_DIR/* -type f | grep part_of_filename); do
for a in $all; do
But this has a major problem: it will repeatedly process the same files over and over again.
The script needs to record which files are done. Then it can list the directory again and process any (and only) new files. Here's one way:
touch ~/done.list
cd $BACKUP_DIR
# loop while f=first file not in done list:
# find list the files; more portable and safer than ls in pipes and scripts
# fgrep -v -f ~/done.list pass through only files not in the done list
# head -n1 pass through only the first one
# grep . control the loop (true iff there is something)
while f=`find * -type f | fgrep -v -f ~/done.list | head -n1 | grep .`; do
<process file $f>
echo "$f" >> ~/done.list
done

inotifywait doesn't work as expected

I wanted to write a script that triggers some code when a file gets changed (meaning the content changes or the file gets overwritten by file with the same name) in a specific directory (or in a subdirectory). When running my code and changing a file it seems to run it twice everytime since I get the echo output twice. Is there something I am missing?
while true; do
change=$(inotifywait -e close_write /home/bla)
change=${change#/home/bla/ * }
echo "$change"
done
Also it doesn't do anything when I change something in a subdirectory of the specified directory.
The outpoot looks like this after i change a file in the specified directory:
Setting up watches.
Watches established.
filename
Setting up watches.
Watches established.
filename
Setting up watches.
Watches established.

I can't reproduce that the script outputs a message twice. Are you sure you don't run it twice (in the background)? Or are you using an editor to change the file? Some editors place a backup file beside the edited file while the file is open. This would explain that you see two messages.
For recursive directory watching you need to pass the option -r to inotifywait. However, you should not run that on a super larger filesystem tree since the number of inotify watches is limited. You can obtain the current limit on your system through
cat /proc/sys/fs/inotify/max_user_watches

Prevent files to be moved by another process in linux

I have problem with bash script.
I have two cron tasks, which gets some number of files from same folder for further processing.
ls -1h "targdir/*.json" | head -n ${LIMIT} > ${TMP_LIST_FILE}
while read REMOTE_FILE
do
mv $REMOTE_FILE $SCRDRL
done < "${TMP_LIST_FILE}"
rm -f "${TMP_LIST_FILE}"
But then two instances of script run simultaneously same file beeing moved to $SRCDRL which different for instances.
The question is how to prevent files to be moved by different script?
UPD:
Maybe I was little uncleare...
I have folder "targdir" where I store json files. And I have two cron tasks which gets some files from that directory to process. For example in targdir exists 25 files first cron task should get first 10 files and move them to /tmp/task1, second cron task should get next 10 files and move them to /tmp/task2 , e.t.c.
But now first 10 files moves to /tmp/task1 and /tmp/task2.

First and foremost: rename is atomic. It is not possible for a file to be moved twice. One of the moves will fail, because the file is no longer there. If the scripts run in parallel, both list the same 10 files and instead of first 10 files moved to /tmp/task1 and next 10 to /tmp/task2 you may get 4 moved to /tmp/task1 and 6 to /tmp/task2. Or maybe 5 and 5 or 9 and 1 or any other combination. But each file will only end up in one task.
So nothing is incorrect; each file is still processed only once. But it will be inefficient, because you could process 10 files at a time, but you are only processing 5. If you want to make sure you always process 10 if there is enough files available, you will have to do some synchronization. There are basically two options:
Place lock around the list+copy. This is most easily done using flock(1) and a lock file. There are two ways to call that too:
Call the whole copying operation via flock:
flock targdir -c copy-script
This requires that you make the part that should be excluded a separate script.
Lock via file descriptor. Before the copying, do
exec 3>targdir/.lock
flock 3
and after it do
flock -u 3
This lets you lock over part of the script only. This does not work in Cygwin (but you probably don't need that).
Move the files one by one until you have enough.
ls -1h targdir/*.json > ${TMP_LIST_FILE}
# ^^^ do NOT limit here
COUNT=0
while read REMOTE_FILE
do
if mv $REMOTE_FILE $SCRDRL 2>/dev/null; then
COUNT=$(($COUNT + 1))
fi
if [ "$COUNT" -ge "$LIMIT" ]; then
break
fi
done < "${TMP_LIST_FILE}"
rm -f "${TMP_LIST_FILE}"
The mv will sometimes fail, in which case you don't count the file and try to move the next one, assuming the mv failed because the file was meanwhile moved by the other script. Each script copies at most $LIMIT files, but it may be rather random selection.
On a side-note if you don't absolutely need to set environment variables in the while loop, you can do without a temporary file. Simply:
ls -1h targdir/*.json | while read REMOTE_FILE
do
...
done
You can't propagate variables out of such loop, because as part of a pipeline it runs in subshell.
If you do need to set environment variables and can live with using bash specifically (I usually try to stick to /bin/sh), you can also write
while read REMOTE_FILE
do
...
done <(ls -1h targdir/*.json)
In this case the loop runs in current shell, but this kind of redirection is bash extension.

The fact that two cron jobs move the same file to the same path should not matter for you unless you are disturbed by the error you get from one of them (one will succeed and the other will fail).
You can ignore the error by using:
...
mv $REMOTE_FILE $SCRDRL 2>/dev/null
...

Since your script is supposed to move a specific number of files from the list, two instances will at best move twice as many files. Unless they even interfere with each other, then the number of moved files might be less.
In any case, this is probably a bad situation to begin with. If you have any way of preventing two scripts running at the same time, you should do that.
If, however, you have no way of preventing two script instances from running at the same time, you should at least harden the scripts against errors:
mv "$REMOTE_FILE" "$SCRDRL" 2>/dev/null
Otherwise your scripts will produce error output (no good idea in a cron script).
Further, I hope that your ${TMP_LIST_FILE} is not the same in both instances (you could use $$ in it to avoid that); otherwise they'd even overwrite this temp file, in the worst case resulting in a corrupted file containing paths you do not want to move.

How can I loop through some files in my script?

I am very much a beginner at this and have searched for answers my question but have not found any that I understand how to implement. Any help would be greatly appreciated.
I have a script:
FILE$=`ls ~/Desktop/File_Converted/`
mkdir /tmp/$FILE
mv ~/Desktop/File_Converted/* /tmp/$FILE/
So I can use Applescript to say when a file is dropped into this desktop folder, create a temp directory, move the file there and the do other stuff. I then delete the temp directory. This is fine as far as it goes, but the problem is that if another file is dropped into File_Converted directory before I am doing doing stuff to the file I am currently working with it will change the value of the $FILE variable before the script has completed operating on the current file.
What I'd like to do is use a variable set up where the variable is, say, $FILE1. I check to see if $FILE1 is defined and, if not, use it. If it is defined, then try $FILE2, etc... In the end, when I am done, I want to reclaim the variable so $FILE1 get set back to null again and the next file dropped into the File_Converted folder can use it again.
Any help would be greatly appreciated. I'm new to this so I don't know where to begin.
Thanks!
Dan

Your question is a little difficult to parse, but I think you're not really understanding shell globs or looping constructs. The globs are expanded based on what's there now, not what might be there earlier or later.
DIR=$(mktemp -d)
mv ~/Desktop/File_Converted/* "$DIR"
cd "$DIR"
for file in *; do
: # whatever you want to do to "$file"
done

You don't need a LIFO -- multiple copies of the script run for different events won't have conflict over their variable names. What they will conflict on is shared temporary directories, and you should use mktemp -d to create a temporary directory with a new, unique, and guaranteed-nonconflicting name every time your script is run.
tempdir=$(mktemp -t -d mytemp.XXXXXX)
mv ~/Desktop/File_Converted/* "$tempdir"
cd "$tempdir"
for f in *; do
...whatever...
done

What you describe is a classic race condition, in which it is not clear that one operation will finish before a conflicting operation starts. These are not easy to handle, but you will learn so much about scripting and programming by handling them that it is well worth the effort to do so, even just for learning's sake.
I would recommend that you start by reviewing the lockfile or flock manpage. Try some experiments. It looks as though you probably have the right aptitude for this, for you are asking exactly the right questions.
By the way, I suspect that you want to kill the $ in
FILE$=`ls ~/Desktop/File_Converted/`
Incidentally, #CharlesDuffy correctly observes that "using ls in scripts is indicative of something being done wrong in and of itself. See mywiki.wooledge.org/ParsingLs and mywiki.wooledge.org/BashPitfalls." One suspects that the suggested lockfile exercise will clear up both points, though it will probably take you several hours to work through it.

Monitor Directory for Changes

Much like a similar SO question, I am trying to monitor a directory on a Linux box for the addition of new files and would like to immediately process these new files when they arrive. Any ideas on the best way to implement this?

Look at inotify.
With inotify you can watch a directory for file creation.

First make sure inotify-tools in installed.
Then use them like this:
logOfChanges="/tmp/changes.log.csv" # Set your file name here.
# Lock and load
inotifywait -mrcq $DIR > "$logOfChanges" &
IN_PID=$$
# Do your stuff here
...
# Kill and analyze
kill $IN_PID
while read entry; do
# Split your CSV, but beware that file names may contain spaces too.
# Just look up how to parse CSV with bash. :)
path=...
event=...
... # Other stuff like time stamps?
# Depending on the event…
case "$event" in
SOME_EVENT) myHandlingCode path ;;
...
*) myDefaultHandlingCode path ;;
done < "$logOfChanges"
Alternatively, using --format instead of -c on inotifywait would be an idea.
Just man inotifywait and man inotifywatch for more infos.
You can also use incron and use it to call a handling script.

One solution I thought of is to create a "file listener" coupled with a cron job. I'm not crazy about this but I think it could work.

fschange (Linux File System Change Notification) is a perfect solution, but it needs to patch your kernel

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string