List files which have a corresponding "ready" file - linux

I have a service "A" which generates some compressed files comprising of the data it receives in requests. In parallel there is another service "B" which consumes these compressed files.
The trick is "B" shouldn't consume any of the files unless they are written completely. The service deduces this information by looking for a ".ready" file created by service "A" with name exactly same as the file generated along with the extension mentioned; once the compression is done. Service "B" uses Apache Camel to do this filtering.
Now, I am writing a shell script which needs the same compressed files and this would need the same filtering be implemented in shell. I need help writing this script. I am aware of find command but a naive shell user, so have very limited knowledge.
Example:
Compressed file: sumit_20171118_1.gz
Corresponding ready
file: sumit_20171118_1.gz.ready
Another compressed file: sumit_20171118_2.gz
No ready file is present for this one.
Of the above listed files only the first should be picked up as it has a corresponding ready file.

The most obvious way would be to use a busy loop. But if you are on GNU/Linux you can do better than that (from: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-GNU-Parallel-as-dir-processor)
inotifywait -qmre MOVED_TO -e CLOSE_WRITE --format %w%f my_dir |
parallel -uj1 echo Do stuff to file {}
This way you do not even have to wait for the .ready file: The command will only be run when writing to the file is finished and the file is closed.
If, however, the .ready file is only written much later then you can search for that one:
inotifywait -qmre MOVED_TO -e CLOSE_WRITE --format %w%f my_dir |
grep --line-buffered '\.ready$' |
parallel -uj1 echo Do stuff to file {.}

Related

Read updates from continuously updated directory

I am writing a bash script that looks at each file in a directory and does some sort of action to it. It's supposed to look something like this (maybe?).
for file in "$dir"* ; do
something
done
Cool, right? The problem is, this directory is being updated frequently (with new files). There is no guarantee that, at some point, I will technically be done with all the files in the dir (therefore exiting the for-loop), but not actually done feeding the directory with extra files. There is no guarantee that I will never be done feeding the directory (well... take that with a grain of salt).
I do NOT want to process the same file more than once.
I was thinking of making a while loop that runs forever and keeps updating some file-list A, while making another file-list B that keeps track of all the files I already processed, and the first file in file-list A that is not in file-list B gets processed.
Is there a better method? Does this method even work? Thanks
Edit: Mandatory "I am bash newb"
#Barmar has a good suggestion. One way to handle this is using inotify to watch for new files. After installing the inotify-tools on your system, you can use the inotifywait command to feed new-file events into a loop.
You may start with something like:
inotifywait -m -e MOVED_TO,CLOSED_WRITE myfolder |
while read dir events file, do
echo "Processing file $file"
...do something with $dir/$file...
mv $dir/$file /some/place/for/processed/files
done
This inotifywait command will generate events for (a) files that are moved into the directory and (b) files that are closed after being opened for writing. This will generally get you what you want, but there are always corner cases that depend on your particular application.
The output of inotifywait looks something like:
tmp/work/ CLOSE_WRITE,CLOSE file1
tmp/work/ MOVED_TO file2

Grep on a file that is being written by another application

I am using grep to match a string in a file but at the same time another application might be writing to the same file.
In that case what will grep do.
Will it allow the file to being written by other application or will it not give access to the file?
Also if it does give access will my grep results be based on before the file was written or after?
Basically I want the grep to not lock the access to the file but if it does that is their an alternative to prevent it from doing so..
My Sample command:
egrep -r -i "regex" /directory/*
grep does not lock the file, so it is safe to use it, while the file is being actively used by another application

Listing files while working with them - Shell Linux

I have a database server that it basic work is to import some specific files, do some calculations and provide data in a web interface.
It's planned for next weeks a hardware replacement, it needs to migrate the database. But there's one problem in it: the actual database is corrupted and show some errors in web interface. This is due to server freezing while importing/calculating, that's why the replacement.
So I'm not willing to just dump the db and restore in the new server. Doesn't make sense to still use the corrupted database and while dumping the old server goes really slow. I have a backup from all files to be imported (the current number is 551) and I'm working on a script to "re-import" all of them and have a nice database again.
The actual server takes ~20 minutes to import each new file. Let's say that new server takes 10 for each file due to its power... It's a long time! And here comes the problem: it receives new file hourly, so there will be more files when it finishes the job.
Restore script start like this:
for a in $(ls $BACKUP_DIR | grep part_of_filename); do
Question is: does this "ls" will have new file names when they come? File names are timestamp based, so they will be in the end of the list.
Or does this "ls" is execute once and results goes to a temp var?
Thanks.
ls will execute once, at the beginning, and any new files won't show up.
You can rewrite that statement to list the files again at the start of each loop (and, as Trey mentioned, better to use find, not ls):
while all=$(find $BACKUP_DIR/* -type f | grep part_of_filename); do
for a in $all; do
But this has a major problem: it will repeatedly process the same files over and over again.
The script needs to record which files are done. Then it can list the directory again and process any (and only) new files. Here's one way:
touch ~/done.list
cd $BACKUP_DIR
# loop while f=first file not in done list:
# find list the files; more portable and safer than ls in pipes and scripts
# fgrep -v -f ~/done.list pass through only files not in the done list
# head -n1 pass through only the first one
# grep . control the loop (true iff there is something)
while f=`find * -type f | fgrep -v -f ~/done.list | head -n1 | grep .`; do
<process file $f>
echo "$f" >> ~/done.list
done

Redirecting the cat ouput of file to the same file

In a particular directory, I made a file named "fileName" and add contents to it. When I typed cat fileName, it's content are printed on the terminal. Now I used the following command:
cat fileName>fileName
No error was shown. Now when I try to see contents of file using,
cat fileName
nothing was shown in the terminal and file is empty (when I checked it). What is the reason for this?
> i.e. redirection to the same file will create/truncate the file before cat command is invoked as it has a higher precedence. You could avoid the same by using intermediate file and then from intermediate to actual file or you could use tee like:
cat fileName | tee fileName
To clarify on SMA's answer, the file is truncated because redirection is handled by the shell, which opens the file for writing before invoking the command. when you run cat file > file,the shell truncates and opens the file for writing, sets stdout to the file, and then execute ["cat", "file"]. So you will have to use some other command for the task like tee
The answers given here are wrong. You will have a problem with truncating regardless of using the redirect or pipeline, although it may APPEAR to work sometimes, depending on size of file or length of your pipeline. It is a race condition, as the reader may have a chance to read some or all of the file before the writer starts, but the point of the pipeline is to run all these at the same time so they will be starting at the same time and the first thing tee executable will do is open the output file (and truncate it in the process). The only way you will not have a problem in this scenario is if the end of the pipeline would load the entirety of the output into memory and only write it to file on shutdown. It is unlikely to happen and defeats the point of having a pipeline.
Proper solution for making this reliable is to just write to a temp file and then rename the temp file back to original filename:
TMP="$(mktemp fileName.XXXXXXXX)"
cat fileName | grep something | tee "${TMP}"
mv "${TMP}" fileName

linux - watch a directory for new files, then run a script

I want to watch a directory in Ubuntu 14.04, and when a new file is created in this directory, run a script.
specifically I have security cameras that upload via FTP captured video when they detect motion. I want to run a script on this FTP server so when new files are created, they get mirrored (uploaded) to a cloud storage service immediately, which is done via a script I've already written.
I found iWatch which lets me do this (http://iwatch.sourceforge.net/index.html) - the problem I am having is that iwatch immediately kicks off the cloud upload script the instant the file is created in the FTP directory, even while the file is in progress of being uploaded still. This causes the cloud sync script to upload 0-byte files, useless to me.
I could add a 'wait' in the cloud upload script maybe but it seems hack-y and impossible to predict how long to wait as it depends on file size, network conditions etc.
Whats a better way to do this?
Although inotifywait was mentioned in comments, a complete solution might be useful to others. This seems to be working:
inotifywait -m -e close_write /tmp/upload/ | gawk '{print $1$3; fflush()}' | xargs -L 1 yourCommandHere
will run
yourCommandHere /tmp/upload/filename
when a newly uploaded file is closed
Notes:
inotifywait is part of apt package inotify-tools in Ubuntu. It uses the kernel inotify service to monitor file or directory events
-m option is monitor mode, outputs one line per event to stdout
-e close_write for file close events for files that were open for writing. File close events hopefully avoid receiving incomplete files.
/tmp/upload can be replaced with some other directory to monitor
the pipe to gawk reformats the inotifywait output lines to drop the 2nd column, which is a repeat of the event type. It combines the dirname in column 1 with the filename in column 3 to make a new line, which is flushed every line to defeat buffering and encourage immediate action by xargs
xargs takes a list of files and runs the given command for each file, appending the filename on the end of the command. -L 1 causes xargs to run after each line received on standard input.
You were close to solution there. You can watch many different events with iwatch - the one that interests you is close_write. Syntax:
iwatch -e close_write <directory_name>
This of course works only if file's closed when the writing's complete, which, while it's a sane assumption, it's not necessarily a true one (yet often is).
Here's another version of reacting to a filesystem event by making a POST request to a given URL.
#!/bin/bash
set -euo pipefail
cd "$(dirname "$0")"
watchRoot=$1
uri=$2
function post()
{
while read path action file; do
echo '{"Directory": "", "File": ""}' |
jq ".Directory |= \"$path\"" |
jq ".File |= \"$file\"" |
curl --data-binary #- -H 'Content-Type: application/json' -X POST $uri || continue
done
}
inotifywait -r -m -e close_write "$watchRoot" | post

Resources