lftp mirror possible race condition during file transfer? - linux

I was trying to keep a local folder [lftp_source_folder] and a remote folder [lftp_server_path] in sync using lftp mirror configuration.
Script for running it continously is given below.
while true
do
lftp -f $BATCHFILE
sleep 20
done
$BATCHFILE manily consists the below :
# sftp config +
echo "mirror --Remove-source-files -R /lftp_source_folder /lftp_server_path/" >> $BATCHFILE
But problem is that, I have a script which will keep on moving files to /lftp_source_folder.
Now I am confused that is there a chance of race condition due to this implementation.
For example
my script is moving files to /lftp_source_folder
say 50% of file has been moved to /lftp_source_folder and the mv command is interrupted by OS.
/lftp_source_folder/new_file.txt [new_file.txt is only 50% of size]
lftp has been invoked by the while loop.
mv command again continues
In step 2, will the lftp upload the file which is 50% completed to lftp server and delete the file ? data will be lost in this case.
If it's race condition, what's the solution ?

If you're moving files within the same filesystem, there's no race condition. mv simply performs a rename() operation, and this is atomic, it's not copying file data.
But if you're moving between different filesystems, you can indeed get a race condition. This is done as a copy followed by deleting the original, and your script might upload the file to the FTP server when only part of it is copied.
The solution to this is to move the file first to a temporary folder on the same filesystem as /lftp_source_folder, then move it from there to /lftp_source_folder. So when the mirror script sees it there, it's guaranteed to be complete.

Related

Wait Until Previous Command Completes

I have written a bash script on my MacMini to execute anytime a file has completed downloading. After the file download is complete, the mac mounts my NAS, renames the file, and then copies the file from the mac to the NAS, deletes the file from the mac and then unmounts the NAS.
My issue is, sometimes, the NAS takes a few seconds to mount. When that happens, I receive an error that the file could not be copies because the directory doesn’t exist.
When the NAS mounts instantly, (if the file size is small), the file copies and then the file deletes and the NAS unmounts.
When the file size is large, the copying process stops when the file is deleted.
What I’m looking for is, how do I make the script “wait” until the NAS is mounted, and then how do I make the script again wait until the file copying is complete?
Thank you for any input. Below is my code.
#connect to NAS
open 'smb://username:password#ip_address_of_NAS/folder_for_files'
#I WANT A WAIT COMMAND HERE SO THAT SCRIPT DOES NOT CONTINUE UNTIL NAS IS MOUNTED
#move to folder where files are downloaded
cd /Users/account/Downloads
#rename files and copy to server
for f in *; do
newName="${f/)*/)}";
mv "$f"/*.zip "$newName".zip;
cp "$newName".zip /Volumes/folder_for_files/"$newName".zip;
#I NEED SCRIPT TO WAIT HERE UNTIL FILE HAS COMPLETED ITS TRANSFER
rm -r "$f";
done
#unmount drive
diskutil unmount /Volumes/folder_for_files
I have no longer a mac to try this, but it seems open 'smb://...' is the only command here that does not wait for completion, but does the actual work in the background instead.
The best way to fix this would be to use something other than open to mount the NAS drive. According to this answer the following should work, but due to the lack of a mac and NAS I cannot test it.
# replace `open ...` with this
osascript <<< 'mount volume "smb://username:password#ip_address_of_NAS"'
If that does not work, use this workaround which manually waits until the NAS is mounted.
open 'smb://username:password#ip_address_of_NAS/folder_for_files'
while [ ! -d /Volumes/folder_for_files/ ]; do sleep 0.1; done
# rest of the script
you can use a loop to sleep and then for five seconds for example, and then run smbstatus and if the output contains any string to identify your smb://username:password#ip_address_of_NAS/folder_for_files connection`
when this is found to then start the copying of your files. You could also have a counter variable to stop, after certain number of times to sleep and then check if the connection has been successful too.

Using incrontab mv file results in 0 byte file

I'm watching a folder using incrontab with the command in the incrontab -e editor:
/media/pi/VDRIVE IN_CLOSE_WRITE sudo mv $#/$# /media/pi/VDRIVE/ready/$#
The watched folder is relieving a file over the network from another machine—the file shows up OK and appears to trigger the incrontab job presumably once the copy process has closed the file, but the mv command results in a 0 bytes file in the destination folder with the correct name.
All run as root.
It seems that there is a bug in Samba on OSX which results in two events when writing to a shared folder on the network. This makes incrontab pretty unworkable when working with OSX computers (more recent OS 10.7 up).
So when OSX writes a file to the Linux samba share, there are two events, the first one triggers the mv action before the file has finished actually writing. Its a bug in OSXs SAMBA implementation.
In the end I used inotify to write events to a log file (of which there are always two), then scanned the file for two instances of the event before performing the action.
Another strategy was to use LSOF on a cron routine that will just ignore any files open for writing.

Read updates from continuously updated directory

I am writing a bash script that looks at each file in a directory and does some sort of action to it. It's supposed to look something like this (maybe?).
for file in "$dir"* ; do
something
done
Cool, right? The problem is, this directory is being updated frequently (with new files). There is no guarantee that, at some point, I will technically be done with all the files in the dir (therefore exiting the for-loop), but not actually done feeding the directory with extra files. There is no guarantee that I will never be done feeding the directory (well... take that with a grain of salt).
I do NOT want to process the same file more than once.
I was thinking of making a while loop that runs forever and keeps updating some file-list A, while making another file-list B that keeps track of all the files I already processed, and the first file in file-list A that is not in file-list B gets processed.
Is there a better method? Does this method even work? Thanks
Edit: Mandatory "I am bash newb"
#Barmar has a good suggestion. One way to handle this is using inotify to watch for new files. After installing the inotify-tools on your system, you can use the inotifywait command to feed new-file events into a loop.
You may start with something like:
inotifywait -m -e MOVED_TO,CLOSED_WRITE myfolder |
while read dir events file, do
echo "Processing file $file"
...do something with $dir/$file...
mv $dir/$file /some/place/for/processed/files
done
This inotifywait command will generate events for (a) files that are moved into the directory and (b) files that are closed after being opened for writing. This will generally get you what you want, but there are always corner cases that depend on your particular application.
The output of inotifywait looks something like:
tmp/work/ CLOSE_WRITE,CLOSE file1
tmp/work/ MOVED_TO file2

Prevent files to be moved by another process in linux

I have problem with bash script.
I have two cron tasks, which gets some number of files from same folder for further processing.
ls -1h "targdir/*.json" | head -n ${LIMIT} > ${TMP_LIST_FILE}
while read REMOTE_FILE
do
mv $REMOTE_FILE $SCRDRL
done < "${TMP_LIST_FILE}"
rm -f "${TMP_LIST_FILE}"
But then two instances of script run simultaneously same file beeing moved to $SRCDRL which different for instances.
The question is how to prevent files to be moved by different script?
UPD:
Maybe I was little uncleare...
I have folder "targdir" where I store json files. And I have two cron tasks which gets some files from that directory to process. For example in targdir exists 25 files first cron task should get first 10 files and move them to /tmp/task1, second cron task should get next 10 files and move them to /tmp/task2 , e.t.c.
But now first 10 files moves to /tmp/task1 and /tmp/task2.
First and foremost: rename is atomic. It is not possible for a file to be moved twice. One of the moves will fail, because the file is no longer there. If the scripts run in parallel, both list the same 10 files and instead of first 10 files moved to /tmp/task1 and next 10 to /tmp/task2 you may get 4 moved to /tmp/task1 and 6 to /tmp/task2. Or maybe 5 and 5 or 9 and 1 or any other combination. But each file will only end up in one task.
So nothing is incorrect; each file is still processed only once. But it will be inefficient, because you could process 10 files at a time, but you are only processing 5. If you want to make sure you always process 10 if there is enough files available, you will have to do some synchronization. There are basically two options:
Place lock around the list+copy. This is most easily done using flock(1) and a lock file. There are two ways to call that too:
Call the whole copying operation via flock:
flock targdir -c copy-script
This requires that you make the part that should be excluded a separate script.
Lock via file descriptor. Before the copying, do
exec 3>targdir/.lock
flock 3
and after it do
flock -u 3
This lets you lock over part of the script only. This does not work in Cygwin (but you probably don't need that).
Move the files one by one until you have enough.
ls -1h targdir/*.json > ${TMP_LIST_FILE}
# ^^^ do NOT limit here
COUNT=0
while read REMOTE_FILE
do
if mv $REMOTE_FILE $SCRDRL 2>/dev/null; then
COUNT=$(($COUNT + 1))
fi
if [ "$COUNT" -ge "$LIMIT" ]; then
break
fi
done < "${TMP_LIST_FILE}"
rm -f "${TMP_LIST_FILE}"
The mv will sometimes fail, in which case you don't count the file and try to move the next one, assuming the mv failed because the file was meanwhile moved by the other script. Each script copies at most $LIMIT files, but it may be rather random selection.
On a side-note if you don't absolutely need to set environment variables in the while loop, you can do without a temporary file. Simply:
ls -1h targdir/*.json | while read REMOTE_FILE
do
...
done
You can't propagate variables out of such loop, because as part of a pipeline it runs in subshell.
If you do need to set environment variables and can live with using bash specifically (I usually try to stick to /bin/sh), you can also write
while read REMOTE_FILE
do
...
done <(ls -1h targdir/*.json)
In this case the loop runs in current shell, but this kind of redirection is bash extension.
The fact that two cron jobs move the same file to the same path should not matter for you unless you are disturbed by the error you get from one of them (one will succeed and the other will fail).
You can ignore the error by using:
...
mv $REMOTE_FILE $SCRDRL 2>/dev/null
...
Since your script is supposed to move a specific number of files from the list, two instances will at best move twice as many files. Unless they even interfere with each other, then the number of moved files might be less.
In any case, this is probably a bad situation to begin with. If you have any way of preventing two scripts running at the same time, you should do that.
If, however, you have no way of preventing two script instances from running at the same time, you should at least harden the scripts against errors:
mv "$REMOTE_FILE" "$SCRDRL" 2>/dev/null
Otherwise your scripts will produce error output (no good idea in a cron script).
Further, I hope that your ${TMP_LIST_FILE} is not the same in both instances (you could use $$ in it to avoid that); otherwise they'd even overwrite this temp file, in the worst case resulting in a corrupted file containing paths you do not want to move.

lftp mirroring directories that don't meet my criteria

I've been writing an lftp script that should mirror a remote directory to a local directory efficiently, possibly transferring multiple gigabyte files at a time.
One of the requirements is that a local user can delete the local file when it is no longer needed, and since I will have multiple "local" computers running this script, I don't want to delete the remote file until I know everyone who needs it, has it. So the script uses the --newer-than flag to only mirror files that are new/modified on the remote server since the last time the lftp script ran locally.
Here's the important bits of the script:
lftp -u $login,$pass $host << EOF
set ftp:ssl-allow yes
set ftp:ssl-protect-data yes
set ftp:ssl-protect-list yes
set ftp:ssl-force yes
set mirror:use-pget-n 5
mirror -X * -I share*/* --newer-than=/local/file/last.run --continue --parallel=5 $remote_dir $local_dir
quit
EOF
Note that the EOF isn't the actual end of the bash script.
So I EXCLUDE everything in $remote_dir except anything in the share/ directory, including the share/ directory itself that are NEWER than the last.run file's timestamp.
This works as expected except in one case where say I have another specifically named directory in share/ called shareWHATEVER/
So share/shareWHATEVER/stuff.txt exists.
The first time it runs, shareWHATEVER/stuff.txt are copied remotely to locally, and all is well.
If I delete the shareWHATEVER directory locally in its entirety, including stuff.txt, then the next time the script runs, stuff.txt it NOT mirrored, but shareWHATEVER is, even though the timestamps have not changed on the remote server.
So locally it looks like share/shareWHATEVER/ where the directory is empty.
Any idea why shareWHATEVER is being copied over even though neither its own timestamp or any of its files' timestamps are --newer-than my local check?
Thanks.
Apparently, creating directories even when no files are copied is just the way lftp works (and the mirror option --no-empty-dirs doesn't change this behaviour).
You could discuss this in the lftp mailing list.

Resources