Prevent files to be moved by another process in linux - linux

I have problem with bash script.
I have two cron tasks, which gets some number of files from same folder for further processing.
ls -1h "targdir/*.json" | head -n ${LIMIT} > ${TMP_LIST_FILE}
while read REMOTE_FILE
do
mv $REMOTE_FILE $SCRDRL
done < "${TMP_LIST_FILE}"
rm -f "${TMP_LIST_FILE}"
But then two instances of script run simultaneously same file beeing moved to $SRCDRL which different for instances.
The question is how to prevent files to be moved by different script?
UPD:
Maybe I was little uncleare...
I have folder "targdir" where I store json files. And I have two cron tasks which gets some files from that directory to process. For example in targdir exists 25 files first cron task should get first 10 files and move them to /tmp/task1, second cron task should get next 10 files and move them to /tmp/task2 , e.t.c.
But now first 10 files moves to /tmp/task1 and /tmp/task2.

First and foremost: rename is atomic. It is not possible for a file to be moved twice. One of the moves will fail, because the file is no longer there. If the scripts run in parallel, both list the same 10 files and instead of first 10 files moved to /tmp/task1 and next 10 to /tmp/task2 you may get 4 moved to /tmp/task1 and 6 to /tmp/task2. Or maybe 5 and 5 or 9 and 1 or any other combination. But each file will only end up in one task.
So nothing is incorrect; each file is still processed only once. But it will be inefficient, because you could process 10 files at a time, but you are only processing 5. If you want to make sure you always process 10 if there is enough files available, you will have to do some synchronization. There are basically two options:
Place lock around the list+copy. This is most easily done using flock(1) and a lock file. There are two ways to call that too:
Call the whole copying operation via flock:
flock targdir -c copy-script
This requires that you make the part that should be excluded a separate script.
Lock via file descriptor. Before the copying, do
exec 3>targdir/.lock
flock 3
and after it do
flock -u 3
This lets you lock over part of the script only. This does not work in Cygwin (but you probably don't need that).
Move the files one by one until you have enough.
ls -1h targdir/*.json > ${TMP_LIST_FILE}
# ^^^ do NOT limit here
COUNT=0
while read REMOTE_FILE
do
if mv $REMOTE_FILE $SCRDRL 2>/dev/null; then
COUNT=$(($COUNT + 1))
fi
if [ "$COUNT" -ge "$LIMIT" ]; then
break
fi
done < "${TMP_LIST_FILE}"
rm -f "${TMP_LIST_FILE}"
The mv will sometimes fail, in which case you don't count the file and try to move the next one, assuming the mv failed because the file was meanwhile moved by the other script. Each script copies at most $LIMIT files, but it may be rather random selection.
On a side-note if you don't absolutely need to set environment variables in the while loop, you can do without a temporary file. Simply:
ls -1h targdir/*.json | while read REMOTE_FILE
do
...
done
You can't propagate variables out of such loop, because as part of a pipeline it runs in subshell.
If you do need to set environment variables and can live with using bash specifically (I usually try to stick to /bin/sh), you can also write
while read REMOTE_FILE
do
...
done <(ls -1h targdir/*.json)
In this case the loop runs in current shell, but this kind of redirection is bash extension.

The fact that two cron jobs move the same file to the same path should not matter for you unless you are disturbed by the error you get from one of them (one will succeed and the other will fail).
You can ignore the error by using:
...
mv $REMOTE_FILE $SCRDRL 2>/dev/null
...

Since your script is supposed to move a specific number of files from the list, two instances will at best move twice as many files. Unless they even interfere with each other, then the number of moved files might be less.
In any case, this is probably a bad situation to begin with. If you have any way of preventing two scripts running at the same time, you should do that.
If, however, you have no way of preventing two script instances from running at the same time, you should at least harden the scripts against errors:
mv "$REMOTE_FILE" "$SCRDRL" 2>/dev/null
Otherwise your scripts will produce error output (no good idea in a cron script).
Further, I hope that your ${TMP_LIST_FILE} is not the same in both instances (you could use $$ in it to avoid that); otherwise they'd even overwrite this temp file, in the worst case resulting in a corrupted file containing paths you do not want to move.

Related

lftp mirror possible race condition during file transfer?

I was trying to keep a local folder [lftp_source_folder] and a remote folder [lftp_server_path] in sync using lftp mirror configuration.
Script for running it continously is given below.
while true
do
lftp -f $BATCHFILE
sleep 20
done
$BATCHFILE manily consists the below :
# sftp config +
echo "mirror --Remove-source-files -R /lftp_source_folder /lftp_server_path/" >> $BATCHFILE
But problem is that, I have a script which will keep on moving files to /lftp_source_folder.
Now I am confused that is there a chance of race condition due to this implementation.
For example
my script is moving files to /lftp_source_folder
say 50% of file has been moved to /lftp_source_folder and the mv command is interrupted by OS.
/lftp_source_folder/new_file.txt [new_file.txt is only 50% of size]
lftp has been invoked by the while loop.
mv command again continues
In step 2, will the lftp upload the file which is 50% completed to lftp server and delete the file ? data will be lost in this case.
If it's race condition, what's the solution ?
If you're moving files within the same filesystem, there's no race condition. mv simply performs a rename() operation, and this is atomic, it's not copying file data.
But if you're moving between different filesystems, you can indeed get a race condition. This is done as a copy followed by deleting the original, and your script might upload the file to the FTP server when only part of it is copied.
The solution to this is to move the file first to a temporary folder on the same filesystem as /lftp_source_folder, then move it from there to /lftp_source_folder. So when the mirror script sees it there, it's guaranteed to be complete.

Read updates from continuously updated directory

I am writing a bash script that looks at each file in a directory and does some sort of action to it. It's supposed to look something like this (maybe?).
for file in "$dir"* ; do
something
done
Cool, right? The problem is, this directory is being updated frequently (with new files). There is no guarantee that, at some point, I will technically be done with all the files in the dir (therefore exiting the for-loop), but not actually done feeding the directory with extra files. There is no guarantee that I will never be done feeding the directory (well... take that with a grain of salt).
I do NOT want to process the same file more than once.
I was thinking of making a while loop that runs forever and keeps updating some file-list A, while making another file-list B that keeps track of all the files I already processed, and the first file in file-list A that is not in file-list B gets processed.
Is there a better method? Does this method even work? Thanks
Edit: Mandatory "I am bash newb"
#Barmar has a good suggestion. One way to handle this is using inotify to watch for new files. After installing the inotify-tools on your system, you can use the inotifywait command to feed new-file events into a loop.
You may start with something like:
inotifywait -m -e MOVED_TO,CLOSED_WRITE myfolder |
while read dir events file, do
echo "Processing file $file"
...do something with $dir/$file...
mv $dir/$file /some/place/for/processed/files
done
This inotifywait command will generate events for (a) files that are moved into the directory and (b) files that are closed after being opened for writing. This will generally get you what you want, but there are always corner cases that depend on your particular application.
The output of inotifywait looks something like:
tmp/work/ CLOSE_WRITE,CLOSE file1
tmp/work/ MOVED_TO file2

RH Linux Bash Script help. Need to move files with specific words in the file

I have a RedHat linux box and I had written a script in the past to move files from one location to another with a specific text in the body of the file.
I typically only write scripts once a year so every year I forget more and more... That being said,
Last year I wrote this script and used it and it worked.
For some reason, I can not get it to work today and I know it's a simple issue and I shouldn't even be asking for help but for some reason I'm just not looking at it correctly today.
Here is the script.
ls -1 /var/text.old | while read file
do
grep -q "to.move" $file && mv $file /var/text.old/TBD
done
I'm listing all the files inside the /var/text.old directory.
I'm reading each file
then I'm grep'ing for "to.move" and holing the results
then I'm moving the resulting found files to the folder /var/text.old/TBD
I am an admin and I have rights to the above files and folders.
I can see the data in each file
I can mv them manually
I have use pwd to grab the correct spelling of the directory.
If anyone can just help me to see what the heck I'm missing here that would really make my day.
Thanks in advance.
UPDATE:
The files I need to move do not have Whitespaces.
The Error I'm getting is as follows:
grep: 9829563.msg: No such file or directory
NOTE: the file "982953.msg" is one of the files I need to move.
Also note: I'm getting this error for every file in the directory that I'm listing.
You didn't post any error, but I'm gonna take a guess and say that you have a filename with a space or special shell character.
Let's say you have 3 files, and ls -1 gives us:
hello
world
hey there
Now, while splits on the value of the special $IFS variable, which is set to <space><tab><newline> by default.
So instead of looping of 3 values like you expect (hello, world, and hey there), you loop over 4 values (hello, world, hey, and there).
To fix this, we can do 2 things:
Set IFS to only a newline:
IFS="
"
ls -1 /var/text.old | while read file
...
In general, I like setting IFS to a newline at the start of the script, since I consider this to be slightly "safer", but opinions on this probably vary.
But much better is to not parse the output of ls, and use for:
for file in /var/text.old/*`; do
This won't fork any external processes (piping to ls to while starts 2), and behaves "less surprising" in other ways. See here for some examples.
The second problem is that you're not quoting $file. You should always quote pathnames with double quoted: "$file" for the same reasons. If $file has a space (or a special shell character, such as *, the meaning of your command changes:
file=hey\ *
mv $file /var/text.old/TBD
Becomes:
mv hey * /var/text.old/TBD
Which is obviously very different from what you intended! What you intended was:
mv "hey *" /var/text.old/TBD

linux server create symbolic links from filenames

I need to write a shell script to run as a cron task, or preferably on creation of a file in a certain folder.
I have an incoming and an outgoing folder (they will be used to log mail). There will be files created with codes as follows...
bmo-001-012-dfd-11 for outgoing and 012-dfd-003-11 for incoming. I need to filter the project/client code (012-dfd) and then place it in a folder in the specific project folder.
Project folders are located in /projects and follow the format 012-dfd. I need to create symbolic links inside the incoming or outgoing folders of the projects, that leads to the correct file in the general incoming and outgoing folders.
/incoming/012-dfd-003-11.pdf -> /projects/012-dfd/incoming/012-dfd-003-11.pdf
/outgoing/bmo-001-012-dfd-11.pdf -> /projects/012-dfd/outgoing/bmo-001-012-dfd-11.pdf
So my questions
How would I make my script run when a file is added to either incoming or outgoing folder
Additionally, is there any associated disadvantages with running upon file modification compared with running as cron task every 5 mins
How would I get the filename of recent (since script last run) files
How would I extract the code from the filename
How would I use the code to create a symlink in the desired folder
EDIT: What I ended up doing...
while inotifywait outgoing; do find -L . -type l -delete; ls outgoing | php -R '
if(
preg_match("/^\w{3}-\d{3}-(\d{3}-\w{3})-\d{2}(.+)$/", $argn, $m)
&& $m[1] && (file_exists("projects/$m[1]/outgoing/$argn") != TRUE)
){
`ln -s $(pwd)/outgoing/$argn projects/$m[1]/outgoing/$argn;`;
}
'; done;
This works quite well - cleaning up deleted symlinks also (with find -L . -type l -delete) but I would prefer to do it without the overhead of calling php. I just don't know bash well enough yet.
Some near-answers for your task breakdown:
On linux, use inotify, possibly through one of its command-line tools, or script language bindings.
See above
Assuming the project name can be extracted thinking positionally from your examples (meaning not only does the project name follows a strict 7-character format, but what precedes it in the outgoing file also does):
echo `basename /incoming/012-dfd-003-11.pdf` | cut -c 1-7
012-dfd
echo `basename /outgoing/bmo-001-012-dfd-11.pdf`| cut -c 9-15
012-dfd
mkdir -p /projects/$i/incoming/ creates directory /projects/012-dfd/incoming/ if i = 012-dfd,
ln -s /incoming/foo /projects/$i/incoming/foo creates a symbolic link from the latter argument, to the preexisting, former file /incoming/foo.
How would I make my script run when a file is added to either incoming or outgoing folder
Additionally, is there any associated disadvantages with running upon file modification compared with running as cron task
every 5 mins
If a 5 minutes delay isn't an issue, I would go for the cron job (it's easier and -IMHO- more flexible)
How would I get the filename of recent (since script last run) files
If your script runs every 5 minutes, then you can tell that all the files created in between now (and now - 5 minutes) are newso, using the command ls or find you can list those files.
How would I extract the code from the filename
You can use the sed command
How would I use the code to create a symlink in the desired folder
Once you have the desired file names, you can usen ln -s command to create the symbolic link

Modifying files nested in tar archive

I am trying to do a grep and then a sed to search for specific strings inside files, which are inside multiple tars, all inside one master tar archive. Right now, I modify the files by
First extracting the master tar archive.
Then extracting all the tars inside it.
Then doing a recursive grep and then sed to replace a specific string in files.
Finally packaging everything again into tar archives, and all the archives inside the master archive.
Pretty tedious. How do I do this automatically using shell scripting?
There isn't going to be much option except automating the steps you outline, for the reasons demonstrated by the caveats in the answer by Kimvais.
tar modify operations
The tar command has some options to modify existing tar files. They are, however, not appropriate for your scenario for multiple reasons, one of them being that it is the nested tarballs that need editing rather than the master tarball. So, you will have to do the work longhand.
Assumptions
Are all the archives in the master archive extracted into the current directory or into a named/created sub-directory? That is, when you run tar -tf master.tar.gz, do you see:
subdir-1.23/tarball1.tar
subdir-1.23/tarball2.tar
...
or do you see:
tarball1.tar
tarball2.tar
(Note that nested tars should not themselves be gzipped if they are to be embedded in a bigger compressed tarball.)
master_repackager
Assuming you have the subdirectory notation, then you can do:
for master in "$#"
do
tmp=$(pwd)/xyz.$$
trap "rm -fr $tmp; exit 1" 0 1 2 3 13 15
cat $master |
(
mkdir $tmp
cd $tmp
tar -xf -
cd * # There is only one directory in the newly created one!
process_tarballs *
cd ..
tar -czf - * # There is only one directory down here
) > new.$master
rm -fr $tmp
trap 0
done
If you're working in a malicious environment, use something other than tmp.$$ for the directory name. However, this sort of repackaging is usually not done in a malicious environment, and the chosen name based on process ID is sufficient to give everything a unique name. The use of tar -f - for input and output allows you to switch directories but still handle relative pathnames on the command line. There are likely other ways to handle that if you want. I also used cat to feed the input to the sub-shell so that the top-to-bottom flow is clear; technically, I could improve things by using ) > new.$master < $master at the end, but that hides some crucial information multiple lines later.
The trap commands make sure that (a) if the script is interrupted (signals HUP, INT, QUIT, PIPE or TERM), the temporary directory is removed and the exit status is 1 (not success) and (b) once the subdirectory is removed, the process can exit with a zero status.
You might need to check whether new.$master exists before overwriting it. You might need to check that the extract operation actually extracted stuff. You might need to check whether the sub-tarball processing actually worked. If the master tarball extracts into multiple sub-directories, you need to convert the 'cd *' line into some loop that iterates over the sub-directories it creates.
All these issues can be skipped if you know enough about the contents and nothing goes wrong.
process_tarballs
The second script is process_tarballs; it processes each of the tarballs on its command line in turn, extracting the file, making the substitutions, repackaging the result, etc. One advantage of using two scripts is that you can test the tarball processing separately from the bigger task of dealing with a tarball containing multiple tarballs. Again, life will be much easier if each of the sub-tarballs extracts into its own sub-directory; if any of them extracts into the current directory, make sure you create a new sub-directory for it.
for tarball in "$#"
do
# Extract $tarball into sub-directory
tar -xf $tarball
# Locate appropriate sub-directory.
(
cd $subdirectory
find . -type f -print0 | xargs -0 sed -i 's/name/alternative-name/g'
)
mv $tarball old.$tarball
tar -cf $tarball $subdirectory
rm -f old.$tarball
done
You should add traps to clean up here, too, so the script can be run in isolation from the master script above and still not leave any intermediate directories around. In the context of the outer script, you might not need to be so careful to preserve the old tarball before the new is created (so rm -f $tarbal instead of the move and remove command), but treated in its own right, the script should be careful not to damage anything.
Summary
What you're attempting is not trivial.
Debuggability splits the job into two scripts that can be tested independently.
Handling the corner cases is much easier when you know what is really in the files.
You probably can sed the actual tar as tar itself does not do compression itself.
e.g.
zcat archive.tar.gz|sed -e 's/foo/bar/g'|gzip > archive2.tar.gz
However, beware that this will also replace foo with bar also in filenames, usernames and group names and ONLY works if foo and bar are of equal length

Resources