How to know if a file in a shared disk is in use by another Linux instance using bash? - linux

I need to know if there is a way to know if a file in a shared disk is in use by another Linux instance.
I have 2 Linux machines sharing a disk. Each random time, the first machine writes a results file (with consecutive filename) to the shared disk when an analysis process is done.
In the other machine I have a bash script verifying if the file has already been finished by the first machine.
The way I verify now is in a for loop in bash script running the stat command to know if the Last modified date of the file is after the current date of the machine. If this is true I can process the file. If not, I run a sleep and then I run stat again.
So, there is any way to avoid this and to know if the file in the shared disk is in use by another machine? Or which is the best way to wait for the finished file?
Thanks in advance.

Write the result file into the same directory with a temporary name. Only rename it to its final name after closing the file under its temporary name, ensuring that contents are flushed. Thus, if the file exists under its final name, it is guaranteed to be complete.
This needs to be into the same directory because NFS renames are only guaranteed to be atomic within a single directory (whereas in non-NFS scenarios, a location anywhere on the same filesystem would work).

You can try using flock as demonstrated here: How to prevent a script from running simultaneously?. There are other ways to synchronize file access as well (lockfile, mkdir, just to name a few). A simple google search should give you what you need. I'm not 100% sure this if appropriate for your setup with the shared disk though.

Kind of the same approach as comparing the date:
Get the filesize two time and compare
STAT1=`du [FILE] | awk '{print $1}'`;
STAT2=`du [FILE] | awk '{print $1}'`;
[ $STAT1 -ne $STAT2 ] \
&& echo "writing to..." \
|| echo "FINISHED"
If you expect hevy I/O and iowaits put a sleep 1 between the to STATS.

Related

Linux ~/.bashrc export most recent directory

I have several environment variables in my ~/.bashrc that point to different directories. I am running a program that creates a new folder every time that it runs and puts a time stamp in the directory name. For example, baseline_2015_11_10_15_40_31-model-stride_1-type_1. Is there away of making a variable that can link to the last created directory?
cd $CURRENT_DIR
Your mileage may vary a lot depending on what exactly do you need to accomplish. However, it almost all cases I would advise against doing something that weird and unreliable like what's described below and revise your architecture to avoid hunting for directories.
Method 1
If your program creates a subdirectory inside current directory, and you always know that nothing else happens in that directory and you want a subdirectory with latest creation timestamp, then you can do something like:
your_complex_program_that_creates_dir
TARGET_DIR=$(ls -t1 --group-directories-first | head -n1)
cd "$TARGET_DIR"
Method 2
If a lot of stuff happens on the system, then you'll end up monitoring what your program does with the filesystem and reacting when it creates a directory. There are two ways to do that, using strace and inotify, both are relatively complex. Here's the way to do that with strace:
strace -o some_temp_file.strace your_complex_program_that_creates_dir
TARGET_DIR=$(sed -ne '/^mkdir(/ { s/^mkdir("\(.*\)", .*).*$/\1/; p }' some_temp_file.strace
cd "$TARGET_DIR"
This snippet runs your_complex_program_that_creates_dir under control of strace, which essentially logs every system call your program makes into a file. Afterwards, this file is analyzed to seek a line like
mkdir("target_dir", 0777) = 0
and extract value of "target_dir" into a variable. Note that:
if your program creates more than 1 directory (even for temporary purposes and deletes them afterwards, or whatever) — there's really no way to determine which of them to grab
running a program with strace is much slower that normal due to huge overhead of logging all the syscalls.
it's super non-portable — facilities like strace exist on most modern OS, but implementations will vary a lot
A solution with inotify works in the same way, but using different mechanism — i.e. it uses OS hook to log all the operations that process performs with file system and then react to it (remember created directory).
However, I repeat, I'd strongly suggest against using any of these solutions beyond research interest.

Modifying a file that is being used as an output redirection by another program

If I have a file of which some output are redirected to, what will happen if I modify that file from another program? Will both changes be recorded to the file?
To illustrate:
Terminal 1 (a file is used to store output of a program using either tee or the redirection operator >:
$ ./program | tee output.log
Terminal 2 (at the same time, the log file is being modified by another program, e.g. vim):
$ vim output.log
It depends on the program and the system calls they make.
vim for example will not write to the file until you issue the ":w" or ":x" commands. It will then detect that the file has changed and makes you confirm the overwriting.
If the program does open(2) on the file with the O_APPEND flag, before each write(2) the file offset is positioned at the end of the file, as if with lseek(2).
So if you have two commands that append like "tee" they will take turns appending.
However, with NFS you still may get corrupted files if more than one process appends data to a file at once, because NFS doesn't support appending to a file and the kernel has to simulate it.
The effect of two or more processes modifying the data of the same file (inode in tech lingo) is undefined. The result depends on the particular order the writing processes are scheduled. This is a classic case of a race condition, i.e. a result depends on the particular order of process execution.

Does the Linux system lock the file when I copy it?

I have wrote a program and it will update a file periodically, sometimes I want to copy the file into another computer to check its content. If I copied the file when the program was not writing it, there is no problem. But, if I copied the file when the program was writing it, the copied file would be partial. So, I wonder that, if the Linux system exists the lock strategy to prevent the situation.
In fact, I copy the file in a bash script, so I want to check if the program is writing it in the bash script. If yes, the bash script will check its state after some seconds and then copy its completed version. So in bash script, how could we check the file was opened or modified by other programs?
You could check from your script whether the file is being written to, and abort/pause copy if it is...
fuser -v /path/to/your/file | awk 'BEGIN{FS=""}$38=="F"{num++}END{print num}'
If the output is smaller 1 you're good to copy :)
When your code writes into the file, it actually writes into an output buffer in memory. The buffer will be flushed out to disk when it becomes full. Thus, when you copy the file whose buffer has not been flushed out to disk, you will observe partial file.
You can modify the buffer size by using the call to setvbuf. If you set the buffer size to zero, it will get flushed out as it is written. Another thing you can do is to make a call to fflush() to flush the output to disk. Either of these two should update the file as it is written.

Redirecting multiple stdouts to single file

I have a program running on multiple machines with NFS and I'd like to log all their outputs into a single file. Can I just run ./my_program >> filename on every machine or is there an issue with concurrency I should be aware of? Since I'm only appending, I don't think there would be a problem, but I'm just trying to make sure.
That could work, but yes, you will have concurrency issues with it, and the log file will be basically indecipherable.
What I would recommend is that there be a log file for each machine and then on some periodical basis (say nightly), concatenate the files together with the machine name as the file name:
for i in "/path/to/logfiles/*"; do
echo "Machine: $i";
cat $i;
done > filename.log
That should give you some ideas, I think.
The NFS protocol does not support atomic append writes, so append writes are never atomic on NFS for any platform. Files WILL end up corrupt if you try.
When appending to files from multiple threads or processes, the fwrites to that file are atomic under the condition that the file was opened in appending mode, the string written to it does not exceed the filesystem blocksize and the filesystem is local. Which in NFS is not the case.
There is a workaround, although I would not know how to do it from a shellscript. The technique is called close-to-open cache consistency

What happens if there are too many files under a single directory in Linux?

If there are like 1,000,000 individual files (mostly 100k in size) in a single directory, flatly (no other directories and files in them), is there going to be any compromises in efficiency or disadvantages in any other possible ways?
ARG_MAX is going to take issue with that... for instance, rm -rf * (while in the directory) is going to say "too many arguments". Utilities that want to do some kind of globbing (or a shell) will have some functionality break.
If that directory is available to the public (lets say via ftp, or web server) you may encounter additional problems.
The effect on any given file system depends entirely on that file system. How frequently are these files accessed, what is the file system? Remember, Linux (by default) prefers keeping recently accessed files in memory while putting processes into swap, depending on your settings. Is this directory served via http? Is Google going to see and crawl it? If so, you might need to adjust VFS cache pressure and swappiness.
Edit:
ARG_MAX is a system wide limit to how many arguments can be presented to a program's entry point. So, lets take 'rm', and the example "rm -rf *" - the shell is going to turn '*' into a space delimited list of files which in turn becomes the arguments to 'rm'.
The same thing is going to happen with ls, and several other tools. For instance, ls foo* might break if too many files start with 'foo'.
I'd advise (no matter what fs is in use) to break it up into smaller directory chunks, just for that reason alone.
My experience with large directories on ext3 and dir_index enabled:
If you know the name of the file you want to access, there is almost no penalty
If you want to do operations that need to read in the whole directory entry (like a simple ls on that directory) it will take several minutes for the first time. Then the directory will stay in the kernel cache and there will be no penalty anymore
If the number of files gets too high, you run into ARG_MAX et al problems. That basically means that wildcarding (*) does not always work as expected anymore. This is only if you really want to perform an operation on all the files at once
Without dir_index however, you are really screwed :-D
Most distros use Ext3 by default, which can use b-tree indexing for large directories.
Some of distros have this dir_index feature enabled by default in others you'd have to enable it yourself. If you enable it, there's no slowdown even for millions of files.
To see if dir_index feature is activated do (as root):
tune2fs -l /dev/sdaX | grep features
To activate dir_index feature (as root):
tune2fs -O dir_index /dev/sdaX
e2fsck -D /dev/sdaX
Replace /dev/sdaX with partition for which you want to activate it.
When you accidently execute "ls" in that directory, or use tab completion, or want to execute "rm *", you'll be in big trouble. In addition, there may be performance issues depending on your file system.
It's considered good practice to group your files into directories which are named by the first 2 or 3 characters of the filenames, e.g.
aaa/
aaavnj78t93ufjw4390
aaavoj78trewrwrwrwenjk983
aaaz84390842092njk423
...
abc/
abckhr89032423
abcnjjkth29085242nw
...
...
The obvious answer is the folder will be extremely difficult for humans to use long before any technical limit, (time taken to read the output from ls for one, their are dozens of other reasons) Is there a good reason why you can't split into sub folders?
Not every filesystem supports that many files.
On some of them (ext2, ext3, ext4) it's very easy to hit inode limit.
I've got a host with 10M files in a directory. (don't ask)
The filesystem is ext4.
It takes about 5 minutes to
ls
One limitation I've found is that my shell script to read the files (because AWS snapshot restore is a lie and files aren't present till first read) wasn't able to handle the argument list so I needed to do two passes. Firstly construct a file list with find (wholename in case you want to do partial matches)
find /path/to_dir/ -wholename '*.ldb'| tee filenames.txt
then secondly read from a the file containing filenames and read all files. (with limited parallelism)
while read -r line; do
if test "$(jobs | wc -l)" -ge 10; then
wait -n
fi
{
#do something with 10x fanout
} &
done < filenames.txt
Posting here in case anyone finds the specific work-around useful when working with too many files.

Resources