Does the Linux system lock the file when I copy it? - linux

I have wrote a program and it will update a file periodically, sometimes I want to copy the file into another computer to check its content. If I copied the file when the program was not writing it, there is no problem. But, if I copied the file when the program was writing it, the copied file would be partial. So, I wonder that, if the Linux system exists the lock strategy to prevent the situation.
In fact, I copy the file in a bash script, so I want to check if the program is writing it in the bash script. If yes, the bash script will check its state after some seconds and then copy its completed version. So in bash script, how could we check the file was opened or modified by other programs?

You could check from your script whether the file is being written to, and abort/pause copy if it is...
fuser -v /path/to/your/file | awk 'BEGIN{FS=""}$38=="F"{num++}END{print num}'
If the output is smaller 1 you're good to copy :)

When your code writes into the file, it actually writes into an output buffer in memory. The buffer will be flushed out to disk when it becomes full. Thus, when you copy the file whose buffer has not been flushed out to disk, you will observe partial file.
You can modify the buffer size by using the call to setvbuf. If you set the buffer size to zero, it will get flushed out as it is written. Another thing you can do is to make a call to fflush() to flush the output to disk. Either of these two should update the file as it is written.

Related

How safe is it reading / copying a file which is being appended to?

If a log file has events constantly being appended to it, how safe is it to read that file (or copy it) with another process?
Unix allows concurrent reading and writing. It is totally safe to read a file while others are appending to it.
Of course it can happen that an appending act is unfinished while a reading act is reaching the end of the file, then this reader will get an incomplete version (e. g. only a part of a new log entry at the end of the file). But technically, this is correct because the file really was in this state while it was being read (e. g. copied).
EDIT
There's more to it.
If a writer process has an open file handle, the file will stay on disk as long as this process keeps the open file handle.
If you remove the file (rm(1), unlink(2)), it will be removed from its directory only. It will stay on disk, and that writer (and everybody else who happens to have an open file handle) will still be able to read the contents of the already removed file. Only after the last process closes its file handle, the file contents will be freed on the disk.
This is sometimes an issue if a process writes a large log file which is filling up the disk. If it keeps and open file handle to the log file, the system administrator cannot free this disk capacity using rm.
A typical approach then is to kill the process as well. Hence it is a good idea, as a process, to close the file handle for the log file again after writing to the log (or close and reopen it at least from time to time).
There's more:
If a process has a an open file handle on a log file, this file handle contains a position. If now the log file is emptied (truncate(1), truncate(2), open(2) for writing not using append flags, : > filepath), the file's contents is indeed removed from the disk. If the process having an open file handle is now writing to this file, it will write at the old position, e. g. at a position of several megabytes. Doing this to an empty file will fill the gap with zeros.
This is no real problem, if a sparse file can be created (typically possible on Unix file systems). Only otherwise will it fill the disk again quickly. But in any case it can be very confusing.

Determine the offset and the size of another process write

I'm working on a backup service. It tracks changes of the files in a the directory to backup. It does that by setting a watch (using inotify with Linux) and comparing the modification time and size after a file has been changed. When it is, the whole file is copied to backup. I'm thinking, could this be done much more efficient? If the backup service can determine the offset and the number of bytes written, it can just copy that, in stead of copy the whole file. I've been looking to fanotify, which offers some interesting features, like an fd to the file modified (by the other process). Now here it stops I think. There is no way as far as I can see it how the process using fanotify can determine from the fd how the file is changed.
Do I overlook something, or is it not possible to get this information?

Modifying a file that is being used as an output redirection by another program

If I have a file of which some output are redirected to, what will happen if I modify that file from another program? Will both changes be recorded to the file?
To illustrate:
Terminal 1 (a file is used to store output of a program using either tee or the redirection operator >:
$ ./program | tee output.log
Terminal 2 (at the same time, the log file is being modified by another program, e.g. vim):
$ vim output.log
It depends on the program and the system calls they make.
vim for example will not write to the file until you issue the ":w" or ":x" commands. It will then detect that the file has changed and makes you confirm the overwriting.
If the program does open(2) on the file with the O_APPEND flag, before each write(2) the file offset is positioned at the end of the file, as if with lseek(2).
So if you have two commands that append like "tee" they will take turns appending.
However, with NFS you still may get corrupted files if more than one process appends data to a file at once, because NFS doesn't support appending to a file and the kernel has to simulate it.
The effect of two or more processes modifying the data of the same file (inode in tech lingo) is undefined. The result depends on the particular order the writing processes are scheduled. This is a classic case of a race condition, i.e. a result depends on the particular order of process execution.

How can we create 'special' files, like /dev/random, in linux?

In Linux file system, there are files such as /dev/zero and /dev/random which are not real files on hard disk.
Is there any way that we can create a similar file and tell it to get ouput from executing a program?
For example, can I create file, say /tmp/tarfile, such that any program reading it actually gets the output from the execution of a different program (/usr/bin/tar ...)?
It is possible to create such a file/program, but it would require creation of a special filesystem in order to insert hooks into the VFS so that accesses can be detected and handled properly.

Forcing a program to flush file contents to disk

I have to debug a program that writes a log file. It takes time for the actual log file to be generated because it takes a while to flush the contents to disk. On top of that, the log file is on a mounted Unix drive on my windows machine. I was wondering if there is a command to make the operating system flush the written contents to disk. Does it also take a while for the file to be updated on the mounted drive in windows?
PS. I actually; don't want to go in and edit the program.
Ted
Look at the following APIs:
http://www.cplusplus.com/reference/clibrary/cstdio/setvbuf/
fsync
Also see the ever-great eat my data: how everybody gets file IO wrong

Resources