How safe is it reading / copying a file which is being appended to?

How safe is it reading / copying a file which is being appended to? - linux

If a log file has events constantly being appended to it, how safe is it to read that file (or copy it) with another process?

Unix allows concurrent reading and writing. It is totally safe to read a file while others are appending to it.
Of course it can happen that an appending act is unfinished while a reading act is reaching the end of the file, then this reader will get an incomplete version (e. g. only a part of a new log entry at the end of the file). But technically, this is correct because the file really was in this state while it was being read (e. g. copied).
EDIT
There's more to it.
If a writer process has an open file handle, the file will stay on disk as long as this process keeps the open file handle.
If you remove the file (rm(1), unlink(2)), it will be removed from its directory only. It will stay on disk, and that writer (and everybody else who happens to have an open file handle) will still be able to read the contents of the already removed file. Only after the last process closes its file handle, the file contents will be freed on the disk.
This is sometimes an issue if a process writes a large log file which is filling up the disk. If it keeps and open file handle to the log file, the system administrator cannot free this disk capacity using rm.
A typical approach then is to kill the process as well. Hence it is a good idea, as a process, to close the file handle for the log file again after writing to the log (or close and reopen it at least from time to time).
There's more:
If a process has a an open file handle on a log file, this file handle contains a position. If now the log file is emptied (truncate(1), truncate(2), open(2) for writing not using append flags, : > filepath), the file's contents is indeed removed from the disk. If the process having an open file handle is now writing to this file, it will write at the old position, e. g. at a position of several megabytes. Doing this to an empty file will fill the gap with zeros.
This is no real problem, if a sparse file can be created (typically possible on Unix file systems). Only otherwise will it fill the disk again quickly. But in any case it can be very confusing.

Related

When writing to a newly created file, can I create the directory entry only after writing is completed?

I'm writing a file that takes minutes to write. External software monitors for this file to appear, but unfortunately doesn't monitor for inotify IN_CLOSE_WRITE events, but rather checks periodically "the file is there" and then starts to process it, which will fail if the file is incomplete. I cannot fix the external software. A workaround I've been using so far is to write a temporary file and then rename it when it's finished, but this workaround complicates my workflow for reasons beyond the scope of this question¹.
Files are not directory entries. Using hardlinks, there can be multiple pointers to the same file. When I open a file for writing, both the inode and the directory entry are created immediately. Can I prevent this? Can I postpone the creation of the directory entry until the file is closed, rather than when the file is opened for writing?
Example Python-code, but the question is not specific to Python:
fp = open(dest, 'w') # currently both inode and directory entry are created here
fp.write(...)
fp.write(...)
fp.write(...)
fp.close() # I would like to create the directory entry only here
Reading everything into memory and then writing it all in one go is not a good solution, because writing will still take time and the file might not fit into memory.
I found the related question Is it possible to create an unlinked file on a selected filesystem?, but I would want to first create an anonymous/unnamed file, then naming it when I'm done writing (I agree with the answer there that creating an inode is unavoidable, but that's fine; I just want to postpone naming it).
Tagging this as linux, because I suspect the answer might be different between Linux and Windows and I only need a solution on Linux.
¹Many files are produced in parallel within dask graphs, and injecting a "move as soon as finished" task in our system would be complicated, so we're really renaming 50 files when 50 files have been written, which causes delays.

What is the OS-level handle of tempfile.mkstemp good for?

I use tempfile.mkstemp when I need to create files in a directory which might stay, but I don't care about the filename. It should only be something that doesn't exist so far and have a prefix- and a suffix.
One part about the documentation that I ignored so far is
mkstemp() returns a tuple containing an OS-level handle to an open file (as would be returned by os.open()) and the absolute pathname of that file, in that order.
What is the OS-level handle and how should one use it?
Background
I always used it like this:
from tempfile import mstemp
_, path = mkstemp(prefix=prefix, suffix=suffix, dir=dir)
with open(path, "w") as f:
f.write(data)
# do something
os.remove(path)
It worked fine so far. However, today I wrote a small script which generates huge files and deletes them. The script aborted the execution with the message
OSError: [Errno 28] No space left on device
When I checked, there were 80 GB free.
My suspicion is that os.remove only "marked" the files for deletion, but the files were not properly removed. And the next suspicion was that I might need to close the OS-level handle before the OS can actually free that disk space.

Your suspicion is correct. The os.remove only removes the directory entry that contains the name of the file. However, the file data remains intact and continues to consume space on the disk until the last open descriptor on the file is closed. During that time normal operations on the file through existing descriptors continue to work, which means you could still use the _ descriptor to seek in, read from, or write to the file after os.remove has returned.
In fact it's common practice to immediately os.remove the file before moving on to using the descriptor to operate on the file contents. This prevents the file from being opened by any other process, and also means that the file won't be left hanging around if this program dies unexpectedly before reaching a later os.remove.
Of course that only works if you're willing and able to use the low-level descriptor for all of your operations on the file, or if you use the os.fdopen method to construct a file object on top of the descriptor and use that new object for all operations. Obviously you only want to do one of those things; mixing descriptor access and file-object access to the same underlying file can produce unexpected results.
os.fdopen(_) should execute faster than open(path) but it doesn't have the context manager integration that open has, so it's not directly usable in a with construct. I think you can use contextlib.closing to get around that.

will io direction operation lock the file?

i have a growing nginx log file about 20G already, and i wish to rotate it.
1, i mv the old log file to a new log file
2, i do > old_log_file.log to truncate the old log file in about 2~3 seconds
if there's a lock(write lock?) on the old log file when i doing the truncating(about 2~3 seconds)?
at that 2~3s period, nginx returns 502 for waiting to append logs to old log file until lock released?
thank you for explaining.

On Linux, there is (almost) no mandatory file locks (more precisely, there used to be some mandatory locking feature in the kernel, but it is deprecated and you really should avoid using it). File locking happens with flock(2) or lockf(3) and is advisory and should be explicit (e.g. with flock(1) command, or some program calling flock or lockf).
So every locking related to files is practically a convention between all the software using that file (and mv(1) or the redirection by your shell don't use file locking).
Remember that a file on Linux is mostly an i-node (see inode(7)) which could have zero, one or several file paths (see path_resolution(7) and be aware of link(2), rename(2), unlink(2)) and used thru some file descriptor. Read ALP (and perhaps Operating Systems: Three Easy Pieces) for more.
No file locking happens in the scenario of your question (and the i-nodes and file descriptors involved are independent).
Consider using logrotate(8).
Some software provide a way to reload their configuration and re-open log files. You should read the documentation of your nginx.

It depends on application if it locks the file. Application that generates this log file must have option to clear log file. One example is in editor like vim file can be externally modified while it is still open in editor.

Does the Linux system lock the file when I copy it?

I have wrote a program and it will update a file periodically, sometimes I want to copy the file into another computer to check its content. If I copied the file when the program was not writing it, there is no problem. But, if I copied the file when the program was writing it, the copied file would be partial. So, I wonder that, if the Linux system exists the lock strategy to prevent the situation.
In fact, I copy the file in a bash script, so I want to check if the program is writing it in the bash script. If yes, the bash script will check its state after some seconds and then copy its completed version. So in bash script, how could we check the file was opened or modified by other programs?

You could check from your script whether the file is being written to, and abort/pause copy if it is...
fuser -v /path/to/your/file | awk 'BEGIN{FS=""}$38=="F"{num++}END{print num}'
If the output is smaller 1 you're good to copy :)

When your code writes into the file, it actually writes into an output buffer in memory. The buffer will be flushed out to disk when it becomes full. Thus, when you copy the file whose buffer has not been flushed out to disk, you will observe partial file.
You can modify the buffer size by using the call to setvbuf. If you set the buffer size to zero, it will get flushed out as it is written. Another thing you can do is to make a call to fflush() to flush the output to disk. Either of these two should update the file as it is written.

Is the file mutex in Linux? How to implement it?

In windows, if I open a file with MS Word, then try to delete it.
The system will stop me. It prevents the file being deleted.
There is a similar mechanism in Linux?
How can I implement it when writing my own program?

There is not a similar mechanism in Linux. I, in fact, find that feature of windows to be an incredible misfeature and a big problem.
It is not typical for a program to hold a file open that it is working on anyway unless the program is a database and updating the file as it works. Programs usually just open the file, write contents and close it when you save your document.
vim's .swp file is updated as vim works, and vim holds it open the whole time, so even if you delete it, the file doesn't really go away. vim will just lose its recovery ability if you delete the .swp file while it's running.
In Linux, if you delete a file while a process has it open, the system keeps it in existence until all references to it are gone. The name in the filesystem that refers to the file will be gone. But the file itself is still there on disk.
If the system crashes while the file is still open it will be cleaned up and removed from the disk when the system comes back up.
The reason this is such a problem in Windows is that mandatory locking frequently prevents operations that should succeed from succeeding. For example, a backup process should be able to read a file that is being written to. It shouldn't have to stop the process that is doing the writing before the backup proceeds. In many other cases, operations that should be able to move forward are blocked for silly reasons.

The semantics of most Unix filesystems (such as Linux's ext2 fs family) is that a file can be unlink(2)'d at any time, even if it is open. However, after such a call, if the file has been opened by some other process, they can continue to read and write to the file through the open file descriptor. The filesystem does not actually free the storage until all open file descriptors have been closed. These are very long-standing semantics.
You may wish to read more about file locking in Unix and Linux (e.g., the Wikipedia article on File Locking.) Basically, mandatory and advisory locks on Linux exist but they're not guaranteed to prevent what you want to prevent.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string