My program does something like:
Open a file (append mode)
Write some stuff
Close
[repeat]
The file is different most of the time but on certain occasions (not uncommon really) is repeated (either consecutively or in a very close iteration).
Is there any chance kernel can play tricks on me and opening the file not pointing to the end of the file? Say the write is not yet completed (buffered somewhere in the kernel) and opening the file again makes the fd point to a position that is not the real end of the file. That will result potentially in overlapping writes.
As I said, my program is single threaded, I see no reason why this would happen but I do not fully understand the kernel guarantees when it comes to this.
Thanks!
Related
I have a "producer" executable (that I can run, but don't control the source for) that continually writes to a growing output file. (Like a log file -- it happens to be binary, but I think of it like a log, just without the nice line breaks).
And I have another "consumer" process that is continually reading from that file in decently big (10mb) chunks.
The good news:
Once the consumer reads a chunk from the file, it is ok to discard that part of the file, we are done with it forever.
I can keep the consumer from going too fast and catching up.
I'm confident the consumer is fast enough to not fall too far behind the producer.
The bad news: if I let it run, eventually the file will get huge enough to fill up the disk (many GBs) and I have to kill off both processes, erase the file, and start everything over. :(
I'd like to have to do that restarting less often! Is there a way to put this on a "treadmill," or ring-buffer, or something similar, where I don't have to have a huge amount of disk space for the full file? The only part I actually need to keep around is the maybe 100mb buffer between the producer and consumer. I'd even be ok with some bridge process in between them, or some pipe magic, or virtual filesystems, or ???
As I keep discovering, there are a variety of File Descriptors - Almost every thing is abstracted around a file descriptor: regular files, sockets, signals and timers (for example). All file descriptors are merely integers.
Given a file descriptor, is it possible to know what type it is? For example, it would be nice to have a system call such as getFdType(fd).
If an epoll_wait is awakened due to multiple file descriptors getting ready, the processing of each file descriptor will be based upon its type. That is the reason I need the type.
Of course, I can maintain this info separately myself but it would be more convenient to have the system support it.
Also, are all file descriptors, irrespective of the type, sequential. I mean if you open a regular data file, then create a timer file descriptor, then a signal file descriptor, are they all guaranteed to be numbered sequentially?
As "that other guy" mentioned, the most obvious such call is fstat. The st_mode member contains bits to distinguish between regular files, devices, sockets, pipes, etc.
But in practice, you will almost certainly need to keep track yourself of which fd is which. Knowing it's a regular file doesn't help too much when you have several different regular files open. So since you have to maintain this information somewhere in your code anyway, then referring back to that record would seem to be the most robust way to go.
(It's also going to be much faster to check some variables within your program than to make one or several additional system calls.)
Also, are all file descriptors, irrespective of the type, sequential. I mean if you open a regular data file, then create a timer file descriptor, then a signal file descriptor, are they all guaranteed to be numbered sequentially?
Not really.
As far as I know, calls that create a new fd will always return the lowest-numbered available fd. There are old programs that rely on this behavior; before dup2 existed, I believe the accepted way to move standard input to a new file was was close(0); open("myfile", ...);.
However, it's hard to really be sure what fds are available. For example, the user may have run your program as /usr/bin/prog 5>/some/file/somewhere and then it will appear that fd 5 gets skipped, because /some/file/somewhere is already open on fd 5. As such, if you open a bunch of files in succession, you cannot really be sure that you will get sequential fds, unless you have just closed all those fds yourself and are sure that all lower-numbered fds are already in use. And doing that seems much more of a hassle (and a source of potential problems) than just keeping track of the fds in the first place.
Consider this pseudo-code and an ext4 file system:
f = open("/tmp/new_file", "w")
write(f, "Test")
close(f)
In another process, I try to open /tmp_newfile immediately afterwards:
Questions
Can the other process open the file?
What content does the other process see? Is it Test?
Expectations
I expect (1) to be true (the metadata is probably synchronized between processes) but (2) to be false (data might be buffered)
More questions
How can I ensure that my file changes are visible to other processes? flush seems to work but it is bad for performance because it forces a write-to-disk. Is there something like soft-flush that makes the changes visible to other processes without flushing it to disk?
Is it guaranteed, that the other process can see the file?
No, it's not guaranteed.
A third process can delete the file, even while you have it open.
Could someone please explain the proper way to use the appendto function?
I am trying to use it to write debug text to a file. I want it written immediately when I call the function, but for some reason the program waits until it exits, and then writes everything at once.
Am I using the right function? Do I need to open, then write, then close the file each time I write to it instead?
Thanks.
Looks like you are having an issue with buffering (this also a common question in other languages, btw). The data you want to write to the file is being held in a memory buffer and is only being written to disk in a latter time (this is done to batch writes to disk together, for better performance).
One possibility is to open and close the file as you already suggested. Closing a file handle will flush the contents of the buffer to disk.
A second possibility is to use the flush function to explicitly request that the data be written to disk. In Lua 4.0.1, you can either call flush passing a file handle
-- If you have opened your file with open:
local myfile = open("myfile.txt", "a")
flush(myfile)
-- If you used appendto the output file handle is in the _OUTPUT global variable
appendto("myfile.txt")
flush(_OUTPUT)
or you can call flush with no arguments, in which case it will flush all the files you have currently open.
flush()
For details, see the reference manual: http://www.lua.org/manual/4.0/manual.html#6.
Probably a dumb question, but if the program is asynchronously writing to a file, and you access that file while it's still writing, are the contents messed up?
In fact, it does not matter whether you are synchronously or asynchronously accessing a file: if some other process (yours or someone else) modifies the file while you are in the middle of reading, you will get inconsistent results.
The exact kind of inconsistency you'll see depends on how the file is written and when reading starts.
In node's default mode (w), a file's existing contents are truncated when the file is opened.
An in-flight read will stop early (without erroring), meaning you'll only have a percentage of the original file.
A read started after the write begins will read up to the last written byte. Depending on how far along and fast the write is, and how you read the file, the read may or may not see the complete file.
If the file is written in r+ mode, the contents are not truncated when the file is opened for writing. This means a read will see part of the old data and part of the new data. Things are further muddied if the write changes the file size.
This is all true regardless of whether you use streams (ie createReadStream), readFile, or even readFileSync. Any part of the file on disk can be changed while node is in the process of buffering the file into memory. (The only notable exception here is if you use writeFileSync and then readFileSync in the same process, since the write call would prevent the read from starting until after the write is complete. However, this still doesn't prevent other processes from changing the file mid-read, and you shouldn't be using the sync methods anyway.)
In other words, reading and writing a file is non-atomic. To avoid inconsistency, you should write the file with a temporary name and then rename it when the write is complete.