Why does inotify error when watching one directory but not another? - linux

A program that uses inotify with IN_CREATE to watch a directory for file creation fails on some directories but works on others. For example it works on /home/randomtroll/testdir but fails on /home/randomtroll ; both have the same owners and permissions. When it fails, the read returns EINVAL. The inotify descriptor and watch have been successfully created; the buffer into which it reads is aligned properly and large enough to accommodate the data read.

The buffer into which I read the inotify descriptor was too small. I made it large enough to accommodate the names of files I was looking for; the creation of files with names longer than that caused a read error. One cannot limit how many bytes one reads when one reads an inotify descriptor. This seems like a bug to me.

Related

When writing to a newly created file, can I create the directory entry only after writing is completed?

I'm writing a file that takes minutes to write. External software monitors for this file to appear, but unfortunately doesn't monitor for inotify IN_CLOSE_WRITE events, but rather checks periodically "the file is there" and then starts to process it, which will fail if the file is incomplete. I cannot fix the external software. A workaround I've been using so far is to write a temporary file and then rename it when it's finished, but this workaround complicates my workflow for reasons beyond the scope of this question¹.
Files are not directory entries. Using hardlinks, there can be multiple pointers to the same file. When I open a file for writing, both the inode and the directory entry are created immediately. Can I prevent this? Can I postpone the creation of the directory entry until the file is closed, rather than when the file is opened for writing?
Example Python-code, but the question is not specific to Python:
fp = open(dest, 'w') # currently both inode and directory entry are created here
fp.write(...)
fp.write(...)
fp.write(...)
fp.close() # I would like to create the directory entry only here
Reading everything into memory and then writing it all in one go is not a good solution, because writing will still take time and the file might not fit into memory.
I found the related question Is it possible to create an unlinked file on a selected filesystem?, but I would want to first create an anonymous/unnamed file, then naming it when I'm done writing (I agree with the answer there that creating an inode is unavoidable, but that's fine; I just want to postpone naming it).
Tagging this as linux, because I suspect the answer might be different between Linux and Windows and I only need a solution on Linux.
¹Many files are produced in parallel within dask graphs, and injecting a "move as soon as finished" task in our system would be complicated, so we're really renaming 50 files when 50 files have been written, which causes delays.

How to read a [nonblocking] filedescriptor of a file that is appended to (aka, like tail -f)?

Actually, I am using libev; but under the hood this is using epoll (I'm only on linux). When I add a watcher to read a file and all data has been read then I do get a call back that there is data to read, but read(2) returns 0 (EOF). At that point I have to stop the watcher or else it will continue to tell me that there is something to read. However, if I stop the watcher and then some other process appends data to that file then I'll never see it.
What is the correct way to get notified that there is additional/appended data in a file that can be read when before I already read till the end?
I'd prefer the answer in terms of libev, but lower level will do too (I can then probably translate that to how to do that with libev).
It is very common, for some reason, for people to think that making an fd nonblocking, or calling poll/select/.. has different behaviour for files compared to other types of file descriptions, but nonblocking behaviour and I/O readyness behaviour is essentially the same for all of types of file descriptions: the kernel will immediately return from read/write etc. if the outcome is known, and will signal I/O readyness when this is the case. When a socket has an EOF condition, select will signal that the socket is ready to read, and you will get 0 (for EOF). The same happens for files - if you are at the end of a file, the kernel will return immediately from read and return 0 to signal EOF.
The important difference is that files can change contents at random places, and can be extended. Pipes and sockets are not random access and cannot be appended to once closed. Thus, while the behaviour is consistent, it is often not what is wanted, namely waiting for a file to change in some way.
The conflict in many people's minds is simply that they want to be told "when there is new data", but if you think about it a bit, you will realise that simply waking you up would not be an adequate interface for this, as you have no way of knowing why you woke up, and what changed.
POSIX doesn't have an interface to do that, other than regularly polling the fd or file (and in case of random changes, regularly reading the whole file!). Some operating systems have an interface to do something similar to that (kqueue on BSDs, inotify on GNU/Linux) , but they are usually not a perfect match, either (for example, inotify cannot watch an fd for changes, it will watch a path for changes).
The closest you can get with libev is to use an ev_stat watcher. It behaves as if you would stat() a path regularly, and invoke the watcher callback whenever the stat data changes. Portably, it does just that: it regularly calls stat, but on some operating systems (currently only inotify on GNU/Linux, as kqueue doesn't have correct semantics for this) it can use other mechanisms to speed this up in some cases, although it will fall back to regular stat polling everywhere, for example for when the file is on a network file system, where inotify can't see remote changes.
To answer your question: If you have a path, you can use an ev_stat watcher to watch for stat data changes, such as size/mtime etc. changes. Doing this right can be a bit tricky (see the libev documentation, especially the part about stat time resolution: http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod#code_ev_stat_code_did_the_file_attri), and you have to keep in mind that this watches a path, not a file descriptor, so you might want to compare the device/inode of your file descriptor and the watched path regularly to see if you still have the correct file open.
This still doesn't tell you what part of the file has changed.
Alternatively, since you apparently only want to read appended data, you could opt to just read() the file regularly (in an ev_timer callback) and do away with all the complexity and hassles of an ev_stat watcher setup (while not forgetting to also compare the path stat data with your fd stat data to see if you still hasve the right file open, depending on whether the file your are reading might get renamed or replaced. Sometimes programs also truncate files, something you can also detect by seeing the size decrease between stat calls).
This is essentially what older tail -f implementations do, while newer ones might, for example, take hints (only) from inotify, just like ev_stat watchers do.
None of that is easy, and details depend on your knowledge of how exactly the file changes, but it's the best you can do.

Linux add listener to the log file

Is there a better way to monitor for file log changes that using inotify? (http://linux.die.net/man/7/inotify). I have several software that writes to different log files and I want to go POST query every time the new line is added to the log.
Currently, my proposal is to set inotify to listen for file changes, get data that was changed since last visit and do post.
Things that are important:
Reaction to event (at least 1 second).
Low CPU and HDD consumption.
Keeping log file (i.e. I want it to be on the machine full, unmodified).
New lines are added once in 1 min.
Thanks for ideas.
Inotify is fine for getting notified about file events such as writing etc but how will you know how much data has been appended? If you know the log files in advance you might as well read until end of file and just sleep for a short while and try again to read (something like "tail -f" does). That way you still have the pointer to where you start to read the newly written data. You could even combine that with Inotify to know when to pick up reading. If you want to just use Inotify, you probably will have to store the pointers to the last read position somewhere.

Syncing a file system that has no file on it

Say I want to synchronize data buffers of a file system to disk (in my case the one of an USB stick partition) on a linux box.
While searching for a function to do that I found the following
DESCRIPTION
sync() causes all buffered modifications to file metadata and
data to be written to the underlying file sys‐
tems.
syncfs(int fd) is like sync(), but synchronizes just the file system
containing file referred to by the open file
descriptor fd.
But what if the file system has no file on it that I can open and pass to syncfs? Can I "abuse" the dot file? Does it appear on all file systems?
Is there another function that does what I want? Perhaps by providing a device file with major / minor numbers or some such?
Yes I think you can do that. The root directory of your file system will have at least one inode for your root directory. You can use the .-file to do that. Play also around with ls -i to see the inode numbers.
Is there a possibility to avoid your problem by mounting your file system with sync? Does performance issues hamper? Did you have a look at remounting? This can sync your file system as well in particular cases.
I do not know what your application is, but I suffered problems with synchronization of files to a USB stick with the FAT32-file system. It resulted in weird read and write errors. I can not imagine any other valid reason why you should sync an empty file system.
From man 8 sync description:
"sync writes any data buffered in memory out to disk. This can include (but is not
limited to) modified superblocks, modified inodes, and delayed reads and writes. This
must be implemented by the kernel; The sync program does nothing but exercise the sync(2)
system call."
So, note that it's all about modification (modified inode, superblocks etc). If you don't have any modification, it don't have anything to sync up.

how does kernel handle new file creation

I wish to understand the way kernel works when a user/app tries to create a file in a directorty.
The background - We have a java applicaiton which consumes messages over JMS, processes it and then writes the XML to an outbound queue+a local directory. Yesterday we obeserved unsual delays in writing to the directory. On 'ls|wc -l' we found >300,000 files in there. Did a quick strace on the process and found it full of mutex calls (More than 3/4 calls in the strace were mutex).
So i thought that new file creation is taking time becasue the system has to every time check certain things (e.g name of files to make sure that the new file with a specific name can be created) amongst 300,000 files and then create a file.
I cleared the directory and the applicaiton resumed to normal service levels.
My questions
Was my analysis correct (It seems cuz the app started working fine after a clear down)?
More imporatant, how does the kernel work when you try to creat a new file in directory.
Can the abnormal number of mutex calls be attributed to the high number of files in the directory?
Many thanks
J
Please read about the Linux Filesystem, i-nodes and d-nodes.
http://en.wikipedia.org/wiki/Inode_pointer_structure
The file system is organized into fixed-sized blocks. If your directory is relatively small, it fits in the direct blocks and things are fast. If your directory is not too big, it fits in the direct blocks and some indirect blocks, and is still reasonably fast. If your directory becomes too big, it spills into double indirect blocks and becomes slow.
Actual sizes depend on file system and kernel configuration.
Rule of thumb is to keep the directory under 12 blocks, depending on your block size. Many systems use 8K blocks; a fast directory is under 98,304 bytes.
A file entry is something like 16*4 bytes in size (IIRC), so plan on no more than 1500 files per directory as a practical upper limit.
Directories with large numbers of entries are often slow - how slow depends on the underlying filesystem.
The common solution is to create a hierarchy of directories, so each dir only has a few hundred entries.
Mutex system calls are a result of the application (probably something in the JVM or the Java libraries) making mutex calls.
Synchronisation internal to the kernel you will not see via strace, as this only examines system calls themselves.
A directory with lots of files should not become inefficient if you are using a filesystem which uses directory indexes; most now do (ext3 does optionally but it's normally enabled nowadays).
Non-indexed directories (like those used on the bad old filesystems - ext2, vfat etc) get really bad with lots of files, and you'll see the "open" system call taking a lot longer.

Resources