Right way to handle this inotify race? - linux

I want to maintain a cache that mirrors a particular directory, so I add a watch whose events are monitored by thread A and then tell thread B to scan that directory and put the filenames into my cache. I have separate threads because I want the application to still be responsive to incoming inotify events during the scan. Otherwise, I could lose events because I wasn't reading them and the inotify queue filled up during the scan.
It is entirely possible that a delete or move_from event for a file will be processed before it was added to my cache by the directory scan. In that case a naive implementation would end up having a cache entry referring to a file that doesn't exist. What's the right way deal with this particular race condition?

The way I'd have done it is to keep two permanent threads: a single utility thread and single inotify thread for non-stop reading from inotify file descriptor. These threads communicate via blocking queue.
When inotify thread detects an event, that can be one of two event types:
An event, indicating, that entire cache for the observed directory must be destroyed and re-created: queue overflow or unmount.
An event, that can be handled by changing a single entry in cache (most other inotify events)
Upon detection, the event is immediately queued to utility thread.
When utility thread receives event of 1st type, it recreaters entire cache from scratch by reading full directory contents into cache. The same happens, when there is no cache yet, and the event of 2nd type arrives. In other cases full readdir() is avoided, and the cache is simply modified according to event.
The race condition, described in your question, may happen only if multiple threads are allowed to modify the cache. The described approach handles it by assuming, that the only thread allowed to modify a cache is utility thread.
If you want to allow other threads to modify the cache (for example, because you don't know if inotify is supported by filesystem), you can use a simpler and more robust approach: do not track individual directory modification events and have utility thread perform full readdir() on every arriving event. In the worst case there will be too much readdirs, but reading directory contents by itself is so cheap, I wouldn't care about that.
If reading full directory contents is not cheap (for example, because it may be very, very big), then you shouldn't store all of it in memory to begin with. Such scenario would better work with small partial cache, that can be quickly refreshed by using telldir, seek and fstat to track small number of files, currently visible to user.

Related

POSIX-compliant file locking (within a single process)?

I'm making a client/server system where clients can download and upload files to the server (one client can do several such operations at once). In case of a client crash it has to resume its interrupted operations on restart.
Obviously, I need some metadata file that keeps track of current operations. It would be accessed by a thread every time a chunk of data is downloaded/uploaded. Also, the client application should be able to print all files' download/upload progress in %.
I don't mind locking the whole meta-file for a single entry (that corresponds to single download/upload operation) update but at least reading it should allow thread concurrency (unless finding and reading a line in a file is fast).
This article says that inter-process file locking sucks in POSIX. What options do I have?
EDIT: it might be clear already but concurrency in my system must be based on pthreads.
This article says that inter-process file locking sucks in POSIX. What options do I have?
The article correctly describes the state of inter-process file locking. In your case, you have single process, thus no inter-process locking is taking place.
[...] concurrency in my system must be based on pthreads.
As you say, one of the possibilities, is to use a global mutex to synchronize accesses to the meta file.
One way to minimize the locking, is to make the per-thread/per-file entry in the meta-file of predictable size (for best results aligned on the file-system block size). That would allow following:
while holding the global mutex, allocate the per-file/per-thread entry in the meta-file by appending it to the end of the meta-file, and saving its file offset,
with the file offset, use the pread()/pwrite() functions to update the information about the file operations, without using or affecting the global file offset.
Since every thread would be writing into its own area of the meta-file, there should be no race conditions.
(If the number of entries has upper limit, then it is also possible to preallocate the whole file and mmap() it, and then use it as if it was plain memory. If application crashes, the most recent state (possibly with some corruptions; application has crashed after all) would be present in the file. Some applications to speed-up restart after software crash, go as far as keep the whole state of the application in a mmap()ed file.)
Another alternative is to use the meta-file as "journal": open the meta-file in append mode; as entry, write status change of the pending file operations, including start and end of the file operations. After the crash, to recover the state of the file transfers, you "replay" the journal file: read the the entries from the file, and update the state of the file operations in the memory. After you reach the end of the journal, the state in memory should be up-to-date and ready to be resumed. The typical complication of the approach, since the journal file is only written into, is that it has to be clean-up periodically, purging old finished operations from it. (Simplest method is to implement the clean-up during recovery (after recovery, write new journal and delete the old) and then periodically gracefully restart the application.)

Adding watches to Inotify in multi-threaded program

I wanted to use inotify for monitoring some files in my C program.
I am wondering if it is safe to have one thread reading from inotify descriptor (the one returned by inotify_init) thus blocking until some event happens, during this waiting there would be a possibility of adding new file to watch queue using inotify_add_watch during the other thread waiting period.
Do I need to synchronize those actions or is it safe to do such thing?
Don't have the exact answer, but I do know from experience that you can't even open files in another thread without triggering the read() in the thread you are using inotify. I recall reading that you need to use inotify_init1() along with the IN_CLOEXEC flag to allow file io in other threads. I'm not sure if that means you can actually use inotify in more than one thread simultaneously though.

Reducing seek times when reading many small files

I need to write some code (in any language) to process 10,000 files that reside on a local Linux filesystem. Each file is ~500KB in size, and consists of fixed-size records of 4KB each.
The processing time per record is negligible, and the records can be processed in any order, both within and across different files.
A naïve implementation would read the files one by one, in some arbitrary order. However, since my disks are very fast to read but slow to seek, this will almost certainly produce code that's bound by disk seeks.
Is there any way to code the reading up so that it's bound by disk throughput rather than seek time?
One line of inquiry is to try and get an approximate idea of where the files reside on disk, and use that to sequence the reads. However, I am not sure what API could be used to do that.
I am of course open to any other ideas.
The filesystem is ext4, but that's negotiable.
Perhaps you could do the reads by scheduling all of them in quick succession with aio_read. That would put all reads in the filesystem read queue at once, and then the filesystem implementation is free to complete the reads in a way that minimizes seeks.
A very simple approach, although no results guaranteed. Open as many of the files at once as you can and read all of them at once - either using threads or asynchronous I/O. This way the disk scheduler knows what you read and can reduce the seeks by itself. Edit: as wildplasser observes, parallel open() is probably only doable using threads, not async I/O.
The alternative is to try to do the heavy lifting yourself. Unfortunately this involves a difficult step - getting the mapping of the files to physical blocks. There is no standard interface to do that, you could probably extract the logic from something like ext2fsprogs or the kernel FS driver. And this involves reading the physical device underlying a mounted filesystem, which can be writing to it at the same time you're trying to get a consistent snapshot.
Once you get the physical blocks, just order them, reverse the mapping back to the file offsets and execute the reads in the physical block order.
could you recommend using a SSD for the file storage? that should reduce seek times greatly as there's no head to move.
Since operations are similar and data are independent you can try using a thread pool to submit jobs that work on a number of files (can be a single file). Then you can have an idle thread complete a single job. This might help overlapping IO operations with execution.
A simple way would be to keep the original program, but fork an extra process which has no other task than to prefetch the files, and prime the disk buffer cache. ( a unix/linux system uses all "free" memory as disk buffer).
The main task will stay a few files behind (say ten). The hard part would be to keep things synchronised. A pipe seems the obvious way to accomplish this.
UPDATE:
Pseudo code for the main process:
fetch filename from worklist
if empty goto 2.
(maybe) fork a worker process or thread
add to prefetch queue
add to internal queue
if fewer than XXX items on internal queue goto 1
fetch filename from internal queue
process it
goto 1
For the slave processes:
fetch from queue
if empty: quit
prefetch file
loop or quit
For the queue, a message queue seems most appropiate, since it maintains message boundaries. Another way would be to have one pipe per child (in the fork() case) or use mutexes (when using threads).
You'll need approximate seektime_per_file / processing_time_per_file worker threads / processes.
As a simplification: if seeking the files is not required (only sequential access), the slave processes could consist of the equivalent of
dd if=name bs=500K
, which could be wrapped into a popen() or a pipe+fork().

When does an O_SYNC write become visible in the pagecache (mmap'd file)?

I have a file mmap'd read-only/shared, with multiple threads/processes reading the data concurrently. A single writer is allowed to modify the data at any time (using a mutex in a separate shared memory region). Changes are performed using a write() on the underlying file. The overall setup is part of a database that is intended to be transactionally consistent.
A number of arbitrary data pages will be written out in any order, and then fdatasync() is called. Nothing in the file points to these altered pages until a root page is written. The root page is written using a second file descriptor that was opened with O_SYNC, so the write will not return until the root page has been written successfully. All of the pages being written are part of the mmap region, so they will eventually become visible to all of the readers.
The question is - does the final O_SYNC write become visible immediately, as soon as the kernel copies the user buffer into the page cache? Or does it become visible only after the synchronous write completes? I've read thru the kernel code a bit but haven't followed it all the way; it looks to me like the user data is copied immediately to the page cache, and then a write is scheduled, and then it waits for the write to complete. In the meantime, the written data is already present in the page cache and so is immediately visible to the reader processes. This is undesirable because if the physical write actually fails, the transaction must be rolled back, and readers should never be allowed to see anything that was written by an unsuccessful transaction.
Anyone know for certain how O_SYNC writes interact with the page cache? I suppose just to be safe I can wrap accesses to the root page with a mutex, but that adds a layer of overhead that would be better to avoid.
Under the formal POSIX standard, updates to MAP_SHARED regions can appear at any time. The Synchronised I/O definition specifies that the write will only return once the data has landed on physical media, but doesn't talk about the data seen by other processes.
In practice on Linux, it works as you have described - the page cache is the staging area from where device writes are dispatched, and a MAP_SHARED mapping is a view of the page cache.
As an alternative, you could put a copy of the root page into a shared anonymous region. The reading processes would use that copy, and the writing process would update it after it has synched the root page to disk. You will still need synchronisation though, because you can't atomically update an entire page.
You should use msync(2) for mmapped files. Mixing write and mmapped access is asking for troubles.

Transferring data between process calls

I have a Linux process that is being called numerous times, and I need to make this process as fast as possible.
The problem is that I must maintain a state between calls (load data from previous call and store it for the next one), without running another process / daemon.
Can you suggest fast ways to do so? I know I can use files for I/O, and would like to avoid it, for obvious performance reasons. Should (can?) I create a named pipe to read/write from and by that avoid real disk I/O?
Pipes aren't appropriate for this. Use posix shared memory or a posix message queue if you are absolutely sure files are too slow - which you should test first.
In the shared memory case your program creates the segment with shm_open() if it doesn't exist or opens it if it does. You mmap() the memory and make whatever changes and exit. You only shm_unlink() when you know your program won't be called anymore and no longer needs the shared memory.
With message queues, just set up the queue. Your program reads the queue, makes whatever changes, writes the queue and exits. Mq_unlink() when you no longer need the queue.
Both methods have kernel persistence so you lose the shared memory and the queue on a reboot.
It sounds like you have a process that is continuously executed by something.
Why not create a factory that spawns the worker threads?
The factory could provide the workers with any information needed.
... I can use files for I/O, and would like to avoid it, for obvious performance reasons.
I wonder what are these reasons please...
Linux caches files in kernel memory in the page cache. Writes go to the page cash first, in other words, a write() syscall is a kernel call that only copies the data from the user space to the page cache (it is a bit more complicated when the system is under stress). Some time later pdflush writes data to disk asynchronously.
File read() first checks the page cache to see if the data is already available in memory to avoid a disk read. What it means is that if one program writes data to files and another program reads it, these two programs are effectively communicating via kernel memory as long as the page cache keeps those files.
If you want to avoid disk writes entirely, that is, the state does not need to be persisted across OS reboots, those files can be put in /dev/shm or in /tmp, which are normally the mount points of in-memory filesystems.

Resources