POSIX-compliant file locking (within a single process)?

POSIX-compliant file locking (within a single process)? - multithreading

I'm making a client/server system where clients can download and upload files to the server (one client can do several such operations at once). In case of a client crash it has to resume its interrupted operations on restart.
Obviously, I need some metadata file that keeps track of current operations. It would be accessed by a thread every time a chunk of data is downloaded/uploaded. Also, the client application should be able to print all files' download/upload progress in %.
I don't mind locking the whole meta-file for a single entry (that corresponds to single download/upload operation) update but at least reading it should allow thread concurrency (unless finding and reading a line in a file is fast).
This article says that inter-process file locking sucks in POSIX. What options do I have?
EDIT: it might be clear already but concurrency in my system must be based on pthreads.

This article says that inter-process file locking sucks in POSIX. What options do I have?
The article correctly describes the state of inter-process file locking. In your case, you have single process, thus no inter-process locking is taking place.
[...] concurrency in my system must be based on pthreads.
As you say, one of the possibilities, is to use a global mutex to synchronize accesses to the meta file.
One way to minimize the locking, is to make the per-thread/per-file entry in the meta-file of predictable size (for best results aligned on the file-system block size). That would allow following:
while holding the global mutex, allocate the per-file/per-thread entry in the meta-file by appending it to the end of the meta-file, and saving its file offset,
with the file offset, use the pread()/pwrite() functions to update the information about the file operations, without using or affecting the global file offset.
Since every thread would be writing into its own area of the meta-file, there should be no race conditions.
(If the number of entries has upper limit, then it is also possible to preallocate the whole file and mmap() it, and then use it as if it was plain memory. If application crashes, the most recent state (possibly with some corruptions; application has crashed after all) would be present in the file. Some applications to speed-up restart after software crash, go as far as keep the whole state of the application in a mmap()ed file.)
Another alternative is to use the meta-file as "journal": open the meta-file in append mode; as entry, write status change of the pending file operations, including start and end of the file operations. After the crash, to recover the state of the file transfers, you "replay" the journal file: read the the entries from the file, and update the state of the file operations in the memory. After you reach the end of the journal, the state in memory should be up-to-date and ready to be resumed. The typical complication of the approach, since the journal file is only written into, is that it has to be clean-up periodically, purging old finished operations from it. (Simplest method is to implement the clean-up during recovery (after recovery, write new journal and delete the old) and then periodically gracefully restart the application.)

Related

Right way to handle this inotify race?

I want to maintain a cache that mirrors a particular directory, so I add a watch whose events are monitored by thread A and then tell thread B to scan that directory and put the filenames into my cache. I have separate threads because I want the application to still be responsive to incoming inotify events during the scan. Otherwise, I could lose events because I wasn't reading them and the inotify queue filled up during the scan.
It is entirely possible that a delete or move_from event for a file will be processed before it was added to my cache by the directory scan. In that case a naive implementation would end up having a cache entry referring to a file that doesn't exist. What's the right way deal with this particular race condition?

The way I'd have done it is to keep two permanent threads: a single utility thread and single inotify thread for non-stop reading from inotify file descriptor. These threads communicate via blocking queue.
When inotify thread detects an event, that can be one of two event types:
An event, indicating, that entire cache for the observed directory must be destroyed and re-created: queue overflow or unmount.
An event, that can be handled by changing a single entry in cache (most other inotify events)
Upon detection, the event is immediately queued to utility thread.
When utility thread receives event of 1st type, it recreaters entire cache from scratch by reading full directory contents into cache. The same happens, when there is no cache yet, and the event of 2nd type arrives. In other cases full readdir() is avoided, and the cache is simply modified according to event.
The race condition, described in your question, may happen only if multiple threads are allowed to modify the cache. The described approach handles it by assuming, that the only thread allowed to modify a cache is utility thread.
If you want to allow other threads to modify the cache (for example, because you don't know if inotify is supported by filesystem), you can use a simpler and more robust approach: do not track individual directory modification events and have utility thread perform full readdir() on every arriving event. In the worst case there will be too much readdirs, but reading directory contents by itself is so cheap, I wouldn't care about that.
If reading full directory contents is not cheap (for example, because it may be very, very big), then you shouldn't store all of it in memory to begin with. Such scenario would better work with small partial cache, that can be quickly refreshed by using telldir, seek and fstat to track small number of files, currently visible to user.

Reducing seek times when reading many small files

I need to write some code (in any language) to process 10,000 files that reside on a local Linux filesystem. Each file is ~500KB in size, and consists of fixed-size records of 4KB each.
The processing time per record is negligible, and the records can be processed in any order, both within and across different files.
A naïve implementation would read the files one by one, in some arbitrary order. However, since my disks are very fast to read but slow to seek, this will almost certainly produce code that's bound by disk seeks.
Is there any way to code the reading up so that it's bound by disk throughput rather than seek time?
One line of inquiry is to try and get an approximate idea of where the files reside on disk, and use that to sequence the reads. However, I am not sure what API could be used to do that.
I am of course open to any other ideas.
The filesystem is ext4, but that's negotiable.

Perhaps you could do the reads by scheduling all of them in quick succession with aio_read. That would put all reads in the filesystem read queue at once, and then the filesystem implementation is free to complete the reads in a way that minimizes seeks.

A very simple approach, although no results guaranteed. Open as many of the files at once as you can and read all of them at once - either using threads or asynchronous I/O. This way the disk scheduler knows what you read and can reduce the seeks by itself. Edit: as wildplasser observes, parallel open() is probably only doable using threads, not async I/O.
The alternative is to try to do the heavy lifting yourself. Unfortunately this involves a difficult step - getting the mapping of the files to physical blocks. There is no standard interface to do that, you could probably extract the logic from something like ext2fsprogs or the kernel FS driver. And this involves reading the physical device underlying a mounted filesystem, which can be writing to it at the same time you're trying to get a consistent snapshot.
Once you get the physical blocks, just order them, reverse the mapping back to the file offsets and execute the reads in the physical block order.

could you recommend using a SSD for the file storage? that should reduce seek times greatly as there's no head to move.

Since operations are similar and data are independent you can try using a thread pool to submit jobs that work on a number of files (can be a single file). Then you can have an idle thread complete a single job. This might help overlapping IO operations with execution.

A simple way would be to keep the original program, but fork an extra process which has no other task than to prefetch the files, and prime the disk buffer cache. ( a unix/linux system uses all "free" memory as disk buffer).
The main task will stay a few files behind (say ten). The hard part would be to keep things synchronised. A pipe seems the obvious way to accomplish this.
UPDATE:
Pseudo code for the main process:
fetch filename from worklist
if empty goto 2.
(maybe) fork a worker process or thread
add to prefetch queue
add to internal queue
if fewer than XXX items on internal queue goto 1
fetch filename from internal queue
process it
goto 1
For the slave processes:
fetch from queue
if empty: quit
prefetch file
loop or quit
For the queue, a message queue seems most appropiate, since it maintains message boundaries. Another way would be to have one pipe per child (in the fork() case) or use mutexes (when using threads).
You'll need approximate seektime_per_file / processing_time_per_file worker threads / processes.
As a simplification: if seeking the files is not required (only sequential access), the slave processes could consist of the equivalent of
dd if=name bs=500K
, which could be wrapped into a popen() or a pipe+fork().

When does an O_SYNC write become visible in the pagecache (mmap'd file)?

I have a file mmap'd read-only/shared, with multiple threads/processes reading the data concurrently. A single writer is allowed to modify the data at any time (using a mutex in a separate shared memory region). Changes are performed using a write() on the underlying file. The overall setup is part of a database that is intended to be transactionally consistent.
A number of arbitrary data pages will be written out in any order, and then fdatasync() is called. Nothing in the file points to these altered pages until a root page is written. The root page is written using a second file descriptor that was opened with O_SYNC, so the write will not return until the root page has been written successfully. All of the pages being written are part of the mmap region, so they will eventually become visible to all of the readers.
The question is - does the final O_SYNC write become visible immediately, as soon as the kernel copies the user buffer into the page cache? Or does it become visible only after the synchronous write completes? I've read thru the kernel code a bit but haven't followed it all the way; it looks to me like the user data is copied immediately to the page cache, and then a write is scheduled, and then it waits for the write to complete. In the meantime, the written data is already present in the page cache and so is immediately visible to the reader processes. This is undesirable because if the physical write actually fails, the transaction must be rolled back, and readers should never be allowed to see anything that was written by an unsuccessful transaction.
Anyone know for certain how O_SYNC writes interact with the page cache? I suppose just to be safe I can wrap accesses to the root page with a mutex, but that adds a layer of overhead that would be better to avoid.

Under the formal POSIX standard, updates to MAP_SHARED regions can appear at any time. The Synchronised I/O definition specifies that the write will only return once the data has landed on physical media, but doesn't talk about the data seen by other processes.
In practice on Linux, it works as you have described - the page cache is the staging area from where device writes are dispatched, and a MAP_SHARED mapping is a view of the page cache.
As an alternative, you could put a copy of the root page into a shared anonymous region. The reading processes would use that copy, and the writing process would update it after it has synched the root page to disk. You will still need synchronisation though, because you can't atomically update an entire page.

You should use msync(2) for mmapped files. Mixing write and mmapped access is asking for troubles.

Transferring data between process calls

I have a Linux process that is being called numerous times, and I need to make this process as fast as possible.
The problem is that I must maintain a state between calls (load data from previous call and store it for the next one), without running another process / daemon.
Can you suggest fast ways to do so? I know I can use files for I/O, and would like to avoid it, for obvious performance reasons. Should (can?) I create a named pipe to read/write from and by that avoid real disk I/O?

Pipes aren't appropriate for this. Use posix shared memory or a posix message queue if you are absolutely sure files are too slow - which you should test first.
In the shared memory case your program creates the segment with shm_open() if it doesn't exist or opens it if it does. You mmap() the memory and make whatever changes and exit. You only shm_unlink() when you know your program won't be called anymore and no longer needs the shared memory.
With message queues, just set up the queue. Your program reads the queue, makes whatever changes, writes the queue and exits. Mq_unlink() when you no longer need the queue.
Both methods have kernel persistence so you lose the shared memory and the queue on a reboot.

It sounds like you have a process that is continuously executed by something.
Why not create a factory that spawns the worker threads?
The factory could provide the workers with any information needed.

... I can use files for I/O, and would like to avoid it, for obvious performance reasons.
I wonder what are these reasons please...
Linux caches files in kernel memory in the page cache. Writes go to the page cash first, in other words, a write() syscall is a kernel call that only copies the data from the user space to the page cache (it is a bit more complicated when the system is under stress). Some time later pdflush writes data to disk asynchronously.
File read() first checks the page cache to see if the data is already available in memory to avoid a disk read. What it means is that if one program writes data to files and another program reads it, these two programs are effectively communicating via kernel memory as long as the page cache keeps those files.
If you want to avoid disk writes entirely, that is, the state does not need to be persisted across OS reboots, those files can be put in /dev/shm or in /tmp, which are normally the mount points of in-memory filesystems.

files on multiple processes

If one of my processes open a file, let's say for reading only, does the O.S guarantee that no other process will write on it as I'm reading, maybe
leaving the reading process with first part of the old file version, and second part of the newer file version, making data integrity questionable?
I am not talking about pipes which have no seek, but on regular files, with seek option (at least when opened with only one process).

No, other processes can change the file contents as you are reading it. Try running "man fcntl" and ignore the section on "advisory" locks; those are "optional" locks that processes only have to pay attention to if they want. Instead, look for the (alas, non-POSIX) "mandatory" locks. Those are the ones that will protect you from other programs. Try a read lock.

No, if you open a file, other processes can write to it, unless you use a lock.
On Linux, you can add an advisory lock on a file with:
#include <sys/file.h>
...
flock(file_descriptor,LOCK_EX); // apply an advisory exclusive lock

Any process which can open the file for writing, may write to it. Writes can happen concurrently with your own writes, resulting in (potentially) indeterminate states.
It is your responsibility as an application writer to ensure that Bad Things don't happen. In my opinion mandatory locking is not a good idea.
A better idea is not to grant write access to processes which you don't want to write to the file.
If several processes open a file, they will have independent file pointers, so they can seek() and not affect one another.
If a file is opened by a threaded program (or a task which shares its file descriptors with another, more generally), the file pointer is also shared, so you need to use another method to access the file to avoid race conditions causing chaos - normally pread, pwrite, or the scatter/gather functions readv and writev.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string