Linux splice() + kernel AIO when writing to disk

Linux splice() + kernel AIO when writing to disk - linux

With kernel AIO and O_DIRECT|O_SYNC, there is no copying into kernel buffers and it is possible to get fine grained notification when data is actually flushed to disk. However, it requires data to be held in user space buffers for io_prep_pwrite().
With splice(), it is possible to move data directly to disk from kernel space buffers (pipes) without never having to copy it around. However, splice() returns immediately after data is queued and does not wait for actual writes to the disk.
The goal is to move data from sockets to disk without copying it around while getting confirmation that it has been flushed out. How to combine both previous approaches?
By combining splice() with O_SYNC, I expect splice() to block and one has to use multiple threads to mask latency. Alternatively, one could use asynchronous io_prep_fsync()/io_prep_fdsync(), but this waits for all data to be flushed, not for a specific write. Neither is perfect.
What would be required is a combination of splice() with kernel AIO, allowing zero copy and asynchronous confirmation of writes, such that a single event driven thread can move data from sockets to the disk and get confirmations when required, but this doesn't seem to be supported. Is there a good workaround / alternative approach?

To get a confirmation of the writes, you can't use splice().
There's aio stuff in userspace, but if you were doing it in the kernel it might come to finding out which bio's (block I/O) are generated and waiting for those:
Block I/O structure:
http://www.makelinux.net/books/lkd2/ch13lev1sec3
If you want to use AIO, you will need to use io_getevents():
http://man7.org/linux/man-pages/man2/io_getevents.2.html
Here are some examples on how to perform AIO:
http://www.fsl.cs.sunysb.edu/~vass/linux-aio.txt
If you do it from userspace and use msync it's still kind of up in the air if it is actually on spinning rust yet.
msync() docs:
http://man7.org/linux/man-pages/man2/msync.2.html
You might have to soften expectations in order to make it more robust, because it might be very expensive to actually be sure that the writes are fisically written on disk.
The 'highest' typical standard for write assurance in light of something like power removal is a journal recording operation that modifies the storage. The journal itself is append only and you can see if entries are complete when you play it back. That very last journal entry may not be complete, so something may still be potentially lost.

Related

File Access (read/write) synchronization between 'n' processes in Linux

I am studying Operating Systems this semester and was just wondering how Linux handles file access (read/write) synchronization, what is the default implementation does it use semaphores, mutexes or monitors? And can you please tell me where I would find this in the source codes or my own copy of Ubuntu and how to disable it?
I need to disable it so i can check if my own implementation of this works, also how do i add my own implementation to the system.
Here's my current plan please tell me if its okay:
Disable the default implementation, add my own. (recompile kernel if need be)
My own version would keep track of every incoming process and maintain a list of what files they were using adn whenever a file would repeat i would check if its a reader process or a writer process
I will be going with a reader preferred solution to the readers writers problem.

Kernel doesn't impose process synchronization (it should be performed by processes while kernel only provides tools for that), but it can guarantee atomicity on some operations: atomic operation can not be interrupted and its result cannot be altered by other operation running in parallel.
Speaking of writing to a file, it has some atomicity guarantees. From man -s3 write:
Atomic/non-atomic: A write is atomic if the whole amount written in one operation is not interleaved with data from any other process. This is useful when there are multiple writers sending data to a single reader. Applications need to know how large a write request can be expected to be performed atomically. This maximum is called {PIPE_BUF}. This volume of IEEE Std 1003.1-2001 does not say whether write requests for more than {PIPE_BUF} bytes are atomic, but requires that writes of {PIPE_BUF} or fewer bytes shall be atomic.
Some discussion on SO: Atomicity of write(2) to a local filesystem.
To maintain atomicity, various kernel routines hold i_mutex mutex of an inode. I.e. in generic_file_write_iter():
mutex_lock(&inode->i_mutex);
ret = __generic_file_write_iter(iocb, from);
mutex_unlock(&inode->i_mutex);
So other write() calls won't mess with your call. Readers, however doesn't lock i_mutex, so they may get invalid data. Actual locking for readers is performed in page cache, so a page (4096 bytes on x86) is a minimum amount data that guarantees atomicity in kernel.
Speaking of recompiling kernel to test your own implementation, there are two ways of doing that: download vanilla kernel from http://kernel.org/ (or from Git), patch and build it - it is easy. Recompiling Ubuntu kernels is harder -- it will require working with Debian build tools: https://help.ubuntu.com/community/Kernel/Compile
I'm not clear about what you trying to achieve with your own implementation. If you want to apply strictier synchronization rules, maybe it is time to look at TxOS?

Is it safe to use O_DIRECT without O_SYNC?

I have a linux application that streams data to files on a directly-attached SAS storage array. It fills large buffers, writes them in O_DIRECT mode, then recycles the buffers (i.e. fills them again etc.). I do not need to use O_SYNC for data integrity, because I can live with data loss on crashes, delayed writing etc. I'm primarily interested in high throughput and I seem to get better performance without O_SYNC. However, I am wondering if it is safe: if O_DIRECT is used but not O_SYNC, when exactly does the write() system call return?
If the write() returns after the DMA to the storage array's cache has been completed, then my application is safe to fill the buffer again. The array itself is in write-back mode: it will write to disk eventually, which is acceptable to me.
If the write returns immediately after the DMA has been initiated (but not yet completed), then my application is not safe, because it would overwrite the buffer while the DMA is still in progress. Obviously I don't want to write corrupted data; but in this case there is also no way that I know to figure out when the DMA for a buffer has been completed and it is safe to refill.
(There are actually several parallel threads, each one with its pool of buffers, although this may not really matter for the question above.)

When the write call returns you can reuse the buffer without any danger. You don't know that the write has made it to disk, but you indicated that was not an issue for you.
One supporting reference is at http://www.makelinux.net/ldd3/chp-15-sect-3, which states:
For example, the use of direct I/O requires that the write system call
operate synchronously; otherwise the application does not know when it
can reuse its I/O buffer.

Reducing seek times when reading many small files

I need to write some code (in any language) to process 10,000 files that reside on a local Linux filesystem. Each file is ~500KB in size, and consists of fixed-size records of 4KB each.
The processing time per record is negligible, and the records can be processed in any order, both within and across different files.
A naïve implementation would read the files one by one, in some arbitrary order. However, since my disks are very fast to read but slow to seek, this will almost certainly produce code that's bound by disk seeks.
Is there any way to code the reading up so that it's bound by disk throughput rather than seek time?
One line of inquiry is to try and get an approximate idea of where the files reside on disk, and use that to sequence the reads. However, I am not sure what API could be used to do that.
I am of course open to any other ideas.
The filesystem is ext4, but that's negotiable.

Perhaps you could do the reads by scheduling all of them in quick succession with aio_read. That would put all reads in the filesystem read queue at once, and then the filesystem implementation is free to complete the reads in a way that minimizes seeks.

A very simple approach, although no results guaranteed. Open as many of the files at once as you can and read all of them at once - either using threads or asynchronous I/O. This way the disk scheduler knows what you read and can reduce the seeks by itself. Edit: as wildplasser observes, parallel open() is probably only doable using threads, not async I/O.
The alternative is to try to do the heavy lifting yourself. Unfortunately this involves a difficult step - getting the mapping of the files to physical blocks. There is no standard interface to do that, you could probably extract the logic from something like ext2fsprogs or the kernel FS driver. And this involves reading the physical device underlying a mounted filesystem, which can be writing to it at the same time you're trying to get a consistent snapshot.
Once you get the physical blocks, just order them, reverse the mapping back to the file offsets and execute the reads in the physical block order.

could you recommend using a SSD for the file storage? that should reduce seek times greatly as there's no head to move.

Since operations are similar and data are independent you can try using a thread pool to submit jobs that work on a number of files (can be a single file). Then you can have an idle thread complete a single job. This might help overlapping IO operations with execution.

A simple way would be to keep the original program, but fork an extra process which has no other task than to prefetch the files, and prime the disk buffer cache. ( a unix/linux system uses all "free" memory as disk buffer).
The main task will stay a few files behind (say ten). The hard part would be to keep things synchronised. A pipe seems the obvious way to accomplish this.
UPDATE:
Pseudo code for the main process:
fetch filename from worklist
if empty goto 2.
(maybe) fork a worker process or thread
add to prefetch queue
add to internal queue
if fewer than XXX items on internal queue goto 1
fetch filename from internal queue
process it
goto 1
For the slave processes:
fetch from queue
if empty: quit
prefetch file
loop or quit
For the queue, a message queue seems most appropiate, since it maintains message boundaries. Another way would be to have one pipe per child (in the fork() case) or use mutexes (when using threads).
You'll need approximate seektime_per_file / processing_time_per_file worker threads / processes.
As a simplification: if seeking the files is not required (only sequential access), the slave processes could consist of the equivalent of
dd if=name bs=500K
, which could be wrapped into a popen() or a pipe+fork().

Recovering data after process restart

I have this requirement in a x86 based Linux system running 2.6.3x kernel..
My process has some dynamic data (not much, in few Mega Bytes range) that has to be recovered if the process crashes. obvious solution is to store the data in shared memory and read it again if the process re-starts. Write to shared memory has to be done carefully so that a process crash in the middle of update won't leave the data corrupted in the shared memory.
Before coding this myself just wanted to check if there is any open source program/library that provides this functionality.. Thanks.
-Santhosh.

I don't think your proposed design is sound. An OS crash (e.g. power failure etc), may cause a mmap'd area to be partially sync'd to disc (maybe the pages are written out in a different order than you wrote them etc), which means your data structures will get corrupted in arbitrary ways.
If you need your database changes to be durable and atomic (maybe consistency and integrity wouldn't hurt either, right?) then I'd strongly recommend using an existing database system which supports ACID, or the appropriate subset. Maybe sqlite or Berkeley DB would do the trick.
You could do it yourself, in principle, but not in the way that you've described - you'd need to create some kind of log file which was updated in a way which could be read back atomically, and be able to "replay" events from some known snapshot etc, which is technically challenging.
Remember that:
An OS failure might cause a write initiated by msync() or similar, to be partially completed to durable disc
mmap does not guarantee to never write back data at other times, i.e. when you haven't called msync() for a while
Pages aren't necessarily written back in the same order that you modified the pages in memory - e.g. you can write to a[0] and then a[4096], and have a[4096] durable but a[0] not after a crash.
Even flushing an individual page is not absolutely guaranteed to be atomic.
I realise that using a library (e.g. bdb or sqlite) for every read or write operation to your data structure is an intrusive change, but if you want this kind of robustness, I think it's necessary.

Transferring data between process calls

I have a Linux process that is being called numerous times, and I need to make this process as fast as possible.
The problem is that I must maintain a state between calls (load data from previous call and store it for the next one), without running another process / daemon.
Can you suggest fast ways to do so? I know I can use files for I/O, and would like to avoid it, for obvious performance reasons. Should (can?) I create a named pipe to read/write from and by that avoid real disk I/O?

Pipes aren't appropriate for this. Use posix shared memory or a posix message queue if you are absolutely sure files are too slow - which you should test first.
In the shared memory case your program creates the segment with shm_open() if it doesn't exist or opens it if it does. You mmap() the memory and make whatever changes and exit. You only shm_unlink() when you know your program won't be called anymore and no longer needs the shared memory.
With message queues, just set up the queue. Your program reads the queue, makes whatever changes, writes the queue and exits. Mq_unlink() when you no longer need the queue.
Both methods have kernel persistence so you lose the shared memory and the queue on a reboot.

It sounds like you have a process that is continuously executed by something.
Why not create a factory that spawns the worker threads?
The factory could provide the workers with any information needed.

... I can use files for I/O, and would like to avoid it, for obvious performance reasons.
I wonder what are these reasons please...
Linux caches files in kernel memory in the page cache. Writes go to the page cash first, in other words, a write() syscall is a kernel call that only copies the data from the user space to the page cache (it is a bit more complicated when the system is under stress). Some time later pdflush writes data to disk asynchronously.
File read() first checks the page cache to see if the data is already available in memory to avoid a disk read. What it means is that if one program writes data to files and another program reads it, these two programs are effectively communicating via kernel memory as long as the page cache keeps those files.
If you want to avoid disk writes entirely, that is, the state does not need to be persisted across OS reboots, those files can be put in /dev/shm or in /tmp, which are normally the mount points of in-memory filesystems.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string