Is it safe to use O_DIRECT without O_SYNC?

Is it safe to use O_DIRECT without O_SYNC? - linux

I have a linux application that streams data to files on a directly-attached SAS storage array. It fills large buffers, writes them in O_DIRECT mode, then recycles the buffers (i.e. fills them again etc.). I do not need to use O_SYNC for data integrity, because I can live with data loss on crashes, delayed writing etc. I'm primarily interested in high throughput and I seem to get better performance without O_SYNC. However, I am wondering if it is safe: if O_DIRECT is used but not O_SYNC, when exactly does the write() system call return?
If the write() returns after the DMA to the storage array's cache has been completed, then my application is safe to fill the buffer again. The array itself is in write-back mode: it will write to disk eventually, which is acceptable to me.
If the write returns immediately after the DMA has been initiated (but not yet completed), then my application is not safe, because it would overwrite the buffer while the DMA is still in progress. Obviously I don't want to write corrupted data; but in this case there is also no way that I know to figure out when the DMA for a buffer has been completed and it is safe to refill.
(There are actually several parallel threads, each one with its pool of buffers, although this may not really matter for the question above.)

When the write call returns you can reuse the buffer without any danger. You don't know that the write has made it to disk, but you indicated that was not an issue for you.
One supporting reference is at http://www.makelinux.net/ldd3/chp-15-sect-3, which states:
For example, the use of direct I/O requires that the write system call
operate synchronously; otherwise the application does not know when it
can reuse its I/O buffer.

Related

How to synchronize `vkMapMemory`?

vkMapMemory states:
vkMapMemory does not check whether the device memory is currently in use before returning the host-accessible pointer. The application must guarantee that any previously submitted command that writes to this range has completed before the host reads from or writes to that range, and that any previously submitted command that reads from that range has completed before the host writes to that region
It links to this site which sadly doesn't seem to exist yet. I am wondering how I would synchronize this?
Basically I need to worry about two things
Only 1 thread accesses the same range at the same time
The Gpu isn't currently trying to read the range
The only real way that I see to synchronize this is with a thread safe list. Every time you want to write/read to/from that buffer you have to add the memory range that you are currently trying to read or write into that thread safe list.
That means when when you want to access that buffer you need to lock that list and search for the range that you are trying to access.
Is that how you would synchronize vkMapMemory or are there other ways to do this?

The only time that the gpu will try to access the mapped memory is when a command buffer accessing that memory has been submitted. That memory will be in use until the associated vkFence has been signaled.
A fully general solution would be to track every memory access by the gpu and surround each CPU mapped memory access with a begin/end pair that will wait on the appropriate fences and call flush/invalidate as needed. This is a lot of state tracking and a plethora of potentially blocking calls.
However for persistent mesh/texture data you will only need to write to memory to a staging buffer and then copy to a device-local non-host-visible buffer. You shouldn't need this often so a single fence to track whether a copy from it is in flight is enough. Or for data that needs to survive only for a single frame (per object transforms) you can use a ring buffer. Readback of GPU occlusion test or compute results, you can use a ring buffer.
I hope you can see the pattern emerging. Use just a few mapped ring-buffers and be very conscious about when they are used by the gpu and then you just need to keep a small array of vkFence+offset+size per ring buffer to ensure no data hazard occurs.

How can I know when data is written to disk?

We'd like to measure the I/O time from an application by instrumenting the read() and write() routines on a Linux system. However, the calls to write() return very fast. According to my OS man page for write (man 2 write):
NOTES
A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy
implementations, it does not even guarantee that space has
successfully
been reserved for the data. The only way to be sure is to call fsync(2) after you are done writing all your data.
Linux manual as of 2013-01-27
so we understand that the write() call initiates an asynchronous call that at some point will flush the data to disk.
So the question is, is there a way to know when the data (even if it has been grouped for caching purposes) is being actually written into disk? -- preferably, when that process starts and ends?
EDIT1 We're particularly interested on measuring the application behavior and we'd like to avoid changing the semantics of the application by changing the parameters to open() -- adding O_SYNC -- or injecting calls to sync(). By changing the application semantics, you can't actually tell about the behavior of the original application.

You could open the file as O_SYNC, which in theory means that write won't return until the data is written to disk. Though what data, real or metadata, is written is dependant on the file system and how it is mounted. This is changing how your application is really working though.
If you're really interested in handling actual I/O to storage yourself (are you a database?) then O_DIRECT leaves you control. Again this is a change in behaviour and imposes additional constraints on your application. It may be what you need, may not.
You really appear to be asking about benchmarking real performance, so the real question is what you want to know. Since a real system does so much caching, the "instant" return from the write is "real" in the sense of what delays on your application actually are. If you're looking for I/O throughput you might be better looking at higher level system statistics.

You basically can't know when the data is really written to disk, and the actual disk writing may happen long time after (typically, a few minutes) your process has terminated. Also, your disk itself has (inside the disk controller) some cache. Be happy with that, since the page cache of your system is then very effective (and makes your Linux system behave quickly).
You might consider calling the sync(2) system call, but you often should not (it could be slow, and still don't guarantee any writing, it is often asking the kernel to flush buffers later).
On a given opened file descriptor, you could consider fsync(2). As Joe answered, you might pass O_SYNC to open, but that would slow down the system.
I strongly suggest (for performance reasons) to trust your kernel page cache management and avoid forcing any disk flush manually. See also the related posix_fadvise(2) & madvise(2) system calls.
If you benchmark some program, run it several times (and take into account what matters to you the most: an average of the measured times -perhaps excluding the best and/or worst of them-, or the worse or the best of them). So the point is that the I/O time (or the CPU time, or the elapsed real time) of an application is something very ambiguous. You probably want to explain your benchmarking process when publishing benchmark results.

You can refer to this link. It might help you.
Flush Data to disk
As far as writing to disk is concerned it is unpredictable. There is no definitive way of telling it. But you can make sure that data is written to disk by calling sync.

Linux splice() + kernel AIO when writing to disk

With kernel AIO and O_DIRECT|O_SYNC, there is no copying into kernel buffers and it is possible to get fine grained notification when data is actually flushed to disk. However, it requires data to be held in user space buffers for io_prep_pwrite().
With splice(), it is possible to move data directly to disk from kernel space buffers (pipes) without never having to copy it around. However, splice() returns immediately after data is queued and does not wait for actual writes to the disk.
The goal is to move data from sockets to disk without copying it around while getting confirmation that it has been flushed out. How to combine both previous approaches?
By combining splice() with O_SYNC, I expect splice() to block and one has to use multiple threads to mask latency. Alternatively, one could use asynchronous io_prep_fsync()/io_prep_fdsync(), but this waits for all data to be flushed, not for a specific write. Neither is perfect.
What would be required is a combination of splice() with kernel AIO, allowing zero copy and asynchronous confirmation of writes, such that a single event driven thread can move data from sockets to the disk and get confirmations when required, but this doesn't seem to be supported. Is there a good workaround / alternative approach?

To get a confirmation of the writes, you can't use splice().
There's aio stuff in userspace, but if you were doing it in the kernel it might come to finding out which bio's (block I/O) are generated and waiting for those:
Block I/O structure:
http://www.makelinux.net/books/lkd2/ch13lev1sec3
If you want to use AIO, you will need to use io_getevents():
http://man7.org/linux/man-pages/man2/io_getevents.2.html
Here are some examples on how to perform AIO:
http://www.fsl.cs.sunysb.edu/~vass/linux-aio.txt
If you do it from userspace and use msync it's still kind of up in the air if it is actually on spinning rust yet.
msync() docs:
http://man7.org/linux/man-pages/man2/msync.2.html
You might have to soften expectations in order to make it more robust, because it might be very expensive to actually be sure that the writes are fisically written on disk.
The 'highest' typical standard for write assurance in light of something like power removal is a journal recording operation that modifies the storage. The journal itself is append only and you can see if entries are complete when you play it back. That very last journal entry may not be complete, so something may still be potentially lost.

Recovering data after process restart

I have this requirement in a x86 based Linux system running 2.6.3x kernel..
My process has some dynamic data (not much, in few Mega Bytes range) that has to be recovered if the process crashes. obvious solution is to store the data in shared memory and read it again if the process re-starts. Write to shared memory has to be done carefully so that a process crash in the middle of update won't leave the data corrupted in the shared memory.
Before coding this myself just wanted to check if there is any open source program/library that provides this functionality.. Thanks.
-Santhosh.

I don't think your proposed design is sound. An OS crash (e.g. power failure etc), may cause a mmap'd area to be partially sync'd to disc (maybe the pages are written out in a different order than you wrote them etc), which means your data structures will get corrupted in arbitrary ways.
If you need your database changes to be durable and atomic (maybe consistency and integrity wouldn't hurt either, right?) then I'd strongly recommend using an existing database system which supports ACID, or the appropriate subset. Maybe sqlite or Berkeley DB would do the trick.
You could do it yourself, in principle, but not in the way that you've described - you'd need to create some kind of log file which was updated in a way which could be read back atomically, and be able to "replay" events from some known snapshot etc, which is technically challenging.
Remember that:
An OS failure might cause a write initiated by msync() or similar, to be partially completed to durable disc
mmap does not guarantee to never write back data at other times, i.e. when you haven't called msync() for a while
Pages aren't necessarily written back in the same order that you modified the pages in memory - e.g. you can write to a[0] and then a[4096], and have a[4096] durable but a[0] not after a crash.
Even flushing an individual page is not absolutely guaranteed to be atomic.
I realise that using a library (e.g. bdb or sqlite) for every read or write operation to your data structure is an intrusive change, but if you want this kind of robustness, I think it's necessary.

Transferring data between process calls

I have a Linux process that is being called numerous times, and I need to make this process as fast as possible.
The problem is that I must maintain a state between calls (load data from previous call and store it for the next one), without running another process / daemon.
Can you suggest fast ways to do so? I know I can use files for I/O, and would like to avoid it, for obvious performance reasons. Should (can?) I create a named pipe to read/write from and by that avoid real disk I/O?

Pipes aren't appropriate for this. Use posix shared memory or a posix message queue if you are absolutely sure files are too slow - which you should test first.
In the shared memory case your program creates the segment with shm_open() if it doesn't exist or opens it if it does. You mmap() the memory and make whatever changes and exit. You only shm_unlink() when you know your program won't be called anymore and no longer needs the shared memory.
With message queues, just set up the queue. Your program reads the queue, makes whatever changes, writes the queue and exits. Mq_unlink() when you no longer need the queue.
Both methods have kernel persistence so you lose the shared memory and the queue on a reboot.

It sounds like you have a process that is continuously executed by something.
Why not create a factory that spawns the worker threads?
The factory could provide the workers with any information needed.

... I can use files for I/O, and would like to avoid it, for obvious performance reasons.
I wonder what are these reasons please...
Linux caches files in kernel memory in the page cache. Writes go to the page cash first, in other words, a write() syscall is a kernel call that only copies the data from the user space to the page cache (it is a bit more complicated when the system is under stress). Some time later pdflush writes data to disk asynchronously.
File read() first checks the page cache to see if the data is already available in memory to avoid a disk read. What it means is that if one program writes data to files and another program reads it, these two programs are effectively communicating via kernel memory as long as the page cache keeps those files.
If you want to avoid disk writes entirely, that is, the state does not need to be persisted across OS reboots, those files can be put in /dev/shm or in /tmp, which are normally the mount points of in-memory filesystems.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string