should libaio engine to be used with unbuffered io (direct)only?

should libaio engine to be used with unbuffered io (direct)only? - performance-testing

I was trying to get performance numbers (simple 4K random read) using fio tool with ioengine as libaio.
I observe that if direct io is disabled (direct=0), then iops fell drastically. when direct=1 was provided the iops were 50 times better!
setup: fio being run from a linux client connected to a PCIe based
appliance over Fibre Channel.
Here is snipped from my fio config file:
[global]
filename=/dev/dm-30
size=10G
runtime=300
time_based
group_reporting
[test]
rw=randread
bs=4k
iodepth=16
runtime=300
ioengine=libaio
refill_buffers
ioscheduler=noop
#direct=1
With this setup, I observed the iops to be around 8000 and when I enabled direct=1 in this above shown config file, I see that iops jump to 250K! (which is realistic in case of the setup I am using)
So, my question is if we use libaio engine, using buffered i/o has any issues? is it mandatory that if we use libaio, we should stick to direct io?

Per the docs on Kernel Asynchronous I/O (AIO) Support for Linux:
What Does Not Work?
AIO read and write on files opened without O_DIRECT (i.e. normal buffered filesystem AIO). On ext2, ext3, jfs, xfs and nfs, these do not return an explicit error, but quietly default to synchronous or rather non-AIO behaviour (i.e io_submit waits for I/O to complete in these cases). For most other filesystems, -EINVAL is reported.
In short, if you don't use O_DIRECT, AIO still "works" for many of the most common file systems, but becomes a slow form on synchronous I/O (you may as well have just used read/write and saved yourself a few system calls). The massive performance increase is the result of actually benefiting from asynchronous behavior.
So to answer the question in your title: Yes, libaio should only be used with unbuffered/O_DIRECT file descriptors if you expect to derive any benefit from it.

Related

Hey could someone help me understand sync syscall usage?

like said in the title, I don't really understand the usage of this syscall. I was writing some program that write some data in a file, and the tutorial I've seen told me to use sys_sync syscall. But my problem is why and when should we use this? The data isn't already written on the file?

The manual says:
sync - Synchronize cached writes to persistent storage
So it is written to the file cache in memory, not on disk.
You rarely have to use sync unless you are writing really important data and need to make sure that data is on disk before you go on. One example of systems that use sync a lot are databases (such as MySQL or PostgreSQL).
So in other words, it is theoretically in your file, just not on disk and therefore if you lose electricity, you could lose the data, especially if you have a lot of RAM and many writes in a raw, it may privilege the writes to cache for a long while, increasing the risk of data loss.
But how can a file be not on the disk? I understand the concept of cache but if I wrote in the disk why would it be in a different place?
First, when you write to a file, you send the data to the Kernel. You don't directly send it to the disk. Some kernel driver is then responsible to write the data to disk. In my days on Apple 2 and Amiga computers, I would actually directly read/write to disk. And at least the Amiga had a DMA so you could setup a buffer, then tell the disk I/O to do a read or a write and it would send you an interrupt when done. On the Apple 2, you had to write loops in assembly language with precise timings to read/write data on floppy disks... A different era!
Although you could, of course, directly access the disk (but with a Kernel like Linux, you'd have to make sure the kernel gives you hands free to do that...).
Cache is primarily used for speed. It is very slow to write to disk (as far as a human is concerned, it looks extremely fast, but compared to how much data the CPU can push to the drive, it's still slow).
So what happens is that the kernel has a task to write data to disk. That task wakes up as soon as data appears in the cache and ends once all the caches are transferred to disk. This task works in parallel. You can have one such task per drive (which is especially useful when you have a system such as RAID 1).
If your application fills up the cache, then a further write will block until some of the cache can be replaced.

and the tutorial I've seen told me to use sys_sync syscall
Well that sounds silly, unless you're doing filesystem write benchmarking or something.
If you have one really critical file that you want to make sure is "durable" wrt. power outages before you do something else (like sent a network packet to acknowledge a complete transfer), use fsync(fd) to sync just that one file's data and metadata.
(In asm, call number SYS_fsync from sys/syscall.h, with the file descriptor as the first register arg.)
But my problem is why and when should we use this?
Generally never use the sync system call in programs you're writing.
There are interactive use-cases where you'd normally use the wrapper command of the same name, sync(1). e.g. with removable media, to get the kernel started doing write-back now, so unmount will take less time once you finish typing it. Or for some benchmarking use-cases.
The system shutdown scripts may run sync after unmounting filesystems (and remounting / read-only), before making a reboot(2) system call.
Re: why sync(2) exists
No, your data isn't already on disk right after echo foo > bar.txt.
Most OSes, including Linux, do write-back caching, not write-through, for file writes.
You don't want write() system calls to wait for an actual magnetic disk when there's free RAM, because the traditional way to do I/O is synchronous so simple single-threaded programs wouldn't be able to do anything else (like reading more data or computing anything) while waiting for write() to return. Blocking for ~10 ms on every write system call would be disastrous; that's as long as a whole scheduler timeslice. (It would still be bad even with SSDs, but of course OSes were designed before SSDs were a thing.) Even just queueing up the DMA would be slow, especially for small file writes that aren't a whole number of aligned sectors, so even letting the disk's own write-back write caching work wouldn't be good enough.
Therefore, file writes do create "dirty" pages of kernel buffers that haven't yet been sent to the disk. Sometimes we can even avoid the IO entirely, e.g. for tmp files that get deleted before anything triggers write-back. On Linux, dirty_writeback_centisecs defaults to 1500 (15 seconds) before the kernel starts write-back, unless it's running low on free pages. (Heuristics for what "low" means use other tunable values).
If you really want writes to flush to disk immediately and wait for data to be on disk, mount with -o sync. Or for one program, have it use open(O_SYNC) or O_DSYNC (for just the data, not metadata like timestamps).
See Are file reads served from dirtied pages in the page cache?
There are other advantages to write-back, including delayed allocation even at the filesystem level. The FS can wait until it knows how big the file will be before even deciding where to put it, allowing better decisions that reduce fragmentation. e.g. a small file can go into a gap that would have been a bad place to start a potentially-large file. (It just have to reserve space to make sure it can put it somewhere.) XFS was one of the first filesystems to do "lazy" delayed allocation, and ext4 has also had the feature for a while.
https://en.wikipedia.org/wiki/XFS#Delayed_allocation
https://en.wikipedia.org/wiki/Allocate-on-flush
https://lwn.net/Articles/323169/

What does O_DIRECT really mean?

If I open a file with O_DIRECT flag, does it mean that whenever a write(blocking mode) to that file returns, the data is on disk?

(This answer pertains to Linux - other OSes may have different caveats/semantics)
Let's start with the sub-question:
If I open a file with O_DIRECT flag, does it mean that whenever a write(blocking mode) to that file returns, the data is on disk?
No (as #michael-foukarakis commented) - if you need a guarantee your data made it to non-volatile storage you must use/add something else.
What does O_DIRECT really mean?
It's a hint that you want your I/O to bypass the Linux kernel's caches. What will actually happen depends on things like:
Disk configuration
Whether you are opening a block device or a file in a filesystem
If using a file within a filesystem
The exact filesystem used and the options in use on the filesystem and the file
Whether you've correctly aligned your I/O
Whether a filesystem has to do a new block allocation to satisfy your I/O
If the underlying disk is local, what layers you have in your kernel storage stack before you reach the disk block device
Linux kernel version
...
The list above is not exhaustive.
In the "best" case, setting O_DIRECT will avoid making extra copies of data while transferring it and the call will return after transfer is complete. You are more likely to be in this case when directly opening block devices of "real" local disks. As previously stated, even this property doesn't guarantee that data of a successful write() call will survive sudden power loss. IF the data is DMA'd out of RAM to non-volatile storage (e.g. battery backed RAID controller) or the RAM itself is persistent storage THEN you may have a guarantee that the data reached stable storage that can survive power loss. To know if this is the case you have to qualify your hardware stack so you can't assume this in general.
In the "worst" case, O_DIRECT can mean nothing at all even though setting it wasn't rejected and subsequent calls "succeed". Sometimes things in the Linux storage stack (like certain filesystem setups) can choose to ignore it because of what they have to do or because you didn't satisfy the requirements (which is legal) and just silently do buffered I/O instead (i.e. write to a buffer/satisfy read from already buffered data). It is unclear whether extra effort will be made to ensure that the data of an acknowledged write was at least "with the device" (but in the O_DIRECT and barriers thread Christoph Hellwig posts that the O_DIRECT fallback will ensure data has at least been sent to the device). A further complication is that using O_DIRECT implies nothing about file metadata so even if write data is "with the device" by call completion, key file metadata (like the size of the file because you were doing an append) may not be. Thus you may not actually be able to get at the data you thought had been transferred after a crash (it may appear truncated, or all zeros etc).
While brief testing can make it look like data using O_DIRECT alone always implies data will be on disk after a write returns, changing things (e.g. using an Ext4 filesystem instead of XFS) can weaken what is actually achieved in very drastic ways.
As you mention "guarantee that the data" (rather than metadata) perhaps you're looking for O_DSYNC/fdatasync()? If you want to guarantee metadata was written too, you will have to look at O_SYNC/fsync().
References
Ext4 Wiki: Clarifying Direct IO's Semantics. Also contains notes about what O_DIRECT does on a few non-Linux OSes.
The "[PATCH 1/1 linux-next] ext4: add compatibility flag check to the patch" LKML thread has a reply from Ext4 lead dev Ted Ts'o talking about how filesystems can fallback to buffered I/O for O_DIRECT rather than failing the open() call.
In the "ubifs: Allow O_DIRECT" LKML thread Btrfs lead developer Chris Mason states Btrfs resorts to buffered I/O when O_DIRECT is requested on compressed files.
ZFS on Linux commit message discussing the semantics of O_DIRECT in different scenarios. Also see the (at the time of writing mid-2020) proposed new O_DIRECT semantics for ZFS on Linux (the interactions are complex and defy a brief explanation).
Linux open(2) man page (search for O_DIRECT in the Description section and the Notes section)
Ensuring data reaches disk LWN article
Infamous Linus Torvalds O_DIRECT LKML thread summary (for even more context you can see the full LKML thread)

How are the O_SYNC and O_DIRECT flags in open(2) different/alike?

The use and effects of the O_SYNC and O_DIRECT flags is very confusing and appears to vary somewhat among platforms. From the Linux man page (see an example here), O_DIRECT provides synchronous I/O, minimizes cache effects and requires you to handle block size alignment yourself. O_SYNC just guarantees synchronous I/O. Although both guarantee that data is written into the hard disk's cache, I believe that direct I/O operations are supposed to be faster than plain synchronous I/O since they bypass the page cache (Though FreeBSD's man page for open(2) states that the cache is bypassed when O_SYNC is used. See here).
What exactly are the differences between the O_DIRECT and O_SYNC flags? Some implementations suggest using O_SYNC | O_DIRECT. Why?

O_DIRECT alone only promises that the kernel will avoid copying data from user space to kernel space, and will instead write it directly via DMA (Direct memory access; if possible). Data does not go into caches. There is no strict guarantee that the function will return only after all data has been transferred.
O_SYNC guarantees that the call will not return before all data has been transferred to the disk (as far as the OS can tell). This still does not guarantee that the data isn't somewhere in the harddisk write cache, but it is as much as the OS can guarantee.
O_DIRECT|O_SYNC is the combination of these, i.e. "DMA + guarantee".

Actuall under linux 2.6, o_direct is syncronous, see the man page:
manpage of open, there is 2 section about it..
Under 2.4 it is not guaranteed
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. Ingeneral this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File
I/O is done directly to/from user-space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block devices is described in raw(8).
but under 2.6 it is guaranteed, see
O_DIRECT
The O_DIRECT flag may impose alignment restrictions on the length and address of userspace buffers and the file offset of I/Os. In Linux alignment restrictions vary by file system and kernel version and might be absent entirely. However there is currently no file system-independent interface for an application to discover these restrictions for a given file or file system. Some file systems provide their own interfaces for doing so, for example the XFS_IOC_DIOINFO operation in xfsctl(3).
Under Linux 2.4, transfer sizes, and the alignment of the user buffer and the file offset must all be multiples of the logical block size of the file system. Under Linux 2.6, alignment to 512-byte boundaries suffices.
O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on the heap and statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another thread in the process, should be completed before fork(2) is called. Failure to do so can result in data corruption and undefined behavior in parent and child processes. This restriction does not apply when the memory buffer for the O_DIRECT I/Os was created using shmat(2) or mmap(2) with the MAP_SHARED flag. Nor does this restriction apply when the memory buffer has been advised as MADV_DONTFORK with madvise(2), ensuring that it will not be available to the child after fork(2).
The O_DIRECT flag was introduced in SGI IRIX, where it has alignment restrictions similar to those of Linux 2.4. IRIX has also a fcntl(2) call to query appropriate alignments, and sizes. FreeBSD 4.x introduced a flag of the same name, but without alignment restrictions.
O_DIRECT support was added under Linux in kernel version 2.4.10. Older Linux kernels simply ignore this flag. Some file systems may not implement the flag and open() will fail with EINVAL if it is used.
Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file. Even when the file system correctly handles the coherency issues in this situation, overall I/O throughput is likely to be slower than using either mode alone. Likewise, applications should avoid mixing mmap(2) of files with direct I/O to the same files.
The behaviour of O_DIRECT with NFS will differ from local file systems. Older kernels, or kernels configured in certain ways, may not support this combination. The NFS protocol does not support passing the flag to the server, so O_DIRECT I/O will only bypass the page cache on the client; the server may still cache the I/O. The client asks the server to make the I/O synchronous to preserve the synchronous semantics of O_DIRECT. Some servers will perform poorly under these circumstances, especially if the I/O size is small. Some servers may also be configured to lie to clients about the I/O having reached stable storage; this will avoid the performance penalty at some risk to data integrity in the event of server power failure. The Linux NFS client places no alignment restrictions on O_DIRECT I/O.
In summary, O_DIRECT is a potentially powerful tool that should be used with caution. It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default.
"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances."---Linus

AFAIK, O_DIRECT bypasses the page cache. O_SYNC uses page cache but syncs it immediately. Page cache is shared between processes so if there is another process that is working on the same file without O_DIRECT flag can read the correct data.

This IBM doc explains the difference rather clearly, I think.
A file opened in the O_DIRECT mode ("direct I/O"), GPFS™ transfers data directly between the user buffer and the file on the disk.Using direct I/O may provide some performance benefits in the
following cases:
The file is accessed at random locations.
There is no access locality.
Direct transfer between the user buffer and the disk can only happen
if all of the following conditions are true: The number of bytes
transferred is a multiple of 512 bytes. The file offset is a multiple
of 512 bytes. The user memory buffer address is aligned on a 512-byte
boundary. When these conditions are not all true, the operation will
still proceed but will be treated more like other normal file I/O,
with the O_SYNC flag that flushes the dirty buffer to disk.

Bursty writes to SD/USB stalling my time-critical apps on embedded Linux

I'm working on an embedded Linux project that interfaces an ARM9 to a hardware video encoder chip, and writes the video out to SD card or USB stick. The software architecture involves a kernel driver that reads data into a pool of buffers, and a userland app that writes the data to a file on the mounted removable device.
I am finding that above a certain data rate (around 750kbyte/sec) I start to see the userland video-writing app stalling for maybe half a second, about every 5 seconds. This is enough to cause the kernel driver to run out of buffers - and even if I could increase the number of buffers, the video data has to be synchronised (ideally within 40ms) with other things that are going on in real time. Between these 5 second "lag spikes", the writes complete well within 40ms (as far as the app is concerned - I appreciate they're buffered by the OS)
I think this lag spike is to do with the way Linux is flushing data out to disk - I note that pdflush is designed to wake up every 5s, my understanding is that this would be what does the writing. As soon as the stall is over the userland app is able to quickly service and write the backlog of buffers (that didn't overflow).
I think the device I'm writing to has reasonable ultimate throughput: copying a 15MB file from a memory fs and waiting for sync to complete (and the usb stick's light to stop flashing) gave me a write speed of around 2.7MBytes/sec.
I'm looking for two kinds of clues:
How can I stop the bursty writing from stalling my app - perhaps process priorities, realtime patches, or tuning the filesystem code to write continuously rather than burstily?
How can I make my app(s) aware of what is going on with the filesystem in terms of write backlog and throughput to the card/stick? I have the ability to change the video bitrate in the hardware codec on the fly which would be much better than dropping frames, or imposing an artificial cap on maximum allowed bitrate.
Some more info: this is a 200MHz ARM9 currently running a Montavista 2.6.10-based kernel.
Updates:
Mounting the filesystem SYNC causes throughput to be much too poor.
The removable media is FAT/FAT32 formatted and must be as the purpose of the design is that the media can be plugged into any Windows PC and read.
Regularly calling sync() or fsync() say, every second causes regular stalls and unacceptably poor throughput
I am using write() and open(O_WRONLY | O_CREAT | O_TRUNC) rather than fopen() etc.
I can't immediately find anything online about the mentioned "Linux realtime filesystems". Links?
I hope this makes sense. First embedded Linux question on stackoverflow? :)

For the record, there turned out to be two main aspects that seem to have eliminated the problem in all but the most extreme cases. This system is still in development and hasn't been thoroughly torture-tested yet but is working fairly well (touch wood).
The big win came from making the userland writer app multi-threaded. It is the calls to write() that block sometimes: other processes and threads still run. So long as I have a thread servicing the device driver and updating frame counts and other data to sychronise with other apps that are running, the data can be buffered and written out a few seconds later without breaking any deadlines. I tried a simple ping-pong double buffer first but that wasn't enough; small buffers would be overwhelmed and big ones just caused bigger pauses while the filesystem digested the writes. A pool of 10 1MB buffers queued between threads is working well now.
The other aspect is keeping an eye on ultimate write throughput to physical media. For this I am keeping an eye on the stat Dirty: reported by /proc/meminfo. I have some rough and ready code to throttle the encoder if Dirty: climbs above a certain threshold, seems to vaguely work. More testing and tuning needed later. Fortunately I have lots of RAM (128M) to play with giving me a few seconds to see my backlog building up and throttle down smoothly.
I'll try to remember to pop back and update this answer if I find I need to do anything else to deal with this issue. Thanks to the other answerers.

I'll throw out some suggestions, advice is cheap.
make sure you are using a lower level API for writing to the disk, don't use user-mode caching functions like fopen, fread, fwrite use the lower level functions open, read, write.
pass the O_SYNC flag when you open the file, this will cause each write operation to block until written to disk, which will remove the bursty behavior of your writes...with the expense of each write being slower.
If you are doing reads/ioctls from a device to grab a chunk of video data, you may want to consider allocating a shared memory region between the application and kernel, otherwise you are getting hit with a bunch of copy_to_user calls when transferring video data buffers from kernel space to user space.
You may need to validate that your USB flash device is fast enough with sustained transfers to write the data.
Just a couple thoughts, hope this helps.

Here is some information about tuning pdflush for write-heavy operations.

Sounds like you're looking for linux realtime filesystems. Be sure to search Google et al for that.
XFS has a realtime option, though I haven't played with it.
hdparm might let you turn off the caching altogether.
Tuning the filesystem options (turn off all the extra unneeded file attributes) might reduce what you need to flush, thus speeding the flush. I doubt that'd help much, though.
But my suggestion would be to avoid using the stick as a filesystem at all and instead use it as a raw device. Stuff data on it like you would using 'dd'. Then elsewhere read that raw data and write it out after baking.
Of course, I don't know if that's an option for you.

Has a debugging aid, you could use strace to see what operations is taking time.
There might be some surprising thing with the FAT/FAT32.
Do you write into a single file, or in multiple file ?
You can make a reading thread, that will maintain a pool of video buffer ready to be written in a queue.
When a frame is received, it is added to the queue, and the writing thread is signaled
Shared data
empty_buffer_queue
ready_buffer_queue
video_data_ready_semaphore
Reading thread :
buf=get_buffer()
bufer_to_write = buf_dequeue(empty_buffer_queue)
memcpy(bufer_to_write, buf)
buf_enqueue(bufer_to_write, ready_buffer_queue)
sem_post(video_data_ready_semaphore)
Writing thread
sem_wait(vido_data_ready_semaphore)
bufer_to_write = buf_dequeue(ready_buffer_queue)
write_buffer
buf_enqueue(bufer_to_write, empty_buffer_queue)
If your writing threaded is blocked waiting for the kernel, this could work.
However, if you are blocked inside the kerne space, then thereis nothing much you can do, except looking for a more recent kernel than your 2.6.10

Without knowing more about your particular circumstances, I can only offer the following guesses:
Try using fsync()/sync() to force the kernel to flush data to the storage device more frequently. It sounds like the kernel buffers all your writes and then ties up the bus or otherwise stalls your system while performing the actual write. With careful calls to fsync() you can try to schedule writes over the system bus in a more fine grained way.
It might make sense to structure the application in such a way that the encoding/capture (you didn't mention video capture, so I'm making an assumption here - you might want to add more information) task runs in its own thread and buffers its output in userland - then, a second thread can handle writing to the device. This will give you a smoothing buffer to allow the encoder to always finish its writes without blocking.
One thing that sounds suspicious is that you only see this problem at a certain data rate - if this really was a buffering issue, I'd expect the problem to happen less frequently at lower data rates, but I'd still expect to see this issue.
In any case, more information might prove useful. What's your system's architecture? (In very general terms.)
Given the additional information you provided, it sounds like the device's throughput is rather poor for small writes and frequent flushes. If you're sure that for larger writes you can get sufficient throughput (and I'm not sure that's the case, but the file system might be doing something stupid, like updating the FAT after every write) then having an encoding thread piping data to a writing thread with sufficient buffering in the writing thread to avoid stalls. I've used shared memory ring buffers in the past to implement this kind of scheme, but any IPC mechanism that would allow the writer to write to the I/O process without stalling unless the buffer is full should do the trick.

A useful Linux function and alternative to sync or fsync is sync_file_range. This lets you schedule data for writing without waiting for the in-kernel buffer system to get around to it.
To avoid long pauses, make sure your IO queue (for example: /sys/block/hda/queue/nr_requests) is large enough. That queue is where data goes in between being flushed from memory and arriving on disk.
Note that sync_file_range isn't portable, and is only available in kernels 2.6.17 and later.

I've been told that after the host sends a command, MMC and SD cards "must respond within 0 to 8 bytes".
However, the spec allows these cards to respond with "busy" until they have finished the operation, and apparently there is no limit to how long a card can claim to be busy (please, please tell me if there is such a limit).
I see that some low-cost flash chips such as the M25P80 have a guaranteed "maximum single-sector erase time" of 3 seconds, although typically it "only" requires 0.6 seconds.
That 0.6 seconds sounds suspiciously similar to your "stalling for maybe half a second".
I suspect the tradeoff between cheap, slow flash chips and expensive, fast flash chips has something to do with the wide variation in USB flash drive results:
http://www.testfreaks.com/blog/information/16gb-usb-drive-comparison-17-drives-compared/
http://www.tomshardware.com/reviews/data-transfer-run,1037-10.html
I've heard rumors that every time a flash sector is erased and then re-programmed, it takes a little bit longer than the last time.
So if you have a time-critical application, you may need to (a) test your SD cards and USB sticks to make sure they meet the minimum latency, bandwidth, etc. required by your application, and (b) peridically re-test or pre-emptively replace these memory devices.

Well obvious first, have you tried explicitly telling the file to flush? I also think there might be some ioctl you can use to do it, but I honestly haven't done much C/POSIX file programming.
Seeing you're on a Linux kernel you should be able to tune and rebuild the kernel to something that suits your needs better, eg. much more frequent but then also smaller flushes to the permanent storage.
A quick check in my man pages finds this:
SYNC(2) Linux Programmer’s Manual SYNC(2)
NAME
sync - commit buffer cache to disk
SYNOPSIS
#include <unistd.h>
void sync(void);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sync(): _BSD_SOURCE || _XOPEN_SOURCE >= 500
DESCRIPTION
sync() first commits inodes to buffers, and then buffers to disk.
ERRORS
This function is always successful.

Doing your own flush()ing sounds right to me - you want to be in control, not leave it to the vagaries of the generic buffer layer.
This may be obvious, but make sure you're not calling write() too often - make sure every write() has enough data to be written to make the syscall overhead worth it. Also, in the other direction, don't call it too seldom, or it'll block for long enough to cause a problem.
On a more difficult-to-reimplement track, have you tried switching to asynchronous i/o? Using aio you could fire off a write and hand it one set of buffers while you're sucking video data into the other set, and when the write finishes you switch sets of buffers.

SD card write performance

I am writing a little application, which is writing jpeg images at a constant rate on a SD card.
I choose an EXT3 filesystem, but the same behaviour was observed with an EXT2 filesystem.
My writing loop looks like this :
get_image()
fwrite()
fsync()
Or like this :
get_image()
fopen()
fwrite()
fsync()
fclose()
I also display some timing statistics, and I can see my program is sometime blocked for several seconds.
The average rate is still good, because if I keep the incoming images into a fifo, then I will write many image in a short period of time after such a stall. Do you know if it is a problem with the OS or if it is related to the SD card itself ?
How could I move closer to realtime ? I don't need strong realtime, but being stalled for several seconds is not acceptable.
Some precision :
Yes it is necessary to fsync after every file, because I want the image to be on disk, not in some user or kernel buffer. Without fsyncing, I have much better throughoutput,
but still unacceptable stall. I don't think it is a buffer problem, since the first stall happens after 50 Mbytes have been written. And according to the man page, fsync is here precisely to ensure there is no data buffered.
Precision regarding the average write rate :
I am writing at a rate that is sustainable by the card I am using. If I pile incoming image while waiting for an fsync to complete, then after this stall the write transfer rate will increase and I will quickly go back to the average rate.
The average transfer rate is around 1.4 MBytes /s.
The systeme is a modern laptop running ubuntu 8.04 with stock kee (2.6.24.19)

Try to open the file with O_DIRECT and do the caching in application level.
We met the similar issue when we were implementing a PVR (Personal Video Record) feature in STB Box. The O_DIRECT trick satisfied our need finaly.(*)
Without O_DIRECT. The data of write() will firstly be cached in the kernel buffer and then be flushed to the media when you call fsync or when the kernel cache buffer is full.(**).
With O_DIRECT.Th kernel will do DMA directly to the physical memory pointed by the userspace buffer passed as parameter to the write syscalls. So there will be no CPU and mem bandwidth spent in the copies between userspace memory and kernel cache, and there will be no CPU time spent in kernel in the management of the cache (like cache lookups, per-page locks etc..).( copied from here )
Not sure it can also solve your problem, but you might want to have a try.
*Despite of Linus's critize of O_DIRECT ,it did solve our problems.
** suppose you did not open the file with O_DSYNC or O_SYNC

Is it necessary to fsync() after every file? You may have better luck letting the OS decide when a good time is to write out all enqueued images to the SD card (amortizing the startup cost of manipulating the SD card filesystem over many images, rather than incurring it for every image).
Can you provide some more details about your platform? Slow I/O times may have to do with other processes on the system, a slow I/O controller, etc..
You might also consider using a filesystem more suited to how flash memory works. FAT32 is more common than extN, but a filesystem specifically built for SD may be in order as well. JFFS is a good example of this. You will probably get better performance with a filesystem designed for flash (as opposed to spinning magnetic media), and you get better wear-leveling (and thus device lifetime/reliability) properties as well.

AFAIK some flash disks have really bad write performance (esp. cheap brands). So if you measure the write speed of your application (including the time required for fsync), what do you get? It might easily be in the order of very few megabytes per second - just because the hardware doesn't do better.
Also, apparently writing can be much slower if you write many small blocks instead of one big block (the flash disk might only get about 10 writes per second done, in bad cases). This is probably something that can be mitigated by the kernel buffer, so using fsync frequently might slow down the writing...
Btw. did you measure write performance on FAT32? I would guess it is about the same, but if not, maybe there's some optimization still available?

I'm not very knowledgeable in this area, but the symptoms you describe sound an awful lot like filling up a buffer. You may be filling a buffer in the file writer or in the I/O device communicating with the SD card itself. You then have to wait until it actually writes the data to the card (thus emptying the buffer) before you can write more. SD cards are not particularly fast writers. If you can find a way to check if data is actually being written to the card during these pauses, that would verify my theory. Some card readers have an LED that blinks when data is being accessed -- that would probably be a good indicator.
Just a hunch... take it with some salt :)

May be this will help - Benchmarking Filesystems:
...I was quite surprised how slow ext3 was overall, as many distributions use this file system as their default file system...
And "ext3 fsync batching":
...This patch measures the time it takes to commit a transaction to the disk, and sleeps based on the speed of the underlying disk.

For anyone reading this and using a kernel above 2.6.28, the recommendation is to use ext4 instead of ext3, which is a filesystem that you can tune for better performance. The best performance is obtained in data=writeback mode, where data is not journaled. Read the Data Mode section from https://www.kernel.org/doc/Documentation/filesystems/ext4.txt.
If you have a partition already created, say /dev/sdb1, then these are some steps that can be used to format it with ext4 without journaling:
mkfs.ext4 /dev/sdb1 -L jp # Creates the ext4 filesystem
tune2fs -o journal_data_writeback /dev/sdb1 # Set to writeback mode
tune2fs -O ^has_journal /dev/sdb1 # Disable journaling
sudo e2fsck -f /dev/sdb1 # Filesystem check is required
Then, you can mount this partition (or set an entry /etc/fstab if you know what you're doing) with the corresponding flags:
mount -t ext4 -O noatime,nodirame,data=writeback /dev/mmcblk0p1 /mnt/sd
Moving from ext3 to an optimized ext4 filesystem should be a drastic difference. And, of course, if your SD card is quicker that should help (i.e. class 10).
See also https://developer.ridgerun.com/wiki/index.php/High_performance_SD_card_tuning_using_the_EXT4_file_system

Might also consider the SD Card, is it NOR or NAND? This page shows an order of magnitude between sd cards (2M/s vs 20M/s).
http://www.robgalbraith.com/bins/camera_multi_page.asp?cid=6007-9597
I think ZFS is optimized for flash memory.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string