PCIe - DMA: Consistent vs. Streaming Memory

PCIe - DMA: Consistent vs. Streaming Memory - linux

Currently I'm adding DMA to my PCIe driver for Linux. As I'm reading through the documentation it makes mention of consistent, or coherent, memory by using the API:
pci_set_consistent_dma_mask(...)
but never really talks about why to use it or what it does. It seems to mention to call the function for best practices and future proofing. The best I can gather is that consistent DMA memory does not have cache effects and the memory is written between device (FPGA) and CPU without any software/driver intervention once set up correctly (assuming I read correctly).
So my questions are:
Assuming a PCIe device does not require consistent memory then why would anyone use it, or in what cases is consistent memory used?
If I use consistent memory then do I not need to implement an interrupt in the PCIe driver for DMA? If true, then how does the userpsace code and device know a transfer has occurred?
If I transfer a lot of small packets, ~50 bytes, continuously and on occasion larger packets, ~6 kB, which DMA memory is better: consistent or streaming?

Think about it this way: "Consistent" means it will be automatically coherent between CPU and bus without doing anything to specifically synchronize it. For example - say I have a memory ring for inbound and outbound packets. It's lifespan will be the entire time the system is in use, and I'm going to be checking it all the time. I want this to be always consistent, because if it isn't I would have to (manually) flush or synchronize the caches, and if this were costly, and I had to do this very time I touched the ring - it would be nightmare.
On the other hand - let's take a single data buffer I'm transferring. I't kind of a "one off" deal. I can let the device transfer it - and maybe it takes many PCI cycles to complete the DMA. And maybe this is inconsistent. That's okay - but when it's done I can flush/sync caches/force consistency. If it took a tiny bit of extra time to do so - no problem - because I'm just doing it once.
So you might ask "why not make everything consistent". Answer is there is generally some level of overhead to make things consistent. Depending on the architecture, this could be significant. So in such cases, there are provisions to allow for inconsistent (streaming) mappings which don't do cache consistency (but require an explicit sync). So allowing an inconsistent transfer could gain you some performance.
Remember too - there are some cases where you would never need any consistency. For example - reading a buffer from a network device to memory, then writing that memory to a disk controller. This data may never be read/used by the CPU at all - so why bother placing any overhead on the CPU cache to track it.
As for you comment about the "interrupt" - this is kind of odd. In a "normal" case - you might have a control structure in consistent memory (like a Tx/Rx rings) which you could poll to tell you if the transaction was done. But the actual data transferred would be in a different memory which could be streaming or non-consistent.

1)Imagine you want to transfer a huge amount of data via PCIE, with high rate. you have to use scatter/gather list, and you can use a consistent memory for prepare this list for FPGA, so FPGA can read this list very fast and then do the transmissions.
2)Of course you need interrupts, otherwise you have to use polling which is very slow and unreliable.
3)If you use larger consistent memory, you can minimize interrupt/polling overheads, so they are faster, but windows usually don't allow you to allocate large consistent memory.

Related

On which factor (parameters) bandwidth is depend using FIO benchmarking of pmem

I am doing FIO testing of /dev/pmem for sequential read with command using:
fio --name=readf --filename=/dev/pmem --iodepth=4 --ioengine=libaio --direct=1 --buffered=0 --groupreprting --timebased --bs=64k --size=10g --rw=read --norandommap --refillbuffers=1 --randrepeat=0 --runtime=300

The question is vague and can be read as what does your maximum disk bandwidth depend on?
Speed of the underlying device.
State of the underlying device.
Busyness of the system.
Amount of I/O that can be sent down in tandem.
How I/O is batched together.
Block size chosen (disks generally have an optimal size).
Whether the I/O is sequential or random.
Whether there is other I/O happening to the same disk (e.g. SMART updates).
Whether cache on the device is exhausted.
Whether the device has to do maintenance.
Size of caches.
Amount of I/O put down.
Compressibility of data.
Configuration parameters of the OS.
Configuration parameters of the hardware.
Size of the region I/O is done within.
In the job given things that stand out are:
The iodepth looks a bit low try boosting it until you don't get a benefit.
setting norandommap is meaningless for a sequential job
Setting both direct=1 and buffered=0 is redundant
groupreprting and timebased are spelt incorrectly
refillbuffers is also spelt wrong and you might get away with scramble_buffers with lower overhead but at the risk of less random data
You might get some benefit from pinning fio to appropriate CPUs
You might get some benefit submitting I/O in batches

What is coherent memory on GPU?

I have stumbled not once into a term "non coherent" and "coherent" memory in the
tech papers related to graphics programming.I have been searching for a simple and clear explanation,but found mostly 'hardcore' papers of this type.I would be glad to receive layman's style answer on what coherent memory actually is on GPU architectures and how it is compared to other (probably not-coherent) memory types.

Memory is memory. But different things can access that memory. The GPU can access memory, the CPU can access memory, maybe other hardware bits, whatever.
A particular thing has "coherent" access to memory if changes made by others to that memory are visible to the reader. Now, you might think this is foolishness. After all, if the memory has been changed, how could someone possibly be unable to see it?
Simply put, caches.
It turns out that changing memory is expensive. So we do everything possible to avoid changing memory unless we absolutely have to. When you write a single byte from the CPU to a pointer in memory, the CPU doesn't write that byte yet. Or at least, not to memory. It writes it to a local copy of that memory called a "cache."
The reason for this is that, generally speaking, applications do not write (or read) single bytes. They are more likely to write (and read) lots of bytes, in small chunks. So if you're going to perform an expensive operation like a memory load or store, you should load or store a large chunk of memory. So you store all of the changes you're going to make to a chunk of memory in a cache, then make a single write of that cached chunk to actual memory at some point in the future.
But if you have two separate devices that use the same memory, you need some way to be certain that writes one device makes are visible to other devices. Most GPUs can't read the CPU cache. And most CPU languages don't have language-level support to say "hey, that stuff I wrote to memory? I really mean for you to write it to memory now." So you usually need something to ensure visibility of changes.
In Vulkan, memory which is labeled by VK_MEMORY_PROPERTY_HOST_COHERENT_BIT means that, if you read/write that memory (via a mapped pointer, since that's the only way Vulkan lets you directly write to memory), you don't need to use functions vkInvalidateMappedMemoryRanges/vkFlushMappedMemoryRanges to make sure the CPU/GPU can see those changes. The visibility of any changes is guaranteed in both directions. If that flag isn't available on the memory, then you must use the aforementioned functions to ensure the coherency of the specific regions of data you want to access.
With coherent memory, one of two things is going on in terms of hardware. Either CPU access to the memory is not cached in any of the CPU's caches, or the GPU has direct access to the CPU's caches (perhaps due to being on the same die as the CPU(s)). You can usually tell that the latter is happening, because on-die GPU implementations of Vulkan don't bother to offer non-coherent memory options.

If memory is coherent then all threads accessing that memory must agree on the state of the memory at all times, e.g.: if thread 0 reads memory location A and thread 1 reads the same location at the same time, both threads should always read the same value.
But if memory is not coherent then threads A and B might read back different values. Thread 0 could think that location A contains a 1, while thread thinks that that location contains a 2. The different threads would have an incoherent view of the memory.
Coherence is hard to achieve with a high number of cores. Often every core must be aware of memory accesses from all other cores. So if you have 4 cores in a quad core CPU, coherence is not that hard to achieve as every core must be informed about the memory accesses addresses of 3 other cores, but in a GPU with 16 cores, every core must be made aware of the memory accesses by 15 other cores. The cores exchange data about the content of their cache using so called "cache coherence protocols".
This is why GPUs often only support limited forms of coherency. If some memory locations are read only or are only accessed by a single thread, then no coherence is required. If caches are small and coherence is not always required but only at specific instructions of the program, then it is possible to achieve correct behavior of the program using cache flushes before or after specific memory accesses.
If your hardware offers both coherent and non-coherent memory types, then you can expect that non-coherent memory will be faster, but if you try to run parallel algorithms using this memory they will fail in really weird ways.

DMA cache coherence management

My question is this: how can I determine when it is safe to disable cache snooping when I am correctly using [pci_]dma_sync_single_for_{cpu,device} in my device driver?
I'm working on a device driver for a device which writes directly to RAM over PCI Express (DMA), and am concerned about managing cache coherence. There is a control bit I can set when initiating DMA to enable or disable cache snooping during DMA, clearly for performance I would like to leave cache snooping disabled if at all possible.
In the interrupt routine I call pci_dma_sync_single_for_cpu() and ..._for_device() as appropriate, when switching DMA buffers, but on 32-bit Linux 2.6.18 (RHEL 5) it turns out that these commands are macros which expand to nothing ... which explains why my device returns garbage when cache snooping is disabled on this kernel!
I've trawled through the history of the kernel sources, and it seems that up until 2.6.25 only 64-bit x86 had hooks for DMA synchronisation. From 2.6.26 there seems to be a generic unified indirection mechanism for DMA synchronisation (currently in include/asm-generic/dma-mapping-common.h) via fields sync_single_for_{cpu,device} of dma_map_ops, but so far I've failed to find any definitions of these operations.

I'm really surprised no one has answered this, so here we go on a non-Linux specific answer (I have insufficient knowledge of the Linux kernel itself to be more specific) ...
Cache snooping simply tells the DMA controller to send cache invalidation requests to all CPUs for the memory being DMAed into. This obviously adds load to the cache coherency bus, and it scales particularly badly with additional processors as not all CPUs will have a single hop connection with the DMA controller issuing the snoop. Therefore, the simple answer to "when it is safe to disable cache snooping" is when the memory being DMAed into either does not exist in any CPU cache OR its cache lines are marked as invalid. In other words, any attempt to read from the DMAed region will always result in a read from main memory.
So how do you ensure reads from a DMAed region will always go to main memory?
Back in the day before we had fancy features like DMA cache snooping, what we used to do was to pipeline DMA memory by feeding it through a series of broken up stages as follows:
Stage 1: Add "dirty" DMA memory region to the "dirty and needs to be cleaned" DMA memory list.
Stage 2: Next time the device interrupts with fresh DMA'ed data, issue an async local CPU cache invalidate for DMA segments in the "dirty and needs to be cleaned" list for all CPUs which might access those blocks (often each CPU runs its own lists made up of local memory blocks). Move said segments into a "clean" list.
Stage 3: Next DMA interrupt (which of course you're sure will not occur before the previous cache invalidate has completed), take a fresh region from the "clean" list and tell the device that its next DMA should go into that. Recycle any dirty blocks.
Stage 4: Repeat.
As much as this is more work, it has several major advantages. Firstly, you can pin DMA handling to a single CPU (typically the primary CPU0) or a single SMP node, which means only a single CPU/node need worry about cache invalidation. Secondly, you give the memory subsystem much more opportunity to hide memory latencies for you by spacing out operations over time and spreading out load on the cache coherency bus. The key for performance is generally to try and make any DMA occur on a CPU as close to the relevant DMA controller as possible and into memory as close to that CPU as possible.
If you always hand off newly DMAed into memory to user space and/or other CPUs, simply inject freshly acquired memory in at the front of the async cache invalidating pipeline. Some OSs (not sure about Linux) have an optimised routine for preordering zeroed memory, so the OS basically zeros memory in the background and keeps a quick satisfy cache around - it will pay you to keep new memory requests below that cached amount because zeroing memory is extremely slow. I'm not aware of any platform produced in the past ten years which uses hardware offloaded memory zeroing, so you must assume that all fresh memory may contain valid cache lines which need invalidating.
I appreciate this only answers half your question, but it's better than nothing. Good luck!
Niall

Maybe a bit overdue, but:
If you disable cache snooping, hardware will no longer take care of cache coherency. Hence, the kernel needs to do this itself. Over the past few days, I've spent some tiem reviewing the X86 variants of [pci_]dma_sync_single_for_{cpu,device}. I've found no indication that they perform any efforts to maintain coherency. This seems consistent with the fact that cache snooping is by default turned on in the PCI(e) spec.
Hence, if you are turning off cache snooping, you will have to maintain coherency yourself, in your driver. Possibly by calling clflush_cache_range() (X86) or similar?
Refs:
http://lkml.indiana.edu/hypermail/linux/kernel/0709.0/1329.html

unbuffered I/O in Linux

I'm writing lots and lots of data that will not be read again for weeks - as my program runs the amount of free memory on the machine (displayed with 'free' or 'top') drops very quickly, the amount of memory my app uses does not increase - neither does the amount of memory used by other processes.
This leads me to believe the memory is being consumed by the filesystems cache - since I do not intend to read this data for a long time I'm hoping to bypass the systems buffers, such that my data is written directly to disk. I dont have dreams of improving perf or being a super ninja, my hope is to give a hint to the filesystem that I'm not going to be coming back for this memory any time soon, so dont spend time optimizing for those cases.
On Windows I've faced similar problems and fixed the problem using FILE_FLAG_NO_BUFFERING|FILE_FLAG_WRITE_THROUGH - the machines memory was not consumed by my app and the machine was more usable in general. I'm hoping to duplicate the improvements I've seen but on Linux. On Windows there is the restriction of writing in sector sized pieces, I'm happy with this restriction for the amount of gain I've measured.
is there a similar way to do this in Linux?

The closest equivalent to the Windows flags you mention I can think of is to open your file with the open(2) flags O_DIRECT | O_SYNC:
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In
general this will degrade performance, but it is useful in special
situations, such as when applications do their own caching. File I/O
is done directly to/from user space buffers. The O_DIRECT flag on its
own makes at an effort to transfer data synchronously, but does not
give the guarantees of the O_SYNC that data and necessary metadata are
transferred. To guarantee synchronous I/O the O_SYNC must be used in
addition to O_DIRECT. See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block devices is
described in raw(8).
Granted, trying to do research on this flag to confirm it's what you want I found this interesting piece telling you that unbuffered I/O is a bad idea, Linus describing it as "brain damaged". According to that you should be using madvise() instead to tell the kernel how to cache pages. YMMV.

You can use O_DIRECT, but in that case you need to do the block IO yourself; you must write in multiples of the FS block size and on block boundaries (it is possible that it is not mandatory but if you do not its performance will suck x1000 because every unaligned write will need a read first).
Another much less impacting way of stopping your blocks using up the OS cache without using O_DIRECT, is to use posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED). Under Linux 2.6 kernels which support it, this immediately discards (clean) blocks from the cache. Of course you need to use fdatasync() or such like first, otherwise the blocks may still be dirty and hence won't be cleared from the cache.
It is probably a bad idea of fdatasync() and posix_fadvise( ... POSIX_FADV_DONTNEED) after every write, but instead wait until you've done a reasonable amount (50M, 100M maybe).
So in short
after every (significant chunk) of writes,
Call fdatasync followed by posix_fadvise( ... POSIX_FADV_DONTNEED)
This will flush the data to disc and immediately remove them from the OS cache, leaving space for more important things.
Some users have found that things like fast-growing log files can easily blow "more useful" stuff out of the disc cache, which reduces cache hits a lot on a box which needs to have a lot of read cache, but also writes logs quickly. This is the main motivation for this feature.
However, like any optimisation
a) You're not going to need it so
b) Do not do it (yet)

as my program runs the amount of free memory on the machine drops very quickly
Why is this a problem? Free memory is memory that isn't serving any useful purpose. When it's used to cache data, at least there is a chance it will be useful.
If one of your programs requests more memory, file caches will be the first thing to go. Linux knows that it can re-read that data from disk whenever it wants, so it will just reap the memory and give it a new use.
It's true that Linux by default waits around 30 seconds (this is what the value used to be anyhow) before flushing writes to disk. You can speed this up with a call to fsync(). But once the data has been written to disk, there's practically zero cost to keeping a cache of the data in memory.
Seeing as you write to the file and don't read from it, Linux will probably guess that this data is the best to throw out, in preference to other cached data. So don't waste effort trying to optimise unless you've confirmed that it's a performance problem.

Bursty writes to SD/USB stalling my time-critical apps on embedded Linux

I'm working on an embedded Linux project that interfaces an ARM9 to a hardware video encoder chip, and writes the video out to SD card or USB stick. The software architecture involves a kernel driver that reads data into a pool of buffers, and a userland app that writes the data to a file on the mounted removable device.
I am finding that above a certain data rate (around 750kbyte/sec) I start to see the userland video-writing app stalling for maybe half a second, about every 5 seconds. This is enough to cause the kernel driver to run out of buffers - and even if I could increase the number of buffers, the video data has to be synchronised (ideally within 40ms) with other things that are going on in real time. Between these 5 second "lag spikes", the writes complete well within 40ms (as far as the app is concerned - I appreciate they're buffered by the OS)
I think this lag spike is to do with the way Linux is flushing data out to disk - I note that pdflush is designed to wake up every 5s, my understanding is that this would be what does the writing. As soon as the stall is over the userland app is able to quickly service and write the backlog of buffers (that didn't overflow).
I think the device I'm writing to has reasonable ultimate throughput: copying a 15MB file from a memory fs and waiting for sync to complete (and the usb stick's light to stop flashing) gave me a write speed of around 2.7MBytes/sec.
I'm looking for two kinds of clues:
How can I stop the bursty writing from stalling my app - perhaps process priorities, realtime patches, or tuning the filesystem code to write continuously rather than burstily?
How can I make my app(s) aware of what is going on with the filesystem in terms of write backlog and throughput to the card/stick? I have the ability to change the video bitrate in the hardware codec on the fly which would be much better than dropping frames, or imposing an artificial cap on maximum allowed bitrate.
Some more info: this is a 200MHz ARM9 currently running a Montavista 2.6.10-based kernel.
Updates:
Mounting the filesystem SYNC causes throughput to be much too poor.
The removable media is FAT/FAT32 formatted and must be as the purpose of the design is that the media can be plugged into any Windows PC and read.
Regularly calling sync() or fsync() say, every second causes regular stalls and unacceptably poor throughput
I am using write() and open(O_WRONLY | O_CREAT | O_TRUNC) rather than fopen() etc.
I can't immediately find anything online about the mentioned "Linux realtime filesystems". Links?
I hope this makes sense. First embedded Linux question on stackoverflow? :)

For the record, there turned out to be two main aspects that seem to have eliminated the problem in all but the most extreme cases. This system is still in development and hasn't been thoroughly torture-tested yet but is working fairly well (touch wood).
The big win came from making the userland writer app multi-threaded. It is the calls to write() that block sometimes: other processes and threads still run. So long as I have a thread servicing the device driver and updating frame counts and other data to sychronise with other apps that are running, the data can be buffered and written out a few seconds later without breaking any deadlines. I tried a simple ping-pong double buffer first but that wasn't enough; small buffers would be overwhelmed and big ones just caused bigger pauses while the filesystem digested the writes. A pool of 10 1MB buffers queued between threads is working well now.
The other aspect is keeping an eye on ultimate write throughput to physical media. For this I am keeping an eye on the stat Dirty: reported by /proc/meminfo. I have some rough and ready code to throttle the encoder if Dirty: climbs above a certain threshold, seems to vaguely work. More testing and tuning needed later. Fortunately I have lots of RAM (128M) to play with giving me a few seconds to see my backlog building up and throttle down smoothly.
I'll try to remember to pop back and update this answer if I find I need to do anything else to deal with this issue. Thanks to the other answerers.

I'll throw out some suggestions, advice is cheap.
make sure you are using a lower level API for writing to the disk, don't use user-mode caching functions like fopen, fread, fwrite use the lower level functions open, read, write.
pass the O_SYNC flag when you open the file, this will cause each write operation to block until written to disk, which will remove the bursty behavior of your writes...with the expense of each write being slower.
If you are doing reads/ioctls from a device to grab a chunk of video data, you may want to consider allocating a shared memory region between the application and kernel, otherwise you are getting hit with a bunch of copy_to_user calls when transferring video data buffers from kernel space to user space.
You may need to validate that your USB flash device is fast enough with sustained transfers to write the data.
Just a couple thoughts, hope this helps.

Here is some information about tuning pdflush for write-heavy operations.

Sounds like you're looking for linux realtime filesystems. Be sure to search Google et al for that.
XFS has a realtime option, though I haven't played with it.
hdparm might let you turn off the caching altogether.
Tuning the filesystem options (turn off all the extra unneeded file attributes) might reduce what you need to flush, thus speeding the flush. I doubt that'd help much, though.
But my suggestion would be to avoid using the stick as a filesystem at all and instead use it as a raw device. Stuff data on it like you would using 'dd'. Then elsewhere read that raw data and write it out after baking.
Of course, I don't know if that's an option for you.

Has a debugging aid, you could use strace to see what operations is taking time.
There might be some surprising thing with the FAT/FAT32.
Do you write into a single file, or in multiple file ?
You can make a reading thread, that will maintain a pool of video buffer ready to be written in a queue.
When a frame is received, it is added to the queue, and the writing thread is signaled
Shared data
empty_buffer_queue
ready_buffer_queue
video_data_ready_semaphore
Reading thread :
buf=get_buffer()
bufer_to_write = buf_dequeue(empty_buffer_queue)
memcpy(bufer_to_write, buf)
buf_enqueue(bufer_to_write, ready_buffer_queue)
sem_post(video_data_ready_semaphore)
Writing thread
sem_wait(vido_data_ready_semaphore)
bufer_to_write = buf_dequeue(ready_buffer_queue)
write_buffer
buf_enqueue(bufer_to_write, empty_buffer_queue)
If your writing threaded is blocked waiting for the kernel, this could work.
However, if you are blocked inside the kerne space, then thereis nothing much you can do, except looking for a more recent kernel than your 2.6.10

Without knowing more about your particular circumstances, I can only offer the following guesses:
Try using fsync()/sync() to force the kernel to flush data to the storage device more frequently. It sounds like the kernel buffers all your writes and then ties up the bus or otherwise stalls your system while performing the actual write. With careful calls to fsync() you can try to schedule writes over the system bus in a more fine grained way.
It might make sense to structure the application in such a way that the encoding/capture (you didn't mention video capture, so I'm making an assumption here - you might want to add more information) task runs in its own thread and buffers its output in userland - then, a second thread can handle writing to the device. This will give you a smoothing buffer to allow the encoder to always finish its writes without blocking.
One thing that sounds suspicious is that you only see this problem at a certain data rate - if this really was a buffering issue, I'd expect the problem to happen less frequently at lower data rates, but I'd still expect to see this issue.
In any case, more information might prove useful. What's your system's architecture? (In very general terms.)
Given the additional information you provided, it sounds like the device's throughput is rather poor for small writes and frequent flushes. If you're sure that for larger writes you can get sufficient throughput (and I'm not sure that's the case, but the file system might be doing something stupid, like updating the FAT after every write) then having an encoding thread piping data to a writing thread with sufficient buffering in the writing thread to avoid stalls. I've used shared memory ring buffers in the past to implement this kind of scheme, but any IPC mechanism that would allow the writer to write to the I/O process without stalling unless the buffer is full should do the trick.

A useful Linux function and alternative to sync or fsync is sync_file_range. This lets you schedule data for writing without waiting for the in-kernel buffer system to get around to it.
To avoid long pauses, make sure your IO queue (for example: /sys/block/hda/queue/nr_requests) is large enough. That queue is where data goes in between being flushed from memory and arriving on disk.
Note that sync_file_range isn't portable, and is only available in kernels 2.6.17 and later.

I've been told that after the host sends a command, MMC and SD cards "must respond within 0 to 8 bytes".
However, the spec allows these cards to respond with "busy" until they have finished the operation, and apparently there is no limit to how long a card can claim to be busy (please, please tell me if there is such a limit).
I see that some low-cost flash chips such as the M25P80 have a guaranteed "maximum single-sector erase time" of 3 seconds, although typically it "only" requires 0.6 seconds.
That 0.6 seconds sounds suspiciously similar to your "stalling for maybe half a second".
I suspect the tradeoff between cheap, slow flash chips and expensive, fast flash chips has something to do with the wide variation in USB flash drive results:
http://www.testfreaks.com/blog/information/16gb-usb-drive-comparison-17-drives-compared/
http://www.tomshardware.com/reviews/data-transfer-run,1037-10.html
I've heard rumors that every time a flash sector is erased and then re-programmed, it takes a little bit longer than the last time.
So if you have a time-critical application, you may need to (a) test your SD cards and USB sticks to make sure they meet the minimum latency, bandwidth, etc. required by your application, and (b) peridically re-test or pre-emptively replace these memory devices.

Well obvious first, have you tried explicitly telling the file to flush? I also think there might be some ioctl you can use to do it, but I honestly haven't done much C/POSIX file programming.
Seeing you're on a Linux kernel you should be able to tune and rebuild the kernel to something that suits your needs better, eg. much more frequent but then also smaller flushes to the permanent storage.
A quick check in my man pages finds this:
SYNC(2) Linux Programmer’s Manual SYNC(2)
NAME
sync - commit buffer cache to disk
SYNOPSIS
#include <unistd.h>
void sync(void);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sync(): _BSD_SOURCE || _XOPEN_SOURCE >= 500
DESCRIPTION
sync() first commits inodes to buffers, and then buffers to disk.
ERRORS
This function is always successful.

Doing your own flush()ing sounds right to me - you want to be in control, not leave it to the vagaries of the generic buffer layer.
This may be obvious, but make sure you're not calling write() too often - make sure every write() has enough data to be written to make the syscall overhead worth it. Also, in the other direction, don't call it too seldom, or it'll block for long enough to cause a problem.
On a more difficult-to-reimplement track, have you tried switching to asynchronous i/o? Using aio you could fire off a write and hand it one set of buffers while you're sucking video data into the other set, and when the write finishes you switch sets of buffers.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string