Queueing writes to file system on Linux?

Queueing writes to file system on Linux? - linux

On a very large SMP machine with many CPUS scripts are run with tens of simultaneous jobs (fewer than the number of CPUs) like this:
some_program -in FIFO1 >OUTPUT1 2>s_p1.log </dev/null &
some_program -in FIFO2 >OUTPUT2 2>s_p2.log </dev/null &
...
some_program -in FIFO40 >OUTPUT40 2>s_p40.log </dev/null &
splitter_program -in real_input.dat -out FIFO1,FIFO2...FIFO40
The splitter reads the input data flat out and distributes it to the FIFOs in order. (Records 1,41,81... to FIFO1; 2,42,82 to FIFO2, etc.) The splitter has low overhead and can pretty much process data as fast as the file system can supply it.
Each some_program processes its stream and writes it to its output file. However, nothing controls the order in which the file system sees these writes. The writes are also very small, on the order of 10 bytes. The script "knows" that there are 40 streams here and that they could be buffered in 20M (or whatever) chunks, and then each chunk written to the file system sequentially. That is, queued writes should be used to maximize write speed to the disks. The OS, however, just sees a bunch of writes at about the same rate on each of the 40 streams.
What happens in practice during a run is that the subprocesses get a lot of CPU time (in top, >80%), then a flush process appears (10% CPU), and all the others drop to low CPU (1%), then it goes back to the higher rate. These pauses go on for several seconds at a time. The flush means that the writes are overwhelming the file caching. Also I think the OS and/or the underlying RAID controller is probably bouncing the physical disk heads around erratically which is reducing the ultimate write speed to the physical disks. That is just a guess though, since it is hard to say what exactly is happening as there is file cache (in a system with over 500Gb of RAM) and a RAID controller between the writes and the disk.
Is there a program or method around for controlling this sort of IO, forcing the file system writes to queue nicely to maximize write speed?
The "buffer" program is not going to help much here because, while it would accumulate an output stream into a big chunk, there wouldn't be an orderly queuing of the writes, so several could go out at the same time. If the data rate in the output streams was uncorrelated this would be less of a problem, but in some cases the data rate is exactly the same in all streams, which means the buffers would all fill at the same time. This would stall the entire tree until the last one was written because any process that cannot write an output will not reads its next input, and that would stall the splitter, as all I/O is synchronous. The buffers need to be emptied in a cyclical manner, preferably before any of them completely fill up, although that may not be avoidable when data output rate exceeds the file system write rate.
There are dozens of parameters for tuning the file system, some of those might help. The scheduler was changed from cfq to deadline because the system was locking up for minutes at a time with the former.

If the problem is sheer I/O bandwidth then buffering won't solve anything. In that case, you need to shrink the data or send it to a higher-bandwidth sink to improve and level your performance. One way to do that would be to reduce the number of parallel jobs, as #thatotherguy said.
If in fact the problem is with the number of distinct I/O operations rather than with the overall volume of data, however, then buffering might be a viable solution. I am unfamiliar with the buffer program you mentioned, but I suppose it does what its name suggests. I don't completely agree with your buffering comments, however:
The "buffer" program is not going to help much here because, while it would accumulate an output stream into a big chunk, there wouldn't be an orderly queuing of the writes, so several could go out at the same time.
You don't necessarily need big chunks. It would probably be ideal to chunk at the native block size of the file system, or a small integer multiple thereof. That might be, say, 4096- or 8192-byte chunks.
Moreover, I don't see why you think you have an "orderly queueing of writes" now, or why you're confident that such a thing is needed.
If the data rate in the output streams was uncorrelated this would be less of a problem, but in some cases the data rate is exactly the same in all streams, which means the buffers would all fill at the same time. This would stall the entire tree until the last one was written because any process that cannot write an output will not reads its next input, and that would stall the splitter, as all I/O is synchronous.
Your splitter is writing to FIFOs. Though it may do so serially, that is not "synchronous" in the sense that the data needs to be drained out the other end before the splitter can proceed -- at least, not if the writes do not exceed the size of the FIFOs' buffers. FIFO buffer capacity varies from system to system, adapts dynamically on some systems, and is configurable (e.g. via fcntl()) on some systems. The default buffer size on modern Linux is 64kB.
The buffers need to be emptied in a cyclical manner, preferably before any of them completely fill up, although that may not be avoidable when data output rate exceeds the file system write rate.
I think this is a problem that pretty much solves itself. If one of the buffers backs up enough to block the splitter, then that ensures that the competing processes will, before too long, give the blocked buffer the opportunity to write. But this is also why you don't want enormous buffers -- you want to interleave disk I/O from different processes relatively finely to try to keep everything going.
The alternative to an external buffer program is to modify your processes to perform internal buffering. That might be an advantage because it removes a whole set of pipes (to an external buffering program) from the mix, and it lightens the process load on the machines. It does mean modifying your working processing program, though, so perhaps it would be better to start with external buffering to see how well that does.

If the problem is that your 40 streams each have a high data rate, and your RAID controller cannot write to physical disk fast enough, then you need to redesign your disk system. Basically, divide it into 40 RAID-1 mirrors and write one file to each mirror set. That makes the writes sequential for each stream, but requires 80 disks.
If the data rate isn't the problem then you need to add more buffering. You might need a pair of threads. One thread to collect the data into memory buffers and another thread to write it into data files and fsync() it. To make the disk writes sequential it should fsync each output file one at a time. That should result in writing large sequential chunks of whatever your buffer size is. 8 MB maybe?

Related

Fastest way to copy a large file locally

I was asked this in an interview.
I said lets just use cp. Then I was asked to mimic implementation cp itself.
So I thought okay, lets open the file, read one by one and write it to another file.
Then I was asked to optimize it further. I thought lets do chunks of read and write those chunks. I didn't have a good answer about what would be good chunk size. Please help me out with that.
Then I was asked to optimize even further. I thought may be we could read from different threads in parallel and write it in parallel.
But I quickly realized reading in parallel is OK but writing will not work in parallel(without locking I mean) since data from one thread might overwrite others.
So I thought okay, lets read in parallel, put it in a queue and then a single thread will take it off the queue and write it to the file one by one.
Does that even improve performance? (I mean not for small files. it would be more overhead but for large files)
Also, is there like an OS trick where I could just point two files to the same data in disk? I mean I know there are symlinks but apart from that?

"The fastest way to copy a file" is going to depend on the system - all the way from the storage media to the CPUs. The most likely bottleneck will be the storage media - but it doesn't have to be. Imagine high-end storage that can move data faster than your system can create physical page mappings to read the data into...
In general, the fastest way to move a lot of data is to make as few copies of it as possible, and to avoid any extra operations, especially S-L-O-W ones such as physical disk head seeks.
So for a local copy on a common single-rotating-disk workstation/desktop/laptop system, the biggest thing to do is minimize physical disk seeks. That means read and write single-threaded, in large chunks (1 MB, for example) so the system can do whatever optimization it can, such as read-ahead or write coalescing.
That will likely get you to 95% or even better of the system's maximum copy performance. Even standard C buffered fopen()/fread()/fwrite() probably gets at least 80-90% of the best possible performance.
You can get the last few percentage points in a few ways. First, by matching your IO block size to a multiple of the file system's block size so that you're always reading full blocks from the filesystem. Second, you can use direct IO to bypass copying your data through the page cache. It will be faster to go disk->userspace or userspace->disk than it is to go disk->page cache->userspace and userspace->page cache->disk, but for single-spinning-disk copy that's not going to matter much, if it's even measurable.
You can use various dd options to test copying a file like this. Try using direct, or notrunc.
You can also try using sendfile() to avoid copying data into userspace entirely. Depending on the implementation, that might be faster than using direct IO.
Pre-allocating the destination file may or may not improve copy performance - that will depend on the filesystem. If the filesystem doesn't support sparse files, though, preallocating the file to a specific length might very well be very, very slow.
There just isn't all that much you can do to dramatically improve performance of a copy from and to the same single spinning physical disk - those disk heads will dance, and that will take time.
SSDs are much easier - to get maximal IO rates, just use parallel IO via multiple threads. But again, the "normal" IO will probably be at 80-90% of maximal.
Things get a lot more interesting and complex optimizing IO performance for other types of storage systems such as large RAID arrays and/or complex filesystems that can stripe single files across multiple underlying storage devices. Maximizing IO on such systems involves matching the software's IO patterns to the characteristics of the storage, and that can be quite complex.
Finally, one important part of maximizing IO rates is not doing things that dramatically slow things down. It's really easy to drag a physical disk down to a few KB/sec IO rates - read/write small chunks from/to random locations all over the disk. If your write process drops 16-byte chunks to random locations, the disk will spend almost all its time seeking and it won't move much data at all while doing that.
In fact, not "killing yourself" with bad IO patterns is a lot more important than spending a lot of effort attempting to get a four or five percentage points faster in optimal cases.
Because if IO is a bottleneck on a simple system, just go buy a faster disk.

But I quickly realized reading in parallel is OK but writing will not work in parallel(without locking I mean) since data from one thread might overwrite others.
Multithreading is not normally going to speed up a process like this. Any performance benefit you may gain could be wiped out by the synchronization overhead.
So I thought okay, lets read in parallel, put it in a queue and then a single thread will take it off the queue and write it to the file one by one.
That's only going to give an advantage on a system that supports asychronous I/O.
To get the maximum speed you'd want to write in buffer sizes that are increments of the cluster factor of the disk (assuming a hard file system). This could be sped up on systems that permit queuing asynchronous I/O (as does, say, Windoze).
You'd also want to create the output file with its initial size being the same as the input file. That ways your write operations never have to extend the file.
Probably the fastest file copy possible would be to memory map the input and output files and did a memory copy. This is especially efficient in systems that treat mapped files as page files.

Should I send data in chunks, or send it all at once?

I have python code that sends data to socket (a rather large file). Should I divide it into 1kb chunks, or would just conn.sendall(file.read()) be acceptable?

It will make little difference to the sending operation. (I assume you are using a TCP socket for the purposes of this discussion.)
When you attempt to send 1K, the kernel will take that 1K, copy it into kernel TCP buffers, and return success (and probably begin sending to the peer at the same time). At which point, you will send another 1K and the same thing happens. Eventually if the file is large enough, and the network can't send it fast enough, or the receiver can't drain it fast enough, the kernel buffer space used by your data will reach some internal limit and your process will be blocked until the receiver drains enough data. (This limit can often be pretty high with TCP -- depending on the OSes, you may be able to send a megabyte or two without ever hitting it.)
If you try to send in one shot, pretty much the same thing will happen: data will be transferred from your buffer into kernel buffers until/unless some limit is reached. At that point, your process will be blocked until data is drained by the receiver (and so forth).
However, with the first mechanism, you can send a file of any size without using undue amounts of memory -- your in-memory buffer (not including the kernel TCP buffers) only needs to be 1K long. With the sendall approach, file.read() will read the entire file into your program's memory. If you attempt that with a truly giant file (say 40G or something), that might take more memory than you have, even including swap space.
So, as a general purpose mechanism, I would definitely favor the first approach. For modern architectures, I would use a larger buffer size than 1K though. The exact number probably isn't too critical; but you could choose something that will fit several disk blocks at once, say, 256K.

linux: smart fsync()?

I'm recording audio and writing the same to a SD Card, the data rate is around 1.5 MB/s. I'm using a class 4 SD Card with ext4 file system.
After certain interval, kernel auto syncs the files. The downside of this is, my application buffers pile up waiting to be written to disk.
I think, if the kernel syncs frequently that what it is doing now, it may solve the issue.
I used fsync() in application to sync after certain intervals. But this does not solve the problem, because certain times kernel has synced just before the application called fsync(), so the fsync() called from application was a waste of time.
I need a syncing mechanism (say, smart_fsync() ), so that when application calls smart_fsync(), then the kernel will sync only if it has not synced in a while, else it will just return.
Since there is no function as smart_fsync(). what can be a possible workaround?

The first question to ask is, what exactly is the problem you're experiencing? The kernel will flush dirty (unwritten cached) buffers periodically - this is because doing so tends to be faster than flushing synchronously (less latency hit for applications). The downside is that this means a larger latency hit if you reach the kernel's limit on dirty data (and potentially more data loss after an unclean shutdown).
If you want to ensure that the data hits disk ASAP, then you should simply open the file with the O_SYNC option. This will flush the data to disk immediately upon write(). Of course, this implies a significant performance penalty, but on the other hand you have complete control over when the data is flushed.
If you are experiencing drops in throughput while the syncing is going on, most likely you are attempting to write faster than the disk can support, and reaching the dirty page memory limit. Unfortunately, this would mean the hardware is simply not up to the write rate you are attempting to push at it - you'll need to write slower, or buffer the data up on faster media (or add more RAM!).
Note also that your 'smart fsync' is exactly what the kernel implements - it will flush pages when one of the following is true:
* There is too much dirty data in memory. Triggers asynchronously (without blocking writes) when the total amount of dirty data exceeds /proc/sys/vm/dirty_background_bytes, or when the percentage of total memory exceeds /proc/sys/vm/dirty_background_ratio. Triggers synchronously (blocking your application's write() for an extended time) when the total amount of data exceeds /proc/sys/vm/dirty_bytes, or the percentage of total memory exceeds /proc/sys/vm/dirty_ratio.
* Dirty data has been pending in memory for too long. The pdflush daemon checks for old dirty blocks every /proc/sys/vm/dirty_writeback_centisecs centiseconds (1/100 seconds), and will expire blocks if they have been in memory for longer than /proc/sys/vm/dirty_expire_centisecs.
It's possible that tuning these parameters might help a bit, but you're probably better off figuring out why the defaults aren't keeping up as is.

Does the full 64K get used for every pipe created?

How are pipes implemented re buffering? I might be creating many pipes but only ever sending/receiving a few bytes through them at a time, so don't want to waste memory unnecessarily.
Edit: I understand what buffering is, I am asking how the buffering is implemented in Linux pipes specifically, ie does the full 64K get allocated regardless of highwatermark?

Buffers are used to equal out the difference in speed between producer and consumer. If you didn't have a buffer, you would have to switch tasks after every byte produced, which would be very inefficient due to the cost of context switches, data and code caches never becoming hot etc. If your consumer can produce data about as fast as the producer consumes it, your buffer use will usually be low (but read on). If the producer is much faster than the consumer, the buffer will fill up completely and the producer will be forced to wait until more space becomes available. The reversed case of slow producer and fast consumer will use a very small part of the buffer for most of the time.
The usage also depends on whether your both processes actually run in parallel (e.g. on separate cores) or if they share a core and only due to the OS's process management are fooled into thinking that they are concurrent. If you have real concurrency (separate core/CPU), your buffer will usually be used less.
Any way, if your applications are not producing much data and their speeds are similar, the buffer will not be very full most of the time. However, I wouldn't be surprised if at OS level, the full 64 kB were allocated any way. But unless you are using an embedded device, 64 kB is not much, so even if always the maximum size is alloctaed, I wouldn't worry about it.
By the way, it is not easy to modify the size of the pipe buffer, for example in this discussion a number of tricks are suggested but they are actually workarounds which modify the way data from the buffer is consumed, not modifying the actual buffer size. You could check ulimit -p but I'm not 100% sure it will give you the control you need.
EDIT: Looking at fs/pipe.c and include/linux/pipe_fs_i.h in Linux code, it looks like the buffers do change their size. Minimum size of the buffer is a full page, though, so if you only need a few bytes, there will be waste. I'm not sure at this point, but some code that uses PIPE_DEF_BUFFERS, which is 16, giving 64 kB with 4 kB pages, makes me wonder if the buffer can fall below 64 kB (the 1 page minimum could be just an additional restriction).

Bursty writes to SD/USB stalling my time-critical apps on embedded Linux

I'm working on an embedded Linux project that interfaces an ARM9 to a hardware video encoder chip, and writes the video out to SD card or USB stick. The software architecture involves a kernel driver that reads data into a pool of buffers, and a userland app that writes the data to a file on the mounted removable device.
I am finding that above a certain data rate (around 750kbyte/sec) I start to see the userland video-writing app stalling for maybe half a second, about every 5 seconds. This is enough to cause the kernel driver to run out of buffers - and even if I could increase the number of buffers, the video data has to be synchronised (ideally within 40ms) with other things that are going on in real time. Between these 5 second "lag spikes", the writes complete well within 40ms (as far as the app is concerned - I appreciate they're buffered by the OS)
I think this lag spike is to do with the way Linux is flushing data out to disk - I note that pdflush is designed to wake up every 5s, my understanding is that this would be what does the writing. As soon as the stall is over the userland app is able to quickly service and write the backlog of buffers (that didn't overflow).
I think the device I'm writing to has reasonable ultimate throughput: copying a 15MB file from a memory fs and waiting for sync to complete (and the usb stick's light to stop flashing) gave me a write speed of around 2.7MBytes/sec.
I'm looking for two kinds of clues:
How can I stop the bursty writing from stalling my app - perhaps process priorities, realtime patches, or tuning the filesystem code to write continuously rather than burstily?
How can I make my app(s) aware of what is going on with the filesystem in terms of write backlog and throughput to the card/stick? I have the ability to change the video bitrate in the hardware codec on the fly which would be much better than dropping frames, or imposing an artificial cap on maximum allowed bitrate.
Some more info: this is a 200MHz ARM9 currently running a Montavista 2.6.10-based kernel.
Updates:
Mounting the filesystem SYNC causes throughput to be much too poor.
The removable media is FAT/FAT32 formatted and must be as the purpose of the design is that the media can be plugged into any Windows PC and read.
Regularly calling sync() or fsync() say, every second causes regular stalls and unacceptably poor throughput
I am using write() and open(O_WRONLY | O_CREAT | O_TRUNC) rather than fopen() etc.
I can't immediately find anything online about the mentioned "Linux realtime filesystems". Links?
I hope this makes sense. First embedded Linux question on stackoverflow? :)

For the record, there turned out to be two main aspects that seem to have eliminated the problem in all but the most extreme cases. This system is still in development and hasn't been thoroughly torture-tested yet but is working fairly well (touch wood).
The big win came from making the userland writer app multi-threaded. It is the calls to write() that block sometimes: other processes and threads still run. So long as I have a thread servicing the device driver and updating frame counts and other data to sychronise with other apps that are running, the data can be buffered and written out a few seconds later without breaking any deadlines. I tried a simple ping-pong double buffer first but that wasn't enough; small buffers would be overwhelmed and big ones just caused bigger pauses while the filesystem digested the writes. A pool of 10 1MB buffers queued between threads is working well now.
The other aspect is keeping an eye on ultimate write throughput to physical media. For this I am keeping an eye on the stat Dirty: reported by /proc/meminfo. I have some rough and ready code to throttle the encoder if Dirty: climbs above a certain threshold, seems to vaguely work. More testing and tuning needed later. Fortunately I have lots of RAM (128M) to play with giving me a few seconds to see my backlog building up and throttle down smoothly.
I'll try to remember to pop back and update this answer if I find I need to do anything else to deal with this issue. Thanks to the other answerers.

I'll throw out some suggestions, advice is cheap.
make sure you are using a lower level API for writing to the disk, don't use user-mode caching functions like fopen, fread, fwrite use the lower level functions open, read, write.
pass the O_SYNC flag when you open the file, this will cause each write operation to block until written to disk, which will remove the bursty behavior of your writes...with the expense of each write being slower.
If you are doing reads/ioctls from a device to grab a chunk of video data, you may want to consider allocating a shared memory region between the application and kernel, otherwise you are getting hit with a bunch of copy_to_user calls when transferring video data buffers from kernel space to user space.
You may need to validate that your USB flash device is fast enough with sustained transfers to write the data.
Just a couple thoughts, hope this helps.

Here is some information about tuning pdflush for write-heavy operations.

Sounds like you're looking for linux realtime filesystems. Be sure to search Google et al for that.
XFS has a realtime option, though I haven't played with it.
hdparm might let you turn off the caching altogether.
Tuning the filesystem options (turn off all the extra unneeded file attributes) might reduce what you need to flush, thus speeding the flush. I doubt that'd help much, though.
But my suggestion would be to avoid using the stick as a filesystem at all and instead use it as a raw device. Stuff data on it like you would using 'dd'. Then elsewhere read that raw data and write it out after baking.
Of course, I don't know if that's an option for you.

Has a debugging aid, you could use strace to see what operations is taking time.
There might be some surprising thing with the FAT/FAT32.
Do you write into a single file, or in multiple file ?
You can make a reading thread, that will maintain a pool of video buffer ready to be written in a queue.
When a frame is received, it is added to the queue, and the writing thread is signaled
Shared data
empty_buffer_queue
ready_buffer_queue
video_data_ready_semaphore
Reading thread :
buf=get_buffer()
bufer_to_write = buf_dequeue(empty_buffer_queue)
memcpy(bufer_to_write, buf)
buf_enqueue(bufer_to_write, ready_buffer_queue)
sem_post(video_data_ready_semaphore)
Writing thread
sem_wait(vido_data_ready_semaphore)
bufer_to_write = buf_dequeue(ready_buffer_queue)
write_buffer
buf_enqueue(bufer_to_write, empty_buffer_queue)
If your writing threaded is blocked waiting for the kernel, this could work.
However, if you are blocked inside the kerne space, then thereis nothing much you can do, except looking for a more recent kernel than your 2.6.10

Without knowing more about your particular circumstances, I can only offer the following guesses:
Try using fsync()/sync() to force the kernel to flush data to the storage device more frequently. It sounds like the kernel buffers all your writes and then ties up the bus or otherwise stalls your system while performing the actual write. With careful calls to fsync() you can try to schedule writes over the system bus in a more fine grained way.
It might make sense to structure the application in such a way that the encoding/capture (you didn't mention video capture, so I'm making an assumption here - you might want to add more information) task runs in its own thread and buffers its output in userland - then, a second thread can handle writing to the device. This will give you a smoothing buffer to allow the encoder to always finish its writes without blocking.
One thing that sounds suspicious is that you only see this problem at a certain data rate - if this really was a buffering issue, I'd expect the problem to happen less frequently at lower data rates, but I'd still expect to see this issue.
In any case, more information might prove useful. What's your system's architecture? (In very general terms.)
Given the additional information you provided, it sounds like the device's throughput is rather poor for small writes and frequent flushes. If you're sure that for larger writes you can get sufficient throughput (and I'm not sure that's the case, but the file system might be doing something stupid, like updating the FAT after every write) then having an encoding thread piping data to a writing thread with sufficient buffering in the writing thread to avoid stalls. I've used shared memory ring buffers in the past to implement this kind of scheme, but any IPC mechanism that would allow the writer to write to the I/O process without stalling unless the buffer is full should do the trick.

A useful Linux function and alternative to sync or fsync is sync_file_range. This lets you schedule data for writing without waiting for the in-kernel buffer system to get around to it.
To avoid long pauses, make sure your IO queue (for example: /sys/block/hda/queue/nr_requests) is large enough. That queue is where data goes in between being flushed from memory and arriving on disk.
Note that sync_file_range isn't portable, and is only available in kernels 2.6.17 and later.

I've been told that after the host sends a command, MMC and SD cards "must respond within 0 to 8 bytes".
However, the spec allows these cards to respond with "busy" until they have finished the operation, and apparently there is no limit to how long a card can claim to be busy (please, please tell me if there is such a limit).
I see that some low-cost flash chips such as the M25P80 have a guaranteed "maximum single-sector erase time" of 3 seconds, although typically it "only" requires 0.6 seconds.
That 0.6 seconds sounds suspiciously similar to your "stalling for maybe half a second".
I suspect the tradeoff between cheap, slow flash chips and expensive, fast flash chips has something to do with the wide variation in USB flash drive results:
http://www.testfreaks.com/blog/information/16gb-usb-drive-comparison-17-drives-compared/
http://www.tomshardware.com/reviews/data-transfer-run,1037-10.html
I've heard rumors that every time a flash sector is erased and then re-programmed, it takes a little bit longer than the last time.
So if you have a time-critical application, you may need to (a) test your SD cards and USB sticks to make sure they meet the minimum latency, bandwidth, etc. required by your application, and (b) peridically re-test or pre-emptively replace these memory devices.

Well obvious first, have you tried explicitly telling the file to flush? I also think there might be some ioctl you can use to do it, but I honestly haven't done much C/POSIX file programming.
Seeing you're on a Linux kernel you should be able to tune and rebuild the kernel to something that suits your needs better, eg. much more frequent but then also smaller flushes to the permanent storage.
A quick check in my man pages finds this:
SYNC(2) Linux Programmer’s Manual SYNC(2)
NAME
sync - commit buffer cache to disk
SYNOPSIS
#include <unistd.h>
void sync(void);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sync(): _BSD_SOURCE || _XOPEN_SOURCE >= 500
DESCRIPTION
sync() first commits inodes to buffers, and then buffers to disk.
ERRORS
This function is always successful.

Doing your own flush()ing sounds right to me - you want to be in control, not leave it to the vagaries of the generic buffer layer.
This may be obvious, but make sure you're not calling write() too often - make sure every write() has enough data to be written to make the syscall overhead worth it. Also, in the other direction, don't call it too seldom, or it'll block for long enough to cause a problem.
On a more difficult-to-reimplement track, have you tried switching to asynchronous i/o? Using aio you could fire off a write and hand it one set of buffers while you're sucking video data into the other set, and when the write finishes you switch sets of buffers.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string