linux: smart fsync()?

linux: smart fsync()? - linux

I'm recording audio and writing the same to a SD Card, the data rate is around 1.5 MB/s. I'm using a class 4 SD Card with ext4 file system.
After certain interval, kernel auto syncs the files. The downside of this is, my application buffers pile up waiting to be written to disk.
I think, if the kernel syncs frequently that what it is doing now, it may solve the issue.
I used fsync() in application to sync after certain intervals. But this does not solve the problem, because certain times kernel has synced just before the application called fsync(), so the fsync() called from application was a waste of time.
I need a syncing mechanism (say, smart_fsync() ), so that when application calls smart_fsync(), then the kernel will sync only if it has not synced in a while, else it will just return.
Since there is no function as smart_fsync(). what can be a possible workaround?

The first question to ask is, what exactly is the problem you're experiencing? The kernel will flush dirty (unwritten cached) buffers periodically - this is because doing so tends to be faster than flushing synchronously (less latency hit for applications). The downside is that this means a larger latency hit if you reach the kernel's limit on dirty data (and potentially more data loss after an unclean shutdown).
If you want to ensure that the data hits disk ASAP, then you should simply open the file with the O_SYNC option. This will flush the data to disk immediately upon write(). Of course, this implies a significant performance penalty, but on the other hand you have complete control over when the data is flushed.
If you are experiencing drops in throughput while the syncing is going on, most likely you are attempting to write faster than the disk can support, and reaching the dirty page memory limit. Unfortunately, this would mean the hardware is simply not up to the write rate you are attempting to push at it - you'll need to write slower, or buffer the data up on faster media (or add more RAM!).
Note also that your 'smart fsync' is exactly what the kernel implements - it will flush pages when one of the following is true:
* There is too much dirty data in memory. Triggers asynchronously (without blocking writes) when the total amount of dirty data exceeds /proc/sys/vm/dirty_background_bytes, or when the percentage of total memory exceeds /proc/sys/vm/dirty_background_ratio. Triggers synchronously (blocking your application's write() for an extended time) when the total amount of data exceeds /proc/sys/vm/dirty_bytes, or the percentage of total memory exceeds /proc/sys/vm/dirty_ratio.
* Dirty data has been pending in memory for too long. The pdflush daemon checks for old dirty blocks every /proc/sys/vm/dirty_writeback_centisecs centiseconds (1/100 seconds), and will expire blocks if they have been in memory for longer than /proc/sys/vm/dirty_expire_centisecs.
It's possible that tuning these parameters might help a bit, but you're probably better off figuring out why the defaults aren't keeping up as is.

Related

Hey could someone help me understand sync syscall usage?

like said in the title, I don't really understand the usage of this syscall. I was writing some program that write some data in a file, and the tutorial I've seen told me to use sys_sync syscall. But my problem is why and when should we use this? The data isn't already written on the file?

The manual says:
sync - Synchronize cached writes to persistent storage
So it is written to the file cache in memory, not on disk.
You rarely have to use sync unless you are writing really important data and need to make sure that data is on disk before you go on. One example of systems that use sync a lot are databases (such as MySQL or PostgreSQL).
So in other words, it is theoretically in your file, just not on disk and therefore if you lose electricity, you could lose the data, especially if you have a lot of RAM and many writes in a raw, it may privilege the writes to cache for a long while, increasing the risk of data loss.
But how can a file be not on the disk? I understand the concept of cache but if I wrote in the disk why would it be in a different place?
First, when you write to a file, you send the data to the Kernel. You don't directly send it to the disk. Some kernel driver is then responsible to write the data to disk. In my days on Apple 2 and Amiga computers, I would actually directly read/write to disk. And at least the Amiga had a DMA so you could setup a buffer, then tell the disk I/O to do a read or a write and it would send you an interrupt when done. On the Apple 2, you had to write loops in assembly language with precise timings to read/write data on floppy disks... A different era!
Although you could, of course, directly access the disk (but with a Kernel like Linux, you'd have to make sure the kernel gives you hands free to do that...).
Cache is primarily used for speed. It is very slow to write to disk (as far as a human is concerned, it looks extremely fast, but compared to how much data the CPU can push to the drive, it's still slow).
So what happens is that the kernel has a task to write data to disk. That task wakes up as soon as data appears in the cache and ends once all the caches are transferred to disk. This task works in parallel. You can have one such task per drive (which is especially useful when you have a system such as RAID 1).
If your application fills up the cache, then a further write will block until some of the cache can be replaced.

and the tutorial I've seen told me to use sys_sync syscall
Well that sounds silly, unless you're doing filesystem write benchmarking or something.
If you have one really critical file that you want to make sure is "durable" wrt. power outages before you do something else (like sent a network packet to acknowledge a complete transfer), use fsync(fd) to sync just that one file's data and metadata.
(In asm, call number SYS_fsync from sys/syscall.h, with the file descriptor as the first register arg.)
But my problem is why and when should we use this?
Generally never use the sync system call in programs you're writing.
There are interactive use-cases where you'd normally use the wrapper command of the same name, sync(1). e.g. with removable media, to get the kernel started doing write-back now, so unmount will take less time once you finish typing it. Or for some benchmarking use-cases.
The system shutdown scripts may run sync after unmounting filesystems (and remounting / read-only), before making a reboot(2) system call.
Re: why sync(2) exists
No, your data isn't already on disk right after echo foo > bar.txt.
Most OSes, including Linux, do write-back caching, not write-through, for file writes.
You don't want write() system calls to wait for an actual magnetic disk when there's free RAM, because the traditional way to do I/O is synchronous so simple single-threaded programs wouldn't be able to do anything else (like reading more data or computing anything) while waiting for write() to return. Blocking for ~10 ms on every write system call would be disastrous; that's as long as a whole scheduler timeslice. (It would still be bad even with SSDs, but of course OSes were designed before SSDs were a thing.) Even just queueing up the DMA would be slow, especially for small file writes that aren't a whole number of aligned sectors, so even letting the disk's own write-back write caching work wouldn't be good enough.
Therefore, file writes do create "dirty" pages of kernel buffers that haven't yet been sent to the disk. Sometimes we can even avoid the IO entirely, e.g. for tmp files that get deleted before anything triggers write-back. On Linux, dirty_writeback_centisecs defaults to 1500 (15 seconds) before the kernel starts write-back, unless it's running low on free pages. (Heuristics for what "low" means use other tunable values).
If you really want writes to flush to disk immediately and wait for data to be on disk, mount with -o sync. Or for one program, have it use open(O_SYNC) or O_DSYNC (for just the data, not metadata like timestamps).
See Are file reads served from dirtied pages in the page cache?
There are other advantages to write-back, including delayed allocation even at the filesystem level. The FS can wait until it knows how big the file will be before even deciding where to put it, allowing better decisions that reduce fragmentation. e.g. a small file can go into a gap that would have been a bad place to start a potentially-large file. (It just have to reserve space to make sure it can put it somewhere.) XFS was one of the first filesystems to do "lazy" delayed allocation, and ext4 has also had the feature for a while.
https://en.wikipedia.org/wiki/XFS#Delayed_allocation
https://en.wikipedia.org/wiki/Allocate-on-flush
https://lwn.net/Articles/323169/

EXT4 slower performance successive writes

I'm working with a C program that does a bunch of things but in some part it writes a whole HDD with multiple calls to fwrite on a single file with a fixed size.
The calls are something like this:
fwrite(some_memory,size_element,total_elements,file);
When I measure the wall time of this calls each call takes a bit longer than the previous one. So for example, if want to write in chunks of 900MB of data, the first call (with empty disk) ends within 7 seconds but the last ones takes somewhere between 10~11 secs (with the disk almost at full capacity).
Is this an expected behavior? Is there any way of getting consistent write times independently of disk current capacity?
I'm using an EXT4 wd green 2TB volume.

I'd say this is to be expected as your early calls are most likely satisfied by the kernel's writeback cache and thus returning slightly quicker because at the time fwrite returns not all the data has reached the disk yet. However I don't know how much memory your system has compared to the 900MBytes of data you're trying to write so this is a guess...
If the kernel's cache becomes full (e.g. because the disk can't keep up) then your userspace program is made to block until it is sufficiently empty and able to accept a bit more data leaky bucket style. Only once all data has gone in to the cache can the fwrite complete. However at that point you are likely doing another fwrite call that tops the cache up again and your subsequent call is forced to wait a bit longer because the cache has not been fully emptied. I'd imagine you would reach a fixed point though...
To see if caches really were behind the behaviour you could issue an fsync after each fwrite (destroying your performance) and take the time from the fwrite submission to fsync completion and see if the variance was so large.
Another thing you could do that might help is to preallocate the full size of the file up front so that the filesystem isn't forced to keep regrowing it as new data is appended to the end (this should cut down on metadata operations and fight fragmentation too).
The dirty_* knobs in https://www.kernel.org/doc/Documentation/sysctl/vm.txt are also likely comming in to play given the large amounts of data you are writing.

How to prioritize write() over mmap updates (or delay mmap page cache flush)

I'm running a specialized DB daemon on a debian-64 with 64G of RAM and lots of disk space. It uses an on-disk hashtable (mmaped) and writes the actual data into a file with regular write() calls. When doing really a lot of updates, a big part of the mmap gets dirty and the page cache tries to flush it to disk, producing lots of random writes which in turn slows down the performance of the regular (sequential) writes to the data file.
If it were possible to delay the page cache flush of the mmaped area performance would improve (I assume), since several (or all) changes to the dirty page would be written at once instead of once for every update (worst case, in reality of course it aggregates a lot of changes anyway).
So my question: Is it possible to delay page cache flush for a memory-mapped area? Or is it possible to prioritze the regular write? Or does anyone have any other ideas? madvise and posix_fadvise don't seem to make any difference...

You could play with the tuneables in /proc/sys/vm. For example, increase the value in dirty_writeback_centisecs to make pdflush wake up somewhat less often, increase dirty_expire_centiseconds so data is allowed to stay dirty for longer until it must be written out, and increase dirty_background_ratio to allow more dirty pages to stay in RAM before something must be done.
See here for a somewhat comprehensive description of what all the values do.
Note that this will affect every process on your machine, but seeing how you're running a huge database server, chances are that this is no problem since you don't want anything else to run on the same machine anyway.
Now of course this delays writes, but it still doesn't fully solve the problem of dirty page writebacks competing with write (though it will likely collapse a few writes if there are many updates).
But: You can use the sync_file_range syscall to force beginning write-out of pages in a given range on your "write" file descriptor (SYNC_FILE_RANGE_WRITE). So while the dirty pages will be written back at some unknown time later (and with greater grace periods), you manually kick off writeback on the ones you're interested.
This doesn't give any guarantees, but it should just work.
Be sure to absolutely positively read the documentation, better read it twice. sync_file_range can very easily corrupt or lose data if you use it wrong. In particular, you must be sure metadata is up-to-date and flushed if you appended to a file, or data that has been "successfully written" will just be "gone" in case of a crash.

I would try mlock. If you mlock the relevant memory range, it may prevent the flush from occurring. You could munlock when you're done.

Write IO breakups on linux?

My application is using O_DIRECT for flushing 2MB worth of data directly to a 3-way-stripe storage (mounted as an lvm volume)..
I am getting a very pathetic write speed on this storage. The iostat shows that the large request size is being broken into smaller ones.
avgrq-sz is <20... There aren't much read on that drive.
It takes around 2 seconds to flush down 2MB worth of contiguous memory blocks (using mlock to assure that), sector aligned (using posix_memalign), whereas tests with dd and iozone rate the storage capable of > 20Mbps of write speed.
I would appreciate any clues on how to investigate this issue further.
PS: If this is not the right forum for this query, I would appreciate indicators to a one that could be helpful.
Thanks.

Write IO breakups on linux?
The disk itself may have a maximum request size, there is a tradeoff being block size and latency (the bigger the request being sent to the disk the longer it will likely take to to be consumed) and there can be constraints on how much vectored I/O a driver can consume in a single request. Given all the above, the kernel is going to "break up" single requests that are too large when submitting further down the stack.
I would appreciate any clues on how to investigate this issue further.
Unfortunately it's hard to say why the avgrq-sz is so small (if its in sectors that about 10KBytes per I/O) without seeing the code actually submitting the I/O (maybe your program is submitting 10KByte buffers?). We also don't know if iozone and dd were using O_DIRECT during the questioners test. If they weren't then their I/O would have been going into the write back cache and then streamed out later and the kernel can do that in a more optimal fashion.
Note: Using O_DIRECT is NOT a go faster stripe. In the right circumstances O_DIRECT can lower overhead BUT writing O_DIRECTly to do disk increases the pressure on you to submit I/O in parallel (e.g. via AIO/io_uring or via multiple processes/threads) if you want to reach the highest possible throughput because you have robbed the kernel of its best way of creating parallel submission to the device for you.

Bursty writes to SD/USB stalling my time-critical apps on embedded Linux

I'm working on an embedded Linux project that interfaces an ARM9 to a hardware video encoder chip, and writes the video out to SD card or USB stick. The software architecture involves a kernel driver that reads data into a pool of buffers, and a userland app that writes the data to a file on the mounted removable device.
I am finding that above a certain data rate (around 750kbyte/sec) I start to see the userland video-writing app stalling for maybe half a second, about every 5 seconds. This is enough to cause the kernel driver to run out of buffers - and even if I could increase the number of buffers, the video data has to be synchronised (ideally within 40ms) with other things that are going on in real time. Between these 5 second "lag spikes", the writes complete well within 40ms (as far as the app is concerned - I appreciate they're buffered by the OS)
I think this lag spike is to do with the way Linux is flushing data out to disk - I note that pdflush is designed to wake up every 5s, my understanding is that this would be what does the writing. As soon as the stall is over the userland app is able to quickly service and write the backlog of buffers (that didn't overflow).
I think the device I'm writing to has reasonable ultimate throughput: copying a 15MB file from a memory fs and waiting for sync to complete (and the usb stick's light to stop flashing) gave me a write speed of around 2.7MBytes/sec.
I'm looking for two kinds of clues:
How can I stop the bursty writing from stalling my app - perhaps process priorities, realtime patches, or tuning the filesystem code to write continuously rather than burstily?
How can I make my app(s) aware of what is going on with the filesystem in terms of write backlog and throughput to the card/stick? I have the ability to change the video bitrate in the hardware codec on the fly which would be much better than dropping frames, or imposing an artificial cap on maximum allowed bitrate.
Some more info: this is a 200MHz ARM9 currently running a Montavista 2.6.10-based kernel.
Updates:
Mounting the filesystem SYNC causes throughput to be much too poor.
The removable media is FAT/FAT32 formatted and must be as the purpose of the design is that the media can be plugged into any Windows PC and read.
Regularly calling sync() or fsync() say, every second causes regular stalls and unacceptably poor throughput
I am using write() and open(O_WRONLY | O_CREAT | O_TRUNC) rather than fopen() etc.
I can't immediately find anything online about the mentioned "Linux realtime filesystems". Links?
I hope this makes sense. First embedded Linux question on stackoverflow? :)

For the record, there turned out to be two main aspects that seem to have eliminated the problem in all but the most extreme cases. This system is still in development and hasn't been thoroughly torture-tested yet but is working fairly well (touch wood).
The big win came from making the userland writer app multi-threaded. It is the calls to write() that block sometimes: other processes and threads still run. So long as I have a thread servicing the device driver and updating frame counts and other data to sychronise with other apps that are running, the data can be buffered and written out a few seconds later without breaking any deadlines. I tried a simple ping-pong double buffer first but that wasn't enough; small buffers would be overwhelmed and big ones just caused bigger pauses while the filesystem digested the writes. A pool of 10 1MB buffers queued between threads is working well now.
The other aspect is keeping an eye on ultimate write throughput to physical media. For this I am keeping an eye on the stat Dirty: reported by /proc/meminfo. I have some rough and ready code to throttle the encoder if Dirty: climbs above a certain threshold, seems to vaguely work. More testing and tuning needed later. Fortunately I have lots of RAM (128M) to play with giving me a few seconds to see my backlog building up and throttle down smoothly.
I'll try to remember to pop back and update this answer if I find I need to do anything else to deal with this issue. Thanks to the other answerers.

I'll throw out some suggestions, advice is cheap.
make sure you are using a lower level API for writing to the disk, don't use user-mode caching functions like fopen, fread, fwrite use the lower level functions open, read, write.
pass the O_SYNC flag when you open the file, this will cause each write operation to block until written to disk, which will remove the bursty behavior of your writes...with the expense of each write being slower.
If you are doing reads/ioctls from a device to grab a chunk of video data, you may want to consider allocating a shared memory region between the application and kernel, otherwise you are getting hit with a bunch of copy_to_user calls when transferring video data buffers from kernel space to user space.
You may need to validate that your USB flash device is fast enough with sustained transfers to write the data.
Just a couple thoughts, hope this helps.

Here is some information about tuning pdflush for write-heavy operations.

Sounds like you're looking for linux realtime filesystems. Be sure to search Google et al for that.
XFS has a realtime option, though I haven't played with it.
hdparm might let you turn off the caching altogether.
Tuning the filesystem options (turn off all the extra unneeded file attributes) might reduce what you need to flush, thus speeding the flush. I doubt that'd help much, though.
But my suggestion would be to avoid using the stick as a filesystem at all and instead use it as a raw device. Stuff data on it like you would using 'dd'. Then elsewhere read that raw data and write it out after baking.
Of course, I don't know if that's an option for you.

Has a debugging aid, you could use strace to see what operations is taking time.
There might be some surprising thing with the FAT/FAT32.
Do you write into a single file, or in multiple file ?
You can make a reading thread, that will maintain a pool of video buffer ready to be written in a queue.
When a frame is received, it is added to the queue, and the writing thread is signaled
Shared data
empty_buffer_queue
ready_buffer_queue
video_data_ready_semaphore
Reading thread :
buf=get_buffer()
bufer_to_write = buf_dequeue(empty_buffer_queue)
memcpy(bufer_to_write, buf)
buf_enqueue(bufer_to_write, ready_buffer_queue)
sem_post(video_data_ready_semaphore)
Writing thread
sem_wait(vido_data_ready_semaphore)
bufer_to_write = buf_dequeue(ready_buffer_queue)
write_buffer
buf_enqueue(bufer_to_write, empty_buffer_queue)
If your writing threaded is blocked waiting for the kernel, this could work.
However, if you are blocked inside the kerne space, then thereis nothing much you can do, except looking for a more recent kernel than your 2.6.10

Without knowing more about your particular circumstances, I can only offer the following guesses:
Try using fsync()/sync() to force the kernel to flush data to the storage device more frequently. It sounds like the kernel buffers all your writes and then ties up the bus or otherwise stalls your system while performing the actual write. With careful calls to fsync() you can try to schedule writes over the system bus in a more fine grained way.
It might make sense to structure the application in such a way that the encoding/capture (you didn't mention video capture, so I'm making an assumption here - you might want to add more information) task runs in its own thread and buffers its output in userland - then, a second thread can handle writing to the device. This will give you a smoothing buffer to allow the encoder to always finish its writes without blocking.
One thing that sounds suspicious is that you only see this problem at a certain data rate - if this really was a buffering issue, I'd expect the problem to happen less frequently at lower data rates, but I'd still expect to see this issue.
In any case, more information might prove useful. What's your system's architecture? (In very general terms.)
Given the additional information you provided, it sounds like the device's throughput is rather poor for small writes and frequent flushes. If you're sure that for larger writes you can get sufficient throughput (and I'm not sure that's the case, but the file system might be doing something stupid, like updating the FAT after every write) then having an encoding thread piping data to a writing thread with sufficient buffering in the writing thread to avoid stalls. I've used shared memory ring buffers in the past to implement this kind of scheme, but any IPC mechanism that would allow the writer to write to the I/O process without stalling unless the buffer is full should do the trick.

A useful Linux function and alternative to sync or fsync is sync_file_range. This lets you schedule data for writing without waiting for the in-kernel buffer system to get around to it.
To avoid long pauses, make sure your IO queue (for example: /sys/block/hda/queue/nr_requests) is large enough. That queue is where data goes in between being flushed from memory and arriving on disk.
Note that sync_file_range isn't portable, and is only available in kernels 2.6.17 and later.

I've been told that after the host sends a command, MMC and SD cards "must respond within 0 to 8 bytes".
However, the spec allows these cards to respond with "busy" until they have finished the operation, and apparently there is no limit to how long a card can claim to be busy (please, please tell me if there is such a limit).
I see that some low-cost flash chips such as the M25P80 have a guaranteed "maximum single-sector erase time" of 3 seconds, although typically it "only" requires 0.6 seconds.
That 0.6 seconds sounds suspiciously similar to your "stalling for maybe half a second".
I suspect the tradeoff between cheap, slow flash chips and expensive, fast flash chips has something to do with the wide variation in USB flash drive results:
http://www.testfreaks.com/blog/information/16gb-usb-drive-comparison-17-drives-compared/
http://www.tomshardware.com/reviews/data-transfer-run,1037-10.html
I've heard rumors that every time a flash sector is erased and then re-programmed, it takes a little bit longer than the last time.
So if you have a time-critical application, you may need to (a) test your SD cards and USB sticks to make sure they meet the minimum latency, bandwidth, etc. required by your application, and (b) peridically re-test or pre-emptively replace these memory devices.

Well obvious first, have you tried explicitly telling the file to flush? I also think there might be some ioctl you can use to do it, but I honestly haven't done much C/POSIX file programming.
Seeing you're on a Linux kernel you should be able to tune and rebuild the kernel to something that suits your needs better, eg. much more frequent but then also smaller flushes to the permanent storage.
A quick check in my man pages finds this:
SYNC(2) Linux Programmer’s Manual SYNC(2)
NAME
sync - commit buffer cache to disk
SYNOPSIS
#include <unistd.h>
void sync(void);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sync(): _BSD_SOURCE || _XOPEN_SOURCE >= 500
DESCRIPTION
sync() first commits inodes to buffers, and then buffers to disk.
ERRORS
This function is always successful.

Doing your own flush()ing sounds right to me - you want to be in control, not leave it to the vagaries of the generic buffer layer.
This may be obvious, but make sure you're not calling write() too often - make sure every write() has enough data to be written to make the syscall overhead worth it. Also, in the other direction, don't call it too seldom, or it'll block for long enough to cause a problem.
On a more difficult-to-reimplement track, have you tried switching to asynchronous i/o? Using aio you could fire off a write and hand it one set of buffers while you're sucking video data into the other set, and when the write finishes you switch sets of buffers.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string