Linux async (io_submit) write v/s normal (buffered) write

Linux async (io_submit) write v/s normal (buffered) write - linux

Since writes are immediate anyway (copy to kernel buffer and return), what's the advantage of using io_submit for writes?
In fact, it (aio/io_submit) seems worse since you have to allocate the write buffers on the heap and can't use stack-based buffers.
My question is only about writes, not reads.
EDIT: I am talking about relatively small writes (few KB at most), not MB or GB, so buffer copy should not be a big problem.

Copying a buffer into the kernel is not necessarily instantaneous.
First the kernel needs to find a free page. If there is none (which is fairly likely under heavy disk-write pressure), it has to decide to evict one. If it decides to evict a dirty page (instead of evicting your process for instance), it will have to actually write it before it can use that page.
there's a related issue in linux when saturating writing to a slow drive, the page cache fills up with dirty pages backed by a slow drive. Whenever the kernel needs a page, for any reason, it takes a long time to acquire one and the whole system freezes as a result.
The size of each individual write is less relevant than the write pressure of the system. If you have a million small writes already queued up, this may be the one that has to block.
Regarding whether the allocation lives on the stack or the heap is also less relevant. If you want efficient allocation of blocks to write, you can use a dedicated pool allocator (from the heap) and not pay for the general purpose heap allocator.
aio_write() gets around this by not copying the buffer into the kernel at all, it may even be DMAd straight out of your buffer (given the alignment requirements), which means you're likely to save a copy as well.

Related

EXT4 slower performance successive writes

I'm working with a C program that does a bunch of things but in some part it writes a whole HDD with multiple calls to fwrite on a single file with a fixed size.
The calls are something like this:
fwrite(some_memory,size_element,total_elements,file);
When I measure the wall time of this calls each call takes a bit longer than the previous one. So for example, if want to write in chunks of 900MB of data, the first call (with empty disk) ends within 7 seconds but the last ones takes somewhere between 10~11 secs (with the disk almost at full capacity).
Is this an expected behavior? Is there any way of getting consistent write times independently of disk current capacity?
I'm using an EXT4 wd green 2TB volume.

I'd say this is to be expected as your early calls are most likely satisfied by the kernel's writeback cache and thus returning slightly quicker because at the time fwrite returns not all the data has reached the disk yet. However I don't know how much memory your system has compared to the 900MBytes of data you're trying to write so this is a guess...
If the kernel's cache becomes full (e.g. because the disk can't keep up) then your userspace program is made to block until it is sufficiently empty and able to accept a bit more data leaky bucket style. Only once all data has gone in to the cache can the fwrite complete. However at that point you are likely doing another fwrite call that tops the cache up again and your subsequent call is forced to wait a bit longer because the cache has not been fully emptied. I'd imagine you would reach a fixed point though...
To see if caches really were behind the behaviour you could issue an fsync after each fwrite (destroying your performance) and take the time from the fwrite submission to fsync completion and see if the variance was so large.
Another thing you could do that might help is to preallocate the full size of the file up front so that the filesystem isn't forced to keep regrowing it as new data is appended to the end (this should cut down on metadata operations and fight fragmentation too).
The dirty_* knobs in https://www.kernel.org/doc/Documentation/sysctl/vm.txt are also likely comming in to play given the large amounts of data you are writing.

Dump global datas to disk in assembly code

The experiment is on Linux, x86 32-bit.
So suppose in my assembly program, I need to periodically (for instance every time after executing 100000 basic blocks) dump an array in .bss section from memory to the disk. The starting address and size of the array is fixed. The array records the executed basic block's address, the size is 16M right now.
I tried to write some native code, to memcpy from .bss section to the stack, and then write it back to disk. But it seems to me that it is very tedious and I am worried about the performance and memory consumption, say, every-time allocate a very large memory on the stack...
So here is my question, how can I dump the memory from global data sections in an efficient way? Am I clear enough?

First of all, don't write this part of your code in asm, esp. not at first. Write a C function to handle this part, and call it from asm. If you need to perf-tune the part that only runs when it's time to dump another 16MiB, you can hand-tune it then. System-level programming is all about checking error returns from system calls (or C stdio functions), and doing that in asm would be painful.
Obviously you can write anything in asm, since making system calls isn't anything special compared to C. And there's no part of any of this that's easier in asm compared to C, except for maybe throwing in an MFENCE around the locking.
Anyway, I've addressed three variations on what exactly you want to happen with your buffer:
Overwrite the same buffer in place (mmap(2) / msync(2))
Append a snapshot of the buffer to a file (with either write(2) or probably-not-working zero-copy vmsplice(2) + splice(2) idea.)
Start a new (zeroed) buffer after writing the old one. mmap(2) sequential chunks of your output file.
In-place overwrites
If you just want to overwrite the same area of disk every time, mmap(2) a file and use that as your array. (Call msync(2) periodically to force the data to disk.) The mmapped method won't guarantee a consistent state for the file, though. Writes can get flushed to disk other than on request. IDK if there's a way to avoid that with any kind of guarantee (i.e. not just choosing buffer-flush timers and so on so your pages usually don't get written except by msync(2).)
Append snapshots
The simple way to append a buffer to a file would be to simply call write(2) when you want it written. write(2) does everything you need. If your program is multi-threaded, you might need to take a lock on the data before the system call, and release the lock afterwards. I'm not sure how fast the write system call would return. It may only return after the kernel has copied your data to the page cache.
If you just need a snapshot, but all writes into the buffer are atomic transactions (i.e. the buffer is always in a consistent state, rather than pairs of values that need to be consistent with each other), then you don't need to take a lock before calling write(2). There will be a tiny amount of bias in this case (data at the end of the buffer will be from a slightly later time than data from the start, assuming the kernel copies in order).
IDK if write(2) returns slower or faster with direct IO (zero-copy, bypassing the page-cache). open(2) your file with with O_DIRECT, write(2) normally.
There has to be a copy somewhere in the process, if you want to write a snapshot of the buffer and then keep modifying it. Or else MMU copy-on-write trickery:
Zero-copy append snapshots
There is an API for doing zero-copy writes of user pages to disk files. Linux's vmsplice(2) and splice(2) in that order will let you tell the kernel to map your pages into the page cache. Without SPLICE_F_GIFT, I assume it sets them up as copy-on-write. (oops, actually the man page says without SPLICE_F_GIFT, the following splice(2) will have to copy. So IDK if there is a mechanism to get copy-on-write semantics.)
Assuming there was a way to get copy-on-write semantics for your pages, until the kernel was done writing them to disk and could release them:
Further writes might need the kernel to memcpy one or two pages before the data hit disk, but save copying the whole buffer. The soft page faults and page-table manipulation overhead might not be worth it anyway, unless your data access pattern is very spatially-localized over the short periods of time until the write hits disk and the to-be-written pages can be released. (I think an API that works this way doesn't exist, because there's no mechanism for getting the pages released right after they hit disk. Linux wants to take them over and keep them in the page cache.)
I haven't ever used vmsplice, so I might be getting some details wrong.
If there's a way to create a new copy-on-write mapping of the same memory, maybe by mmaping a new mapping of a scratch file (on a tmpfs filesystem, prob. /dev/shm), that would get you snapshots without holding the lock for long. Then you can just pass the snapshot to write(2), and unmap it ASAP before too many copy-on-write page faults happen.
New buffer for every chunk
If it's ok to start with a zeroed buffer after every write, you could mmap(2) successive chunk of the file, so the data you generate is always already in the right place.
(optional) fallocate(2) some space in your output file, to prevent fragmentation if your write pattern isn't sequential.
mmap(2) your buffer to the first 16MiB of your output file.
run normally
When you want to move on to the next 16MiB:
take a lock to prevent other threads from using the buffer
munmap(2) your buffer
mmap(2) the next 16MiB of the file to the same address, so you don't need to pass the new address around to writers. These pages will be pre-zeroed, as required by POSIX (can't have the kernel exposing memory).
release the lock
Possibly mmap(buf, 16MiB, ... MAP_FIXED, fd, new_offset) could replace the munmap / mmap pair. MAP_FIXED discards old mmapings that it overlaps. I assume this doesn't mean that modifications to the file / shared memory are discarded, but rather that the actual mapping changes, even without an munmap.

Two clarifications for Append snapshots case from Peter's answer.
1. Appending without O_DIRECT
As Peter said, if you don't use O_DIRECT, write() will return as soon data was copied to page cache. If page cache is full, it will block until some outdated page will be flushed to disk.
If you are only appending data without reading it (soon), you can benefit from periodically calling sync_file_range(2) to schedule flush for previously written pages and posix_fadvise(2) with POSIX_FADV_DONTNEED flag to remove already flushed pages from page cache. This could significantly reduce the posibility that write() would block.
2. Appending with O_DIRECT
With O_DIRECT, write() normally would block until data is sent to disk (although it's not strictly guaranteed, see here). Since this is slow, be prepared to implement you own I/O scheduling if you need non-blocking writes.
The benefits you could archive are: more predictable behaviour (you control when you will block) and probably reduced memory and CPU usage by collaboration of your application and kernel.

mmap(): resetting old memory to a zero'd non-resident state

I'm writing a memory allocation routine, and it's currently running smoothly. I get my memory from the OS with mmap() in 4096-byte pages. When I start my memory allocator I allocate 1gig of virtual address space with mmap(), and then as allocations are made I divide it up into hunks according to the specifics of my allocation algorithm.
I feel safe allocating as much as a 1gig of memory on a whim because I know mmap() doesn't actually put pages into physical memory until I actually write to them.
Now, the program using my allocator might have a spurt where it needs a lot of memory, and in this case the OS would have to eventually put a whole 1gig worth of pages into physical RAM. The trouble is that the program might then go into a dormant period where it frees most of that 1gig and then uses only minimal amounts of memory. Yet, all I really do inside of my allocator's MyFree() function is to flip a few bits of bookkeeping data which mark the previously used gig as free, but I know this doesn't cause the OS remove those pages from physical memory.
I can't use something like munmap() to fix this problem, because the nature of the allocation algorithm is such that it requires a continuous region of memory without any holes in it. Basically I need a way to tell the OS "Listen, you can take these pages out of physical memory and clear them to 0, but please remap them on the fly when I need them again, as if they were freshly mmap()'d"
What would be the best way to go about this?
Actually, after writing this all up I just realized that I can probably do an munmap() followed immediately by a fresh mmap(). Would that be the correct way to go about? I get the sense that there's probably some more efficient way to do this.

You are looking for madvise(addr, length, MADV_DONTNEED). From the manpage:
MADV_DONTNEED: Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file.
Note especially the language about how subsequent accesses will succeed but revert to zero-fill-on-demand (for mappings without an underlying file).
Your thinking-out-loud alternative of an munmap followed immediately by another mmap will also work but risks kernel-side inefficiencies because it is no longer tracking the allocation a single contiguous region; if there are many such unmap-and-remap events the kernelside data structures might wind up being quite bloated.
By the way, with this kind of allocator it's very important that you use MAP_NORESERVE for the initial allocation, and then touch each page as you allocate it, and trap any resulting SIGSEGV and fail the allocation. (And you'll need to document that your allocator installs a handler for SIGSEGV.) If you don't do this your application will not work on systems that have disabled memory overcommit. See the mmap manpage for more detail.

What is the difference between buffer and cache memory in Linux?

To me it's not clear what's the difference between the two Linux memory concepts : buffer and cache. I've read through this post and it seems to me that the difference between them is the expiration policy:
buffer's policy is first-in, first-out
cache's policy is Least Recently Used.
Am I right?
In particular, I'm looking at the two commands: free and vmstat
james#utopia:~$ vmstat -S M
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
5 0 0 173 67 912 0 0 19 59 75 1087 24 4 71 1
james#utopia:~$ free -m
total used free shared buffers cached
Mem: 2007 1834 172 0 67 914
-/+ buffers/cache: 853 1153
Swap: 2859 0 2859

Buffers are associated with a specific block device, and cover caching
of filesystem metadata as well as tracking in-flight pages. The cache
only contains parked file data. That is, the buffers remember what's
in directories, what file permissions are, and keep track of what
memory is being written from or read to for a particular block device.
The cache only contains the contents of the files themselves.
quote link

Cited answer (for reference):
Short answer: Cached is the size of the page cache. Buffers is the size of in-memory block I/O buffers. Cached matters; Buffers is largely irrelevant.
Long answer: Cached is the size of the Linux page cache, minus the memory in the swap cache, which is represented by SwapCached (thus the total page cache size is Cached + SwapCached). Linux performs all file I/O through the page cache. Writes are implemented as simply marking as dirty the corresponding pages in the page cache; the flusher threads then periodically write back to disk any dirty pages. Reads are implemented by returning the data from the page cache; if the data is not yet in the cache, it is first populated. On a modern Linux system, Cached can easily be several gigabytes. It will shrink only in response to memory pressure. The system will purge the page cache along with swapping data out to disk to make available more memory as needed.
Buffers are in-memory block I/O buffers. They are relatively short-lived. Prior to Linux kernel version 2.4, Linux had separate page and buffer caches. Since 2.4, the page and buffer cache are unified and Buffers is raw disk blocks not represented in the page cache—i.e., not file data. The Buffers metric is thus of minimal importance. On most systems, Buffers is often only tens of megabytes.

"Buffers" represent how much portion of RAM is dedicated to cache disk blocks. "Cached" is similar like "Buffers", only this time it caches pages from file reading.
quote from:
https://web.archive.org/web/20110207101856/http://www.linuxforums.org/articles/using-top-more-efficiently_89.html

It's not 'quite' as simple as this, but it might help understand:
Buffer is for storing file metadata (permissions, location, etc). Every memory page is kept track of here.
Cache is for storing actual file contents.

Explained by Red Hat:
Cache Pages:
A cache is the part of the memory which transparently stores data so that future requests for that data can be served faster. This memory is utilized by the kernel to cache disk data and improve i/o performance.
The Linux kernel is built in such a way that it will use as much RAM as it can to cache information from your local and remote filesystems and disks. As the time passes over various reads and writes are performed on the system, kernel tries to keep data stored in the memory for the various processes which are running on the system or the data that of relevant processes which would be used in the near future. The cache is not reclaimed at the time when process get stop/exit, however when the other processes requires more memory then the free available memory, kernel will run heuristics to reclaim the memory by storing the cache data and allocating that memory to new process.
When any kind of file/data is requested then the kernel will look for a copy of the part of the file the user is acting on, and, if no such copy exists, it will allocate one new page of cache memory and fill it with the appropriate contents read out from the disk.
The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere in the disk. When some data is requested, the cache is first checked to see whether it contains that data. The data can be retrieved more quickly from the cache than from its source origin.
SysV shared memory segments are also accounted as a cache, though they do not represent any data on the disks. One can check the size of the shared memory segments using ipcs -m command and checking the bytes column.
Buffers:
Buffers are the disk block representation of the data that is stored under the page caches. Buffers contains the metadata of the files/data which resides under the page cache.
Example: When there is a request of any data which is present in the page cache, first the kernel checks the data in the buffers which contain the metadata which points to the actual files/data contained in the page caches. Once from the metadata the actual block address of the file is known, it is picked up by the kernel for processing.

buffer and cache.
A buffer is something that has yet to be "written" to disk.
A cache is something that has been "read" from the disk and stored for later use.

I think this page will help understanding the difference between buffer and cache deeply. http://www.tldp.org/LDP/sag/html/buffer-cache.html
Reading from a disk is very slow compared to accessing (real) memory. In addition, it is common to read the same part of a disk several times during relatively short periods of time. For example, one might first read an e-mail message, then read the letter into an editor when replying to it, then make the mail program read it again when copying it to a folder. Or, consider how often the command ls might be run on a system with many users. By reading the information from disk only once and then keeping it in memory until no longer needed, one can speed up all but the first read. This is called disk buffering, and the memory used for the purpose is called the buffer cache.
Since memory is, unfortunately, a finite, nay, scarce resource, the buffer cache usually cannot be big enough (it can't hold all the data one ever wants to use). When the cache fills up, the data that has been unused for the longest time is discarded and the memory thus freed is used for the new data.
Disk buffering works for writes as well. On the one hand, data that is written is often soon read again (e.g., a source code file is saved to a file, then read by the compiler), so putting data that is written in the cache is a good idea. On the other hand, by only putting the data into the cache, not writing it to disk at once, the program that writes runs quicker. The writes can then be done in the background, without slowing down the other programs.

Seth Robertson's Link 2 said "For thorough understanding of those terms, refer to Linux kernel book like Linux Kernel Development by Robert M. Love."
I found some contents about 'buffer' in the 2nd edition of the book.
Although the physical device itself is addressable at the sector level, the kernel performs all disk operations in terms of blocks.
When a block is stored in memory (say, after a read or pending a write), it is stored in a 'buffer'. Each 'buffer' is associated with exactly one block. The 'buffer' serves as the object that represents a disk block in memory.
A 'buffer' is the in-memory representation of a single physical disk block.
Block I/O operations manipulate a single disk block at a time. A common block I/O operation is reading and writing inodes. The kernel provides the bread() function to perform a low-level read of a single block from disk. Via 'buffers', disk blocks are mapped to their associated in-memory pages. "

Quote from the book:
Introduction to Information Retrieval
Cache
We want to keep as much data as possible in memory, especially those data that we need to access frequently. We call the technique of keeping frequently used disk data in main memory caching.
Buffer
Operating systems generally read and write entire blocks. Thus, reading a single byte from disk can take as much time as reading the entire block. Block sizes of 8, 16, 32, and 64 kilobytes (KB) are common. We call the part of main memory where a block being read or written is stored a buffer.

Buffer is an area of memory used to temporarily store data while it's being moved from one place to another.
Cache is a temporary storage area used to store frequently accessed data for rapid access. Once the data is stored in the cache, future use can be done by accessing the cached copy rather than re-fetching the original data, so that the average access time is shorter.
Note: buffer and cache can be allocated on disk as well

Buffer contains metadata which helps improve write performance
Cache contains the file content itself (sometimes yet to write to disk) which improves read performance

For starters the general concept would be helpful, a buffer is an area of memory used to temporarily store data while being moved from one place to another. On the other hand, a cache is a temporary storage area to store frequently accessed data for rapid access.
In Linux:
The cache in Linux is called Page Cache. It is that certain amount of system memory that the kernel reserves for caching the file system disk accesses. This is to make overall performance faster. During Linux read system calls, the kernel checks if the cache contains the requested blocks of data. If it does, then that would be a successful cache hit. The cache returns this data without doing any I/O to the disk system. The Linux cache approach is called a write-back cache. This means first, the data is written to cache memory and marked as dirty until synchronized to disk. Then, the kernel maintains the internal data structure to optimize which data to evict from the cache when the cache demands any additional space. For example, when memory usage reaches certain thresholds, background tasks start writing dirty data to disk, thereby emptying the memory cache.
Reading from a disk is very slow compared to accessing (real) memory. In addition, it is common to read the same part of a disk several times during relatively short periods of time. For example, one might first read an e-mail message, then read the letter into an editor when replying to it, then make the mail program read it again when copying it to a folder. Or, consider how often the command ls might be run on a system with many users. By reading the information from disk only once and then keeping it in memory until no longer needed, one can speed up all but the first read. This is called disk buffering, and the memory used for the purpose is called the buffer cache.

Cache: This is a place acquired by kernel on physical RAM to store pages in caches. Now we need some sort of index to get the address of pages from caches. Here we need the buffer for page caches which keeps metadata of page cache.

From the man page for free:
DESCRIPTION
free displays the total amount of free and used physical and swap memory in the system, as well as the buffers and caches used by the
kernel. The information is gathered by parsing /proc/meminfo. The displayed columns are:
total Total installed memory (MemTotal and SwapTotal in /proc/meminfo)
used Used memory (calculated as total - free - buffers - cache)
free Unused memory (MemFree and SwapFree in /proc/meminfo)
shared Memory used (mostly) by tmpfs (Shmem in /proc/meminfo)
buffers
Memory used by kernel buffers (Buffers in /proc/meminfo)
cache Memory used by the page cache and slabs (Cached and SReclaimable in /proc/meminfo)
buff/cache
Sum of buffers and cache
available
Estimation of how much memory is available for starting new applications, without swapping. Unlike the data provided by the cache
or free fields, this field takes into account page cache and also that not all reclaimable memory slabs will be reclaimed due to
items being in use (MemAvailable in /proc/meminfo, available on kernels 3.14, emulated on kernels 2.6.27+, otherwise the same as
free)

unbuffered I/O in Linux

I'm writing lots and lots of data that will not be read again for weeks - as my program runs the amount of free memory on the machine (displayed with 'free' or 'top') drops very quickly, the amount of memory my app uses does not increase - neither does the amount of memory used by other processes.
This leads me to believe the memory is being consumed by the filesystems cache - since I do not intend to read this data for a long time I'm hoping to bypass the systems buffers, such that my data is written directly to disk. I dont have dreams of improving perf or being a super ninja, my hope is to give a hint to the filesystem that I'm not going to be coming back for this memory any time soon, so dont spend time optimizing for those cases.
On Windows I've faced similar problems and fixed the problem using FILE_FLAG_NO_BUFFERING|FILE_FLAG_WRITE_THROUGH - the machines memory was not consumed by my app and the machine was more usable in general. I'm hoping to duplicate the improvements I've seen but on Linux. On Windows there is the restriction of writing in sector sized pieces, I'm happy with this restriction for the amount of gain I've measured.
is there a similar way to do this in Linux?

The closest equivalent to the Windows flags you mention I can think of is to open your file with the open(2) flags O_DIRECT | O_SYNC:
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In
general this will degrade performance, but it is useful in special
situations, such as when applications do their own caching. File I/O
is done directly to/from user space buffers. The O_DIRECT flag on its
own makes at an effort to transfer data synchronously, but does not
give the guarantees of the O_SYNC that data and necessary metadata are
transferred. To guarantee synchronous I/O the O_SYNC must be used in
addition to O_DIRECT. See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block devices is
described in raw(8).
Granted, trying to do research on this flag to confirm it's what you want I found this interesting piece telling you that unbuffered I/O is a bad idea, Linus describing it as "brain damaged". According to that you should be using madvise() instead to tell the kernel how to cache pages. YMMV.

You can use O_DIRECT, but in that case you need to do the block IO yourself; you must write in multiples of the FS block size and on block boundaries (it is possible that it is not mandatory but if you do not its performance will suck x1000 because every unaligned write will need a read first).
Another much less impacting way of stopping your blocks using up the OS cache without using O_DIRECT, is to use posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED). Under Linux 2.6 kernels which support it, this immediately discards (clean) blocks from the cache. Of course you need to use fdatasync() or such like first, otherwise the blocks may still be dirty and hence won't be cleared from the cache.
It is probably a bad idea of fdatasync() and posix_fadvise( ... POSIX_FADV_DONTNEED) after every write, but instead wait until you've done a reasonable amount (50M, 100M maybe).
So in short
after every (significant chunk) of writes,
Call fdatasync followed by posix_fadvise( ... POSIX_FADV_DONTNEED)
This will flush the data to disc and immediately remove them from the OS cache, leaving space for more important things.
Some users have found that things like fast-growing log files can easily blow "more useful" stuff out of the disc cache, which reduces cache hits a lot on a box which needs to have a lot of read cache, but also writes logs quickly. This is the main motivation for this feature.
However, like any optimisation
a) You're not going to need it so
b) Do not do it (yet)

as my program runs the amount of free memory on the machine drops very quickly
Why is this a problem? Free memory is memory that isn't serving any useful purpose. When it's used to cache data, at least there is a chance it will be useful.
If one of your programs requests more memory, file caches will be the first thing to go. Linux knows that it can re-read that data from disk whenever it wants, so it will just reap the memory and give it a new use.
It's true that Linux by default waits around 30 seconds (this is what the value used to be anyhow) before flushing writes to disk. You can speed this up with a call to fsync(). But once the data has been written to disk, there's practically zero cost to keeping a cache of the data in memory.
Seeing as you write to the file and don't read from it, Linux will probably guess that this data is the best to throw out, in preference to other cached data. So don't waste effort trying to optimise unless you've confirmed that it's a performance problem.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string