What are the semantics of vmsplice(2), with and without gifting? - linux

I'm trying to understand the functionality of the vmsplice(2) syscall (man page here). I have two questions about the effect of the SPLICE_F_GIFT flag:
The man page says that once you gift pages to the kernel, you must never modify the memory again. Does that mean the memory is pinned forever, or does it perhaps refer to virtual memory that can be unmapped by the gifting process, rather than physical memory? In other words, what does a typical use of this look like?
If I don't set SPLICE_F_GIFT, is vmsplice(2) any different than a vectorized write syscall like writev(2)?

1 - Yes, its different.
If you write 1GB to a pipe with write, it will loop until those 1GB are delivered to the pipe, unless a signal interrupts the work.
If you vmsplice 1GB to a pipe, it will only block if the pipe buffer is full, and then only write what's available in the pipe's buffer.
Very frustrating that it doesn't loop over and keep writing as a regular write. You trade not copying with having to do a whole bunch of vmsplice calls and having to implement a loop for partial vmsplice writes.
2 - I was vmsplicing from mmaped areas and was able to munmap instantly after vmsplicing, without crashes or data corruption.

Does that mean the memory is pinned forever, or does it perhaps refer to virtual memory that can be unmapped by the gifting process, rather than physical memory? In other words, what does a typical use of this look like?
You are promising not to modify the page. Not the page's virtual addressing. For most use cases the suggest operation is something like:
mmap
read
vmsplice
munmap
Generally you want to use mmap over malloc as you want to ensure you have a page, not just 4096bytes of RAM. Which could sit in the middle of a 2MB, or 1GB HUGE_PAGE if your allocator determines that is more efficient.
If I don't set SPLICE_F_GIFT, is vmsplice(2) any different than a vectorized write syscall like writev(2)?
Yes
Most buffers in the kernels are pipes. Or really pipes are represented by the same data structure as buffers.

Related

Dump global datas to disk in assembly code

The experiment is on Linux, x86 32-bit.
So suppose in my assembly program, I need to periodically (for instance every time after executing 100000 basic blocks) dump an array in .bss section from memory to the disk. The starting address and size of the array is fixed. The array records the executed basic block's address, the size is 16M right now.
I tried to write some native code, to memcpy from .bss section to the stack, and then write it back to disk. But it seems to me that it is very tedious and I am worried about the performance and memory consumption, say, every-time allocate a very large memory on the stack...
So here is my question, how can I dump the memory from global data sections in an efficient way? Am I clear enough?
First of all, don't write this part of your code in asm, esp. not at first. Write a C function to handle this part, and call it from asm. If you need to perf-tune the part that only runs when it's time to dump another 16MiB, you can hand-tune it then. System-level programming is all about checking error returns from system calls (or C stdio functions), and doing that in asm would be painful.
Obviously you can write anything in asm, since making system calls isn't anything special compared to C. And there's no part of any of this that's easier in asm compared to C, except for maybe throwing in an MFENCE around the locking.
Anyway, I've addressed three variations on what exactly you want to happen with your buffer:
Overwrite the same buffer in place (mmap(2) / msync(2))
Append a snapshot of the buffer to a file (with either write(2) or probably-not-working zero-copy vmsplice(2) + splice(2) idea.)
Start a new (zeroed) buffer after writing the old one. mmap(2) sequential chunks of your output file.
In-place overwrites
If you just want to overwrite the same area of disk every time, mmap(2) a file and use that as your array. (Call msync(2) periodically to force the data to disk.) The mmapped method won't guarantee a consistent state for the file, though. Writes can get flushed to disk other than on request. IDK if there's a way to avoid that with any kind of guarantee (i.e. not just choosing buffer-flush timers and so on so your pages usually don't get written except by msync(2).)
Append snapshots
The simple way to append a buffer to a file would be to simply call write(2) when you want it written. write(2) does everything you need. If your program is multi-threaded, you might need to take a lock on the data before the system call, and release the lock afterwards. I'm not sure how fast the write system call would return. It may only return after the kernel has copied your data to the page cache.
If you just need a snapshot, but all writes into the buffer are atomic transactions (i.e. the buffer is always in a consistent state, rather than pairs of values that need to be consistent with each other), then you don't need to take a lock before calling write(2). There will be a tiny amount of bias in this case (data at the end of the buffer will be from a slightly later time than data from the start, assuming the kernel copies in order).
IDK if write(2) returns slower or faster with direct IO (zero-copy, bypassing the page-cache). open(2) your file with with O_DIRECT, write(2) normally.
There has to be a copy somewhere in the process, if you want to write a snapshot of the buffer and then keep modifying it. Or else MMU copy-on-write trickery:
Zero-copy append snapshots
There is an API for doing zero-copy writes of user pages to disk files. Linux's vmsplice(2) and splice(2) in that order will let you tell the kernel to map your pages into the page cache. Without SPLICE_F_GIFT, I assume it sets them up as copy-on-write. (oops, actually the man page says without SPLICE_F_GIFT, the following splice(2) will have to copy. So IDK if there is a mechanism to get copy-on-write semantics.)
Assuming there was a way to get copy-on-write semantics for your pages, until the kernel was done writing them to disk and could release them:
Further writes might need the kernel to memcpy one or two pages before the data hit disk, but save copying the whole buffer. The soft page faults and page-table manipulation overhead might not be worth it anyway, unless your data access pattern is very spatially-localized over the short periods of time until the write hits disk and the to-be-written pages can be released. (I think an API that works this way doesn't exist, because there's no mechanism for getting the pages released right after they hit disk. Linux wants to take them over and keep them in the page cache.)
I haven't ever used vmsplice, so I might be getting some details wrong.
If there's a way to create a new copy-on-write mapping of the same memory, maybe by mmaping a new mapping of a scratch file (on a tmpfs filesystem, prob. /dev/shm), that would get you snapshots without holding the lock for long. Then you can just pass the snapshot to write(2), and unmap it ASAP before too many copy-on-write page faults happen.
New buffer for every chunk
If it's ok to start with a zeroed buffer after every write, you could mmap(2) successive chunk of the file, so the data you generate is always already in the right place.
(optional) fallocate(2) some space in your output file, to prevent fragmentation if your write pattern isn't sequential.
mmap(2) your buffer to the first 16MiB of your output file.
run normally
When you want to move on to the next 16MiB:
take a lock to prevent other threads from using the buffer
munmap(2) your buffer
mmap(2) the next 16MiB of the file to the same address, so you don't need to pass the new address around to writers. These pages will be pre-zeroed, as required by POSIX (can't have the kernel exposing memory).
release the lock
Possibly mmap(buf, 16MiB, ... MAP_FIXED, fd, new_offset) could replace the munmap / mmap pair. MAP_FIXED discards old mmapings that it overlaps. I assume this doesn't mean that modifications to the file / shared memory are discarded, but rather that the actual mapping changes, even without an munmap.
Two clarifications for Append snapshots case from Peter's answer.
1. Appending without O_DIRECT
As Peter said, if you don't use O_DIRECT, write() will return as soon data was copied to page cache. If page cache is full, it will block until some outdated page will be flushed to disk.
If you are only appending data without reading it (soon), you can benefit from periodically calling sync_file_range(2) to schedule flush for previously written pages and posix_fadvise(2) with POSIX_FADV_DONTNEED flag to remove already flushed pages from page cache. This could significantly reduce the posibility that write() would block.
2. Appending with O_DIRECT
With O_DIRECT, write() normally would block until data is sent to disk (although it's not strictly guaranteed, see here). Since this is slow, be prepared to implement you own I/O scheduling if you need non-blocking writes.
The benefits you could archive are: more predictable behaviour (you control when you will block) and probably reduced memory and CPU usage by collaboration of your application and kernel.

mmap(): resetting old memory to a zero'd non-resident state

I'm writing a memory allocation routine, and it's currently running smoothly. I get my memory from the OS with mmap() in 4096-byte pages. When I start my memory allocator I allocate 1gig of virtual address space with mmap(), and then as allocations are made I divide it up into hunks according to the specifics of my allocation algorithm.
I feel safe allocating as much as a 1gig of memory on a whim because I know mmap() doesn't actually put pages into physical memory until I actually write to them.
Now, the program using my allocator might have a spurt where it needs a lot of memory, and in this case the OS would have to eventually put a whole 1gig worth of pages into physical RAM. The trouble is that the program might then go into a dormant period where it frees most of that 1gig and then uses only minimal amounts of memory. Yet, all I really do inside of my allocator's MyFree() function is to flip a few bits of bookkeeping data which mark the previously used gig as free, but I know this doesn't cause the OS remove those pages from physical memory.
I can't use something like munmap() to fix this problem, because the nature of the allocation algorithm is such that it requires a continuous region of memory without any holes in it. Basically I need a way to tell the OS "Listen, you can take these pages out of physical memory and clear them to 0, but please remap them on the fly when I need them again, as if they were freshly mmap()'d"
What would be the best way to go about this?
Actually, after writing this all up I just realized that I can probably do an munmap() followed immediately by a fresh mmap(). Would that be the correct way to go about? I get the sense that there's probably some more efficient way to do this.
You are looking for madvise(addr, length, MADV_DONTNEED). From the manpage:
MADV_DONTNEED: Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file.
Note especially the language about how subsequent accesses will succeed but revert to zero-fill-on-demand (for mappings without an underlying file).
Your thinking-out-loud alternative of an munmap followed immediately by another mmap will also work but risks kernel-side inefficiencies because it is no longer tracking the allocation a single contiguous region; if there are many such unmap-and-remap events the kernelside data structures might wind up being quite bloated.
By the way, with this kind of allocator it's very important that you use MAP_NORESERVE for the initial allocation, and then touch each page as you allocate it, and trap any resulting SIGSEGV and fail the allocation. (And you'll need to document that your allocator installs a handler for SIGSEGV.) If you don't do this your application will not work on systems that have disabled memory overcommit. See the mmap manpage for more detail.

Linux async (io_submit) write v/s normal (buffered) write

Since writes are immediate anyway (copy to kernel buffer and return), what's the advantage of using io_submit for writes?
In fact, it (aio/io_submit) seems worse since you have to allocate the write buffers on the heap and can't use stack-based buffers.
My question is only about writes, not reads.
EDIT: I am talking about relatively small writes (few KB at most), not MB or GB, so buffer copy should not be a big problem.
Copying a buffer into the kernel is not necessarily instantaneous.
First the kernel needs to find a free page. If there is none (which is fairly likely under heavy disk-write pressure), it has to decide to evict one. If it decides to evict a dirty page (instead of evicting your process for instance), it will have to actually write it before it can use that page.
there's a related issue in linux when saturating writing to a slow drive, the page cache fills up with dirty pages backed by a slow drive. Whenever the kernel needs a page, for any reason, it takes a long time to acquire one and the whole system freezes as a result.
The size of each individual write is less relevant than the write pressure of the system. If you have a million small writes already queued up, this may be the one that has to block.
Regarding whether the allocation lives on the stack or the heap is also less relevant. If you want efficient allocation of blocks to write, you can use a dedicated pool allocator (from the heap) and not pay for the general purpose heap allocator.
aio_write() gets around this by not copying the buffer into the kernel at all, it may even be DMAd straight out of your buffer (given the alignment requirements), which means you're likely to save a copy as well.

munmap performance on Linux

I have a multi-threaded application on RHEL 5.8 which reads large files (about 500MB each) via mmap and do some processing on them; one thread does the mmap and other threads do the processing. When the file is no longer on filesystem, munmap is performed to free the memory.
My problem is that munmap (and sometimes close on the file) slows down the other threads, performing operations on a different memory, so I am wondering if there is a better way to implement this. I have 2 ideas: split the memory to smaller chunks to munmap smaller blocks (is this even possible?), or not use munmap at all and allocate / deallocate memory myself, optionally cache the memory blocks if the file is no longer on filesystem, and reuse it for next file.
Thanks for any ideas.
The actual reason it gets slow is that munmap() takes the mm->mmap_sem lock for the entire duration of the syscall. Several other operations are liable to be blocked by this, for example (but not limited to) fork()/mmap(). This is especially important to note for architectures that do not implement a lockless get_user_pages_fast() operation for pages already in-memory, because a bunch of futex operations (that underpin pthread primitives) will call get_user_pages_fast() and the default implementation will try to take a read lock on mmap_sem.
If you're reading the memory sequentially, try to regularly use posix_madvise() with MADV_DONTNEED on the read memory pages. See posix_madvise().
It's also available as madvise() under Linux. See madvise()
When the file is no longer on filesystem, munmap is performed
So you call munmap when the file is unlinked from the filesystem. Then, probably, what is slowing down the system is the actual deletion of the inode, that is done when all the directory entries, file descriptors and memory maps are released.
There is known issues with the performance of deletes in some filesystems in linux (ext3). If that is the case you could try changing to ext4 (with extents), if that is feasible in your scenario.
Other option would be to hard link the files in other directory, so they are not really deleted when you munmmap them. Then, you could call ionice -c 3 rm <last-link> or similar to actually delete them in the background...
What I ended up doing (and it proved sufficient) was to munmap the big memory block in pieces, e.g. I had 500MB block and I performed munmap in 100MB chunks.

Is fork() copy-on-write a stable exposed behavior that can be used to implement read-only shared memory?

The man page on fork() states that it does not copy data pages, it maps them into the child process and puts a copy-on-write flag. Is that behavior:
consistent between flavors of Linux?
considered an implementation detail and therefore likely to change?
I'm wondering if I can use fork() as a means to get a shared read-only memory block on the cheap. If the memory is physically copied, it would be rather expensive - there's a lot of forking going on, and the data area is big enough - but I'm hoping not...
Linux running on machines without a MMU (memory management unit) will copy all process memory on fork().
However, those systems are usually very small and embedded and you probably don't have to worry about them.
Many services such as Apache's fork model, use the initialize and fork() method to share initialized data structures.
You should be aware that if you are using languages like Perl and Python that use reference-counted variables, or C++ shared_ptr's, this model will not work. It will not work because as the reference counts are adjusted up and down, the memory becomes unshared and gets copied.
This causes huge amounts of memory usage in Perl daemons like SpamAssassin that attempt to use an initialize and fork model.
Yes you can certainly rely on it on MMU-Linux kernels; this is almost everything.
However, the page size isn't the same everywhere.
It is possible to explicitly make a shared memory area for forked process, by using mmap() to create an anonymous map - one which is not backed by a physical file. On fork, this area will always remain shared (provided the child doesn't unmap it, or map something else in at the same address). You can mprotect it to be readonly if you want.
Memory allocated with (for example) malloc can easily end up sharing a page with something that isn't readonly, which means it gets copied anyway when another structure is modified. This includes internal structures used by the malloc implementation. So you might want to mmap a specific area for this purpose and allocate from that.
Can you rely on the fact that all Linux flavors do it this way? No. But you can rely on the fact that those who don't use an even faster method.
Therefore you should use the feature and rely on it and revisit your decision if you get a performance problem.
The success of this approach depends on how well you stick to your self-imposed "read-only" limitation. Both parent and child have to obey this stricture, else the memory gets copied.
This may not be the catastrophe you're envisioning, however. The kernel can copy as little as a single page (typically 4 KB) to implement CoW semantics. A typical Linux server will use something more complex, some sort of slab allocator, so the copied region could be much larger.
The main point is that this is decoupled from your program's conception of its memory use. If you malloc() 1 GB of RAM, fork off a child, and the child changes just the first byte of that memory block, the entire 1 GB block isn't copied. Perhaps as little as one page is copied, up to the slab size containing that first byte.
Yes
All the linux distros use the same kernel, albeit with slightly different versions and releases of it.
It's unlikely that another underlying fork(2) implementation will be faster any time soon, so it's a safe bet that copy-on-write will continue to be the mechanism. Perhaps it won't be forever, but for years, definitely.
Certainly some major software systems (for example, Phusion Passenger) use fork(2) in the same way that you want to, so you would not be the only one taking advantage of CoW.

Resources