munmap performance on Linux

munmap performance on Linux - linux

I have a multi-threaded application on RHEL 5.8 which reads large files (about 500MB each) via mmap and do some processing on them; one thread does the mmap and other threads do the processing. When the file is no longer on filesystem, munmap is performed to free the memory.
My problem is that munmap (and sometimes close on the file) slows down the other threads, performing operations on a different memory, so I am wondering if there is a better way to implement this. I have 2 ideas: split the memory to smaller chunks to munmap smaller blocks (is this even possible?), or not use munmap at all and allocate / deallocate memory myself, optionally cache the memory blocks if the file is no longer on filesystem, and reuse it for next file.
Thanks for any ideas.

The actual reason it gets slow is that munmap() takes the mm->mmap_sem lock for the entire duration of the syscall. Several other operations are liable to be blocked by this, for example (but not limited to) fork()/mmap(). This is especially important to note for architectures that do not implement a lockless get_user_pages_fast() operation for pages already in-memory, because a bunch of futex operations (that underpin pthread primitives) will call get_user_pages_fast() and the default implementation will try to take a read lock on mmap_sem.

If you're reading the memory sequentially, try to regularly use posix_madvise() with MADV_DONTNEED on the read memory pages. See posix_madvise().
It's also available as madvise() under Linux. See madvise()

When the file is no longer on filesystem, munmap is performed
So you call munmap when the file is unlinked from the filesystem. Then, probably, what is slowing down the system is the actual deletion of the inode, that is done when all the directory entries, file descriptors and memory maps are released.
There is known issues with the performance of deletes in some filesystems in linux (ext3). If that is the case you could try changing to ext4 (with extents), if that is feasible in your scenario.
Other option would be to hard link the files in other directory, so they are not really deleted when you munmmap them. Then, you could call ionice -c 3 rm <last-link> or similar to actually delete them in the background...

What I ended up doing (and it proved sufficient) was to munmap the big memory block in pieces, e.g. I had 500MB block and I performed munmap in 100MB chunks.

Related

Hey could someone help me understand sync syscall usage?

like said in the title, I don't really understand the usage of this syscall. I was writing some program that write some data in a file, and the tutorial I've seen told me to use sys_sync syscall. But my problem is why and when should we use this? The data isn't already written on the file?

The manual says:
sync - Synchronize cached writes to persistent storage
So it is written to the file cache in memory, not on disk.
You rarely have to use sync unless you are writing really important data and need to make sure that data is on disk before you go on. One example of systems that use sync a lot are databases (such as MySQL or PostgreSQL).
So in other words, it is theoretically in your file, just not on disk and therefore if you lose electricity, you could lose the data, especially if you have a lot of RAM and many writes in a raw, it may privilege the writes to cache for a long while, increasing the risk of data loss.
But how can a file be not on the disk? I understand the concept of cache but if I wrote in the disk why would it be in a different place?
First, when you write to a file, you send the data to the Kernel. You don't directly send it to the disk. Some kernel driver is then responsible to write the data to disk. In my days on Apple 2 and Amiga computers, I would actually directly read/write to disk. And at least the Amiga had a DMA so you could setup a buffer, then tell the disk I/O to do a read or a write and it would send you an interrupt when done. On the Apple 2, you had to write loops in assembly language with precise timings to read/write data on floppy disks... A different era!
Although you could, of course, directly access the disk (but with a Kernel like Linux, you'd have to make sure the kernel gives you hands free to do that...).
Cache is primarily used for speed. It is very slow to write to disk (as far as a human is concerned, it looks extremely fast, but compared to how much data the CPU can push to the drive, it's still slow).
So what happens is that the kernel has a task to write data to disk. That task wakes up as soon as data appears in the cache and ends once all the caches are transferred to disk. This task works in parallel. You can have one such task per drive (which is especially useful when you have a system such as RAID 1).
If your application fills up the cache, then a further write will block until some of the cache can be replaced.

and the tutorial I've seen told me to use sys_sync syscall
Well that sounds silly, unless you're doing filesystem write benchmarking or something.
If you have one really critical file that you want to make sure is "durable" wrt. power outages before you do something else (like sent a network packet to acknowledge a complete transfer), use fsync(fd) to sync just that one file's data and metadata.
(In asm, call number SYS_fsync from sys/syscall.h, with the file descriptor as the first register arg.)
But my problem is why and when should we use this?
Generally never use the sync system call in programs you're writing.
There are interactive use-cases where you'd normally use the wrapper command of the same name, sync(1). e.g. with removable media, to get the kernel started doing write-back now, so unmount will take less time once you finish typing it. Or for some benchmarking use-cases.
The system shutdown scripts may run sync after unmounting filesystems (and remounting / read-only), before making a reboot(2) system call.
Re: why sync(2) exists
No, your data isn't already on disk right after echo foo > bar.txt.
Most OSes, including Linux, do write-back caching, not write-through, for file writes.
You don't want write() system calls to wait for an actual magnetic disk when there's free RAM, because the traditional way to do I/O is synchronous so simple single-threaded programs wouldn't be able to do anything else (like reading more data or computing anything) while waiting for write() to return. Blocking for ~10 ms on every write system call would be disastrous; that's as long as a whole scheduler timeslice. (It would still be bad even with SSDs, but of course OSes were designed before SSDs were a thing.) Even just queueing up the DMA would be slow, especially for small file writes that aren't a whole number of aligned sectors, so even letting the disk's own write-back write caching work wouldn't be good enough.
Therefore, file writes do create "dirty" pages of kernel buffers that haven't yet been sent to the disk. Sometimes we can even avoid the IO entirely, e.g. for tmp files that get deleted before anything triggers write-back. On Linux, dirty_writeback_centisecs defaults to 1500 (15 seconds) before the kernel starts write-back, unless it's running low on free pages. (Heuristics for what "low" means use other tunable values).
If you really want writes to flush to disk immediately and wait for data to be on disk, mount with -o sync. Or for one program, have it use open(O_SYNC) or O_DSYNC (for just the data, not metadata like timestamps).
See Are file reads served from dirtied pages in the page cache?
There are other advantages to write-back, including delayed allocation even at the filesystem level. The FS can wait until it knows how big the file will be before even deciding where to put it, allowing better decisions that reduce fragmentation. e.g. a small file can go into a gap that would have been a bad place to start a potentially-large file. (It just have to reserve space to make sure it can put it somewhere.) XFS was one of the first filesystems to do "lazy" delayed allocation, and ext4 has also had the feature for a while.
https://en.wikipedia.org/wiki/XFS#Delayed_allocation
https://en.wikipedia.org/wiki/Allocate-on-flush
https://lwn.net/Articles/323169/

Dump global datas to disk in assembly code

The experiment is on Linux, x86 32-bit.
So suppose in my assembly program, I need to periodically (for instance every time after executing 100000 basic blocks) dump an array in .bss section from memory to the disk. The starting address and size of the array is fixed. The array records the executed basic block's address, the size is 16M right now.
I tried to write some native code, to memcpy from .bss section to the stack, and then write it back to disk. But it seems to me that it is very tedious and I am worried about the performance and memory consumption, say, every-time allocate a very large memory on the stack...
So here is my question, how can I dump the memory from global data sections in an efficient way? Am I clear enough?

First of all, don't write this part of your code in asm, esp. not at first. Write a C function to handle this part, and call it from asm. If you need to perf-tune the part that only runs when it's time to dump another 16MiB, you can hand-tune it then. System-level programming is all about checking error returns from system calls (or C stdio functions), and doing that in asm would be painful.
Obviously you can write anything in asm, since making system calls isn't anything special compared to C. And there's no part of any of this that's easier in asm compared to C, except for maybe throwing in an MFENCE around the locking.
Anyway, I've addressed three variations on what exactly you want to happen with your buffer:
Overwrite the same buffer in place (mmap(2) / msync(2))
Append a snapshot of the buffer to a file (with either write(2) or probably-not-working zero-copy vmsplice(2) + splice(2) idea.)
Start a new (zeroed) buffer after writing the old one. mmap(2) sequential chunks of your output file.
In-place overwrites
If you just want to overwrite the same area of disk every time, mmap(2) a file and use that as your array. (Call msync(2) periodically to force the data to disk.) The mmapped method won't guarantee a consistent state for the file, though. Writes can get flushed to disk other than on request. IDK if there's a way to avoid that with any kind of guarantee (i.e. not just choosing buffer-flush timers and so on so your pages usually don't get written except by msync(2).)
Append snapshots
The simple way to append a buffer to a file would be to simply call write(2) when you want it written. write(2) does everything you need. If your program is multi-threaded, you might need to take a lock on the data before the system call, and release the lock afterwards. I'm not sure how fast the write system call would return. It may only return after the kernel has copied your data to the page cache.
If you just need a snapshot, but all writes into the buffer are atomic transactions (i.e. the buffer is always in a consistent state, rather than pairs of values that need to be consistent with each other), then you don't need to take a lock before calling write(2). There will be a tiny amount of bias in this case (data at the end of the buffer will be from a slightly later time than data from the start, assuming the kernel copies in order).
IDK if write(2) returns slower or faster with direct IO (zero-copy, bypassing the page-cache). open(2) your file with with O_DIRECT, write(2) normally.
There has to be a copy somewhere in the process, if you want to write a snapshot of the buffer and then keep modifying it. Or else MMU copy-on-write trickery:
Zero-copy append snapshots
There is an API for doing zero-copy writes of user pages to disk files. Linux's vmsplice(2) and splice(2) in that order will let you tell the kernel to map your pages into the page cache. Without SPLICE_F_GIFT, I assume it sets them up as copy-on-write. (oops, actually the man page says without SPLICE_F_GIFT, the following splice(2) will have to copy. So IDK if there is a mechanism to get copy-on-write semantics.)
Assuming there was a way to get copy-on-write semantics for your pages, until the kernel was done writing them to disk and could release them:
Further writes might need the kernel to memcpy one or two pages before the data hit disk, but save copying the whole buffer. The soft page faults and page-table manipulation overhead might not be worth it anyway, unless your data access pattern is very spatially-localized over the short periods of time until the write hits disk and the to-be-written pages can be released. (I think an API that works this way doesn't exist, because there's no mechanism for getting the pages released right after they hit disk. Linux wants to take them over and keep them in the page cache.)
I haven't ever used vmsplice, so I might be getting some details wrong.
If there's a way to create a new copy-on-write mapping of the same memory, maybe by mmaping a new mapping of a scratch file (on a tmpfs filesystem, prob. /dev/shm), that would get you snapshots without holding the lock for long. Then you can just pass the snapshot to write(2), and unmap it ASAP before too many copy-on-write page faults happen.
New buffer for every chunk
If it's ok to start with a zeroed buffer after every write, you could mmap(2) successive chunk of the file, so the data you generate is always already in the right place.
(optional) fallocate(2) some space in your output file, to prevent fragmentation if your write pattern isn't sequential.
mmap(2) your buffer to the first 16MiB of your output file.
run normally
When you want to move on to the next 16MiB:
take a lock to prevent other threads from using the buffer
munmap(2) your buffer
mmap(2) the next 16MiB of the file to the same address, so you don't need to pass the new address around to writers. These pages will be pre-zeroed, as required by POSIX (can't have the kernel exposing memory).
release the lock
Possibly mmap(buf, 16MiB, ... MAP_FIXED, fd, new_offset) could replace the munmap / mmap pair. MAP_FIXED discards old mmapings that it overlaps. I assume this doesn't mean that modifications to the file / shared memory are discarded, but rather that the actual mapping changes, even without an munmap.

Two clarifications for Append snapshots case from Peter's answer.
1. Appending without O_DIRECT
As Peter said, if you don't use O_DIRECT, write() will return as soon data was copied to page cache. If page cache is full, it will block until some outdated page will be flushed to disk.
If you are only appending data without reading it (soon), you can benefit from periodically calling sync_file_range(2) to schedule flush for previously written pages and posix_fadvise(2) with POSIX_FADV_DONTNEED flag to remove already flushed pages from page cache. This could significantly reduce the posibility that write() would block.
2. Appending with O_DIRECT
With O_DIRECT, write() normally would block until data is sent to disk (although it's not strictly guaranteed, see here). Since this is slow, be prepared to implement you own I/O scheduling if you need non-blocking writes.
The benefits you could archive are: more predictable behaviour (you control when you will block) and probably reduced memory and CPU usage by collaboration of your application and kernel.

Linux async (io_submit) write v/s normal (buffered) write

Since writes are immediate anyway (copy to kernel buffer and return), what's the advantage of using io_submit for writes?
In fact, it (aio/io_submit) seems worse since you have to allocate the write buffers on the heap and can't use stack-based buffers.
My question is only about writes, not reads.
EDIT: I am talking about relatively small writes (few KB at most), not MB or GB, so buffer copy should not be a big problem.

Copying a buffer into the kernel is not necessarily instantaneous.
First the kernel needs to find a free page. If there is none (which is fairly likely under heavy disk-write pressure), it has to decide to evict one. If it decides to evict a dirty page (instead of evicting your process for instance), it will have to actually write it before it can use that page.
there's a related issue in linux when saturating writing to a slow drive, the page cache fills up with dirty pages backed by a slow drive. Whenever the kernel needs a page, for any reason, it takes a long time to acquire one and the whole system freezes as a result.
The size of each individual write is less relevant than the write pressure of the system. If you have a million small writes already queued up, this may be the one that has to block.
Regarding whether the allocation lives on the stack or the heap is also less relevant. If you want efficient allocation of blocks to write, you can use a dedicated pool allocator (from the heap) and not pay for the general purpose heap allocator.
aio_write() gets around this by not copying the buffer into the kernel at all, it may even be DMAd straight out of your buffer (given the alignment requirements), which means you're likely to save a copy as well.

Does madvise(_, _, MADV_DONTNEED) instruct the OS to lazily write to disk?

Hypothetically, suppose I want to perform sequential writing to a potentially very large file.
If I mmap() a gigantic region and madvise(MADV_SEQUENTIAL) on that entire region, then I can write to the memory in a relatively efficient manner. This I have gotten to work just fine.
Now, in order to free up various OS resources as I am writing, I occasionally perform a munmap() on small chunks of memory that have already been written to. My concern is that munmap() and msync()will block my thread, waiting for the data to be physically committed to disk. I cannot slow down my writer at all, so I need to find another way.
Would it be better to use madvise(MADV_DONTNEED) on the small, already-written chunk of memory? I want to tell the OS to write that memory to disk lazily, and not to block my calling thread.
The manpage on madvise() has this to say, which is rather ambiguous:
MADV_DONTNEED
Do not expect access in the near future. (For the time being, the
application is finished with the given range, so the kernel can free
resources associated with it.) Subsequent accesses of pages in this
range will succeed, but will result either in re-loading of the memory
contents from the underlying mapped file (see mmap(2)) or
zero-fill-on-demand pages for mappings without an underlying file.

No!
For your own good, stay away from MADV_DONTNEED. Linux will not take this as a hint to throw pages away after writing them back, but to throw them away immediately. This is not considered a bug, but a deliberate decision.
Ironically, the reasoning is that the functionality of a non-destructive MADV_DONTNEED is already given by msync(MS_INVALIDATE|MS_ASYNC), MS_ASYNC on the other hand does not start I/O (in fact, it does nothing at all, following the reasoning that dirty page writeback works fine anyway), fsync always blocks, and sync_file_range may block if you exceed some obscure limit and is considered "extremely dangerous" by the documentation, whatever that means.
Either way, you must msync(MS_SYNC), or fsync (both blocking), or sync_file_range (possibly blocking) followed by fsync, or you will lose data with MADV_DONTNEED. If you cannot afford to possibly block, you have no choice, sadly, but to do this in another thread.

For recent Linux kernels (just tested on Linux 5.4), MADV_DONTNEED works as expected when the mapping is NOT private, e.g. mmap without MAP_PRIVATE flag. I'm not sure what's the behavior on previous versions of Linux kernel.
From release 4.15 of the Linux man-pages project's madvise manpage:
After a successful MADV_DONTNEED operation, the semantics of memory access in the specified region are changed: subsequent accesses of pages in the range will succeed, but will result in either repopulating the memory contents from the up-to-date contents of the underlying mapped file (for shared file mappings, shared anonymous mappings, and shmem-based techniques such as System V shared memory segments) or zero-fill-on-demand pages for anonymous private mappings.
Linux added a new flag MADV_FREE with the same behavior in BSD systems in Linux 4.5
which just mark pages as available to free if needed, but it doesn't free them immediately, making possible to reuse the memory range without incurring in the costs of faulting the pages again.
For why MADV_DONTNEED for private mapping may result zero filled pages upon future access, watch Bryan Cantrill's rant as mentioned in comments of #Damon's answer. Spoiler: it comes from Tru64 UNIX.

As already mentioned, MADV_DONTNEED is not your friend. Since Linux 5.4, you can use MADV_COLD to tell the kernel it should page out that memory when there is memory pressure. This seems to be exactly what is wanted in this situation.
Read more here:
https://lwn.net/Articles/793462/

first, madv_sequential enables aggressive readahead, so you don't need it.
second, os will lazily write dirty file-baked memory to disk anyway, even if you will do nothing. but madv_dontneed will instruct it to free memory immediately (what you call "various os resources"). third, it is not clear that mmapping files for sequential writing has any advantage. you probably will be better served by just write(2) (but use buffers - either manual or stdio).

Any use of buffering for writing data on linux ext4 filesystem?

I am using ext4 on linux 2.6 kernel. I have records in byte arrays, which can range from few hundred to 16MB. Is there any benefit in an application using write() for every record as opposed to saying buffering X MB and then using write() on X MB?
If there is a benefit in buffering, what would be a good value for ext4. This question is for someone who has profiled the behavior of the multiblock allocator in ext4.
My understanding is that filesystem will buffer in multiples of pagesize and attempt to flush them on disk. What happens if the buffer provided to write() is bigger than filesystem buffer? Is this a crude way to force filesystem to flush to disk()

The "correct" answer depends on what you really want to do with the data.
write(2) is designed as single trip into kernel space, and provides good control over I/O. However, unless the file is opened with O_SYNC, the data goes into kernel's cache only, not on disk. O_SYNC changes that to ensure file is synchroinized to disk. The actual writing to disk is issued by kernel cache, and ext4 will try to allocate as big buffer to write to minimize fragmentation, iirc. In general, write(2) with either buffered or O_SYNC file is a good way to control whether the data goes to kernel or whether it's still in your application's cache.
However, for writing lots of records, you might be interested in writev(2), which writes data from a list of buffers. Similarly to write(2), it's an atomic call (though of course that's only in OS semantics, not actually on disk, unless, again, Direct I/O is used).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string