vm/min_free_kbytes - Why Keep Minimum Reserved Memory? - linux

According to this article:
/proc/sys/vm/min_free_kbytes: This controls the amount of memory that is kept free for use by special reserves including “atomic” allocations (those which cannot wait for reclaim)
My question is that what does it mean by "those which cannot wait for reclaim"? In other words, I would like to understand why there's a need to tell the system to always keep a certain minimum amount of memory free and under what circumstances will this memory be used? [It must be used by something; don't see the need otherwise]
My second question: does setting this memory to something higher than 4MB (on my system) leads to better performance? We have a server which occasionally exhibit very poor shell performance (e.g. ls -l takes 10-15 seconds to execute) when certain processes get going and if setting this number to something higher will lead to better shell performance?

(link is dead, looks like it's now here)
That text is referring to atomic allocations, which are requests for memory that must be satisfied without giving up control (i.e. the current thread can not be suspended). This happens most often in interrupt routines, but it applies to all cases where memory is needed while holding an essential lock. These allocations must be immediate, as you can't afford to wait for the swapper to free up memory.
See Linux-MM for a more thorough explanation, but here is the memory allocation process in short:
_alloc_pages first iterates over each memory zone looking for the first one that contains eligible free pages
_alloc_pages then wakes up the kswapd task [..to..] tap into the reserve memory pools maintained for each zone.
If the memory allocation still does not succeed, _alloc pages will either give up [..] In this process _alloc_pages executes a cond_resched() which may cause a sleep, which is why this branch is forbidden to allocations with GFP_ATOMIC.
min_free_kbytes is unlikely to help much with the described "ls -l takes 10-15 seconds to execute"; that is likely caused by general memory pressure and swapping rather than zone exhaustion. The min_free_kbytes setting only needs to allow enough free pages to handle immediate requests. As soon as normal operation is resumed, the swapper process can be run to rebalance the memory zones. The only time I've had to increase min_free_kbytes is after enabling jumbo frames on a network card that didn't support dma scattering.
To expand on your second question a bit, you will have better results tuning vm.swappiness and the dirty ratios mentioned in the linked article. However, be aware that optimizing for "ls -l" performance may cause other processes to become slower. Never optimize for a non-primary usecase.

All linux systems will attempt to make use of all physical memory available to the system, often through the creation of a filesystem buffer cache, which put simply is an I/O buffer to help improve system performance. Technically this memory is not in use, even though it is allocated for caching.
"wait for reclaim", in your question, refers to the process of reclaiming that cache memory that is "not in use" so that it can be allocated to a process. This is supposed to be transparent but in the real world there are many processes that do not wait for this memory to become available. Java is a good example, especially where a large minimum heap size has been set. The process tries to allocate the memory and if it is not instantly available in one large contiguous (atomic?) chunk, the process dies.
Reserving a certain amount of memory with min_free_kbytes allows this memory to be instantly available and reduces the memory pressure when new processes need to start, run and finish while there is a high memory load and a full buffer cache.
4MB does seem rather low because if the buffer cache is full, any process that wants an immediate allocation of more than 4MB will likely fail. The setting is very tunable and system-specific, but if you have a few GB of memory available it can't hurt to bump up the reserve memory to 128MB. I'm not sure what effect it will have on shell interactivity, but likely positive.

This memory is kept free from use by normal processes. As #Arno mentioned, the special processes that can run include interrupt routines, which must be run now (as it's an interrupt), and finish before any other processes can run (atomic). This can include things like swapping out memory to disk when memory is full.
If the memory is filled an interrupt (memory management) process runs to swap some memory into disk so it can free some memory for use by normal processes. But if vm.min_free_kbytes is too small for it to run, then it locks up the system. This is because this interrupt process must run first to free memory so others can run, but then it's stuck because it doesn't have enough reserved memory vm.min_free_kbytes to do its task resulting in a deadlock.
Also see:
https://www.linbit.com/en/kernel-min_free_kbytes/ and
https://askubuntu.com/questions/41778/computer-freezing-on-almost-full-ram-possibly-disk-cache-problem (where the memory management process has so little memory to work with it takes so long to swap little by little that it feels like a freeze.)

Related

What is memory reclaim in linux

I am very new to Linux memory management. While reading some documents on the topic, I had some basic questions.
Below is my config:
vm.swappiness=10
vm.vfs_cache_pressure=140
vm.min_free_kbytes=2013265
My understanding is, if free memory falls below vm.min_free_kbytes, then the OS will reclaim memory.
Is Memory reclaim a deletion of unwanted files or copying to Swap memory from RAM?
If it's copying to Swap memory from RAM, then if I am not using Swap memory, what will happen?
Is swappiness always greater than the vm.min_free_kbytes?
What is the significance of vm.vfs_cache_pressure?
Memory reclaim is the mechanism of creating more free RAM pages, by throwing somewhere else the data residing in them. It has nothing to do with files. When more RAM is needed, data is dropped from RAM (trashed away, if it can be refetched) or copied to the swap file (so the data will be refetchable).
If there is not a swap file, but some data should be saved to the (non existent) swap area, then an out-of-memory error happens. Typically, this is notified to the process which is trying to get the memory (via alloc() and similar) - the alloc() fails and returns NULL. The process can choose what to do, or even crash. If the memory is needed by the kernel itself (normally quite rare), a PANIC happens and the system locks completely.
swappiness is, in percentage, the tendency of the kernel to use the swap, even if not strictly needed, in order to have plenty of ram ready for memory requests. Simply put, a 100% swappiness means the kernel tries to always swap, a swappiness of 0 means the kernel tries to not do swap (there are some special values however). min_free_kbytes indicates real kilobytes, it is not a percentage, and it is the minimum amount that should always be free in order to let the kernel to work well. Even starting a memory reclaim could require some more ram to do the job: it would be catastrophic if, to get some memory, you need just a little memory but you don't have it! :-)
vfs_cache_pressure is again a percentage. It indicates how much the kernel tries to get rid of (memory) cache used for the file system (vfs=virtual file system). The cache for the filesystem is quite a good candidate to throw away, because it keeps information easily readable from the disk. Unfortunately, if the computer needs frequently to use the file system, it has to read, and read again, and read again always the same data. Caching is a big performance boost. Of course, if a system does little disk I/O, then this cache is the best candidate to throw away when memory hungry.
All this things are succintly explained here: https://www.kernel.org/doc/Documentation/sysctl/vm.txt

How can I shrink the Linux page cache from within kernel space?

I'm working on a system that involves some custom hardware and a custom Linux device driver I wrote for the hardware. The system occasionally needs to move large amounts of data very rapidly and therefore my driver dynamically (i.e. when needed) allocates large (1 GB) DMA buffers which are used and then freed when they are no longer needed. To allocate such large buffers I actually allocate a bunch of smaller buffers (256 X 4MB) using dma_alloc_coherent and then map them contiguously into user space using remap_pfn_range. This works very well most of the time.
During testing, after the system has been running test cases for a long time, I sometimes see DMA allocation failures where one of the dma_alloc_coherent calls in my driver fails which causes my application layer software to crash. I was finally able to track down this problem and I discovered that when I see DMA allocation failures the Linux kernel page cache is very full.
For example, on the last failure that I captured the page cache filled 27 GB of the 32 GB of RAM on my system. I suspected that the page cache "fullness" was causing dma_alloc_coherent calls to fail. To test this theory I manually emptied the page cache using:
# echo 1 > /proc/sys/vm/drop_caches
This dropped the size of the cache from 27 GB to 94 MB and I was able to allocate 20+ 1 GB DMA buffers with no issues.
Clearly the page cache is a beneficial thing so I would prefer not to have to completely empty it every time I run out of space when allocating DMA buffers. My questions is this: how can I dynamically shrink the page cache in kernel space such that if a call to dma_alloc_coherent fails I can recover just enough space so that I can retry the call and have it succeed?
My system is x86_64 based running a 3.16.x Linux kernel.
I have found some vague references that suggest what I'm attempting may be possible, for example "These objects are automatically
reclaimed by the kernel when memory is needed elsewhere on the system." (from: https://www.kernel.org/doc/Documentation/sysctl/vm.txt). But I have not yet found any specifics that indicate how the memory is reclaimed.
Any assistance with this would be greatly appreciated!
TL;DR : Scan for active superblocks and drop references to non-dirty ones until you have reclaimed as much system memory as you need. (or you finally run out of references to active superblocks.)
How to write kernel code to dynamically shrink the fs page-cache,
to recover just enough space so that a subsequent call to dma_alloc_coherent() succeeds?
To answer this question, let us take a look at what the "drop_caches operation" did to reduce the fs page-cache from 27GB to 94MB on your system.
echo 1 > /proc/sys/vm/drop_caches
invokes
drop_caches_sysctl_handler()
which in turn invokes iterate_supers() and
passes it the pointer to the function drop_pagecache_sb().
What happens next is that iterate_supers() scans for active superblocks and everytime it finds one, it calls drop_pagecache_sb(), passing it a reference to the active superblock.
This iterative procedure continues until references to all the active superblocks are freed from the fs page-cache. This is a non-destructive operation and will only free blocks that are completely unused. Dirty-objects will continue to be in use until written out to disk and are not free-able. If you run sync first to flush them out to disk, the "drop_caches operation" tends to free more memory.
Since you are interested in running this process to reclaim a limited/known amount of memory i.e. what is soon going to be requested using dma_alloc_coherent(), you simply need to implement the above functionality with an additional check at the end of each iteration and abort the superblock scan immediately once the amount of free system memory crosses the desired level.
A couple of points to keep in mind to further optimise this procedure :
Is there a preference for certain block devices over others?
You may want to iterate over active superblocks of the block devices that you do not care about first. If enough memory is not reclaimed, then scan the block devices that you would prefer to retain in the fs page-cache unless absolutely necessary to reclaim required memory. get_active_super() might be of help here.
iterate_supers_type() seems interesting
It allows one to iterate over superblocks of specific file_system_type
Please note that this is a speculative solution based purely on the analysis of existing code within the Linux kernel that you have observed to already solve your problem. Once the above approach is implemented, it will only allow you to control the same i.e. attempt to reclaim fs page-cache memory only to the extent required for your immediate needs.
Technically when certain allocation fails then Kernel will try to free memory.Depending upon memory failures(soft failure/hard failure). Hard failures causes Kernel to enter into direct reclaim path. Direct reclaim is costly operation which might take undefined time to complete and even after that allocation might fail.
Here you have two options:
1) Play with VM settings like dirty_ratio,dirty_background_ratio etc to maintain free ram. see : https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-tunables.html
2) Write a kernel daemon, which calls kernel function which handles drop_cache (because drop_cache migh sleep).

How to determine the real memory usage of a single process?

How can I calculate the real memory usage of a single process? I am not talking about the virtual memory, because it just keeps growing. For instance, there are proc files like smaps, where you can get the mappings of a process. But this is virtual memory and the values of that file just keeps growing for running process. But I would like to reflect the real memory usage of a process. E.g. if you plot the memory usage of a process it should represent the allocations of memory and also the freeing of memory. So the plot should be like an up and down movement instead of a linear function, that just keeps growing for a running process.
So, how could I calculate the real memory usage? I would appreciate any helpful answer.
It's actually kind of a complicated question. The two most common metrics for a program's memory usage at the OS level are virtual size and resident set size. (These show in the output of ps -u as the VSZ and RSS columns.) Roughly speaking, these tell the total memory the program has assigned to it, versus how much it is currently actively using.
Further complicating the question is that when you use malloc (or the C++ new operator) to allocate memory, memory is allocated from a pool in your process which is built by occasionally requesting an allocation of memory from the operating system. But when you free memory, the memory goes back into this pool, but it is typically not returned to the OS. So as your program allocates and frees memory, you typically will not see its memory footprint go up and down. (However, if it frees a lot of memory and then doesn't allocate it any more, eventually you may see its rss go down.)

How to stop page cache for disk I/O in my linux system?

Here is my system based on Linux2.6.32.12:
1 It contains 20 processes which occupy a lot of usr cpu
2 It needs to write data on rate 100M/s to disk and those data would not be used recently.
What I expect:
It can run steadily and disk I/O would not affect my system.
My problem:
At the beginning, the system run as I thought. But as the time passed, Linux would cache a lot data for the disk I/O, that lead to physical memory reducing. At last, there will be not enough memory, then Linux will swap in/out my processes. It will cause I/O problem that a lot cpu time was used to I/O.
What I have try:
I try to solved the problem, by "fsync" everytime I write a large block.But the physical memory is still decreasing while cached increasing.
How to stop page cache here, it's useless for me
More infomation:
When Top show free 46963m, all is well including cpu %wa is low and vmstat shows no si or so.
When Top show free 273m, %wa is so high which affect my processes and vmstat shows a lot si and so.
I'm not sure that changing something will affect overall performance.
Maybe you might use posix_fadvise(2) and sync_file_range(2) in your program (and more rarely fsync(2) or fdatasync(2) or sync(2) or syncfs(2), ...). Also look at madvise(2), mlock(2) and munlock(2), and of course mmap(2) and munmap(2). Perhaps ionice(1) could help.
In the reader process, you might perhaps use readhahead(2) (perhaps in a separate thread).
Upgrading your kernel (to a 3.6 or better) could certainly help: Linux has improved significantly on these points since 2.6.32 which is really old.
To drop pagecache you can do the following:
"echo 1 > /proc/sys/vm/drop_caches"
drop_caches are usually 0. And, can be changed as per need. As you've identified yourself, that you need to free pagecache, so this is how to do it. You can also take a look at dirty_writeback_centisecs (and it's related tunables)(http://lxr.linux.no/linux+*/Documentation/sysctl/vm.txt#L129) to make quick writeback, but note it might have consequences, as it calls up kernel flasher thread to write out dirty pages. Also, note the uses of dirty_expire_centices, which defines how much time some data needs to be eligible for writeout.

Reclaim memory after program exit

Here is my problem: after running a suite of programs, free tells me that after execution there is about 1 GB less memory free. After some searches I found SO: What really happens when you dont free after malloc which (as I understand it) makes clear that missing memory deallocations should not be the problem... (is that correct?)
top does not show any processes that use significant amounts of memory.
How can I find out 'what happend' to the memory, i.e. which program allocated it and why it is not free after program execution?
Where does free collect its information?
(I am running a recent Ubuntu version)
Yes, memory used by your program is freed after your program exits.
The statistics in "free" are confusing, but the fact is that the memory IS available to other programs:
http://kevinclosson.wordpress.com/2009/11/17/linux-free-memory-is-it-free-or-reclaimable-yes-when-i-want-free-memory-i-want-free-memory/
http://sourcefrog.net/weblog/software/linux-kernel/free-mem.html
Here's an event better link:
http://www.linuxatemyram.com/
free (1) is a misnomer, it should more correctly be called unused, because that's what it shows. Or maybe it should be called physicalfree (or, more precisely, the "free" column in the output should be named "unused").
You'll note that "buffers" and "cached" tends to go up as "free" goes down. Memory does not disappear, it just gets assigned to a different "bucket".
The difference between free memory and unused memory is that while both are "free", the unused memory is truly so (no physical memory in use) whereas the simply "free" memory is often moved into the buffer cache. That is for example the case for all executable images and libraries, anything that is read-only or read-execute. If the same file is loaded again later, the "free" page is mapped into the process again and no data must be loaded.
Note that "unused" is actually a bad thing, although it is not immediately obvious (it sounds good, doesn't it?). Free (but physically used) memory serves a purpose, whereas free (unused) memory means you could as well have saved on money for RAM. Therefore, having unused memory (e.g. by purging pages) is exactly what you don't want.
Stunningly, under Windows there exists a lot of "memory optimizer" tools which cost real money and which do just that...
About reclaiming memory, the way this works is easy: The OS simply removes the references to all pages in the working set. If a page is shared with another process, nothing spectacular happens. If it belongs to a non-anonymous mapping and is not writeable (or writeable and not written), it goes into the buffer cache. Otherwise, it goes zap poof.
This removes any memory allocated with malloc as well as the memory used by executables and file mappings, and (since all memory is based on pages) everything else.
It is probably your OS using up that space for its own purposes.
For example, many modern OS's will keep programs loaded in memory after they terminate, in case you want to start them up again. If their guess is right, it saves a lot of time at the cost of some memory that wasn't being used anyway. Some OS's will even speculatively load some commonly used programs.
CPU utilization works the same way. Often your OS will speculatively do some work when the CPU would otherwise be "idle".

Resources