What cause kernel to eat CPU on page_fault?

What cause kernel to eat CPU on page_fault? - linux

hw/os: linux 4.9, 64G RAM.
16 daemons running. Each reading random short (100 bytes) pieces of 5GiB file accessing it as a memory mapped via mmap() at daemon startup. Each daemon reads its own file, so 16 5GiB files total.
Each daemon making maybe 10 reads per second. Not too much, disk load is rather small.
Sometimes (1 event in 5 minutes, no period, totally random) some random daemon stuck in kernel code with the followind stack (see picture) for 300 milliseconds. This does not corellate with major-faults: the major-faults go at constant rate about 100...200 per second. Disk reads are also constant.
What can cause this?
Text of the image: __list_del_entry isolate_lru_pages.isra.48 shrink_inactive_list shrink_node_memcg shrink_node node_reclaim get_page_from_freelist enqueue_task_fair sched_clock __alloc_pages_nodemask alloc_pages_vma handle_mm_fault __do_page_fault page_fault

You have shrink_node and node_reclaim functions in your stack. They are called to free memory (which is shown as buff/cache by free command-line tool): https://www.kernel.org/doc/html/latest/admin-guide/mm/concepts.html#reclaim
The process of freeing the reclaimable physical memory pages and repurposing them is called (surprise!) reclaim. Linux can reclaim pages either asynchronously or synchronously, depending on the state of the system. When the system is not loaded, most of the memory is free and allocation requests will be satisfied immediately from the free pages supply. As the load increases, the amount of the free pages goes down and when it reaches a certain threshold (high watermark), an allocation request will awaken the kswapd daemon. It will asynchronously scan memory pages and either just free them if the data they contain is available elsewhere, or evict to the backing storage device (remember those dirty pages?). As memory usage increases even more and reaches another threshold - min watermark - an allocation will trigger direct reclaim. In this case allocation is stalled until enough memory pages are reclaimed to satisfy the request.
So your 64 GB RAM system has situation when there is no free memory left. This amount of memory is enough to hold a copy of 12 files of 5 GB each, and your daemons uses 16 files. Linux may read more data from files than it was required by application with Readahead technique ("Linux readahead: less tricks for more", ols 2007 pp273-284, man 2 readahead). MADV_SEQUENTIAL may also turn on this mechanism, https://man7.org/linux/man-pages/man2/madvise.2.html
MADV_SEQUENTIAL
Expect page references in sequential order. (Hence, pages in
the given range can be aggressively read ahead, and may be
freed soon after they are accessed.)
MADV_RANDOM
Expect page references in random order. (Hence, read ahead
may be less useful than normally.)
Not sure how your daemons did open and read files, was MADV_SEQUENTIAL active for them or not (or was this flag added by glibc or any other library). Also there can be some effect from THP - Transpartent huge pages https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html. Normal 4.9 kernel is from 2016 and thp expansion for filesystems was planned in 2019 https://lwn.net/Articles/789159/, but if you use RHEL/CentOS, some features may be backported into fork of 4.9 kernel.
You should check free and cat /proc/meminfo output periodically to check how your daemons and linux kernel readahead uses memory.

Related

Find exact physical memory usage in Ubuntu/Linux

(I'm new to Linux)
Say I've 1300 MB memory, on a Ubuntu machine. OS and other default programs consumes 300 MB memory and 1000 MB is free for my own applications.
I installed my application and I could configure it to use 700 MB memory, when the application starts.
However I couldn't verify its actual memory usage. Even I disabled swap space.
The "VIRT" value shows a huge value and "RES", "SHR", "%MEM" shows very less value.
It is difficult to find actual physical memory usage, similar to "Resource monitor" in Windows, which will say my application is using 700 MB memory.
Is there any way to find actual physical memory in Ubuntu/Linux ?

TL;DR - Virtual memory is complicated.
The best measure of a Linux processes current usage of physical memory is RES.
The RES value represents the sum of all of the processes pages that are currently resident in physical memory. It includes resident code pages and resident data pages. It also includes shared pages (SHR) that are currently RAM resident, though these pages cannot be exclusively ascribed to >>this<< process.
The VIRT value is actually the sum of all notionally allocated pages for the process, and it includes pages that are currently RAM resident, pages that are currently swapped to disk.
See https://stackoverflow.com/a/56351211/1184752 for another explanation.
Note that RES is giving you (roughly) instantaneous RAM usage. That is what you asked about ...
The "actual" memory usage over time is more complicated because the OS's virtual memory subsystem is typically be swapping pages in and out according to demand. So, for example, some of your application's pages may not have been accesses recently, and the OS may then swap them out (to swap space) to free up RAM for other pages required by your application ... or something else.
The VIRT value while actually representing virtual address space, is a good approximation of total (virtual) memory usage. However, it may be an over-estimate:
Some pages in a processes address space are shared between multiple processes. This includes read-only code segments, pages shared between parent and child processes between vfork and exec, and shared memory segments created using mmap.
Some pages may be set to have illegal access (e.g. for stack red-zones) and may not be backed by either RAM or swap device pages.
Some pages of the address space in certain states may not have been committed to either RAM or disk yet ... depending on how the virtual memory system is implemented. (Consider the case where a process requests a huge memory segment and neither reads from it or writes to it. It is possible that the virtual memory implementation will not allocate RAM pages until the first read or write in the page. And if you use lazy swap reservation, swap pages not be committed either. But beware that you can get into trouble with lazy swap reservation.)
VIRT can also be under-estimate because the OS usually reserves swap space for all pages ... whether they are currently swapped in or swapped out. So if you count the RAM and swap versions of a given page as separate units of storage, VIRT usually underestimates the total storage used.
Finally, if your real goal is to limit your application to using at most
700 MB (of virtual address space) then you can use ulimit -v ... to do this. If the application tries to request memory beyond its limit, the request fails.

How can I shrink the Linux page cache from within kernel space?

I'm working on a system that involves some custom hardware and a custom Linux device driver I wrote for the hardware. The system occasionally needs to move large amounts of data very rapidly and therefore my driver dynamically (i.e. when needed) allocates large (1 GB) DMA buffers which are used and then freed when they are no longer needed. To allocate such large buffers I actually allocate a bunch of smaller buffers (256 X 4MB) using dma_alloc_coherent and then map them contiguously into user space using remap_pfn_range. This works very well most of the time.
During testing, after the system has been running test cases for a long time, I sometimes see DMA allocation failures where one of the dma_alloc_coherent calls in my driver fails which causes my application layer software to crash. I was finally able to track down this problem and I discovered that when I see DMA allocation failures the Linux kernel page cache is very full.
For example, on the last failure that I captured the page cache filled 27 GB of the 32 GB of RAM on my system. I suspected that the page cache "fullness" was causing dma_alloc_coherent calls to fail. To test this theory I manually emptied the page cache using:
# echo 1 > /proc/sys/vm/drop_caches
This dropped the size of the cache from 27 GB to 94 MB and I was able to allocate 20+ 1 GB DMA buffers with no issues.
Clearly the page cache is a beneficial thing so I would prefer not to have to completely empty it every time I run out of space when allocating DMA buffers. My questions is this: how can I dynamically shrink the page cache in kernel space such that if a call to dma_alloc_coherent fails I can recover just enough space so that I can retry the call and have it succeed?
My system is x86_64 based running a 3.16.x Linux kernel.
I have found some vague references that suggest what I'm attempting may be possible, for example "These objects are automatically
reclaimed by the kernel when memory is needed elsewhere on the system." (from: https://www.kernel.org/doc/Documentation/sysctl/vm.txt). But I have not yet found any specifics that indicate how the memory is reclaimed.
Any assistance with this would be greatly appreciated!

TL;DR : Scan for active superblocks and drop references to non-dirty ones until you have reclaimed as much system memory as you need. (or you finally run out of references to active superblocks.)
How to write kernel code to dynamically shrink the fs page-cache,
to recover just enough space so that a subsequent call to dma_alloc_coherent() succeeds?
To answer this question, let us take a look at what the "drop_caches operation" did to reduce the fs page-cache from 27GB to 94MB on your system.
echo 1 > /proc/sys/vm/drop_caches
invokes
drop_caches_sysctl_handler()
which in turn invokes iterate_supers() and
passes it the pointer to the function drop_pagecache_sb().
What happens next is that iterate_supers() scans for active superblocks and everytime it finds one, it calls drop_pagecache_sb(), passing it a reference to the active superblock.
This iterative procedure continues until references to all the active superblocks are freed from the fs page-cache. This is a non-destructive operation and will only free blocks that are completely unused. Dirty-objects will continue to be in use until written out to disk and are not free-able. If you run sync first to flush them out to disk, the "drop_caches operation" tends to free more memory.
Since you are interested in running this process to reclaim a limited/known amount of memory i.e. what is soon going to be requested using dma_alloc_coherent(), you simply need to implement the above functionality with an additional check at the end of each iteration and abort the superblock scan immediately once the amount of free system memory crosses the desired level.
A couple of points to keep in mind to further optimise this procedure :
Is there a preference for certain block devices over others?
You may want to iterate over active superblocks of the block devices that you do not care about first. If enough memory is not reclaimed, then scan the block devices that you would prefer to retain in the fs page-cache unless absolutely necessary to reclaim required memory. get_active_super() might be of help here.
iterate_supers_type() seems interesting
It allows one to iterate over superblocks of specific file_system_type
Please note that this is a speculative solution based purely on the analysis of existing code within the Linux kernel that you have observed to already solve your problem. Once the above approach is implemented, it will only allow you to control the same i.e. attempt to reclaim fs page-cache memory only to the extent required for your immediate needs.

Technically when certain allocation fails then Kernel will try to free memory.Depending upon memory failures(soft failure/hard failure). Hard failures causes Kernel to enter into direct reclaim path. Direct reclaim is costly operation which might take undefined time to complete and even after that allocation might fail.
Here you have two options:
1) Play with VM settings like dirty_ratio,dirty_background_ratio etc to maintain free ram. see : https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-tunables.html
2) Write a kernel daemon, which calls kernel function which handles drop_cache (because drop_cache migh sleep).

vm/min_free_kbytes - Why Keep Minimum Reserved Memory?

According to this article:
/proc/sys/vm/min_free_kbytes: This controls the amount of memory that is kept free for use by special reserves including “atomic” allocations (those which cannot wait for reclaim)
My question is that what does it mean by "those which cannot wait for reclaim"? In other words, I would like to understand why there's a need to tell the system to always keep a certain minimum amount of memory free and under what circumstances will this memory be used? [It must be used by something; don't see the need otherwise]
My second question: does setting this memory to something higher than 4MB (on my system) leads to better performance? We have a server which occasionally exhibit very poor shell performance (e.g. ls -l takes 10-15 seconds to execute) when certain processes get going and if setting this number to something higher will lead to better shell performance?

(link is dead, looks like it's now here)
That text is referring to atomic allocations, which are requests for memory that must be satisfied without giving up control (i.e. the current thread can not be suspended). This happens most often in interrupt routines, but it applies to all cases where memory is needed while holding an essential lock. These allocations must be immediate, as you can't afford to wait for the swapper to free up memory.
See Linux-MM for a more thorough explanation, but here is the memory allocation process in short:
_alloc_pages first iterates over each memory zone looking for the first one that contains eligible free pages
_alloc_pages then wakes up the kswapd task [..to..] tap into the reserve memory pools maintained for each zone.
If the memory allocation still does not succeed, _alloc pages will either give up [..] In this process _alloc_pages executes a cond_resched() which may cause a sleep, which is why this branch is forbidden to allocations with GFP_ATOMIC.
min_free_kbytes is unlikely to help much with the described "ls -l takes 10-15 seconds to execute"; that is likely caused by general memory pressure and swapping rather than zone exhaustion. The min_free_kbytes setting only needs to allow enough free pages to handle immediate requests. As soon as normal operation is resumed, the swapper process can be run to rebalance the memory zones. The only time I've had to increase min_free_kbytes is after enabling jumbo frames on a network card that didn't support dma scattering.
To expand on your second question a bit, you will have better results tuning vm.swappiness and the dirty ratios mentioned in the linked article. However, be aware that optimizing for "ls -l" performance may cause other processes to become slower. Never optimize for a non-primary usecase.

All linux systems will attempt to make use of all physical memory available to the system, often through the creation of a filesystem buffer cache, which put simply is an I/O buffer to help improve system performance. Technically this memory is not in use, even though it is allocated for caching.
"wait for reclaim", in your question, refers to the process of reclaiming that cache memory that is "not in use" so that it can be allocated to a process. This is supposed to be transparent but in the real world there are many processes that do not wait for this memory to become available. Java is a good example, especially where a large minimum heap size has been set. The process tries to allocate the memory and if it is not instantly available in one large contiguous (atomic?) chunk, the process dies.
Reserving a certain amount of memory with min_free_kbytes allows this memory to be instantly available and reduces the memory pressure when new processes need to start, run and finish while there is a high memory load and a full buffer cache.
4MB does seem rather low because if the buffer cache is full, any process that wants an immediate allocation of more than 4MB will likely fail. The setting is very tunable and system-specific, but if you have a few GB of memory available it can't hurt to bump up the reserve memory to 128MB. I'm not sure what effect it will have on shell interactivity, but likely positive.

This memory is kept free from use by normal processes. As #Arno mentioned, the special processes that can run include interrupt routines, which must be run now (as it's an interrupt), and finish before any other processes can run (atomic). This can include things like swapping out memory to disk when memory is full.
If the memory is filled an interrupt (memory management) process runs to swap some memory into disk so it can free some memory for use by normal processes. But if vm.min_free_kbytes is too small for it to run, then it locks up the system. This is because this interrupt process must run first to free memory so others can run, but then it's stuck because it doesn't have enough reserved memory vm.min_free_kbytes to do its task resulting in a deadlock.
Also see:
https://www.linbit.com/en/kernel-min_free_kbytes/ and
https://askubuntu.com/questions/41778/computer-freezing-on-almost-full-ram-possibly-disk-cache-problem (where the memory management process has so little memory to work with it takes so long to swap little by little that it feels like a freeze.)

Different between memory usage in WHM/Cpanel and Linux

If I go to WHM and see my server's memory usage, it says that only 16% of memory is in use.
But when I connect to server using SSH and run command "free -m" then it shows that 80% is in use. Why is that? I want to know exact memory usage of all applications running like MySQL, Apache e.t.c.
How do I view that?
Thanks

As they say, "It's Complicated".
Linux uses unused memory for disk buffering and caching. It speeds things up. But you may need to look at the -/+ buffers/cache line of free.
'ps' can show you, for any given process, or for all processes, the %cpu, %mem, cumulative cpu-time, rss (resident set size, the non-swapped physical memory that a process is using), size (very approximate amount of swap space that would be required if the process were to dirty all writable pages and then be swapped out), vsize (virtual memory usage of entire process (vm_lib + vm_exe + vm_data + vm_stack)), and much much more.
For any given process, you can cat /proc/$PID/status -- it's human readable -- and check out the VmSize, VmLck, VmRSS, VmData, VmStk, VmExe, VmLib, and VmPTE values, along with others...
But that's just for starters... Processes can allocate memory but not use it. (Memory can be allocated, but the memory pages are not created/issued until they're actually used. That whole on-demand thing.)
Processes can map in hardware space, showing up as using a large quantity of memory that's not actually coming from system RAM. (X-servers are known to sometimes do this. It's some wonky stuff involved kernel drivers...)
There's the executable, which is usually a memory-mapped file. Meaning that parts that are swapped-in are taking up RAM, but when swapped out it never takes up swapfile space.
Processes can have other memory-mapped files...
There's shared-memory libraries, where the same RAM pages are used by multiple programs concurrently.
So we have to ask, as far as memory goes, what exactly counts and what doesn't?

What are the exact conditions based on which Linux swaps process(s) memory from RAM to a swap file?

My server has 8Gigs of RAM and 8Gigs configured for swap file. I have memory intensive apps running. These apps have peak loads during which we find swap usage increase. Approximately 1 GIG of swap is used.
I have another server with 4Gigs of RAM and 8 Gigs of swap and similar memory intensive apps running on it. But here swap usage is very negligible. Around 100 MB.
I was wondering what are the exact conditions or a rough formula based on which Linux will do a swapout of a process memory in RAM to the swap file.
I know its based on swapiness factor. What else is it based on? Swap file size? Any pointers to Linux kernel documentation/source code explaining this will be great.

I've seen a lot of people posting subjective explanations of what this does. Here is hopefully a more full answer.
In the split LRU on post 2.6.28 Linux swappiness is a multiplier used to arbitrarily modify the fraction that is calculated determining the pressure built up in both LRUs.
So, for example on a system with no free memory left - the value of the existing memory you have is measured based off of the rate of how much memory is listed as 'Active' and the rate of how often pages are promoted to active after falling into the inactive list.
An LRU with many promotions/demotions of pages between active and inactive is in a lot of use.
Typically file backed storage is cheaper and safer to evict when your running out of memory and automatically is given a modifier of 200 (this makes file backed memory 200 times more worthless than swap backed memory (Which has a value of 0) when it multiplies this fraction.
What swappiness does is modify this value by deducting the swappiness number you gave (default 60) to file memory and adding the swappiness value you gave as a multiplier to anon memory. Thus the default swappiness leaves you with anonymous memory being 80 times more valuable than file memory (200-60 for file, 0+60 for anon). Thus, on a typical linux system that has used up all its memory, page cache would have to be 80 TIMES more active than anonymous memory for anonymous memory to be swapped out in favour of page cache.
If you set swappiness to 100 this gives anon a modifier of 100 and file memory a modifier of 100 (200 - 100) leaving both LRUs equally weighted. Thus on a file heavy system that wants page cache providing the anon memory is not as active as page cache then anon memory will be swapped to disk to make space for extra page cache.

Linux (or any other OS) divides memory up into pages (typically 4Kb). Each of these pages represent a chunk of memory. Usage information for these pages is maintained, which basically contains info about whether the page is free or is in use (part of some process), whether it has been accessed recently, what kind of data it contains (process data, executable code etc.), owner of the page, etc. These pages can also be broadly divided into two categories - filesystem pages or the page cache (in which all data read/written to your filesystem resides) and pages belonging to processes.
When the system is running low on memory, the kernel starts swapping out pages based on their usage. Using a list of pages sorted w.r.t recency of access is common for determining which pages can be swapped out (linux kernel has such a list too).
During swapping, Linux kernel needs to decide what to trade-off when nuking pages in memory and sending them to swap. If it swaps filesystem pages too aggressively, more reads are required from the filesystem to read those pages back when they are needed. However, if it swaps out process pages more aggressively it can hurt interactivity, because when the user tries to use the swapped out processes, they will have to be read back from the disk. See a nice discussion here on this.
By setting swappiness = 0, you are telling the linux kernel not to swap out pages belonging to processes. When setting swappiness = 100 instead, you tell the kernel to swap out pages belonging to processes more aggressively. To tune your system, try changing the swappiness parameter in steps of 10, monitoring performance and pages being swapped in/out at each setting using the "vmstat" command. Keep the setting that gives you the best results. Remember to do this testing during peak usage hours. :)
For database applications, swappiness = 0 is generally recommended. (Even then, test different settings on your systems to arrive to a good value).
References:
http://www.linuxvox.com/2009/10/what-is-the-linux-kernel-parameter-vm-swappiness/
http://www.pythian.com/news/1913/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string