What are the exact conditions based on which Linux swaps process(s) memory from RAM to a swap file? - linux

My server has 8Gigs of RAM and 8Gigs configured for swap file. I have memory intensive apps running. These apps have peak loads during which we find swap usage increase. Approximately 1 GIG of swap is used.
I have another server with 4Gigs of RAM and 8 Gigs of swap and similar memory intensive apps running on it. But here swap usage is very negligible. Around 100 MB.
I was wondering what are the exact conditions or a rough formula based on which Linux will do a swapout of a process memory in RAM to the swap file.
I know its based on swapiness factor. What else is it based on? Swap file size? Any pointers to Linux kernel documentation/source code explaining this will be great.

I've seen a lot of people posting subjective explanations of what this does. Here is hopefully a more full answer.
In the split LRU on post 2.6.28 Linux swappiness is a multiplier used to arbitrarily modify the fraction that is calculated determining the pressure built up in both LRUs.
So, for example on a system with no free memory left - the value of the existing memory you have is measured based off of the rate of how much memory is listed as 'Active' and the rate of how often pages are promoted to active after falling into the inactive list.
An LRU with many promotions/demotions of pages between active and inactive is in a lot of use.
Typically file backed storage is cheaper and safer to evict when your running out of memory and automatically is given a modifier of 200 (this makes file backed memory 200 times more worthless than swap backed memory (Which has a value of 0) when it multiplies this fraction.
What swappiness does is modify this value by deducting the swappiness number you gave (default 60) to file memory and adding the swappiness value you gave as a multiplier to anon memory. Thus the default swappiness leaves you with anonymous memory being 80 times more valuable than file memory (200-60 for file, 0+60 for anon). Thus, on a typical linux system that has used up all its memory, page cache would have to be 80 TIMES more active than anonymous memory for anonymous memory to be swapped out in favour of page cache.
If you set swappiness to 100 this gives anon a modifier of 100 and file memory a modifier of 100 (200 - 100) leaving both LRUs equally weighted. Thus on a file heavy system that wants page cache providing the anon memory is not as active as page cache then anon memory will be swapped to disk to make space for extra page cache.

Linux (or any other OS) divides memory up into pages (typically 4Kb). Each of these pages represent a chunk of memory. Usage information for these pages is maintained, which basically contains info about whether the page is free or is in use (part of some process), whether it has been accessed recently, what kind of data it contains (process data, executable code etc.), owner of the page, etc. These pages can also be broadly divided into two categories - filesystem pages or the page cache (in which all data read/written to your filesystem resides) and pages belonging to processes.
When the system is running low on memory, the kernel starts swapping out pages based on their usage. Using a list of pages sorted w.r.t recency of access is common for determining which pages can be swapped out (linux kernel has such a list too).
During swapping, Linux kernel needs to decide what to trade-off when nuking pages in memory and sending them to swap. If it swaps filesystem pages too aggressively, more reads are required from the filesystem to read those pages back when they are needed. However, if it swaps out process pages more aggressively it can hurt interactivity, because when the user tries to use the swapped out processes, they will have to be read back from the disk. See a nice discussion here on this.
By setting swappiness = 0, you are telling the linux kernel not to swap out pages belonging to processes. When setting swappiness = 100 instead, you tell the kernel to swap out pages belonging to processes more aggressively. To tune your system, try changing the swappiness parameter in steps of 10, monitoring performance and pages being swapped in/out at each setting using the "vmstat" command. Keep the setting that gives you the best results. Remember to do this testing during peak usage hours. :)
For database applications, swappiness = 0 is generally recommended. (Even then, test different settings on your systems to arrive to a good value).
References:
http://www.linuxvox.com/2009/10/what-is-the-linux-kernel-parameter-vm-swappiness/
http://www.pythian.com/news/1913/

Related

What cause kernel to eat CPU on page_fault?

hw/os: linux 4.9, 64G RAM.
16 daemons running. Each reading random short (100 bytes) pieces of 5GiB file accessing it as a memory mapped via mmap() at daemon startup. Each daemon reads its own file, so 16 5GiB files total.
Each daemon making maybe 10 reads per second. Not too much, disk load is rather small.
Sometimes (1 event in 5 minutes, no period, totally random) some random daemon stuck in kernel code with the followind stack (see picture) for 300 milliseconds. This does not corellate with major-faults: the major-faults go at constant rate about 100...200 per second. Disk reads are also constant.
What can cause this?
Text of the image: __list_del_entry isolate_lru_pages.isra.48 shrink_inactive_list shrink_node_memcg shrink_node node_reclaim get_page_from_freelist enqueue_task_fair sched_clock __alloc_pages_nodemask alloc_pages_vma handle_mm_fault __do_page_fault page_fault
You have shrink_node and node_reclaim functions in your stack. They are called to free memory (which is shown as buff/cache by free command-line tool): https://www.kernel.org/doc/html/latest/admin-guide/mm/concepts.html#reclaim
The process of freeing the reclaimable physical memory pages and repurposing them is called (surprise!) reclaim. Linux can reclaim pages either asynchronously or synchronously, depending on the state of the system. When the system is not loaded, most of the memory is free and allocation requests will be satisfied immediately from the free pages supply. As the load increases, the amount of the free pages goes down and when it reaches a certain threshold (high watermark), an allocation request will awaken the kswapd daemon. It will asynchronously scan memory pages and either just free them if the data they contain is available elsewhere, or evict to the backing storage device (remember those dirty pages?). As memory usage increases even more and reaches another threshold - min watermark - an allocation will trigger direct reclaim. In this case allocation is stalled until enough memory pages are reclaimed to satisfy the request.
So your 64 GB RAM system has situation when there is no free memory left. This amount of memory is enough to hold a copy of 12 files of 5 GB each, and your daemons uses 16 files. Linux may read more data from files than it was required by application with Readahead technique ("Linux readahead: less tricks for more", ols 2007 pp273-284, man 2 readahead). MADV_SEQUENTIAL may also turn on this mechanism, https://man7.org/linux/man-pages/man2/madvise.2.html
MADV_SEQUENTIAL
Expect page references in sequential order. (Hence, pages in
the given range can be aggressively read ahead, and may be
freed soon after they are accessed.)
MADV_RANDOM
Expect page references in random order. (Hence, read ahead
may be less useful than normally.)
Not sure how your daemons did open and read files, was MADV_SEQUENTIAL active for them or not (or was this flag added by glibc or any other library). Also there can be some effect from THP - Transpartent huge pages https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html. Normal 4.9 kernel is from 2016 and thp expansion for filesystems was planned in 2019 https://lwn.net/Articles/789159/, but if you use RHEL/CentOS, some features may be backported into fork of 4.9 kernel.
You should check free and cat /proc/meminfo output periodically to check how your daemons and linux kernel readahead uses memory.

Find exact physical memory usage in Ubuntu/Linux

(I'm new to Linux)
Say I've 1300 MB memory, on a Ubuntu machine. OS and other default programs consumes 300 MB memory and 1000 MB is free for my own applications.
I installed my application and I could configure it to use 700 MB memory, when the application starts.
However I couldn't verify its actual memory usage. Even I disabled swap space.
The "VIRT" value shows a huge value and "RES", "SHR", "%MEM" shows very less value.
It is difficult to find actual physical memory usage, similar to "Resource monitor" in Windows, which will say my application is using 700 MB memory.
Is there any way to find actual physical memory in Ubuntu/Linux ?
TL;DR - Virtual memory is complicated.
The best measure of a Linux processes current usage of physical memory is RES.
The RES value represents the sum of all of the processes pages that are currently resident in physical memory. It includes resident code pages and resident data pages. It also includes shared pages (SHR) that are currently RAM resident, though these pages cannot be exclusively ascribed to >>this<< process.
The VIRT value is actually the sum of all notionally allocated pages for the process, and it includes pages that are currently RAM resident, pages that are currently swapped to disk.
See https://stackoverflow.com/a/56351211/1184752 for another explanation.
Note that RES is giving you (roughly) instantaneous RAM usage. That is what you asked about ...
The "actual" memory usage over time is more complicated because the OS's virtual memory subsystem is typically be swapping pages in and out according to demand. So, for example, some of your application's pages may not have been accesses recently, and the OS may then swap them out (to swap space) to free up RAM for other pages required by your application ... or something else.
The VIRT value while actually representing virtual address space, is a good approximation of total (virtual) memory usage. However, it may be an over-estimate:
Some pages in a processes address space are shared between multiple processes. This includes read-only code segments, pages shared between parent and child processes between vfork and exec, and shared memory segments created using mmap.
Some pages may be set to have illegal access (e.g. for stack red-zones) and may not be backed by either RAM or swap device pages.
Some pages of the address space in certain states may not have been committed to either RAM or disk yet ... depending on how the virtual memory system is implemented. (Consider the case where a process requests a huge memory segment and neither reads from it or writes to it. It is possible that the virtual memory implementation will not allocate RAM pages until the first read or write in the page. And if you use lazy swap reservation, swap pages not be committed either. But beware that you can get into trouble with lazy swap reservation.)
VIRT can also be under-estimate because the OS usually reserves swap space for all pages ... whether they are currently swapped in or swapped out. So if you count the RAM and swap versions of a given page as separate units of storage, VIRT usually underestimates the total storage used.
Finally, if your real goal is to limit your application to using at most
700 MB (of virtual address space) then you can use ulimit -v ... to do this. If the application tries to request memory beyond its limit, the request fails.

How does the kernel-side page cache virt <-> phys mapping interact with the TLB?

I am writing an application that makes heavy use of mmap, including from distinct processes (not concurrently, but serially). A big determinant of performance is how the TLB is managed user and kernel side for such mappings.
I understand reasonably well the user-visible aspects of the Linux page cache. I think this understanding extends to the userland performance impacts1.
What I don't understand is how those same pages are mapped into kernel space, and how this interacts with the TLB (on x86-64). You can find lots of information on how this worked in the 32-bit x86 world2, but I didn't dig up the answer for 64-bit.
So the two questions are (both interrelated and probably answered in one shot):
How is the page cache mapped3 in kernel space on x86-64?
If you read() N pages from a file in some process, then again read exactly those N pages again from another process on the same CPU, it possible that all the kernel side reads (during the kernel -> userpace copy of the contents) hit in the TLB? Note that this is (probably) a direct consequence of (1).
My overall goal here is to understand at a deep level the performance difference of one-off accessing of cached files via mmap or non-mmap calls such as read.
1 For example, if you mmap a file into your processes' virtual address space, you have effectively asked for your process page tables to contain a mapping from the returned/requested virual address range to a physical range corresponding to the pages for that file in the page cache (even if they don't exist in the page cache, yet). If MAP_POPULATE is specified, all the page table entries will actually be populated before the mmap call returns, and if not they will be populated as you fault-in the associated pages (sometimes with optimizations such as fault-around).
2 Basically, (for 3:1 mappings anyway) Linux uses a single 1 GB page to map approximately the first 1 GB of physical RAM directly (and places it at the top 1 GB of virtual memory), which is the end of story for machines with <= 1 GB RAM (the page cache necessarily goes in that 1GB mapping and hence a single 1 GB TLB entry covers everything). With more than 1GB RAM, the page cache is preferentially allocated from "HIGHMEM" - the region above 1GB which isn't covered by the kernel's 1GB mapping, so various temporary mapping strategies are used.
3 By mapped I mean how are the page tables set up for its access, aka how does the virtual <-> physical mapping work.
Due to vast virtual address space compared to physical ram installed (128TB for the kernel), the common trick is to permanently map all the ram. This is known as "direct map".
In principle it is possible that both relevant TLB and cache entries survive the context switch and all the other code executed, but it is hard to say how likely this can be in the real world.

Linux using swap instead of RAM with large image processing

I'm processing large images on a Linux server using the R programming language, so I expect much of the RAM to be used in the image processing and file writing process.
However, the server is using swap memory long before it appears to need to, thus slowing down the processing time significantly. See following image:
This shows I am using roughly 50% of the RAM for the image processing, about 50% appears to be reserved for disk cache (yellow) and yet 10Gb of swap is being used!
I was watching the swap being eaten up, and it didn't happen when the RAM was any higher in use than is being shown in this image. The swap appears to be eaten up during the processed data being written to a GeoTiff file.
My working theory is that the disk writing process is using much of the disk cache area (yellow region), and therefore the yellow isn't actually available to the server (as is often assumed of disk cache RAM)?
Does that sound reasonable? Is there another reason for swap being used when RAM is apparently available?
I believe you may be affected by swappiness kernel parameter:
When an application needs memory and all the RAM is fully occupied, the kernel has two ways to free some memory at its disposal: it can either reduce the disk cache in the RAM by eliminating the oldest data or it may swap some less used portions (pages) of programs out to the swap partition on disk. It is not easy to predict which method would be more efficient. The kernel makes a choice by roughly guessing the effectiveness of the two methods at a given instant, based on the recent history of activity.
Swappiness takes a value between 0 and 100 to change the balance between swapping applications and freeing cache. At 100, the kernel will always prefer to find inactive pages and swap them out. A value of 0 gives something close to the old behavior where applications that wanted memory could shrink the cache to a tiny fraction of RAM.
If you want to force the kernel to avoid swapping whenever possible and give the RAM from device buffers and disk cache to your application, you can set swappiness to zero:
echo 0 > /proc/sys/vm/swappiness
Note that you may actually worsen the performace with this setting, because your disk cache may shrink to a tiny fraction of what it is now, making disk access slower.

Find out how many pages of memory a process uses on linux

I need to find out how many pages of memory a process allocates?
Each page is 4096, the process memory usage I'm having some problems locating the correct value. When I'm looking in the gome-system-monitor there are a few values to choose from under memory map.
Thanks.
The point of this is to divide the memory usage by the page count and verify the page size.
It's hard to figure exact amount of memory allocated correctly: there are pages shared with other processes (r/o parts of libraries), never used memory allocated by brk and anonymous mmap, mmaped file which are not fetched from disk completely due to efficient processing algorithms which touch only small part of file etc, swapped out pages, dirty pages to-be-written-on-disk etc.
If you want to deal with all this complexity and figure out True count of pages, the detailed information is available at /proc/<pid>/smaps, and there are tools, like mem_usage.py or smem.pl (easily googlable) to turn it into more-or-less usable summary.
This would be the "Resident Set Size", assuming you process doesn't use swap.
Note that a process may allocate far more memory ("Virtual Memory Size"), but as long as it don't writes to the memory, it is not represented by physical memory, be it in RAM or on the disk.
Some system tools, like top, display a huge value for "swap" for each process - this is of course completly wrong, the value is the difference between VMS and RSS and most likely those unused, but allocated, memory pages.

Resources