How to measure minor page fault cost? - page-fault

I want to verify transparent huge page(THP) would cause large page fault latency, because Linux must zero pages before returning them to user. THP is 512x larger than 4KB pages, thus slower to clear. When memory is fragmented, the OS often compact memory to generate THP.
So I want to measure minor page fault latency(cost), but I still have no idea.

Check https://www.kernel.org/doc/Documentation/vm/transhuge.txt documentation and search LWN & RedHat docs for THP latency and THP faults.
https://www.kernel.org/doc/Documentation/vm/transhuge.txt says about zero THP:
By default kernel tries to use huge zero page on read page fault to
anonymous mapping. It's possible to disable huge zero page by writing 0
or enable it back by writing 1:
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
You can vary the setting (introduced around 2012: https://lwn.net/Articles/517465/ Adding a huge zero page) and do measurements of page mapping and access latency. Just read some system time with rdtsc/rdtscp/CLOCK_MONOTONIC, do accesses to the page, reread time; record stats about the time difference, like min/max/avg; draw a histogram - count how many differences were in 0..100, 101..300, 301..600 ... ranges and how many were bigger than some huge value. Array to count histogram many be small enough.
You may try mmap() with MAP_POPULATE flag - (http://d3s.mff.cuni.cz/teaching/advanced_operating_systems/slides/10_huge_pages.pdf page 17)
RedHat blog has post about THP & page fault latency (with help of their stap SystemTap tracing): https://developers.redhat.com/blog/2014/03/10/examining-huge-pages-or-transparent-huge-pages-performance/
To prevent information leakage from the previous user of the page the kernel writes zeros in the entire page. For a 4096 byte page this is a relatively short operation and will only take a couple of microseconds. The x86 hugepages are 2MB in size, 512 times larger than the normal page. Thus, the operation may take hundreds of microseconds and impact the operation of latency sensitive code. Below is a simple SystemTap command line script to show which applications have huge pages zeroed out and how long those operations take. It will run until cntl-c is pressed.
stap -e 'global huge_clear probe kernel.function("clear_huge_page").return {
huge_clear [execname(), pid()] <<< (gettimeofday_us() - #entry(gettimeofday_us()))}'
Also, I'm not sure about this, but in theory, Linux Kernel may have some kernel thread to do prezeroing of huge pages before they will be required by any application.

Related

Memory capacity saturation and minor page faults

In USE Method: Linux Performance Checklist it is mentioned that
The goal is a measure of memory capacity saturation - the degree to which a process is driving the system beyond its ability (and causing paging/swapping). [...] Another metric that may serve a similar goal is minor-fault rate by process, which could be watched from /proc/PID/stat.
I'm not sure I understand what minor-faults have to do with memory saturation.
Quoting wikipedia for reference
If the page is loaded in memory at the time the fault is generated, but is not marked in the memory management unit as being loaded in memory, then it is called a minor or soft page fault.
I think what the book is referring to is the following OS behaviour that could make soft page faults increase with memory pressure. But there are other reasons for soft page faults (allocating new pages with mmap(MAP_ANONYMOUS) and then freeing them again; every first-touch of a new page will cost a soft page fault, although fault-around for a group of contiguous pages can reduce that to one fault per N pages for some small N when iterating through a new large allocation.)
When approaching memory pressure limits, Linux (like many other OSes) will un-wire a page in the HW page tables to see if a soft page-fault happens very soon. If no, then it may actually evict that page from memory1.
But if it does soft-pagefault before being evicted, the kernel just has to wire it back in to the page table, having saved a hard page-fault. (And the I/O to write it out in the first place.)
Footnote 1: Writing it to disk if dirty, either in swap space or a file-backed mapping if not anonymous; otherwise just dropping it. The kernel could start this disk I/O while waiting to see if it gets faulted back in; IDK if Linux does this or not.

Bypassing 4KB block size limitation on block layer/device

We are developing an ssd-type storage hardware device that can take read/write request for big block size >4KB at a time (even in MBs size).
My understanding is that linux and its filesystem will "chop down" files into 4KB block size that will be passed to block device driver, which will need to physically fill the block with data from the device (ex., for write)
I am also aware the kernel page size has a role in this limitation as it is set at 4KB.
For experiment, I want to find out if there is a way to actually increase this block size, so that we will save some time (instead of doing multiple 4KB writes, we can do it with bigger block size).
Is there any FS or any existing project that I can take a look for this?
If not, what is needed to do this experiment - what parts of linux needs to be modified?
Trying to find out the level of difficulties and resource needed. Or, if it is even impossible to do so and/or any reason why we do not even need to do so. Any comment is appreciated.
Thanks.
The 4k limitation is due to the page cache. The main issue is that if you have a 4k page size, but a 32k block size, what happens if the file is only 2000 bytes long, so you only allocate a 4k page to cover the first 4k of the block. Now someone seeks to offset 20000, and writes a single byte. Now suppose the system is under a lot of memory pressure, and the 4k page for the first 2000 bytes, which is clean, gets pushed out of memory. How do you track which parts of the 32k block contain valid data, and what happens when the system needs to write out the dirty page at offset 20000?
Also, let's assume that the system is under a huge amount of memory pressure, we need to write out that last page; what if there isn't enough memory available to instantiante the other 28k of the 32k block, so we can do the read-modify-write cycle just to update that one dirty 4k page at offset 20000?
These problems can all be solved, but it would require a lot of surgery in the VM layer. The VM layer would need to know that for this file system, pages need to be instantiated in chunks of 8 pages at a time, and if that there is memory pressure to push out a particular page, you need write out all of the 8 pages at the same time if it is dirty, and then drop all 8 pages from the page cache at the same time. All of this implies that you want to track page usage and page dirty not at the 4k page level, but at the compound 32k page/"block" level. It basically will involve changes to almost every single part of the VM subsystem, from the page cleaner, to the page fault handler, the page scanner, the writeback algorithms, etc., etc., etc.
Also consider that even if you did hire a Linux VM expert to do this work, (which the HDD vendors would deeply love you for, since they also want to be able to deploy HDD's with a 32k or 64k physical sector size), it will be 5-7 years before such a modified VM layer would make its appearance in a Red Hat Enterprise Linux kernel, or the equivalent enterprise or LTS kernel for SuSE or Ubuntu. So if you are working at a startup who is hoping to sell your SSD product into the enterprise market --- you might as well give up now with this approach. It's just not going to work before you run out of money.
Now, if you happen to be working for a large Cloud company who is making their own hardware (ala Facebook, Amazon, Google, etc.) maybe you could go down this particular path, since they don't use enterprise kernels that add new features at a glacial pace --- but for that reason, they want to stick relatively close to the upstream kernel to minimize their maintenance cost.
If you do work for one of these large cloud companies, I'd strongly recommend that you contact other companies who are in this same space, and maybe you could collaborate with them to see if together you could do this kind of development work and together try to get this kind of change upstream. It really, really is not a trivial change, though --- especially since the upstream linux kernel developers will demand that this not negatively impact performance in the common case, which will not be involving > 4k block devices any time in the near future. And if you work at a Facebook, Google, Amazon, etc., this is not the sort of change that you would want to maintain as a private change to your kernel, but something that you would want to get upstream, since other wise it would be such a massive, invasive change that supporting it as an out-of-tree patch would be huge headache.
Although I've never written a device driver for Linux, I find it very unlikely that this is a real limitation of the driver interface. I guess it's possible that you would want to break I/O into scatter-gather lists where each entry in the list is one page long (to improve memory allocation performance and decrease memory fragmentation), but most device types can handle those directly nowadays, and I don't think anything in the driver interface actually requires it. In fact, the simplest way that requests are issued to block devices (described on page 13 -- marked as page 476 -- of that text) looks like it receives:
a sector start number
a number of sectors to transfer (no limit is mentioned, let alone a limit of 8 512B sectors)
a pointer to write the data into / read the data from (not a scatter-gather list for this simple case, I guess)
whether this is a read versus a write
I suspect that if you're seeing exclusively 4K accesses it's probably a result of the caller not requesting more than 4K at a time -- if the filesystem you're running on top of your device only issues 4K reads, or whatever is using the filesystem only accesses one block at a time, there is nothing your device driver can do to change that on its own!
Using one block at a time is common for random access patterns like database read workloads, but database log or FS journal writes or large serial file reads on a traditional (not copy-on-write) filesystem would issue large I/Os more like what you're expecting. If you want to try issuing large reads against your device directly to see if it's possible through whatever driver you have now, you could use dd if=/dev/rdiskN of=/dev/null bs=N to see if increasing the bs parameter from 4K to 1M shows a significant throughput increase.

Changing memory page size

i was reading there,that the number of virtual memory pages are equal to number of physical memory frames and that the size of frames and that of pages are equal,like for my 32bit system the page size is 4096.
Well i was thinking is there there any way to change the page size or the frame size?
I am using Linux OS. I have searched a lot and what I found is,we can change the page size or in fact we can increase the page size by shifting to huge pages.Is there any other way to change (increase or decrease) or set the page size of our choice?
(Not coding anything,a general question)
In practice it is (nearly) impossible to "change" the memory page size, since the page size is known & determined by the MMU hardware, so the operating system is taking that into account. However, notice that some Linux systems (and hardware!) have hugetlbpage and Linux mmap(2) might accept MAP_HUGETLB (but your code should handle the case of processors or kernels without huge page support, e.g. by calling mmap again without MAP_HUGETLB when the first mmap with MAP_HUGETLB has failed).
From what I read, on some Linux systems, you can use hugetlbpage with various sizes. But the sysadmin can restrict these (or some kernels disable it), so your code should always be prepared that a mmap with MAP_HUGETLB fails.
Even with those "huge pages", the page size is not arbitrary. Use sysconf(_SC_PAGE_SIZE) on POSIX systems to get the standard page size (it is usually 4Kbytes). See also sysconf(3)
AFAIK, even on systems with hugetlbpage feature, mmap can be called without MAP_HUGETLB and the page size (as reported by sysconf(_SC_PAGE_SIZE)) is still 4Kbytes. Perhaps some recent kernels with some weird configurations are using huge pages everywhere, and IIRC some kernels might be configured with 1Mbyte page (i am not sure about that and I might be wrong)...

What happens in paginated (virtual memory) systems when a process is started up?

I'm studying through Tanenbaum's "Modern Operating Systems" book and just read the following paragraph in the book:
When a process is started up, all of its page table entries are marked as not in memory. As soon as any page is referenced, a page fault will occur. The operating system then sets the R bit (in its internal tables), changes the page tables entry to point to the correct page, with mode READ ONLY, and restarts the instruction. If the page is subsequently modified, another page fault will occur, allowing the operating system to set the M bit and change the page's mode to READ/WRITE.
It seems to be extremely inneficient for me. He suggests that when a process is started up a lot of page faults must occur and the real memory is being filled up as the instructions are being executed.
It appears more logical to me that at least the text of the process is put in memory at the beginning, instead of it being put at every instruction execution (with a page fault per instruction execution).
Could someone explain me what is the advantage of this method that the book explains?
Tanenbaum describese two techniques in this paragraph:
When a process is started up, all of its page table entries are marked as not in memory. As > soon as any page is referenced, a page fault will occur. The operating system then sets the > R bit (in its internal tables), changes the page tables entry to point to the correct page, > with mode READ ONLY, and restarts the instruction.
This technique is also called demand-paging (the pages are loaded from disk to memory on-demand, if a page-fault occurs). I can think of at least two reasons why you would want to do this:
Memory consumption: Only the pages that are really needed are loaded from disk into main memory, there might be parts of your program you never execute or parts in your data section you never write to during the execution. In that case, these parts are never loaded in the first place, which means you have more RAM available for other processes. Nowadays, with huge amounts of memory you can of course debate if this is still a valid argument.
Speed: Loading from disk is slow and was much slower a decade ago. Doing the pagetable setup on-demand in a lazy fashion allows to defer the block fetching from the disk. Loading everything at once might delay the execution of your program. Again, disks are now a lot faster and SSDs make this argument even more void. On the other hand, because of dynamic libraries, binaries are not that big and usually require only a few page-faults until they are loaded in RAM.
If the page is subsequently modified, another page fault will occur, allowing the
operating system to set the M bit and change the page's mode to READ/WRITE.
Again, the reason for this is memory consumption. In the old days, where memory was scarce, swapping (moving pages back to disk again if the memory became full) was the solution to provide you with an illusion of a much larger working set of pages. If a page was already swapped out before and never modified inbetween you could just get rid of the page by removing the present bit in the pagetable, thus freeing up the memory the page previosuly occupied to load another frame. The modified bit helps you to detect if you need to write a new version of the page back out to disk, or if you can actually leave the old version as is and swap it back in again once it is needed.
The method you mention where you setup a process with all page table entries prepopulated (also known as pre-paging) is perfectly valid. You are trading memory consumption for speed. The page-table walk and also setting the modified bit is implemented in hardware (on x86) which means it performs not that bad. However, pre-population saves you from executing the page-fault handler, which altough usually heavily optimized, is implemented in software.

What are the exact conditions based on which Linux swaps process(s) memory from RAM to a swap file?

My server has 8Gigs of RAM and 8Gigs configured for swap file. I have memory intensive apps running. These apps have peak loads during which we find swap usage increase. Approximately 1 GIG of swap is used.
I have another server with 4Gigs of RAM and 8 Gigs of swap and similar memory intensive apps running on it. But here swap usage is very negligible. Around 100 MB.
I was wondering what are the exact conditions or a rough formula based on which Linux will do a swapout of a process memory in RAM to the swap file.
I know its based on swapiness factor. What else is it based on? Swap file size? Any pointers to Linux kernel documentation/source code explaining this will be great.
I've seen a lot of people posting subjective explanations of what this does. Here is hopefully a more full answer.
In the split LRU on post 2.6.28 Linux swappiness is a multiplier used to arbitrarily modify the fraction that is calculated determining the pressure built up in both LRUs.
So, for example on a system with no free memory left - the value of the existing memory you have is measured based off of the rate of how much memory is listed as 'Active' and the rate of how often pages are promoted to active after falling into the inactive list.
An LRU with many promotions/demotions of pages between active and inactive is in a lot of use.
Typically file backed storage is cheaper and safer to evict when your running out of memory and automatically is given a modifier of 200 (this makes file backed memory 200 times more worthless than swap backed memory (Which has a value of 0) when it multiplies this fraction.
What swappiness does is modify this value by deducting the swappiness number you gave (default 60) to file memory and adding the swappiness value you gave as a multiplier to anon memory. Thus the default swappiness leaves you with anonymous memory being 80 times more valuable than file memory (200-60 for file, 0+60 for anon). Thus, on a typical linux system that has used up all its memory, page cache would have to be 80 TIMES more active than anonymous memory for anonymous memory to be swapped out in favour of page cache.
If you set swappiness to 100 this gives anon a modifier of 100 and file memory a modifier of 100 (200 - 100) leaving both LRUs equally weighted. Thus on a file heavy system that wants page cache providing the anon memory is not as active as page cache then anon memory will be swapped to disk to make space for extra page cache.
Linux (or any other OS) divides memory up into pages (typically 4Kb). Each of these pages represent a chunk of memory. Usage information for these pages is maintained, which basically contains info about whether the page is free or is in use (part of some process), whether it has been accessed recently, what kind of data it contains (process data, executable code etc.), owner of the page, etc. These pages can also be broadly divided into two categories - filesystem pages or the page cache (in which all data read/written to your filesystem resides) and pages belonging to processes.
When the system is running low on memory, the kernel starts swapping out pages based on their usage. Using a list of pages sorted w.r.t recency of access is common for determining which pages can be swapped out (linux kernel has such a list too).
During swapping, Linux kernel needs to decide what to trade-off when nuking pages in memory and sending them to swap. If it swaps filesystem pages too aggressively, more reads are required from the filesystem to read those pages back when they are needed. However, if it swaps out process pages more aggressively it can hurt interactivity, because when the user tries to use the swapped out processes, they will have to be read back from the disk. See a nice discussion here on this.
By setting swappiness = 0, you are telling the linux kernel not to swap out pages belonging to processes. When setting swappiness = 100 instead, you tell the kernel to swap out pages belonging to processes more aggressively. To tune your system, try changing the swappiness parameter in steps of 10, monitoring performance and pages being swapped in/out at each setting using the "vmstat" command. Keep the setting that gives you the best results. Remember to do this testing during peak usage hours. :)
For database applications, swappiness = 0 is generally recommended. (Even then, test different settings on your systems to arrive to a good value).
References:
http://www.linuxvox.com/2009/10/what-is-the-linux-kernel-parameter-vm-swappiness/
http://www.pythian.com/news/1913/

Resources