linux / virtual/physical page sizes - linux

Is the page size constant? To be more specific , getconf PAGE_SIZE gives 4096, fair enough. But can this change through a program's runtime? Or is it constant throughout the entire OS process spawn. I.e. , is it possible for a process to have 1024 and 2048 AND 4096 page sizes? Let's just talk virtual page sizes for now. But going further is it possible for a virtual page to span a physical page of greater size?

It is possible for a process to use more than one pagesize. On newer kernels this may even happen without notice, see Andrea Arcangelis transparent huge pages.
Other than that, you can request memory with a different (usually larger) page size over hugetlbfs.
The main reason for having big pages is performance, the TLB in the processor is very limited in size, and fewer but bigger pages mean more hits.

Related

variable page size in linux

I was going through, paging concept in linux, where most of the time I found that page size is fix (default value is 4kb).
When I searched more over internet I found there is one POWER architecture from IBM which support variable size paging.
https://www-304.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.performance/variable_large_page.htm
The POWER7® processor supports mixing 4 KB, 64 KB, and 16 MB page sizes within a single segment.
Here I did not get one thing whether this variable page size is handled during run time? I mean if variable page size can be allocated on demand at anytime?? and if yes then how is it possible?
If this not the right platform please move this question on right one.
Linux has a fixed size of memory pages which is set to 4KB. Since this leads to a huge number of page entries to be managed by the MMU, Linux (RH) also supports transparent huge pages. This feature has to be enabled while booting and allows page sizes of 2Mb and 2GB. Be aware that the kernel is doing some kind of defragmentation which degrades performance. This can be switched off by writing 'never' to /sys/kernel/mm/transparent_hugepage/defrag.

Changing memory page size

i was reading there,that the number of virtual memory pages are equal to number of physical memory frames and that the size of frames and that of pages are equal,like for my 32bit system the page size is 4096.
Well i was thinking is there there any way to change the page size or the frame size?
I am using Linux OS. I have searched a lot and what I found is,we can change the page size or in fact we can increase the page size by shifting to huge pages.Is there any other way to change (increase or decrease) or set the page size of our choice?
(Not coding anything,a general question)
In practice it is (nearly) impossible to "change" the memory page size, since the page size is known & determined by the MMU hardware, so the operating system is taking that into account. However, notice that some Linux systems (and hardware!) have hugetlbpage and Linux mmap(2) might accept MAP_HUGETLB (but your code should handle the case of processors or kernels without huge page support, e.g. by calling mmap again without MAP_HUGETLB when the first mmap with MAP_HUGETLB has failed).
From what I read, on some Linux systems, you can use hugetlbpage with various sizes. But the sysadmin can restrict these (or some kernels disable it), so your code should always be prepared that a mmap with MAP_HUGETLB fails.
Even with those "huge pages", the page size is not arbitrary. Use sysconf(_SC_PAGE_SIZE) on POSIX systems to get the standard page size (it is usually 4Kbytes). See also sysconf(3)
AFAIK, even on systems with hugetlbpage feature, mmap can be called without MAP_HUGETLB and the page size (as reported by sysconf(_SC_PAGE_SIZE)) is still 4Kbytes. Perhaps some recent kernels with some weird configurations are using huge pages everywhere, and IIRC some kernels might be configured with 1Mbyte page (i am not sure about that and I might be wrong)...

Why is the page size of Linux (x86) 4 KB, how is that calculated?

The default memory page size of the Linux kernel on x86 architecture was 4 KB, I wonder how was that calculated, and why ?
The default page size is fixed by what the MMU (memory management unit) of the CPU supports. In 32-bit protected mode x86 supports two kinds of pages:
normal ones, 4 KiB
huge ones, 4 MiB
Not all x86 processors support large pages. One needs to have a CPU with Page Size Extension (PSE) capabilities. This excludes pre-Pentium processors. Virtually all current-generation x86 CPUs implements it.
4 KiB is widely popuplar page granularity in other architectures too. One could argue that this size comes from the division of a 32-bit virutal address into two 10-bit indexes in page directories/tables and the remaining 12 bits give the 4 KiB page size.
Introduction
The first Intel processor that supported the paging virtual memory technique was the Intel 80386. The processor supported a single page size, 4 KB. Since it was released in 1985, we have to go back to that period of time to understand why Intel chose that particular page size.
Atlas was the first computer to support paging with a page size of 3 KB and it had a profound influence on the design of virtual memory and motivated related research. The system was designed between 1958-1962. It's interesting to note that the page size supported by the 80386 is somewhat close to the page size supported by the Atlas, even though the 80386 was designed about 20 years later and computers (and the workloads they ran) have evolved radically during that period of time! In fact, many computers from that period used page sizes that range between 0.5-5 KB. Researchers at that time actually spent a substantial amount of effort studying virtual memory systems (paging and segmentation).
One of the big questions was: What is the optimal page size? A large number of works were published in the 60s and 70s that attempt to study and understand the impact of the page size on the performance of applications and recommend or provide guidelines on how to choose a page size. There are certainly a number of papers that were never published. As far as I know, this includes the document from Intel that says "... Therefore, the page size should be 4 KB." But the factors that influence or interact with the page size and the process of choosing a page size (or multiple page sizes for that matter) are well known, which is what I'll discuss in this answer at a basic level. I'll also explain in particular why a page size of 4 KB is reasonable.
The Page Size Problem
In the paging method, physical memory is organized as a sequence of contiguous regions of memory, called page frames, that are of the same size (which is the defining characteristic of paging1). Each page frame can be mapped to equally-sized chunk of a virtual address space, called a virtual page.
Suppose that a page consists of N bytes2 (which implies that a page frame is also N bytes in size by definition) and consider a virtual address space that consists of P pages (i.e., the page numbers are {0, 1, 2, ..., P - 1} and the total number of virtual addresses is N * P). Suppose also that the physical address space consists of F page frames (i.e., the page frame numbers are {0, 1, 2, ..., F - 1} and the total number of physical addresses is N * F).
Given a virtual address VA, a mechanism (a mapping device) is required to determine the physical address, PA, it is mapped to or a page fault should be raised in case it was not mapped. The mapping device uses a data structure (the page table) stored somewhere to perform the mapping. There should be an entry in that table for each allocated virtual page that describes how the page is mapped and potentially some additional attributes (such as protection attributes). The design of the page table entry, as you'll see, interacts with the page size. I'll discuss the design of the page table entry in the Intel 80386 later.
The size of a virtual address is log2(N * P) and the size of a physical address is log2(N * F). Some bits of VA would represent the offset within the page while the other bits would represent the page number, which identifies the page using the mapping device.
How many options do we have for the page size? Well, it can be as a little as one byte up to N * P or N * F, whichever is smaller. That's a lot of options.
It is Convenient for The Page Size to be a Power of 2
A virtual address, VA, is equivalent to a pair of page number and offset, (PN, OFF). The translation process should be as efficient as possible. It's convenient for programmers3 to have the bytes within a page to be contiguous in the address space. In this way, the addresses of the items within a multi-byte data structure can be calculated with simple arithmetic on a single address, which would constitute the base address of the data structure. This can be achieved by using the least significant log2(N) bits (rounded up) of a virtual address to represent the offset and the rest of the bits to represent the page number.
If N is not power of 2, there will be some bits that are shared between the offset and page number, depending on the values of these bits. By making N a power of 2, such complication does not exist. So it would neat to use page sizes that are a power of 2. All real processors that support paging use page sizes that are power of two (although the unit of addressability may not be 8 bits), which makes sense. But to be honest, it's not clear whether this really matters. Using today's technology, whether N is a power of 2 may not have any measurable impact on performance or any other metric of interest. In fact, in the future as larger and larger page sizes are needed, it may happen that some page size that is not a power of 2 is better. But so far, this has not happened. The point I'm trying to make here is that the design option of making the page size not a power 2 should not be automatically dismissed. I believe this is a good opportunity of research in future virtual memory systems.
Anyway, keeping in mind that the choice of 4 KB pages was made in the 80s, such extremely small variations in page sizes were shown (both theoretically and experimentally) to be of little importance. So the convenience of power-of-2 page sizes triumphed. This reduces the number of page sizes to consider exponentially. But we still have a good range of options.
Favoring Smaller Page Sizes
Since the mapping device works at the level of pages, the unit of allocation from the perspective of the operating system is a virtual page4. Even if an application needs to only allocate 1 byte, it still has to tell the OS to allocate a whole virtual page for that 1 byte. This issue is called internal fragmentation. Each application (or maybe even each component of an application) has its own virtual address space from which it allocates memory in page-size chunks. It's expected from each application to not use a single page for a single object that it allocates, but rather allocate as many objects as possible from the same page before it allocates more pages. However, because page attributes work at the level of pages, the same application may use multiple user-mode memory managers (such as when using multiple C/C++ runtimes), and it's difficult for application to share parts of the pages they are not using with other applications, internal fragmentation can occur in many pages in the system. Using smaller page sizes can help reduce the amount of wasted physical (for the whole system) and virtual (per process) memory.
In addition, typically applications go through a number of phases throughout their lifetime, where they exhibit different memory requirements in different phases. If the page size is, say, 16 KB, but many applications may require only less 10 KB for many of their phases, then there would be a lot of wasted physical memory, which may lead to out-of-memory situations that could have been avoided if smaller page sizes, such as 8 or 4 KB, were supported.
Smaller page sizes are preferable for handling copy-on-write soft page faults because the smaller the page is, it would take less time to create a copy of it. For extremely small page sizes, this may not make any measurable difference, depending on the memory bus bandwidth.
Typical amounts of physical memory available in computers in the 70s were in the range of 10s or 100s of KBs. It wouldn't make sense to have page sizes of hundreds of KBs or larger. In fact, the working sets of applications at that time were typically only few or tens of KBs. So even page sizes as small as 16 KB are unlikely to be practical because only a couple of pages may consume all of the available physical memory. The page size should be coherent with the amount of physical memory. This argument can be applied to today's systems of course (it wouldn't make sense to have 128-GB pages for example).
So considering working set sizes and physical memory availability of the 70s and early 80s, the page size should be a power of 2 in the range 20-214. Cool, now we have only 15 options to choose from.
Favoring Larger Page Sizes
We can also argue that larger page sizes are better. For the same working set, smaller page sizes imply a larger number of pages per application , which would require page table entries to enable translation. This fundamentally requires larger page tables irrespective of the structure of the page tables (although the exact overhead depends on the design of the page table entry itself, which I'll discuss later). Imagine having for example 4-byte pages and typical working sets of tens of KBs. Then most of the physical memory would actually be allocated to hold the page tables, not the data. Paging out page tables to seconding storage leads to double page faults for individual memory references, which would be absolutely terrible for performance. Such design is obviously ridiculous.
Essentially, the page size shouldn't be (much) smaller than the smallest possible working set size that can possibly ever be. This immediately rules out pages of sizes 20-26, leaving for us 8 options. The papers of the 70s and early 80s that study page sizes mostly study only these 8 options.
There is another reason that makes larger page sizes advantageous. One of the benefits of virtual memory is to be able to transparently use secondary storage in addition to main memory to hold volatile data. However, secondary storage devices are mechanical and perform best with bulk transfers. This is more of a guideline really and we should not rule out any page sizes yet. There are devices with different designs and characteristics and eventually, the performance advantage of bulk transfers will saturate at some point. But it's certainly something to take into account when measuring the impact of page sizes on performance. If the type of applications being considered exhibit little spatial locality, then smaller page sizes would still be preferable because copying extra bytes to or from the disk is not exactly for free.
Spatial locality of reference encourages the use of certain page sizes. For very small page sizes, it's highly likely that all bytes in the page will be used in the a small period of time. As the size of a page gets larger, the number of bytes that are less likely to be used increases. However, for very large pages, all of the working set may fit within a small number of pages irrespective of locality. Therefore, to minimize the number of page faults, the page size should be either too small or too larger, but not in between. But ultimately, this varies from one application to another. Emerging programming paradigms, such as object-oriented programming and functional programming, and techniques such as multithreading may actually reduce spatial locality due to the extensive use of linked structures and due to the way different application interact with each other. Larger page sizes are preferable to reduce the number of page faults. However, smaller page sizes might be better for memory shared between multiple applications to reduce internal fragmentation of the shared pages. It has also been shown experimentally that the number of page frames that can be held in main memory is correlated with the number of page faults, favoring smaller page sizes.
The need for TLBs was well recognized at that time. Intel called them page caches in their patents (not sure whether Intel was first to use that term). Fast TLBs are very important because address translation is on the critical path of executing instructions. To make them as fast as possible, they should be made tiny, but then they can only cache a small number of page table entries. This motivates the use of larger page sizes.
In the search for the optimal page size, it turns out that there isn't one. It was known at that time that there is a need for supporting multiple page sizes. In fact, the Multics system designed in 1965 supported 64- or 1,024-word pages (a word is 36 bits in size). Page sizes in the range 27-214 were shown to be optimal both empirically and theoretically in different scenarios. Intel must have observed that 4-KB pages result in the best average performance for the applications that their customers used in the 80s. For today's computers and applications, such tiny differences in page sizes don't make much difference as it used to be in the 70s and 80s.
The Design of the Page Table Entry of the Intel 80386
The design of the page directory entry and page table entry is discussed in detail in an Intel patent. This is important because the size of the page table entry and the overall structure of the page table were taken into account in many studies about the page size because they all interact with each other. I prefer not to discuss this in more detail to keep the answer short.
The Page Size of the Near Future
Intel has been granted a patent a couple of months ago that apparently proposes a system with a default page size of 64 KB, but at the same time supporting 4 KB pages (and other IA-32e page sizes) for backward compatibility. I quote from the patent:
As a result of support of the mapping of the 64 KB page into 4 KB
subpages, VA64 directly supports all currently defined operations on 4
KB pages, including independent protection bits per 4 KB page and
arbitrary 4 KB-aligned address mappings. VA64 also supports OS kernel
page management on 4 KB boundaries, even when the OS kernel allocates
memory in 64 KB units. As a result of support of large pages, VA64
supports all divisions of the virtual address range into pages that an
existing paging system such as Intel Corporation's IA-32e paging
system supports. Therefore, VA64 supports applications and hardware
devices that work with a 4 KB-page Windows® OS kernel, while also
taking full advantage of 64 KB pages when 64 KB pages can be used.
The capabilities of VA64 can be adopted gradually by the OS kernel,
rather than requiring them all to be supported in the first generation
VA64-capable OS kernel. For example, a VA64-capable OS kernel could
start by mapping all pages to current sizes (e.g., 4 KB/2 GB/1 TB in
Intel Corporation's IA-32e paging system), but changing to a new page
table format. After the change in page table format, the OS kernel
could be modified to map virtual memory in 64 KB units and change to
store 64 KB pages in its free list. Then the OS kernel could start
using 64 KB pages whenever alignment and protections permit, and add
support for other VA64 capabilities.
I don't know anything else about the VA64 paging system other than what's written in the patent. There is nothing on it anywhere on the Internet. I guess we'll find out more soon.
Selected References
Denning, P.J. (1970). Virtual memory. ACM Computing Surveys Volume 2 Issue 3, 153-189.
Gelenbe, E., Tiberio, P., & Boekhorst, J. C. A. (1973). Page size in demand-paging systems. Acta Informatica, 3(1), 1-23.
Alanko, T. O., & Verkamo, A. I. (1983). Segmentation, paging and optimal page sizes in virtual memory. Performance Evaluation, 3(1), 13-33.
Corbató, F. J., & Vyssotsky, V. A. (1965). Introduction and overview of the Multics system. In Proceedings of the November 30--December 1, 1965, fall joint computer conference, part I (pp. 185-196).
Footnotes
(1) Actually the size of a single virtual page can be multiple of the size of a page frame, but let's not go there.
(2) We can generalize the formulation and use the term "word" to refer to the smallest addressable unit of memory rather than byte, but that is not important here.
(3) Maybe not programmers, depending on the programming language, but compilers, linkers, assemblers, and tools that work with binary code.
(4) It's certainly possible to design a system that supports using paging and nonpaging at the same time, thereby potentially supporting multiple units of allocation, but let's not go there.
The design of 4KB normal page size of 32-bit architecture is actually very interesting :)
And I'd like to add an extra answer to demonstrate why it is reasonable.
x86 uses '2 level page table' to translate virtual memory addresses into physical memory addresses.
So assume that both page directory and page table contain 2x entries, and the page size is 2y bytes. To make full use of the 232 addresses, we have:
2x × 2x × 2y = 232 ⇒ 2x + y = 32
Each entry in page directory/table consumes 4 bytes (32-bit), therefore:
2y/4 = 2x ⇒ y - 2 = x
Thus y = 12, and page size in bytes will be 2y = 212 = 4KB.
And what about '1 level page table'? This is interesting because logically we can use a single page table for address lookup.
Assume that the page directory contains 2x entries, each mapping an address to corresponding page, and page size is 2y bytes.
Again, to make full use of 232 addresses, we need:
2x × 2y = 232 ⇒ x + y = 32
And:
2y/4 = 2x ⇒ y - 2 = x
We get y = 17, and page size is 2y = 217 = 128KB.
We might also argue that, in '2 level page table' version, page directory and page table might have different sizes. However, this means we'll be using larger page directory, which will occupy more than one memory page. Sadly, every time a new user process is spawned, for its own page directory, OS has to allocate successive pages, which is not elegant by design.
That depends on the processor architecture.
The default page size is 4 KB on many architectures. It can generally be increased (sometimes a lot, such as AMD64's 1 GB) by switching to huge page mode. This allows the page table to be smaller, which can result in performance improvements.
I got this from the wikipedia article and I quote:
Page size is usually determined by processor architecture
Check out the article below:
http://en.wikipedia.org/wiki/Page_(computer_memory)
I add this answer/comment because the calculation of sleepsort is very interesting, I have to add it to my webpage (with mentioning the source of course).
An (possible) answer to the question how you can calculate the pagesize programmatically can be found here.
The way it is calculated as mentioned by sleepsort is very interesting. I did the same for x86_64 and the result wasn't what I expected.
However further reading on memory management x86_64 I found that for 64 bit intel, 16 bits are not used, leave it 48 bits for us to calculate.
9 bits for the memory tree branches,
each branch another 9 bits,
here in another 9 for branches
and finally 9 bits for the leaves of the last branch
So 48 - 36 = 12 bits for each address in the memory page.
212 = 4096 like mentioned before by sleepsort.
I just wonder how many pass this architecture is, 3 or 4 (if anyone can explain that)
following sleepsort's calculation:
2x * 2x * 2x * 2x * 2y = 248
4x + y = 48
This time we have 8 bytes for each address (8 bytes * 8 bits / byte = 64 bits)
2y / 23 = 2x
y - 3 = x
Filled in in previous equation:
4(y - 3) + y = 48
4y - 12 + y = 48
5y = 60
y = 12
Because 2y represents the pagesize, the size = 212 = 4096 bytes
Leave me with the question "what about those huge pages in Linux"?

What are the exact conditions based on which Linux swaps process(s) memory from RAM to a swap file?

My server has 8Gigs of RAM and 8Gigs configured for swap file. I have memory intensive apps running. These apps have peak loads during which we find swap usage increase. Approximately 1 GIG of swap is used.
I have another server with 4Gigs of RAM and 8 Gigs of swap and similar memory intensive apps running on it. But here swap usage is very negligible. Around 100 MB.
I was wondering what are the exact conditions or a rough formula based on which Linux will do a swapout of a process memory in RAM to the swap file.
I know its based on swapiness factor. What else is it based on? Swap file size? Any pointers to Linux kernel documentation/source code explaining this will be great.
I've seen a lot of people posting subjective explanations of what this does. Here is hopefully a more full answer.
In the split LRU on post 2.6.28 Linux swappiness is a multiplier used to arbitrarily modify the fraction that is calculated determining the pressure built up in both LRUs.
So, for example on a system with no free memory left - the value of the existing memory you have is measured based off of the rate of how much memory is listed as 'Active' and the rate of how often pages are promoted to active after falling into the inactive list.
An LRU with many promotions/demotions of pages between active and inactive is in a lot of use.
Typically file backed storage is cheaper and safer to evict when your running out of memory and automatically is given a modifier of 200 (this makes file backed memory 200 times more worthless than swap backed memory (Which has a value of 0) when it multiplies this fraction.
What swappiness does is modify this value by deducting the swappiness number you gave (default 60) to file memory and adding the swappiness value you gave as a multiplier to anon memory. Thus the default swappiness leaves you with anonymous memory being 80 times more valuable than file memory (200-60 for file, 0+60 for anon). Thus, on a typical linux system that has used up all its memory, page cache would have to be 80 TIMES more active than anonymous memory for anonymous memory to be swapped out in favour of page cache.
If you set swappiness to 100 this gives anon a modifier of 100 and file memory a modifier of 100 (200 - 100) leaving both LRUs equally weighted. Thus on a file heavy system that wants page cache providing the anon memory is not as active as page cache then anon memory will be swapped to disk to make space for extra page cache.
Linux (or any other OS) divides memory up into pages (typically 4Kb). Each of these pages represent a chunk of memory. Usage information for these pages is maintained, which basically contains info about whether the page is free or is in use (part of some process), whether it has been accessed recently, what kind of data it contains (process data, executable code etc.), owner of the page, etc. These pages can also be broadly divided into two categories - filesystem pages or the page cache (in which all data read/written to your filesystem resides) and pages belonging to processes.
When the system is running low on memory, the kernel starts swapping out pages based on their usage. Using a list of pages sorted w.r.t recency of access is common for determining which pages can be swapped out (linux kernel has such a list too).
During swapping, Linux kernel needs to decide what to trade-off when nuking pages in memory and sending them to swap. If it swaps filesystem pages too aggressively, more reads are required from the filesystem to read those pages back when they are needed. However, if it swaps out process pages more aggressively it can hurt interactivity, because when the user tries to use the swapped out processes, they will have to be read back from the disk. See a nice discussion here on this.
By setting swappiness = 0, you are telling the linux kernel not to swap out pages belonging to processes. When setting swappiness = 100 instead, you tell the kernel to swap out pages belonging to processes more aggressively. To tune your system, try changing the swappiness parameter in steps of 10, monitoring performance and pages being swapped in/out at each setting using the "vmstat" command. Keep the setting that gives you the best results. Remember to do this testing during peak usage hours. :)
For database applications, swappiness = 0 is generally recommended. (Even then, test different settings on your systems to arrive to a good value).
References:
http://www.linuxvox.com/2009/10/what-is-the-linux-kernel-parameter-vm-swappiness/
http://www.pythian.com/news/1913/

Is an entire process’s virtual address space split into pages

Note: I'm attempting to study a high-level overview of Virtual Memory allocation
Is an entire process's virtual address space split into pages, of a particular size:
.text
.bss
.data
Does this also include heap space and stack - or is this always non-pageable?
First note that "pages" are simply regions of an address space. A region that is "non-pageable" (by which I assume you mean it cannot be swapped to disk) is still logically divided into pages, but the OS might implement a different policy on those pages.
The most common page size is 4096 bytes. Many architectures support use of multiple page sizes at the same time (e.g. 4K pages as well as 1MB pages). However, operating systems often stick with just one page size, since under most circumstances, the costs of managing multiple page sizes are much higher than the benefits this provides. Exceptions exist but I don't think you need worry about them.
Every virtual page has certain permissions attached to it, like whether it's readable, writeable, executable (varies depending on hardware support). The OS can use this to help enforce security, cache coherency (for shared memory), and swapping pages out of physical memory.
The .text, .bss and .data regions need not be known to the OS (though most OSes do know about them, for security and performance reasons).
The OS may not actually allocate memory for a stack/heap page until the first time that page is accessed. The OS may provide system calls to request more pages of heap/stack space. Some OSes provide shared memory or shared library functionality which leads to more regions appearing in the address space. Depends on the OS.
Typically, on a paged operating system a processes entire address space is split into pages. Each linear address contains two components - a page number in the most significant bits, and an offset within the page in the least significant bits.
For example, with 32 bit linear addresses and 4kB pages, the upper 20 bits are a page number and the lower 12 bits are a page offset.
.data is where the program's initialized global variables lay. .bss contains the globals without an explicit initializer (with a default value of 0). The heap and stack are separate memory zones from these and from each other. All the memory seen by a process is all virtual memory split in pages. A process doesn't see anything else than virtual memory.

Resources