How Linux kernel decide to which memory zone to use? - linux

When I check pagetypeinfo
cat /proc/pagetypeinfo
I see three types of memory zones;
DMA
DMA32
Normal
How Linux choose a memory zone to allocate a new page?

These memory zones are defined only for the 32 bit systems and not in the 64 bit.
Rememember these are the kernel accessible main memory we are talking about. In a 32 bit (4GB) system, the split between the kernel and the user space is 1:3. Meaning kernel can access 1GB and the user space 3GB. The kernel's 1GB is split as follows:
Zone_DMA (0-16MB): Permanently mapped into the kernel address space.
For compatibility reasons for older ISA devices that can address only the lower 16MB of main memory.
Zone_Normal (16MB-896MB): Permanently mapped into the kernel address space.
Many kernel operations can only take place using ZONE_NORMAL so it is the most performance critical zone and is the memory mostly allocated by the kernel.
ZONE_HIGH_MEM (896MB-above): not permanently mapped into the kernel's address space.
Kernel can access entire 4GB main memory. kernel's 1GB through Zone_DMA & Zone_Normal and user's 3GB through ZONE_HIGH_MEM. With Intel's Physical Address Extension (PAE), one gets 4 extra bits to address the main memory resulting in 36 bits, a total of 64GB of memory that can be accessed. The delta address space (36 bit address - 32 bit address) is where ZONE_HIGH_MEM is used to map to the user accessed main memory (ie between 2GB - 4GB).
Read more:
http://www.quora.com/Linux-Kernel/Why-is-there-ZONE_HIGHMEM-in-the-x86-32-Linux-kernel-but-not-in-the-x86-64-kernel
http://www.quora.com/Linux-Kernel/What-is-the-difference-between-high-memory-and-normal-memory
Linux 3/1 virtual address split

For every memory allocation request (for eg via kmalloc), based on the flags passed to the function,kernel selects the memory zone. these requests internally triggers the kernel function alloc_pages().
zonelist is an argument that gets passed to alloc_pages(), that
Points to a zonelist data structure describing, in order of preference, the mem-
ory zones suitable for the memory allocation.
refer the memory management chapter in book Understanding the Linux kernel

Related

How much memory does a 64bit Linux Kernel take up?

The address space is huge for the x86-64 even though 48-bit addresses are mainly used.
On x86 32-bit machines it was pretty clear how much RAM the kernel took up. Generally around 1 GB of ZONE_NORMAL is on the bottom of memory while everything else above the 1GB in PHYSICAL (not virtual) addresses were for ZONE_HIGHMEM (for user space). This would be a 3:1 split. Of course we can have configurations were we can have a 1:3, 2:2, etc. (by changing VM_SPLIT).
How much memory in RAM is for kernel space for 64 bit kernels?
I know the PAGE_OFFSET is set to a value far above physically addressable memory in x64 (for both 48 and 56). PAGE_OFFSET in x64 just describes the split in virtual address space, not physical (a 48 bit PAGE_OFFSET would be ffff888000000000 ).
So does 1 GB of memory house kernel space? 2GB? 3? Are there variable or macros that describe the size? Is it calculated?
Each user-space process can use its own 2^47 bytes (128 TiB) of virtual address space. Or more on a system with PML5 support.
The available physical RAM to back those pages is the total size of physical RAM, minus maybe 30 MiB or so that the kernel needs for its own code/data. (Not including the pagecache: Linux will use any spare pages as buffers and disk cache). This is mostly unrelated to virtual address-space limits.
1G is how much virtual address space a kernel used up. Not how much physical RAM.
The address-space question mattered for how much memory a single process could use at the same time, but the kernel can still use all your RAM for caching file data, etc. Unless you're finding the 2^(48-1) or 2^(57-1) bytes of the low half virtual address-space range cramped, there's no equivalent problem.
See the kernel's Documentation/x86/x86-64/mm.txt for the x86-64 virtual memory map. Also Why 4-level paging can only cover 64 TiB of physical address re: x86-64 Linux not doing inconvenient HIGHMEM stuff - the entire high half of virtual address space is reserved for the kernel, and it maps all the RAM because it's a kernel.
Virtual address space usage does indirectly set a 64 TiB limit on how much physical RAM the kernel can use, but if you have less than that there's no effect. Just like how a 32-bit kernel wasn't a problem if your machine had less than 1 or 2 GiB of RAM.
The amount of physical RAM actually reserved by the kernel depends on build options and modules, but might be something like 16 to 32 MiB.
Check dmesg output and look for something like this kernel log message from an x86-64 5.16.3-arch1 kernel I found in an old boot-log message.
Memory: 32538176K/33352340K available (14344K kernel code, 2040K rwdata, 8996K rodata, 1652K init, 4336K bss, 813904K reserved, 0K cma-reserved
Don't count the init (freed in after boot) or reserved parts; I'm pretty sure Linux doesn't actually reserve ~800 MiB in a way that makes it unusable for anything else.
Also look for the later Freeing unused decrypted memory: 2036K / Freeing unused kernel image (initmem) memory: 1652K etc. (That's the same size as the init part listed earlier, which is why you don't have to count it.)
It might also dynamically allocate some memory during startup; that initial "memory" line is just the sum of its .text, .data, and .bss sections, static code+data sizes.
On 64-Bit systems, the only limitation is on how much physical memory the kernel can use. The kernel will map all the available ram, and user space applications should be able to gain access to as much as the kernel can provide while maintaining sufficient for the kernel to operate.

Virtual Address Space

I have started to learn about Virtual Address Space (VAS) and I have few questions:
How much of VAS is created for each process depending on the architecture (32-bit and 64-bit)?
Is VAS for each process created on hard disk? If so, what happens if there is not enough space?
What is the difference between VAS and Virtual Memory (VM)?
Virtual address versus physical address
During the execution of your program, the variables (integers, arrays, strings, etc.) are stored somewhere in the main memory of your computer (RAM). Some programming languages (like C or C++) allow you to obtain the memory address at which a given variable is stored (with the & operator), and to manipulate that address (add to it, subtract from it, print it, etc.).
Here is a C program that prints the memory address of a variable:
#include <stdio.h>
int main(void) {
int variable = 1234;
void *address = &variable;
printf("Memory address of variable: %p\n", address);
return 0;
}
Output:
Memory address of variable: 0x7ffc9e9662a4
Now, if you compile and execute this program on a typical desktop computer, with a typical operating system (like GNU/Linux or Windows), the memory address that is printed by this program is not the hardware address at which the data 1234 is actually located in the memory chip. This may be surprising, but there is a level of indirection between the addresses used by your program and the hardware addresses.
Virtual address space on 64-bit computers
On a 64-bit computer, a memory address manipulated by your program is an integer between 0 and 18446744073709551615 inclusive. Such an address is called a virtual memory address. The range of those addresses is called the virtual address space of the process. You can ask the operating system to map a range of virtual memory addresses to the physical memory of your computer, so that when you try to read or write bytes at thoses addresses, your program doesn't crash for accessing unmapped virtual memory addresses.
Typically, on x86-64 computers, only 248 virtual memory addresses can be successfully mapped to physical memory, because 256 TiB of usable virtual address space is considered sufficient. In the future, processor manufacturers may raise or remove this limit if there is a need for it.
Virtual address space on 32-bit computers
On 32-bit computers, there are 232 virtual memory addresses. On those computers, a memory address manipulated by your program is an integer between 0 and 4294967295 inclusive.
On x86 32-bit computers, there is usually no restriction on the range of virtual memory addresses that can be mapped to physical memory addresses.
Mapping a range of virtual memory addresses
On GNU/Linux, you can request a mapping by calling the function mmap(). On Windows, you can request a mapping by calling the function VirtualAlloc(). Those functions take the size of the mapping as argument, and return the first virtual address that is now backed by actual physical memory. Those functions can fail to create a new mapping if the physical memory is already completely used by other processes. And again, if you try to access (read or write) the content of a virtual memory address that is outside an area mapped by mmap() or VirtualAlloc(), the operating system will terminate your program (by sending a segmentation fault signal).
On GNU/Linux, a process can examine the mappings created in its virtual address space just by reading the file /proc/self/maps. You can learn a lot by reading the output of the command cat /proc/self/maps.
Hard disk drive
On a typical computer, the main memory is a semiconductor memory, and the hard disk drive is only a secondary storage device.
On a typical operating system, a range of virtual memory addresses can only be mapped to the main memory (which is usually a semiconductor memory device). Such a range cannot be directly mapped to a secondary storage device (usually a hard disk drive) without using the main memory as intermediary.
On an n-bit machine, the VAS is 2n bytes large. So, on a 32-bit machine, the VAS 232 = 4 GiB large.
Virtual memory is not created on disk. In fact, the existence of a disk is not needed for implementing virtual memory. Most implementations of virtual memory are paged. So, when a 4 GiB VAS is created, only the pages that are needed are mapped into that VAS. For example, suppose a process only uses 16 pages of memory on a 32-bit system with 4k-sized pages. Despite having a 4 GiB VAS, only 16 * 4k = 216 bytes of memory are mapped into the VAS. The rest of the memory is unmapped. If the CPU tries to access this unmapped memory, a segmentation fault will occur. If a process wants to map memory at this address, then (in a POSIX-complaint OS) it can request the mapping from the OS using mmap(2). This will make a lot more sense once you learn about page tables.
Virtual memory is a concept. A virtual address space is an entity that stems from the concept of virtual memory. These terms go hand in hand, but refer to different things.
I will list a couple of caveats.
Caveat 1.1
I am not not aware of any 64-bit processor that truly supports a 64-bit VAS. The addresses themselves are 64 bits wide, but a certain number of upper bits are ignored. AMD's first implementation of x86_64 only supported 48-bit addresses. The upper 16 bits of an address were effectively ignored. In such a system, the addresses are 64 bits wide, but the real size of the VAS is limited to 248 bytes. Subsequent architectures supported 56-bit addresses.
Caveat 1.2
If a processor supports PAE, then the VAS on an n-bit machine may be larger than 2n bytes. This is how 32-bit processors can support VASs larger than 4 GiB.
Caveat 2.1
Not really a caveat, but this is related to your question. You asked what happens when there isn't enough space on disk to create a VAS. As I mentioned in the main answer, the VAS is not created on disk. However, any computer only has a finite amount of physical memory. What happens when a process requests a page be mapped, but there is no physical memory available? There are there several ways to handle this:
Swapping is done by temporarily moving a page that is mapped in virtual memory to disk. The entire contents of the page are copied to disk. Then, the process that requested the page has the physical page mapped into their memory. Eventually, the old page may be requested. If this occurs, then the OS copies the page from disk and remaps it into the corresponding VAS. This is what Linux and most modern operating systems do.
The process is simply told there is no memory available, for example, through an error number like ENOMEM.
The process is blocked until memory is available. I haven't seen this in practice.
Swapping implies the use of a disk, but virtual memory does not imply the use of swapping, hence a disk is not necessary for virtual memory.
Virtual Address Space - wikipedia
When a new application on a 32-bit OS is executed, the process has a 4 GiB VAS: each one of the memory addresses (from 0 to 232 − 1) in that space can have a single byte as a value. Initially, none of them have values.
For n-bit OS, these n-address lines allow address space upto 2n addresses, i.e., 0 to 2n - 1. This would mean 16 EiB for 64-bit OS. (Though in actual implementations, less space is used as this much space is unnecessary.)
CPU Cache - wikipedia
Most general purpose CPUs implement some form of virtual memory. To summarize, either each program running on the machine sees its own simplified address space, which contains code and data for that program only, or all programs run in a common virtual address space. A program executes by calculating, comparing, reading and writing to addresses of its virtual address space, rather than addresses of physical address space, making programs simpler and thus easier to write.
For example, in C++, program memory is divided in stack, heap, data, code. I'm not sure if analogy is correct (may be), but it somewhat presents an insight if you're aware.
Virtual memory - wikipedia
In computing, virtual memory is a memory management technique that provides an "idealized abstraction of the storage resources that are actually available on a given machine"3 which "creates the illusion to users of a very large (main) memory".[4]
The computer's operating system, using a combination of hardware and software, maps memory addresses used by a program, called virtual addresses, into physical addresses in computer memory. Main storage, as seen by a process or task, appears as a contiguous address space or collection of contiguous segments. The operating system manages virtual address spaces and the assignment of real memory to virtual memory.
Address translation hardware in the CPU, often referred to as a memory management unit (MMU), automatically translates virtual addresses to physical addresses. Software within the operating system may extend these capabilities to provide a virtual address space that can exceed the capacity of real memory and thus reference more memory than is physically present in the computer.
If you know about computer architecture (which I'm sure you do from the question), it'd be clarified by now.
Still, for anyone in general, I'm giving a bit of explanation.
Assume addresses as pointers in C++. If you don't know C++, closest analogy would be array/list indices in any language. Now the addresses point to the memory locations, just like pointers point to the variable. The actual data is stored in the variable. To get the variable data using pointer/index, you provide address location from where the data is to be extracted. Now in physical memory, there won't be a thing like a variable. There is memory and it's location address through which it is accessed.
The real memory is physical memory, which is the hard disks. It is accessed with physical addresses, which would be unique for each byte.
Accessing physical memory directly with physical addresses would be cumbersome. Thus the addresses are simplified by the OS to virtual addresses. These addresses may or may not be unique (these aren't physical addresses, remember). Thus, multiple virtual addresses may point to same location.
The virtual memory is not actually existent, rather it's just a concept of physical memory simplified using virtual addresses to give the user an illusionous space say where next memory location is stored at next address (virtual address to be precise).
Since multiple virtual addresses can be mapped, by using MMU, to same physical address, and thus to point to same phyical memory location, the virtual memory size can be made to exceed the physical memory size (virtually). But effectively, the memory size would still be same as physical.
Thus, to access a memory data, Virtual Addresses are specified by user/program to OS, which are converted to Physical Addresses by memory management unit (mmu) and then applied to the address lines of the computer architecture (electronics spotted!!), which yields the data at the corresponding physical location. And this concept is called Virtual Memory.
-Himanshu

Linux x86: Where is the real mode address space mapped to in protected kernel mode?

In Linux running on an x86 platform where is the real mode address space mapped to in protected kernel mode? In kernel mode, a thread can access the kernel address space directly. The kernel is in the lower 8MB, The page table is at a certain position, etc (as describe here). But where does the real mode address space go? Can it be accessed directly? For example the BIOS and BIOS addons (See here)?
(My x86-fu is a bit weak. I'll add some tags so that other people can (hopefully) correct me if I'm lying anywhere.)
Physical addresses are the same in real and protected mode. The only difference is in how you get from an address (offset) specified in an instruction to a physical address:
In real mode, the physical address is basically (segment_reg << 4) + offset.
In protected mode, the physical address is translate_via_page_table([segment_reg] + offset).
By [segment_reg] I mean the base address of the segment, looked up in the Global or Local Descriptor Table at the offset in segment_reg. translate_via_page_table() means the address translation done via paging (if enabled).
Looking here, it seems the BIOS ROM appears at physical addresses 0x000F0000-0x000FFFFF. To get at that memory in protected mode with paging, you would have to map it into the virtual address space somewhere by setting up correct page table entries. Assuming 4 KB pages (the usual case), mapping the entire range should require 16 ((0xFFFFF-0xF0000+1)/4096) entries.
To see how the Linux kernel does things, you could look into how e.g. /dev/mem, which allows reading of arbitrary physical addresses, is implemented. The implementation is in drivers/char/mem.c.
The following command (from e.g. this answer) will dump the memory range 0xC0000-0xFFFFF (meaning it includes the video BIOS too, per the memory map linked above):
$ dd if=/dev/mem bs=1k skip=768 count=256 > bios
1024*768 = 0xC0000, and 1024*(768+256) - 1 = 0xFFFFF, which gives the expected physical memory range.
Tracing things a bit, read_mem() in drivers/char/mem.c calls xlate_dev_mem_ptr(), which has an x86-specific implementation in arch/x86/mm/ioremap.c. The ioremap_cache() call in that function seems to be responsible for mapping in the page if needed.
Note that BIOS routines won't work in protected mode by the way. They assume the CPU is running in real mode.
For Linux x86 32 bits, the first 896MB of physical RAM is mapped to a contiguous block of virtual memory starting at virtual address 0xC0000000 to 0xF7FFFFFF. Virtual addresses from 0xF8000000 to 0xFFFFFFFF are assigned dynamically to various parts of the physical memory, so the kernel can have a window of 128MB mapped into any part of physical memory beyond the 896MB limit.
The kernel itself loads at physical address 1MB and up, leaving the first MB free. This first MB is used, for instance, to have DMA buffers that ISA devices needs to be there, because they use the 8237 DMA controller, which can only be mapped to such addresses.
So, reading from virtual memory address 0xC0000000 is actually reading from physical address 0x00000000 (provided the kernel has flagged that page as present)

Linux 3/1 virtual address split

I am missing something when it comes to understanding the need for highmem to address more than 1GB of RAM. Could someone point out where I go wrong? Thanks!
What I know:
1 GB of a processes' virtual memory (high memory region) is reserved for kernel operations. The user space can use the remaining 3 GB. This is the 3/1 split.
The virtual memory features of the VM map the (continuous) virtual memory pages to physical pages (RAM).
What I don't know:
What operations use the kernel virtual memory? I suppose things like kmalloc(...) in kernel-space would use kernel virtual memory.
I would think that 4GB of RAM could be used under this scheme. I don't get why the kernel 1 GB virtual space is the limiting factor when addressing physical space. This is where my understanding breaks down. Please advise.
I've been reading this (http://kerneltrap.org/node/2450), which is great. But it doesn't quite address my question to my liking.
The reason that kernel virtual space is a limiting factor on useable physical memory is because the kernel needs access to all physical memory, and the way it accesses physical memory is through kernel virtual addresses. The kernel doesn't use special instructions that allow direct access to physical memory locations - it has to set up page table entries for any physical ranges that it wants to talk to.
In the "old style" scheme, the kernel set things up so that every process's page tables mapped virtual addresses from 0xC0000000 to 0xFFFFFFFF directly to physical addresses from 0x00000000 to 0x3FFFFFFF (these pages were marked so that they were only accessible in ring 0 - kernel mode). These are the "kernel virtual addresses". Under this scheme, the kernel could directly read and write any physical memory location without having to fiddle with the MMU to change the mappings.
Under the HIGHMEM scheme, the mappings from kernel virtual addresses to physical addresses aren't fixed - parts of physical memory are mapped in and out of the kernel virtual address space as the kernel needs access to that memory. This allows more physical memory to be used, but at the cost of having to constantly change the virtual-to-physical mappings, which is quite an expensive operation.
Mapping 1 GB to kernel in each process allows processes to switch to kernel mode without also performing a context switch. Responses to system calls such as read(), mmap() and others can then be appropriately processed in the calling process' address space.
If space for the kernel were not reserved in each process, switching to "kernel mode" in between executing user space code would be more expensive, and be unable to use virtual address mapping through the hardware MMU (memory management unit) for the system calls being serviced.
Systems running a 32bit kernel with more than 1GB of physical memory, are able to assign physical memory locations in ZONE_HIGHMEM (roughly above the 1GB mark), which can require the kernel to jump through hoops for certain operations to interact with them. The addition of PAE (physical address extension), extends this problem by allowing upto 64GB of physical memory, decreasing the ratio of memory within the 1GB physical address memory, to regions allocated in ZONE_HIGHMEM.
For example the system calls use the kernel space.
You can have 64GB of physical ram, but on 32-bit platforms processors can access only 4gb because of the 32-bit virtual addressing. Actually, you can have 1GB of RAM and 3GB of swap and virtual addressing will make it look like you have 4GB. On 64-bit platforms virtual addressing is practically unlimited.

why do we need zone_highmem on x86?

In linux kernel, mem_map is the array which holds all "struct page" descriptors. Those pages includes the 128MiB memory in lowmem for dynamically mapping highmem.
Since the lowmem size is 1GiB, so the mem_map array has only 1GiB/4KiB=256KiB entries. If each entry size is 32 byte, then the mem_map memory size = 8MiB. But if we could use mem_map to map all 4GiB physical memory(if we have so much physical memory available on x86-32), then the mem_map array would occupy 32MiB, that is not a lot of kernel memory(or am i wrong?).
So my question is: why do we need to use that 128MiB in low for indirect highmem mapping in the first place? Or put another way, why not to map all those max 4GiB physical memory(if available) in the kernel space directly?
Note: if my understanding of the kernel source above is wrong, please correct. Thanks!
Look Here: http://www.xml.com/ldd/chapter/book/ch13.html
Kernel low memory is the 'real' memory map, addressed with 32-bit pointers on x86.
Kernel high memory is the 'virtual' memory map, addressed with virtual structures on x86.
You don't want to map it all into the kernel address space, because you can't always address all of it, and you need most of your memory for virtual memory segments (virtual, page-mapped process space.)
At least, that's how I read it. Wow, that's a complicated question you asked.
To throw more confusion, chapter 13 talks about some PCI devices not being able to address the 32-bit space, which was the genesis of my previous comment:
On x86, some kernel memory usage is limited to the first Gigabyte of memory bacause of DMA addressing concerns. I'm not 100% familiar with the topic, but there's a comapatibility mode for DMA on the PCI bus. That may be what you are looking at.
3.6 GB is not the ceiling when using physical address extension, which is commonly needed on most modern x86 boards, especially with memory hotplug.
Or put another way, why not to map all those max 4GiB physical
memory(if available) in the kernel space directly?
One reason is userspace: every usespace process have its own virtual address space. Suppose you have 4Gb of RAM on x86. So if we suggest that kernel owns 1Gb of memory (~800 directly mapped + ~200 vmalloc) all other ~3Gb should be dynamically distributed between processes spinning in user space. So how can you map your 4Gbs directly when you have a several address spaces?
why do we need zone_highmem on x86?
The reason is the same. Kernel reserves only ~800Mb for low mem. All other memory will be allocated and connected with particular virtual address only on demand. For example if you will execute a binary a new virtual address space will be created and some pages will be allocated for storing your binary code and data (heap ,stack ...). So the key attribute of high mem is to serve dynamic memory allocation requests, you never know in advance what will be triggered by userspace...

Resources