Note: I'm attempting to study a high-level overview of Virtual Memory allocation
Is an entire process's virtual address space split into pages, of a particular size:
.text
.bss
.data
Does this also include heap space and stack - or is this always non-pageable?
First note that "pages" are simply regions of an address space. A region that is "non-pageable" (by which I assume you mean it cannot be swapped to disk) is still logically divided into pages, but the OS might implement a different policy on those pages.
The most common page size is 4096 bytes. Many architectures support use of multiple page sizes at the same time (e.g. 4K pages as well as 1MB pages). However, operating systems often stick with just one page size, since under most circumstances, the costs of managing multiple page sizes are much higher than the benefits this provides. Exceptions exist but I don't think you need worry about them.
Every virtual page has certain permissions attached to it, like whether it's readable, writeable, executable (varies depending on hardware support). The OS can use this to help enforce security, cache coherency (for shared memory), and swapping pages out of physical memory.
The .text, .bss and .data regions need not be known to the OS (though most OSes do know about them, for security and performance reasons).
The OS may not actually allocate memory for a stack/heap page until the first time that page is accessed. The OS may provide system calls to request more pages of heap/stack space. Some OSes provide shared memory or shared library functionality which leads to more regions appearing in the address space. Depends on the OS.
Typically, on a paged operating system a processes entire address space is split into pages. Each linear address contains two components - a page number in the most significant bits, and an offset within the page in the least significant bits.
For example, with 32 bit linear addresses and 4kB pages, the upper 20 bits are a page number and the lower 12 bits are a page offset.
.data is where the program's initialized global variables lay. .bss contains the globals without an explicit initializer (with a default value of 0). The heap and stack are separate memory zones from these and from each other. All the memory seen by a process is all virtual memory split in pages. A process doesn't see anything else than virtual memory.
Related
I am writing an application that makes heavy use of mmap, including from distinct processes (not concurrently, but serially). A big determinant of performance is how the TLB is managed user and kernel side for such mappings.
I understand reasonably well the user-visible aspects of the Linux page cache. I think this understanding extends to the userland performance impacts1.
What I don't understand is how those same pages are mapped into kernel space, and how this interacts with the TLB (on x86-64). You can find lots of information on how this worked in the 32-bit x86 world2, but I didn't dig up the answer for 64-bit.
So the two questions are (both interrelated and probably answered in one shot):
How is the page cache mapped3 in kernel space on x86-64?
If you read() N pages from a file in some process, then again read exactly those N pages again from another process on the same CPU, it possible that all the kernel side reads (during the kernel -> userpace copy of the contents) hit in the TLB? Note that this is (probably) a direct consequence of (1).
My overall goal here is to understand at a deep level the performance difference of one-off accessing of cached files via mmap or non-mmap calls such as read.
1 For example, if you mmap a file into your processes' virtual address space, you have effectively asked for your process page tables to contain a mapping from the returned/requested virual address range to a physical range corresponding to the pages for that file in the page cache (even if they don't exist in the page cache, yet). If MAP_POPULATE is specified, all the page table entries will actually be populated before the mmap call returns, and if not they will be populated as you fault-in the associated pages (sometimes with optimizations such as fault-around).
2 Basically, (for 3:1 mappings anyway) Linux uses a single 1 GB page to map approximately the first 1 GB of physical RAM directly (and places it at the top 1 GB of virtual memory), which is the end of story for machines with <= 1 GB RAM (the page cache necessarily goes in that 1GB mapping and hence a single 1 GB TLB entry covers everything). With more than 1GB RAM, the page cache is preferentially allocated from "HIGHMEM" - the region above 1GB which isn't covered by the kernel's 1GB mapping, so various temporary mapping strategies are used.
3 By mapped I mean how are the page tables set up for its access, aka how does the virtual <-> physical mapping work.
Due to vast virtual address space compared to physical ram installed (128TB for the kernel), the common trick is to permanently map all the ram. This is known as "direct map".
In principle it is possible that both relevant TLB and cache entries survive the context switch and all the other code executed, but it is hard to say how likely this can be in the real world.
Is virtually contiguous memory also always physically contiguous? If not, how is virtually continuous memory allocated and memory-mapped over physically non-contiguous RAM blocks? A detailed answer is appreciated.
Short answer: You need not care (unless you're a kernel/driver developer). It is all the same to you.
Longer answer: On the contrary, virtually contiguous memory is usually not physically contiguous (only in very small amounts). Except by coincidence, or shortly after the machine has just booted. That isn't necessary, however.
The only way of allocating larger amounts of physically contiguous RAM is by using large pages (since the memory within one page needs to be contiguous). It is however a useless endeavor, since there is no observable difference for your process whether memory of which you think that it is contiguous is actually contiguous, but there are disadvantages to using large pages.
Memory mapping over phyically non-contiuous RAM works in no particularly "special" way. It follows the same method which all memory management follows.
The OS divides virtual memory in "pages" and creates page table entries for your process. When you access a memory in some location, either the corresponding page does not exist at all, or it exists and corresponds to a real page in RAM, or it exists but doesn't correspond to a real page in RAM.
If the page exists in RAM, nothing happens at all1. Otherwise a fault is generated and some operating system code is run. If it turns out the page doesn't exist at all (or does not have the correct access rights), your process is killed with a segmentation fault.
Otherwise, the OS chooses an arbitrary page that isn't used (or it swaps out the one it thinks is the least important one), and loads the data from disk into that page. In the case of a memory mapping, the data comes from the mapped file, otherwise it comes from swap (and for completely new allocated memory, the zero page is copied). The OS then returns control back to your process. You never know this happened.
If you access another location in a "contiguous" (or so you think!) memory area which lies in a different page, the exact same procedure runs.
1 In reality, it is a little more complicated, since a page may exist in RAM but not exist "officially", being part of a list of pages that are to be recycled or such. But this gets too complicated.
No, it doesn't have to. Any page of the virtual memory can be mapped to an arbitrary physical page. Therefore you can have adjacent pages of your virtual memory pointing to non-adjacent physical pages. This mapping is maintained by the OS and is used by the MMU unit of CPU.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
This is a noob question about computer science: How is ram allocated?
Fo example, I use Windows. Can I know which adresses are used by a program? How does Windows allocate memory? Contiguous or non contiguous?
Is it the same thing on Linux OSes ?
And, can I have access to the whole ram with a program? (I don't believe in that, but...)
Do you know any good lectures/documentation on this?
First, when you think you are allocating RAM, you really are not. This is confusing, I know, but it's really not complicated once you understand how it works. Keep reading.
RAM is allocated by the operating systems in units called "pages". Usually, this means contiguous regions of 4kiB, but other sizes are possible (to complicate things further, there exists support for "large pages" (usually on the order of 1-4MiB) on modern processors, and the operating system may have an allocation granularity different from the page size, for example Windows has a page size of 4kiB with a granularity of 64kiB).
Let's ignore those additional details and just think of "pages" that have one particular size (4KiB).
If you allocate and use areas that are greater than the system's page size, you will usually not have contiguous memory, but you will nevertheless see it as contiguous, since your program can only "think" in virtual addresses. In reality you may be using two (or more) pages that are not contiguous at all, but they appear to be. These virtual addresses are transparently translated to the actual addresses by the MMU.
Further, not all memory that you believe to have allocated necessarily exists in RAM at all times, and the same virtual address may correspond to entirely different pieces of RAM at different times (for example when a page is swapped out and is later swapped in again -- your program will see it at the same address, but in reality it is most likely in a different piece of RAM).
Virtual memory is a very powerful instrument. While one address in your program can only refer to [at most] one physical address (in a particular page) in RAM, one physical page of RAM can be mapped to several different addresses in your program, and even in several independent programs.
It is for example possible to create "circular" memory regions, and code from shared libraries is often loaded into one memory location, but used by many programs at the same time (and it will have different addresses in those different programs). Or, you can share memory between programs with that technique so when one program writes to some address, the value in the other program's memory location changes (because it is the exact same memory!).
On a high level, you ask your standard library for memory (e.g. malloc), and the standard library manages a pool of regions that it has reserved in a more or less unspecified way (there are many different allocator implementations, they all have in common that you can ask them for memory, and they give back an address -- this is where you think that you are allocating RAM when you are not).
When the allocator needs more memory, it asks the operating system to reserve another block. Under Linux, this might be sbrk and mmap, under Windows, this would for example be VirtualAlloc.
Generally, there are 3 things you can do with memory, and it generally works the same under Linux and Windows (and every other modern OS), although the API functions used are different, and there are a few more minor differences.
You can reserve it, this does more or less nothing, apart from logically dividing up your address space (only your process cares about that).
Next, you can commit it, this again doesn't do much, but it somewhat influences other processes. The system has a total limit of how much memory it can commit for all processes (physical RAM plus page file size), and it keeps track of that. Which means that memory you commit counts against the same limit that another process could commit. Otherwise, again, not much happens.
Last, you can access memory. This, finally, has a noticeable effect. Upon first accessing a page, a fault occurs (because the page does not exist at all!), and the operating system either fetches some data from a file (if the page belongs to a mapping) or it clears some page (possibly after first saving it to disk). The OS then adjusts the structures in the virtual memory system so you see this physical page of RAM at the address you accessed.
From your point of view, none of that is visible. It just works as if by magic.
It is possible to inspect processes for what areas in their address space are used, and it is possible (but kind of meaningless) to translate this to physical addresses. Note that the same program run at different times might store e.g. one particular variable at a different address. Under Windows, you can for example use the VMMap tool to inspect process memory allocation.
You can only use all RAM if you write your own operating system, since there is always a little memory which the OS reserves that user processes cannot use.
Otherwise you can in principle use [almost] all memory. However, whether or not you can directly use that much depends on whether your process is 32 or 64 bits. Computers nowadays typically have more RAM than you can address with 32 bits, so either you need to use address windowing extensions or your process must be 64 bits. Also, even given an amount of RAM that is in principle addressable using 32 bits, some address space factors (e.g. fragentation, kernel reserve) may prevent you from directly using all memory.
I have an application where I have to allocate on Windows (using operator new) quite a large memory space (hundreds of MB). The application is 32 bit (we don't use 64 bit for now, even on 64 bit systems) and I enabled /LARGEADDRESSAWARE linker option to be able to use 4 GB of user space memory.
Question If I need to allocate, say 450 MB of contiguous memory does the virtual address space of the process need to have a contiguous large enough space and additionally the physical memory does not have to be fragmented on the system ? I ask this because I can make it so that my application reserves a large enough contiguous space but don't know if other applications on the system can affect me in this way. Does OS page tables need to translate contiguous virtual addresses seen by the application into contiguous physical addresses ?
If the memory is simply used in your software, then your 450MB allocation will only need a hole of 450MB in the virtual space. It can be satisfied with pages from every corner of the memory system [as long as there is at least 450MB available somewhere in the system - including swapspace].
Your system will get a little bit better performance if the OS is able to allocate the pages in contiguous blocks of 2MB a piece [using "large pages" of 2MB at a time]. But the system will fall back to individual 4KB pages if it needs to.
One of several benefits with a paged memory architecture is that any physical page can be placed at any virtual address. In some systems, for example Xen virutalization manager in Debug mode, pages are INTENTIONALLY allocated out of sequence, to make it easier to detect when the system makes assumptions about memory pages being contiguous.
You don't need to be concerned about contiguity of the physical memory. That's one thing that virtual to physical address translation helps you with. As long as you can reserve a chunk of the address space and back it with physical memory, wherever it happens to be, things are going to work.
I have a confusing notion about the process of segmentation & paging in x86 linux machines. Will be glad if some clarify all the steps involved from the start to the end.
x86 uses paged segmentation memory technique for memory management.
Can any one please explain what happens from the moment an executable .elf format file is loaded from hard disk in to main memory to the time it dies. when compiled the executable has different sections in it (text, data, stack, heap, bss). how will this be loaded ? how will they be set up under paged segmentation memory technique.
Wanted to know how the page tables get set up for the loaded program ? Wanted to know how GDT table gets set up. how the registers are loaded ? and why it is said that logical addresses (the ones that are processed by segmentation unit of MMU are 48 bits (16 bits of segment selector + 32 bit offset) when it is a bit 32 bit machine. how will other 16 bits be stored ? any thing accessed from ram must be 32 bits or 4 bytes how does the rest of 16 bits be accessed (to be loaded into segment registers) ?
Thanks in advance. the question can have a lot of things. but wanted to get clarification about the entire life cycle of an executable. Will be glad if some answers and pulls up a discussion on this.
Unix traditionally has implemented protection via paging. 286+ provides segmentation, and 386+ provides paging. Everyone uses paging, few make any real use of segmentation.
In x86, every memory operand has an implicit segment (so the address is really 16 bit selector + 32 bit offset), depending on the register used. So if you access [ESP + 8] the implied segment register is SS, if you access [ESI] the implied segment register is DS, if you access [EDI+4] the implied segment register is ES,... You can override this via segment prefix overrides.
Linux, and virtually every modern x86 OS, uses a flat memory model (or something similar). Under a flat memory model each segment provides access to the whole memory, with a base of 0 and a limit of 4Gb, so you don't have to worry about the complications segmentation brings about. Basically there are 4 segments: kernelspace code (RX), kernelspace data (RW), userspace code (RX), userspace data (RW).
An ELF file consists of some headers that pont to "program segments" and "sections". Section are used for linking. Program segments are used for loading. Program segments are mapped into memory via mmap(), this setups page-table entries with appropriate permissions.
Now, older x86 CPUs' paging mechanism only provided RW access control (read permission implies execute permission), while segmentation provided RWX access control. The end permission takes into account both segmentation and paging (e.g: RW (data segment) + R (read only page) = R (read only), while RX (code segment) + R (read only page) = RX (read and execute)).
So there are some patches that provide execution prevention via segmentation: e.g. OpenWall provided a non-executable stack by shrinking the code segment (the one with execute permission), and having special emulation in the page fault handler for anything that needed execution from a high memory address (e.g: GCC trampolines, self-modified code created on the stack to efficiently implement nested functions).
There's no such thing as paged segmentation, not in the official documentation at least. There are two different mechanisms working together and more or less independently of each other:
Translation of a logical address of the form 16-bit segment selector value:16/32/64-bit segment offset value, that is, a pair of 2 numbers into a 32/64-bit virtual address.
Translation of the virtual address into a 32/64-bit physical address.
Logical addresses is what your applications operate directly with. Then follows the above 2-step translation of them into what the RAM will understand, physical addresses.
In the first step the GDT (or it can be LDT, depends on the selector value) is indexed by the selector to find the relevant segment's base address and size. The virtual address will be the sum of the segment base address and the offset. The segment size and other things in segment descriptors are needed to provide protection.
In the second step the page tables are indexed by different parts of the virtual address and the last indexed table in the hierarchy gives the final, physical address that goes out on the address bus for the RAM to see. Just like with segment descriptors, page table entries contain not only addresses but also protection control bits.
That's about it on the mechanisms.
Now, in many x86 OSes the segment selectors that are used for applications are fixed, they are the same in all of them, they never change and they point to segment descriptors that have base addresses equal to 0 and sizes equal to the possible maximum (e.g. 4GB in non-64-bit modes). Such a GDT setup effectively means that the first step does no useful work and the offset part of the logical address translates into numerically equal virtual address.
This makes the segment selector values practically useless. They still have to be loaded into the CPU's segment registers (in non-64-bit modes into at least CS, SS, DS and ES), but beyond that point they can be forgotten about.
This all (except Linux-related details and the ELF format) is explained in or directly follows from Intel's and AMD's x86 CPU manuals. You'll find many more details there.
Perhaps read the Assembly HOWTO. When a Linux process starts to execute an ELF executable using the execve system call, it is essentially (sort of) mmap-ing some segments (and initializing registers, and a tiny part of the stack). Read also the SVR4 x86 ABI supplement and its x86-64 variant. Don't forget that a Linux process only see memory mapping for its address space and only cares about virtual memory
There are many good books on Operating Systems (=O.S.) kernels, notably by A.Tanenbaum & by M.Bach, and some on the linux kernel
NB: segment registers are nearly (almost) unused on Linux.