How are page tables initialized? - linux

I have been learning about virtual memory recently, and some questions were raised - especially regarding the initialization of all the structs.
assume x86 architecture, linux 2.4 (=> 2 level paging).
at the beginning, what do the PGD's entries contain, if they dont point to any allocated Page Table?
same question for page tables - how are the entries initialized?
when process creates new memory area, say ,for virtual addresses 100 - 200, does it also create (if needed) and initialize the page tables that correspond to those addresses? or wait until there is an access for a specific address?
when page table entry needs to be initialized to physical address (say on write access) - how does OS select it?
thanks in advance.

Entries have a valid bit. So if there are no page tables allocated in a page directory, all entries theoretically have the valid bit off and it wouldn't matter what else is in the entry.
Same as above, except I think if a page table is created that means a page from this range has been accessed so at least one entry will be set as valid upon page table initialization. Otherwise, there'd be no reason to create an empty page table at all and take up memory.
I'm interpreting your "creating new memory area" as using a malloc() call. Malloc is a way to ask the OS for memory and to map that memory to your virtual address space. That memory comes from your heap virtual memory range and I don't think you can guarantee the specific addresses the OS uses, just the size. If you use mmap I think you do have the ability to ask for specific addresses to be used, but in general you only want to do this for specific cases like shared memory.
As for the page tables, I imagine that when the OS gets your memory during a malloc call, it will update your page table for you with the new mappings. If it doesn't during the malloc, then it will when you try to access the memory and it causes a page fault.
In Linux, the OS generally keeps track of a free list of pages so that it can easily grab memory without worrying about anyone else using it. My guess is the free list is initialized upon booting by communicating with the main memory controller/bitmap to know which spots of physical memory are in use, but maybe a hardware person can back this up.

Related

Do cache hits count as page references in Linux?

In Linux, pages in memory have a PG_referenced bit which is set when there is a reference to the page. My question is, if there is a memory read/write to an address in a page and there was a cache hit for that address, will it count as a reference to that page?
Is PG_referenced what Linux calls the bit in the hardware page tables that the CPU updates when a page is accessed? Or is it tracking whether a page is referenced by another kernel data structure?
The CPU hardware sets a bit in the page-table entry on access to virtual page, if it wasn't already set. (I assume most ISAs have functionality like this to help detect less recently used pages, e.g. the OS can clear the Accessed bit on some pages, and later check to see which pages still haven't been accessed. Without having to actually invalidate them and force a soft page fault on access.)
On x86 for example, the check (for whether to "trap" to an internal microcode assist to atomically set the Accessed and/or Dirty bit in the PTE and higher levels in the page directory) is based on the TLB entry caching the PTE.
This is unrelated to D-cache or I-cache hit or miss for the physical address.
Updating the Accessed and/or Dirty bits works even for pages set to be uncacheable, I think.

Find out pages used by mem_map[] array

Recently, I am working on the ARM Linux kernel, and I need to split the HighMem zone into two parts. So I have added a new zone into the kernel, let's say "NewMem". Therefore, I have three zones in my system, they are Normal, NewMem, and HighMem. The size of the NewMem zone is 512MB (, totally 131072 pages). My propose is that I want to manage all the page frames in NewMem zone in my own way, currently I use a doubly linked list to allocate/de-allocate pages. Note that the buddy system for NewMem zone is still exist, but I do not use it. To achieve this. I modified the page allocation routine to make sure that the kernel cannot allocate any page frame from my zone.
My concern is that can I use all the page frames in that zone as it is suggested that each zone is concerned with a subset of the mem_map[] array. I found that only 131084 pages are free in NewMem zone.Therefore, some page frames in my zone may used to store mem_map[], writing data to these pages may lead to unpredictable errors. So is there exist any way to find out which page frame is used to store mem_map[], so that I can avoid rewriting it.
You have to check the break down of physical and virtual memory. usually mem_map is stored on first mappable address of the virtual memory. In Unix, kernel image of usually 8MB is stored at physical address of 1MiB accessed with virtual address PAGE_OFFSET + 0x00100000. 8MiB is reserved in virtual memory for kernel image. Then comes the 16 MiB of zone_dma. So first address which can be used by kernel for mapping is 0xC1000000. Which is supposed to contain mem_map array.
I am not familiar with ARM memory break down but from your post it is evident that there is no zone_dma at least in your case. So your best bet is that address 0xC0800000 stores mem_map. I am assuming that kernel image is 8MB.
As stated above in general first mappable virtual address stores mem_map. You can calculate that address with size and location of kernel image and zone_dma(present or not).
Please come with your feedback.

In Linux, physical memory pages belong to the kernel data segment are swappable or not?

I'm asking because I remember that all physical pages belong to the kernel are pinned in memory and thus are unswappable, like what is said here: http://www.cse.psu.edu/~axs53/spring01/linux/memory.ppt
However, I'm reading a research paper and feel confused as it says,
"(physical) pages frequently move between the kernel data segment and user space."
It also mentions that, in contrast, physical pages do not move between the kernel code segment and user space.
I think if a physical page sometimes belongs to the kernel data segment and sometimes belongs to user space, it must mean that physical pages belong to the kernel data segment are swappable, which is against my current understanding.
So, physical pages belong to the kernel data segment are swappable? unswappable?
P.S. The research paper is available here:
https://www.cs.cmu.edu/~arvinds/pubs/secvisor.pdf
Please search "move between" and you will find it.
P.S. again, a virtual memory area ranging from [3G + 896M] to 4G belongs to the kernel and is used for mapping physical pages in ZONE_HIGHMEM (x86 32-bit Linux, 3G + 1G setting). In such a case, the kernel may first map some virtual pages in the area to the physical pages that host the current process's page table, modify some page table entries, and unmap the virtual pages. This way, the physical pages may sometimes belong to the kernel and sometimes belong to user space, because they do not belong to the kernel after the unmapping and thus become swappable. Is this the reason?
tl;dr - the memory pools and swapping are different concepts. You can not make any deductions from one about the other.
kmalloc() and other kernel data allocation come from slab/slub, etc. The same place that the kernel gets data for user-space. Ergo pages frequently move between the kernel data segment and user space. This is correct. It doesn't say anything about swapping. That is a separate issue and you can not deduce anything.
The kernel code is typically populated at boot and marked read-only and never changes after that. Ergo physical pages do not move between the kernel code segment and user space.
Why do you think because something comes from the same pool, it is the same? The network sockets also come from the same memory pool. It is a seperation of concern. The linux-mm (memory management system) handles swap. A page can be pinned (unswappable). The check for static kernel memory (this may include .bss and .data) is a simple range check. The memory is normally pinned and marked unswappable at the linux-mm layer. The user data (whos allocation come from the same pool) can be marked as swappable by the linux-mm. For instance, even without swap, user-space text is still swappable because it is backed by an inode. Caching is much simpler for read-only data. If data is swapped, it is marked as such in the MMU tables and a fault handler must distinguish between swap and a SIGBUS; which is part of the linux-mm.
There are also versions of Linux with no-mm (or no MMU) and these will never swap anything. In theory someone might be able to swap kernel data; but the why is it in the kernel? The Linux way would be to use a module and only load them as needed. Certainly, the linux-mm data is kernel data and hopefully, you can see a problem with swapping that.
The problem with conceptual questions like this,
It can differ with Linux versions.
It can differ with Linux configurations.
The advice can change as Linux evolves.
For certain, the linux-mm code can not be swappable, nor any interrupt handler. It is possible that at some point in time, kernel code and/or data could be swapped. I don't think that this is ever the current case outside of module loading/unloading (and it is rather pedantic/esoteric as to whether you call this swapping or not).
I think if a physical page sometimes belongs to the kernel data segment and sometimes belongs to user space, it must mean that physical pages belong to the kernel data segment are swappable, which is against my current understanding.
there is no connection between swappable memory and page movement between user space and kernel space. whether a page can be swapped or not depends totally on whether it is pinned or not. Pinned pages are not swapped so their mapping is considered permanent.
So, physical pages belong to the kernel data segment are swappable? unswappable?
usually pages used by kernel are pinned and so are meant not to be swappable.
However, I'm reading a research paper and feel confused as it says, "(physical) pages frequently move between the kernel data segment and user space."
Could you please give a link of this research papaer?
As far as I known, (just from UNIX lectures and labs at school) the pages for kernel space has been allocated for kernel, with a simple, fixed mapping algorithm, and they are all pinned. After kernel turn on the paging mode, (bits operation of CR0&CR3 for x86) there will be the first user mode process, and the pages which has been allocated for kernel will not be in the available set of pages for user space.

Allocating "temporary" memory (in Linux)

I'm trying to find any system functionality that would allow a process to allocate "temporary" memory - i.e. memory that is considered discardable by the process, and can be take away by the system when memory is needed, but allowing the process to benefit from available memory when possible. In other words, the process tells the system it's OK to sacrifice the block of memory when the process is not using it. Freeing the block is also preferable to swapping it out (it's more expensive, or as expensive, to swap it out rather then re-constitute its contents).
Systems (e.g. Linux), have those things in the kernel, like F/S memory cache. I am looking for something like this, but available to the user space.
I understand there are ways to do this from the program, but it's really more of a kernel job to deal with this. To some extent, I'm asking the kernel:
if you need to reduce my, or another process residency, take these temporary pages off first
if you are taking these temporary pages off, don't swap them out, just unmap them
Specifically, I'm interested on a solution that would work on Linux, but would be interested to learn if any exist for any other O/S.
UPDATE
An example on how I expect this to work:
map a page (over swap). No difference to what's available right now.
tell the kernel that the page is "temporary" (for the lack of a better name), meaning that if this page goes away, I don't want it paged in.
tell the kernel that I need the temporary page "back". If the page was unmapped since I marked it "temporary", I am told that happened. If it hasn't, then it starts behaving as a regular page.
Here are the problems to have that done over existing MM:
To make pages not being paged in, I have to allocate them over nothing. But then, they can get paged out at any time, without notice. Testing with mincore() doesn't guarantee that the page will still be there by the time mincore() finishes. Using mlock() requires elevated privileges.
So, the closest I can get to this is by using mlock(), and anonymous pages. Following the expectations I outlined earlier, it would be:
map an anonymous, locked page. (MAP_ANON|MAP_LOCKED|MAP_NORESERVE). Stamp the page with magic.
for making page "temporary", unlock the page
when needing the page, lock it again. If the magic is there, it's my data, otherwise it's been lost, and I need to reconstitute it.
However, I don't really need for pages to be locked in RAM when I'm using them. Also, MAP_NORESERVE is problematic if memory is overcommitted.
This is what the VmWare ESXi server aka the Virtual Machine Monitor (VMM) layer implements. This is used in the Virtual Machines and is a way to reclaim memory from the virtual machine guests. Virtual machines that have more memory allocated than they actually are using/require are made to release/free it to the VMM so that it can assign it back to the Virtual Machines guests that are in need of it.
This technique of Memory Reclamation is mentioned in this paper: http://www.vmware.com/files/pdf/mem_mgmt_perf_vsphere5.pdf
On similar lines, something similar you can implement in your kernel.
I'm not sure to understand exactly your needs. Remember that processes run in virtual memory (their address space is virtual), that the kernel is dealing with virtual to physical address translation (using the MMU) and with paging. So page fault can happen at any time. The kernel will choose to page-in or page-out at arbitrary moments - and will choose which page to swap (only the kernel care about RAM, and it can page-out any physical RAM page at will). Perhaps you want the kernel to tell you when a page is genuinely discarded. How would the kernel take away temporary memory from your process without your process being notified ? The kernel could take away and later give back some RAM.... (so you want to know when the given back memory is fresh)
You might use mmap(2) with MAP_NORESERVE first, then again (on the same memory range) with MAP_FIXED|MAP_PRIVATE. See also mincore(2) and mlock(2)
You can also later use madvise(2) with MADV_WONTNEED or MADV_WILLNEED etc..
Perhaps you want to mmap some device like /dev/null, /dev/full, /dev/zero or (more likely) write your own kernel module providing a similar device.
GNU Hurd has an external pager mechanism... You cannot yet get exactly that on Linux. (Perhaps consider mmap on some FUSE mounted file).
I don't understand what you want to happen when the kernel is paging out your memory, and what you want to happen when the kernel is paging in again such a page because your process is accessing it. Do you want to get a zero-ed page, or a SIGSEGV ?

Virtually contiguous vs. physically contiguous memory

Is virtually contiguous memory also always physically contiguous? If not, how is virtually continuous memory allocated and memory-mapped over physically non-contiguous RAM blocks? A detailed answer is appreciated.
Short answer: You need not care (unless you're a kernel/driver developer). It is all the same to you.
Longer answer: On the contrary, virtually contiguous memory is usually not physically contiguous (only in very small amounts). Except by coincidence, or shortly after the machine has just booted. That isn't necessary, however.
The only way of allocating larger amounts of physically contiguous RAM is by using large pages (since the memory within one page needs to be contiguous). It is however a useless endeavor, since there is no observable difference for your process whether memory of which you think that it is contiguous is actually contiguous, but there are disadvantages to using large pages.
Memory mapping over phyically non-contiuous RAM works in no particularly "special" way. It follows the same method which all memory management follows.
The OS divides virtual memory in "pages" and creates page table entries for your process. When you access a memory in some location, either the corresponding page does not exist at all, or it exists and corresponds to a real page in RAM, or it exists but doesn't correspond to a real page in RAM.
If the page exists in RAM, nothing happens at all1. Otherwise a fault is generated and some operating system code is run. If it turns out the page doesn't exist at all (or does not have the correct access rights), your process is killed with a segmentation fault.
Otherwise, the OS chooses an arbitrary page that isn't used (or it swaps out the one it thinks is the least important one), and loads the data from disk into that page. In the case of a memory mapping, the data comes from the mapped file, otherwise it comes from swap (and for completely new allocated memory, the zero page is copied). The OS then returns control back to your process. You never know this happened.
If you access another location in a "contiguous" (or so you think!) memory area which lies in a different page, the exact same procedure runs.
1 In reality, it is a little more complicated, since a page may exist in RAM but not exist "officially", being part of a list of pages that are to be recycled or such. But this gets too complicated.
No, it doesn't have to. Any page of the virtual memory can be mapped to an arbitrary physical page. Therefore you can have adjacent pages of your virtual memory pointing to non-adjacent physical pages. This mapping is maintained by the OS and is used by the MMU unit of CPU.

Resources