What about the many addresses unpresented in the /proc/$pid/maps file? - linux

Brief Version:
what status are the addresses unpresented in the maps file? Are they belongs to unallocated virtual pages or allocated from anonymous file or others?
Detailed Version
I'm learning about VM. In my book(CS:APP), I learned that all virtual pages can be cut into three sets: unallocated, allocated but not cached, allocated and cached.I have some questions about "what are allocated pages and unallocated pages? When are pages allocated?" And also, is stack and heap belongs to allocated pages or unallocated or only allocate when used?
Trying to solve these problems, I read the /proc/$pid/maps file, while I think I can get anything I want from it. In my mind, the file contains all memory mapping relations. But there isn't information about is it cached(I know maybe it cannot be seen from user mode...), and are the unpresented pages unallocated?

Honestly, I don't really know about the maps file. What I do know is that the information on every page is stored in page structures at all time. I'm gonna take x86-64 as an example.
For x86-64 on Linux you have Page Global Directory (PGD), Page Upper Directory (PUD), Page Middle Directory (PMD) and Page Directory (PD). The address of the bottom of the PGD table is stored in the CR3 register. PGD contains addresses of the PUDs, PUDs contain addresses of the PMDs, PMDs contain addresses of the PDs and PDs contain addresses of the physical pages.
A virtual address, of which only 48 bits are used, is split into 5 parts. The 12 least significant bits are the offset in the physical page. The next chunk of 9 bits is the offset in the PD. The next chunk is the offset in PMD etc. For example let's say you have virtual address 0x0000000000000123. This virtual address will be translated by the MMU in the CPU by looking at entry (offset) 0 of the PGD, entry 0 of the PUD, entry 0 of the PMD, entry 0 of the PD and finally offset 0x123 in the actual physical page in RAM. Every virtual address is 64 bits of which only the 48 least significant bits will be used.
At boot, the kernel makes checks to determine how much memory is available. It then builds its kernel structures accordingly.
When the kernel boots it will mark all pages as unallocated in its own structures (except for kernel needs). The page structure is important for this. The kernel has a page C structure for every page in the system (https://linux-kernel-labs.github.io/refs/heads/master/labs/memory_mapping.html and https://elixir.bootlin.com/linux/v4.6/source/include/linux/mm_types.h). This structure informs the kernel whether the page is allocated or not.
Each physical page in the system has a struct page associated with
it to keep track of whatever it is we are using the page for at the
moment. Note that we have no way to track which tasks are using
a page, though if it is a pagecache page, rmap structures can tell us
who is mapping it.
At first the pages are mostly unallocated. When you start a new process by launching an executable as the user of the system, pages are allocated for your process. On Linux, executables are ELF files. ELF is a conventional format which separates code in loadable segments. Each segment gets a virtual address at which it is going to be loaded in the virtual address space.
Let's say you have an elf file with one segment which should be loaded at virtual address 0x400000. When you launch that ELF executable, the Linux kernel will call certain functions which will look at the size of the code and allocate pages accordingly. The kernel will look at its structures and determine using algorithms where the process will be allocated in RAM. It will then setup the page tables according to where the virtual addresses for that process should land in actual physical memory.
What's important to understand is that each CPU core in the system has only one process running at a time. Each core has it's own set of page tables. When a process switch occurs for one core, the page tables are swapped completely to point to somewhere else in RAM. The same virtual address can point anywhere in RAM depending on how the page tables are set up.
The kernel holds a task_struct for every process running in the system. The task_struct contains a field named pgd which is a pointer to the PGD of the process. Each process has its very own PGD. If you dereference the pointer to the PGD you get the actual value of the first entry of the PGD. This first entry is the address of the PUD. With this only pointer, the kernel can reach every table belonging to the process and modify them at will.
While a process is running, it can ask for more memory. This is called dynamic memory allocation. The kernel has no way to know how much memory the process is going to ask in advance since it is dynamic (done while code is executing). When the process asks for more memory, the kernel determines what page to give to the process depending on an algorithm. It then marks this page as allocated to that process. The task_struct contains a mm field which is of type mm_struct (https://manybutfinite.com/post/how-the-kernel-manages-your-memory/). It is a memory descriptor for that process so that the kernel can know what memory the process is using. In itself the process doesn't need that information since the process should rely only on itself to ask for memory properly to the operating system and to not jump somewhere in RAM where it doesn't belong.
You ask about heap and stack. The stack for a process is allocated at the beginning of the process and I think it has a fixed size. If you overflow the stack, you will throw a CPU exception which will prompt the kernel to kill your process. Each CPU core has a special register called RSP. This is the stack pointer. It points to the top of the stack (the stack grows downward toward low memory). When the kernel allocates a stack for a process you launch, it will set up this register to point at the top of it. The stack pointer contains a virtual address. It will thus be translated using the page tables just like any address.
The heap is allocated and managed completely by the OS. It doesn't have special registers like the stack. It is allocated only when the process asks for more memory during code execution. The kernel knows in advance how much memory a process requires. It is all written in the ELF executable. All static memory is allocated during compilation and thus the kernel knows everything about the size of static memory. The only moment it requires to allocate new memory to a process is when the process actually asks for it. In C++ you use the keyword new to ask for heap memory dynamically. If you don't use this keyword, then the kernel knows in advance where your variables will be allocated (where they will be in memory). Only the stack will be used by static memory.

Related

Is it possible to overwrite kernel code by replacing the page tables?

So it seems that you cannot modify kernel code because the PTE that points to it is marked as executable as opposed to writeable. I was wondering if you could overwrite kernel code using the following method? (this only applies to x86 and assumes we have root access so we run the following steps as a kernel module)
Read in the contents of the CR3 register
Use kmalloc to allocate memory big enough to replicate all the PTE and the PDE
Copy all the paging data into the newly allocated memory using the value obtained from the CR3 register
Mark the relevant pages as executable and writeable
Overwrite the CR3 register with a pointer to the memory we kmalloc'ed in step 2
At this point, assuming this all worked, wouldnt you be able to overwrite return addresses and other parts of the kernel code? Whereas before doing this we would be stopped from the paging mechanism protections?
3 . Copy all the paging data into the newly allocated memory using the value obtained from the CR3 register
5 . Overwrite the CR3 register with a pointer to the memory we kmalloc'ed in step 2
These two steps might not work:
CR3 gives you an physical address; however, for reading the page data you require a virtual address. It is not even guaranteed that the PTD is currently mapped (and therefore accessible).
And to overwrite the CR3 register you need to know the physical address of the memory you have allocated using kmalloc; however, you only know the virtual address.
However, you might use virt_to_phys and phys_to_virt to translate physical to virtual addresses.
Is it possible to overwrite kernel code ...?
I'm not sure, but the following attempt should work:
The page tables themselves should be read-write - at least the ones used by kmalloc.
Instead of copying the PTD and the page tables, you could allocate some memory using kmalloc which is 2 page sizes long (8 KiB if the "traditional" 4 KiB memory pages are used). This means that "your" memory block under all circumstances contains one complete memory page.
When you have the virtual addresses of the PTD and the page tables, you can re-map "your" memory page so it does no longer point to your "kmalloc memory" but to the kernel code you want to modify...
At this point, assuming this all worked, wouldn't you be able to overwrite return addresses and other parts of the kernel code?
I'm not sure if I understand your question correctly.
But a kernel module is part of the kernel - so nothing stops a kernel module from doing something completely stupid (intentionally or because of a bug).
For this reason you have to be very careful when programming kernel modules.
And because "root" has the ability to load kernel modules, it is important that hackers or malware never get "root" access. Otherwise malware could be injected into the kernel using insmod.

What structure is traversed to deallocate pages, when a process terminates? (Page Table or something else?)

I am trying to understand the nature of the operations carried out regarding the deallocation of physical memory when a process terminates.
Assumed that page table for the process is a multi-level tree structure thats implemented on Linux.
My current understanding is that the OS would need to deallocate each physical page frame that is mapped to whatever subset of the virtual addresses for which the Page Table entry (PTE) exists. This could happen by a traversal of the multi-level tree PT structure & for the PTEs that have their valid bit set, the physical frame descriptor corresponding to the PTE is added to the free list (which is used in the Buddy allocation process).
My question is: Is the traversal of the Page Table actually done for this? An alternative, faster way would be to maintain a linked list of the page frame descriptors allotted to a process, for each process & then traverse that linearly during process termination. Is this more generic & faster method instead followed?
I'm not sure that page gets physically deallocated at process ending.
My understanding is that MMU is managed by the kernel.
But each process has its own virtual address space, which the kernel changes:
for explicit syscalls changing it, ie. mmap(2)
at program start thru execve(2) (which can be thought of several virtual mmap-s as described by the segments of the ELF program executable file)
at process termination, as if each segment of the address space was virtually munmap-ed
And when a process terminates, it is its virtual address space (but not any physical RAM pages) which gets destroyed or deallocated!
So the page table (whatever definition you give to it) is probably managed inside the kernel by a few primitives like adding a segment to virtual address space and removing a segment from it. The virtual space is lazily managed, since the kernel uses copy on write techniques to make fork fast.
Don't forget that some pages (e.g. the code segment of shared libraries) are shared between processes and that every task of a multi-threaded process are sharing the same virtual address space.
BTW, the Linux kernel is free software, so you should study its source code (from http://kernel.org/). Look also on http://kernelnewbies.org ; memory management happens inside the mm/ subtree of the kernel source.
There are lots of resources. Look into linux-kernel-slides, slides#245 for a start, and there are many books and resources about the Linux kernel... Look for vm_area_struct, pgetable, etc...

High memory mappings in kernel virtual address space

The linear address beyond 896MB correspond to High memory region ZONE_HIGHMEM.
So the page allocator functions will not work on this region, since they give the linear address of directly mapped page frames in ZONE_NORMAL and ZONE_DMA.
I am confused about these lines specified in Undertanding linux Kernel:
What do they mean when they say "In 64 bit hardware platforms ZONE_HIGHMEM is always empty."
What does this highlighted statement mean: "The allocation of high-memory page frames is done only through alloc_pages() function. These functions do not return linear address since they do not exist. Instead the functions return linear address of the page descriptor of the first allocated page frame. These linear addresses always exist, because all page descriptors are allocated in low memory once and forever during kernel initialization."
What are these Page descriptors and does the 896MB already have all page descriptors of entire RAM.
The x86-32 kernel needs high memory to access more than 1G of physical memory, as it is impossible to permanently map more than 2^{32} addresses within a 32-bit address space and the kernel/user split is 1G/3G.
The x86-64 kernel has no such limitation, as the amount of physically-addressable memory (currently 256T) fits within its 64-bit address space and thus may always be permanently mapped.
High memory is a hack. Ideally you don't need it. Indeed, the point of x86-64 is to be able to directly address all the memory you could possibly want. Taken
from https://www.quora.com/Linux-Kernel/What-is-the-difference-between-high-memory-and-normal-memory
I think page descriptor means struct page. And considering the sizeof struct page. Yes all of them can be stored in ZONE_NORMAL

How kernel threaduse memory descriptor(mm_struct) of last ran process in Linux?

Some of the points mentioned in the Linux kernel Development (by Robert
Love) book about mm_struct and kernel thread are :
"Kernel threads do not have a process address space and therefore do not
have an associated memory descriptor. Thus, the mm field of a kernel
thread's process descriptor is NULL. "
"Because kernel threads do not have any pages in user-space, they do not
really deserve their own memory descriptor and page tables (page tables are
discussed later in the chapter). Despite this, kernel threads need some of
the data, such as the page tables, even to access kernel memory."
"Kernel threads do not have an address space and mm is NULL. Therefore, when
a kernel thread is scheduled, the kernel notices that mm is NULL and keeps
the previous process's address space loaded. The kernel then updates the
active_mm field of the kernel thread's process descriptor to refer to the
previous process's memory descriptor. The kernel thread can then use the
previous process's page tables as needed."
Now my queries are:
1. First it is mentioned that the kernel threads dont have any page in user
space and hence they dont deserve memory desriptor and page tables and in
the next line it says it needs some data such as page tables to access
kernel memory. What page table it is referring here?? Every process has its
own page table for mapping the virtual to physical address, why kernel
thread requires that?
How page table use by kernel thread?
Every thread whether its a user-space or kernel space process requires a page-table. The kernel address space (Virtual Memory address space) is directly mapped to the physical address space where as the user-space address space isn't directly mapped. Moreover the user-space application address space mappings keeps changing as the new processes are created, terminated, swapped whereas the kernel space mappings remain constant.
To learn more you can visit the following link :-
Process address Space
Or post the queries here.
There are some usespace applications and kernel threads in your system. Every virtual address space consists of kernel and user part. Kernel is the same for all processes, user part is different.
Every process has its own page table for mapping the virtual to
physical address, why kernel thread requires that?
Kernel threads need page tables to make translation from virtual address to physical while accessing memory.
How page table use by kernel thread?
Imagine a simple case, memory write such as a[i] = 5; in kernel space. This one typically goes through MMU, which use page tables to get physical address according to virtual address (in this case &a[i]). So there is no something special about kernel threads, the difference is that on context switch they don't change pgd (page global directory), they use pgd of last process, because all processes have the same kernel part and you can pick just last one (see actime_mm) and it will be ok.

Virtual memory sections and memory mapping area

As process has virtual memory which is copied into RAM during run time. As given in the previous post.
Which part of process virtual memory layout does mmap() uses?
I have following doubles :
If memory mapping is inside unallocated memory and it is inside process's virtual memory. As virtual memory helps to avoid one process to touch other process's virtual memory. Then how can memory mapping is used for Interprocess Communication(IPC)?
In OS like Linux, whether has each individual process separate section of heap, stack and memory mapping or all processes have one common section for heap, stack and MMAP?
Example :
if there are P1,P2 and P3 processes are running on linux OS. will all have common table as given in picture or each individual task have separate table to each section.
In 32 bit system, 2^32=4 gigabytes of virtual memory is possible and 1G byte is reserved for kernel and 3 gigabytes for userspace applications. can each individual process have up to 3 gigabytes of virtual memory or sum of all userspace applications size could be 3 gigabytes (i.e virtual memory size of (P1+P2+P3)<=3 gigabytes)?
--
Learner
Using memory mapping for IPC works by mapping the same range of physical memory into two or more virtual address ranges in different processes. This works for communication because both processes are using the exact same memory cells (although they might "see" them differently, at different addresses). You change a value in one mapping, and it is instantly visible in the other mapping in a different process because it is the very same memory.
Every process has its own independent stack and heap. The OS does not care about that at all, it only cares about pages. The heap and the stack are things that are implemented by the application (via the runtime). When you call a function like malloc, the allocator in the runtime either returns a block that it already had reserved earlier or one that it has recylced (you called free earlier), or it asks the OS to reserve some more memory (sbrk or mmap). When you first access this memory, the OS sees a page fault and verifies that you are allowed to access this location (because you've reserved it) and then provides a valid page.
Every process can use (as in "reserve") the whole available address space (3GiB in your example). This does not interfere with any other process. Note that due to fragmentation and alignment, and because your executable and the stack take away a little bit, you will in practice not be able to allocate the full 3 GiB, but you can get close to it.
All processes together can use as much virtual memory as is available on the system (physical RAM plus swap space), but they can only use as much as there is physical memory available at the same time (minus a little bit for this and that, like unpageable kernel memory and such).

Resources