How kernel page table get initialized? - linux

I am following Gorman's virtual memory management book.
There is a section about kernel table page initialization which is said to be divided into two phases, bootstrapping and finalizing.
Here is what it says about the bootstrapping phase.
The assembler function startup_32() is responsible for enabling the paging unit in
arch/i386/kernel/head.S. While all normal kernel code in vmlinuz is compiled
with the base address at PAGE_OFFSET + 1MiB, the kernel is actually loaded beginning
at the first megabyte (0x00100000) of memory. The first megabyte is used
by some devices for communication with the BIOS and is skipped. The bootstrap
code in this file treats 1MiB as its base address by subtracting __PAGE_OFFSET from
any address until the paging unit is enabled. Therefore before the paging unit is
enabled, a page table mapping has to be established that translates the 8MiB of
physical memory to the virtual address PAGE_OFFSET.
Why we want to subtract __PAGE_OFFEST? For what purpose?
Why we have to do subtracting before the paging unit is enabled? Isn't that we always use subtracting for mapping kernel virtual address to physical memory address?
Why it is 8MB?
Thanks,

Since x86 code isn't generally position-independent, if it's compiled to execute at address X (__PAGE_OFFSET + 1MB) but loaded at address Y (1MB), all addresses inside of it need to be decremented by Y-X (__PAGE_OFFSET + 1MB - 1MB = __PAGE_OFFSET) for it to work.
For example, if there's an instruction to read a byte of memory from the beginning of the kernel, __PAGE_OFFSET + 1MB, the address is reduced by __PAGE_OFFSET and the actual read location becomes 1MB, exactly where the kernel starts in the memory.
When page translation is finally enabled, __PAGE_OFFSET can and, I believe, is effectively subtracted by the page translation mechanism by mapping a range of virtual addresses to a range of physical addresses that are smaller by __PAGE_OFFSET (that is, physical=virtual-__PAGE_OFFSET per the page tables).
Unless there's some additional kernel relocation involved, 8MB is likely just the size of the mapping range, sufficient to map the entire kernel.

Related

Is it possible to overwrite kernel code by replacing the page tables?

So it seems that you cannot modify kernel code because the PTE that points to it is marked as executable as opposed to writeable. I was wondering if you could overwrite kernel code using the following method? (this only applies to x86 and assumes we have root access so we run the following steps as a kernel module)
Read in the contents of the CR3 register
Use kmalloc to allocate memory big enough to replicate all the PTE and the PDE
Copy all the paging data into the newly allocated memory using the value obtained from the CR3 register
Mark the relevant pages as executable and writeable
Overwrite the CR3 register with a pointer to the memory we kmalloc'ed in step 2
At this point, assuming this all worked, wouldnt you be able to overwrite return addresses and other parts of the kernel code? Whereas before doing this we would be stopped from the paging mechanism protections?
3 . Copy all the paging data into the newly allocated memory using the value obtained from the CR3 register
5 . Overwrite the CR3 register with a pointer to the memory we kmalloc'ed in step 2
These two steps might not work:
CR3 gives you an physical address; however, for reading the page data you require a virtual address. It is not even guaranteed that the PTD is currently mapped (and therefore accessible).
And to overwrite the CR3 register you need to know the physical address of the memory you have allocated using kmalloc; however, you only know the virtual address.
However, you might use virt_to_phys and phys_to_virt to translate physical to virtual addresses.
Is it possible to overwrite kernel code ...?
I'm not sure, but the following attempt should work:
The page tables themselves should be read-write - at least the ones used by kmalloc.
Instead of copying the PTD and the page tables, you could allocate some memory using kmalloc which is 2 page sizes long (8 KiB if the "traditional" 4 KiB memory pages are used). This means that "your" memory block under all circumstances contains one complete memory page.
When you have the virtual addresses of the PTD and the page tables, you can re-map "your" memory page so it does no longer point to your "kmalloc memory" but to the kernel code you want to modify...
At this point, assuming this all worked, wouldn't you be able to overwrite return addresses and other parts of the kernel code?
I'm not sure if I understand your question correctly.
But a kernel module is part of the kernel - so nothing stops a kernel module from doing something completely stupid (intentionally or because of a bug).
For this reason you have to be very careful when programming kernel modules.
And because "root" has the ability to load kernel modules, it is important that hackers or malware never get "root" access. Otherwise malware could be injected into the kernel using insmod.

How exactly do kernel virtual addresses get translated to physical RAM?

On the surface, this appears to be a silly question. Some patience please.. :-)
Am structuring this qs into 2 parts:
Part 1:
I fully understand that platform RAM is mapped into the kernel segment; esp on 64-bit systems this will work well. So each kernel virtual address is indeed just an offset from physical memory (DRAM).
Also, it's my understanding that as Linux is a modern virtual memory OS, (pretty much) all addresses are treated as virtual addresses and must "go" via hardware - the TLB/MMU - at runtime and then get translated by the TLB/MMU via kernel paging tables. Again, easy to understand for user-mode processes.
HOWEVER, what about kernel virtual addresses? For efficiency, would it not be simpler to direct-map these (and an identity mapping is indeed setup from PAGE_OFFSET onwards). But still, at runtime, the kernel virtual address must go via the TLB/MMU and get translated right??? Is this actually the case? Or is kernel virtual addr translation just an offset calculation?? (But how can that be, as we must go via hardware TLB/MMU?). As a simple example, lets consider:
char *kptr = kmalloc(1024, GFP_KERNEL);
Now kptr is a kernel virtual address.
I understand that virt_to_phys() can perform the offset calculation and return the physical DRAM address.
But, here's the Actual Question: it can't be done in this manner via software - that would be pathetically slow! So, back to my earlier point: it would have to be translated via hardware (TLB/MMU).
Is this actually the case??
Part 2:
Okay, lets say this is the case, and we do use paging in the kernel to do this, we must of course setup kernel paging tables; I understand it's rooted at swapper_pg_dir.
(I also understand that vmalloc() unlike kmalloc() is a special case- it's a pure virtual region that gets backed by physical frames only on page fault).
If (in Part 1) we do conclude that kernel virtual address translation is done via kernel paging tables, then how exactly does the kernel paging table (swapper_pg_dir) get "attached" or "mapped" to a user-mode process?? This should happen in the context-switch code? How? Where?
Eg.
On an x86_64, 2 processes A and B are alive, 1 cpu.
A is running, so it's higher-canonical addr
0xFFFF8000 00000000 through 0xFFFFFFFF FFFFFFFF "map" to the kernel segment, and it's lower-canonical addr
0x0 through 0x00007FFF FFFFFFFF map to it's private userspace.
Now, if we context-switch A->B, process B's lower-canonical region is unique But
it must "map" to the same kernel of course!
How exactly does this happen? How do we "auto" refer to the kernel paging table when
in kernel mode? Or is that a wrong statement?
Thanks for your patience, would really appreciate a well thought out answer!
First a bit of background.
This is an area where there is a lot of potential variation between
architectures, however the original poster has indicated he is mainly
interested in x86 and ARM, which share several characteristics:
no hardware segments or similar partitioning of the virtual address space (when used by Linux)
hardware page table walk
multiple page sizes
physically tagged caches (at least on modern ARMs)
So if we restrict ourselves to those systems it keeps things simpler.
Once the MMU is enabled, it is never normally turned off. So all CPU
addresses are virtual, and will be translated to physical addresses
using the MMU. The MMU will first look up the virtual address in the
TLB, and only if it doesn't find it in the TLB will it refer to the
page table - the TLB is a cache of the page table - and so we can
ignore the TLB for this discussion.
The page table
describes the entire virtual 32 or 64 bit address space, and includes
information like:
whether the virtual address is valid
which mode(s) the processor must be in for it to be valid
special attributes for things like memory mapped hardware registers
and the physical address to use
Linux divides the virtual address space into two: the lower portion is
used for user processes, and there is a different virtual to physical
mapping for each process. The upper portion is used for the kernel,
and the mapping is the same even when switching between different user
processes. This keep things simple, as an address is unambiguously in
user or kernel space, the page table doesn't need to be changed when
entering or leaving the kernel, and the kernel can simply dereference
pointers into user space for the
current user process. Typically on 32bit processors the split is 3G
user/1G kernel, although this can vary. Pages for the kernel portion
of the address space will be marked as accessible only when the processor
is in kernel mode to prevent them being accessible to user processes.
The portion of the kernel address space which is identity mapped to RAM
(kernel logical addresses) will be mapped using big pages when possible,
which may allow the page table to be smaller but more importantly
reduces the number of TLB misses.
When the kernel starts it creates a single page table for itself
(swapper_pg_dir) which just describes the kernel portion of the
virtual address space and with no mappings for the user portion of the
address space. Then every time a user process is created a new page
table will be generated for that process, the portion which describes
kernel memory will be the same in each of these page tables. This could be
done by copying all of the relevant portion of swapper_pg_dir, but
because page tables are normally a tree structures, the kernel is
frequently able to graft the portion of the tree which describes the
kernel address space from swapper_pg_dir into the page tables for each
user process by just copying a few entries in the upper layer of the
page table structure. As well as being more efficient in memory (and possibly
cache) usage, it makes it easier to keep the mappings consistent. This
is one of the reasons why the split between kernel and user virtual
address spaces can only occur at certain addresses.
To see how this is done for a particular architecture look at the
implementation of pgd_alloc(). For example ARM
(arch/arm/mm/pgd.c) uses:
pgd_t *pgd_alloc(struct mm_struct *mm)
{
...
init_pgd = pgd_offset_k(0);
memcpy(new_pgd + USER_PTRS_PER_PGD, init_pgd + USER_PTRS_PER_PGD,
(PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
...
}
or
x86 (arch/x86/mm/pgtable.c) pgd_alloc() calls pgd_ctor():
static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
{
/* If the pgd points to a shared pagetable level (either the
ptes in non-PAE, or shared PMD in PAE), then just copy the
references from swapper_pg_dir. */
...
clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
KERNEL_PGD_PTRS);
...
}
So, back to the original questions:
Part 1: Are kernel virtual addresses really translated by the TLB/MMU?
Yes.
Part 2: How is swapper_pg_dir "attached" to a user mode process.
All page tables (whether swapper_pg_dir or those for user processes)
have the same mappings for the portion used for kernel virtual
addresses. So as the kernel context switches between user processes,
changing the current page table, the mappings for the kernel portion
of the address space remain the same.
The kernel address space is mapped to a section of each process for example on 3:1 mapping after address 0xC0000000. If the user code try to access this address space it will generate a page fault and it is guarded by the kernel.
The kernel address space is divided into 2 parts, the logical address space and the virtual address space. It is defined by the constant VMALLOC_START. The CPU is using the MMU all the time, in user space and in kernel space (can't switch on/off).
The kernel virtual address space is mapped the same way as user space mapping. The logical address space is continuous and it is simple to translate it to physical so it can be done on demand using the MMU fault exception. That is the kernel is trying to access an address, the MMU generate fault , the fault handler map the page using macros __pa , __va and change the CPU pc register back to the previous instruction before the fault happened, now everything is ok. This process is actually platform dependent and in some hardware architectures it mapped the same way as user (because the kernel doesn't use a lot of memory).

Linux x86: Where is the real mode address space mapped to in protected kernel mode?

In Linux running on an x86 platform where is the real mode address space mapped to in protected kernel mode? In kernel mode, a thread can access the kernel address space directly. The kernel is in the lower 8MB, The page table is at a certain position, etc (as describe here). But where does the real mode address space go? Can it be accessed directly? For example the BIOS and BIOS addons (See here)?
(My x86-fu is a bit weak. I'll add some tags so that other people can (hopefully) correct me if I'm lying anywhere.)
Physical addresses are the same in real and protected mode. The only difference is in how you get from an address (offset) specified in an instruction to a physical address:
In real mode, the physical address is basically (segment_reg << 4) + offset.
In protected mode, the physical address is translate_via_page_table([segment_reg] + offset).
By [segment_reg] I mean the base address of the segment, looked up in the Global or Local Descriptor Table at the offset in segment_reg. translate_via_page_table() means the address translation done via paging (if enabled).
Looking here, it seems the BIOS ROM appears at physical addresses 0x000F0000-0x000FFFFF. To get at that memory in protected mode with paging, you would have to map it into the virtual address space somewhere by setting up correct page table entries. Assuming 4 KB pages (the usual case), mapping the entire range should require 16 ((0xFFFFF-0xF0000+1)/4096) entries.
To see how the Linux kernel does things, you could look into how e.g. /dev/mem, which allows reading of arbitrary physical addresses, is implemented. The implementation is in drivers/char/mem.c.
The following command (from e.g. this answer) will dump the memory range 0xC0000-0xFFFFF (meaning it includes the video BIOS too, per the memory map linked above):
$ dd if=/dev/mem bs=1k skip=768 count=256 > bios
1024*768 = 0xC0000, and 1024*(768+256) - 1 = 0xFFFFF, which gives the expected physical memory range.
Tracing things a bit, read_mem() in drivers/char/mem.c calls xlate_dev_mem_ptr(), which has an x86-specific implementation in arch/x86/mm/ioremap.c. The ioremap_cache() call in that function seems to be responsible for mapping in the page if needed.
Note that BIOS routines won't work in protected mode by the way. They assume the CPU is running in real mode.
For Linux x86 32 bits, the first 896MB of physical RAM is mapped to a contiguous block of virtual memory starting at virtual address 0xC0000000 to 0xF7FFFFFF. Virtual addresses from 0xF8000000 to 0xFFFFFFFF are assigned dynamically to various parts of the physical memory, so the kernel can have a window of 128MB mapped into any part of physical memory beyond the 896MB limit.
The kernel itself loads at physical address 1MB and up, leaving the first MB free. This first MB is used, for instance, to have DMA buffers that ISA devices needs to be there, because they use the 8237 DMA controller, which can only be mapped to such addresses.
So, reading from virtual memory address 0xC0000000 is actually reading from physical address 0x00000000 (provided the kernel has flagged that page as present)

High memory mappings in kernel virtual address space

The linear address beyond 896MB correspond to High memory region ZONE_HIGHMEM.
So the page allocator functions will not work on this region, since they give the linear address of directly mapped page frames in ZONE_NORMAL and ZONE_DMA.
I am confused about these lines specified in Undertanding linux Kernel:
What do they mean when they say "In 64 bit hardware platforms ZONE_HIGHMEM is always empty."
What does this highlighted statement mean: "The allocation of high-memory page frames is done only through alloc_pages() function. These functions do not return linear address since they do not exist. Instead the functions return linear address of the page descriptor of the first allocated page frame. These linear addresses always exist, because all page descriptors are allocated in low memory once and forever during kernel initialization."
What are these Page descriptors and does the 896MB already have all page descriptors of entire RAM.
The x86-32 kernel needs high memory to access more than 1G of physical memory, as it is impossible to permanently map more than 2^{32} addresses within a 32-bit address space and the kernel/user split is 1G/3G.
The x86-64 kernel has no such limitation, as the amount of physically-addressable memory (currently 256T) fits within its 64-bit address space and thus may always be permanently mapped.
High memory is a hack. Ideally you don't need it. Indeed, the point of x86-64 is to be able to directly address all the memory you could possibly want. Taken
from https://www.quora.com/Linux-Kernel/What-is-the-difference-between-high-memory-and-normal-memory
I think page descriptor means struct page. And considering the sizeof struct page. Yes all of them can be stored in ZONE_NORMAL

program life in terms of paged segmentation memory

I have a confusing notion about the process of segmentation & paging in x86 linux machines. Will be glad if some clarify all the steps involved from the start to the end.
x86 uses paged segmentation memory technique for memory management.
Can any one please explain what happens from the moment an executable .elf format file is loaded from hard disk in to main memory to the time it dies. when compiled the executable has different sections in it (text, data, stack, heap, bss). how will this be loaded ? how will they be set up under paged segmentation memory technique.
Wanted to know how the page tables get set up for the loaded program ? Wanted to know how GDT table gets set up. how the registers are loaded ? and why it is said that logical addresses (the ones that are processed by segmentation unit of MMU are 48 bits (16 bits of segment selector + 32 bit offset) when it is a bit 32 bit machine. how will other 16 bits be stored ? any thing accessed from ram must be 32 bits or 4 bytes how does the rest of 16 bits be accessed (to be loaded into segment registers) ?
Thanks in advance. the question can have a lot of things. but wanted to get clarification about the entire life cycle of an executable. Will be glad if some answers and pulls up a discussion on this.
Unix traditionally has implemented protection via paging. 286+ provides segmentation, and 386+ provides paging. Everyone uses paging, few make any real use of segmentation.
In x86, every memory operand has an implicit segment (so the address is really 16 bit selector + 32 bit offset), depending on the register used. So if you access [ESP + 8] the implied segment register is SS, if you access [ESI] the implied segment register is DS, if you access [EDI+4] the implied segment register is ES,... You can override this via segment prefix overrides.
Linux, and virtually every modern x86 OS, uses a flat memory model (or something similar). Under a flat memory model each segment provides access to the whole memory, with a base of 0 and a limit of 4Gb, so you don't have to worry about the complications segmentation brings about. Basically there are 4 segments: kernelspace code (RX), kernelspace data (RW), userspace code (RX), userspace data (RW).
An ELF file consists of some headers that pont to "program segments" and "sections". Section are used for linking. Program segments are used for loading. Program segments are mapped into memory via mmap(), this setups page-table entries with appropriate permissions.
Now, older x86 CPUs' paging mechanism only provided RW access control (read permission implies execute permission), while segmentation provided RWX access control. The end permission takes into account both segmentation and paging (e.g: RW (data segment) + R (read only page) = R (read only), while RX (code segment) + R (read only page) = RX (read and execute)).
So there are some patches that provide execution prevention via segmentation: e.g. OpenWall provided a non-executable stack by shrinking the code segment (the one with execute permission), and having special emulation in the page fault handler for anything that needed execution from a high memory address (e.g: GCC trampolines, self-modified code created on the stack to efficiently implement nested functions).
There's no such thing as paged segmentation, not in the official documentation at least. There are two different mechanisms working together and more or less independently of each other:
Translation of a logical address of the form 16-bit segment selector value:16/32/64-bit segment offset value, that is, a pair of 2 numbers into a 32/64-bit virtual address.
Translation of the virtual address into a 32/64-bit physical address.
Logical addresses is what your applications operate directly with. Then follows the above 2-step translation of them into what the RAM will understand, physical addresses.
In the first step the GDT (or it can be LDT, depends on the selector value) is indexed by the selector to find the relevant segment's base address and size. The virtual address will be the sum of the segment base address and the offset. The segment size and other things in segment descriptors are needed to provide protection.
In the second step the page tables are indexed by different parts of the virtual address and the last indexed table in the hierarchy gives the final, physical address that goes out on the address bus for the RAM to see. Just like with segment descriptors, page table entries contain not only addresses but also protection control bits.
That's about it on the mechanisms.
Now, in many x86 OSes the segment selectors that are used for applications are fixed, they are the same in all of them, they never change and they point to segment descriptors that have base addresses equal to 0 and sizes equal to the possible maximum (e.g. 4GB in non-64-bit modes). Such a GDT setup effectively means that the first step does no useful work and the offset part of the logical address translates into numerically equal virtual address.
This makes the segment selector values practically useless. They still have to be loaded into the CPU's segment registers (in non-64-bit modes into at least CS, SS, DS and ES), but beyond that point they can be forgotten about.
This all (except Linux-related details and the ELF format) is explained in or directly follows from Intel's and AMD's x86 CPU manuals. You'll find many more details there.
Perhaps read the Assembly HOWTO. When a Linux process starts to execute an ELF executable using the execve system call, it is essentially (sort of) mmap-ing some segments (and initializing registers, and a tiny part of the stack). Read also the SVR4 x86 ABI supplement and its x86-64 variant. Don't forget that a Linux process only see memory mapping for its address space and only cares about virtual memory
There are many good books on Operating Systems (=O.S.) kernels, notably by A.Tanenbaum & by M.Bach, and some on the linux kernel
NB: segment registers are nearly (almost) unused on Linux.

Resources