How to read kernel page table?

How to read kernel page table? - linux

Linux separates virtual memory space into two parts: 0x00000000 ~ 0xBFFFFFFF and 0xC0000000 ~ 0xFFFFFFFF. As I read, all the processes share the same kernel virtual space 0xC0000000 ~ 0xFFFFFFFF.
I am trying to lock one TLB for system call on ARM architecture. For example, for raw_spin system call, I got the virtual address 0xc04d35b0 from System.map then I want to find corresponding physical address to lock one TLB entry.
My question is how can I read the kernel page table? Thanks!

Related

what is kernel mapping in linux?

what is kernel mapping? What are permanent mapping and temporary mapping. What is a window in this context? I went through code and explanation of this but could not understand this

I'm assuming you're talking about memory mapping in linux kernel.
Memory mapping is a process of mapping kernel address space directly to users process's address space.
Types of addresses :
User virtual address : These are the regular addresses seen by user-space programs
Physical addresses : The addresses used between the processor and the system’s memory.
Bus addresses : The addresses used between peripheral buses and memory. Often, they are the same as the physical addresses used by the processor, but that is not necessarily the case.
Kernel logical addresses : These make up the normal address space of the kernel.
Kernel virtual addresses : Kernel virtual addresses are similar to logical addresses in that they are a mapping from a kernel-space address to a physical address.
High and Low Memory :
Low memory : Memory for which logical addresses exist in kernel space. On almost every system you will likely encounter, all memory is low memory.
High memory : Memory for which logical addresses do not exist, because it is beyond the address range set aside for kernel virtual addresses.This means the kernel needs to start using temporary mappings of the pieces of physical memory that it wants to access.
Kernel splits virtual address into two part user address space and kernel address space. The kernel’s code and data structures must fit into that space, but the biggest consumer of kernel address space is virtual mappings for physical memory. Thus kernel needs its own virtual address for any memory it must touch directly. So, the maximum amount of physical memory that could be handled by the kernel was the amount that could be mapped into the kernel’s portion of the virtual address space, minus the space used by kernel code.
Temporary mapping : When a mapping must be created but the current context cannot sleep, the kernel provides temporary mappings (also called atomic mappings). The kernel can atomically map a high memory page into one of the reserved mappings (which can hold temporary mappings). Consequently, a temporary mapping can be used in places that cannot sleep, such as interrupt handlers, because obtaining the mapping never blocks.
Ref :
kernel.org/doc/Documentation/vm/highmem.txt
static.lwn.net/images/pdf/LDD3/ch15.pdf
man mmap
notes.shichao.io/lkd/ch12/

A full answer would be very long, for details refers (for example) to Linux Kernel Addressing or Understanding the Linux Kernel (pages 306-). These concepts are related to the way address spaces are organized in Linux. Firstly how kernel space is mapped into user space (kernel mapped onto user space simplifies the switching in between user and kernel mode) and, secondly the way physical memory is mapped onto kernel space (because kernel have to manage physical memory).
Beware that this is of no concern in modern 64bit architectures.

How exactly do kernel virtual addresses get translated to physical RAM?

On the surface, this appears to be a silly question. Some patience please.. :-)
Am structuring this qs into 2 parts:
Part 1:
I fully understand that platform RAM is mapped into the kernel segment; esp on 64-bit systems this will work well. So each kernel virtual address is indeed just an offset from physical memory (DRAM).
Also, it's my understanding that as Linux is a modern virtual memory OS, (pretty much) all addresses are treated as virtual addresses and must "go" via hardware - the TLB/MMU - at runtime and then get translated by the TLB/MMU via kernel paging tables. Again, easy to understand for user-mode processes.
HOWEVER, what about kernel virtual addresses? For efficiency, would it not be simpler to direct-map these (and an identity mapping is indeed setup from PAGE_OFFSET onwards). But still, at runtime, the kernel virtual address must go via the TLB/MMU and get translated right??? Is this actually the case? Or is kernel virtual addr translation just an offset calculation?? (But how can that be, as we must go via hardware TLB/MMU?). As a simple example, lets consider:
char *kptr = kmalloc(1024, GFP_KERNEL);
Now kptr is a kernel virtual address.
I understand that virt_to_phys() can perform the offset calculation and return the physical DRAM address.
But, here's the Actual Question: it can't be done in this manner via software - that would be pathetically slow! So, back to my earlier point: it would have to be translated via hardware (TLB/MMU).
Is this actually the case??
Part 2:
Okay, lets say this is the case, and we do use paging in the kernel to do this, we must of course setup kernel paging tables; I understand it's rooted at swapper_pg_dir.
(I also understand that vmalloc() unlike kmalloc() is a special case- it's a pure virtual region that gets backed by physical frames only on page fault).
If (in Part 1) we do conclude that kernel virtual address translation is done via kernel paging tables, then how exactly does the kernel paging table (swapper_pg_dir) get "attached" or "mapped" to a user-mode process?? This should happen in the context-switch code? How? Where?
Eg.
On an x86_64, 2 processes A and B are alive, 1 cpu.
A is running, so it's higher-canonical addr
0xFFFF8000 00000000 through 0xFFFFFFFF FFFFFFFF "map" to the kernel segment, and it's lower-canonical addr
0x0 through 0x00007FFF FFFFFFFF map to it's private userspace.
Now, if we context-switch A->B, process B's lower-canonical region is unique But
it must "map" to the same kernel of course!
How exactly does this happen? How do we "auto" refer to the kernel paging table when
in kernel mode? Or is that a wrong statement?
Thanks for your patience, would really appreciate a well thought out answer!

First a bit of background.
This is an area where there is a lot of potential variation between
architectures, however the original poster has indicated he is mainly
interested in x86 and ARM, which share several characteristics:
no hardware segments or similar partitioning of the virtual address space (when used by Linux)
hardware page table walk
multiple page sizes
physically tagged caches (at least on modern ARMs)
So if we restrict ourselves to those systems it keeps things simpler.
Once the MMU is enabled, it is never normally turned off. So all CPU
addresses are virtual, and will be translated to physical addresses
using the MMU. The MMU will first look up the virtual address in the
TLB, and only if it doesn't find it in the TLB will it refer to the
page table - the TLB is a cache of the page table - and so we can
ignore the TLB for this discussion.
The page table
describes the entire virtual 32 or 64 bit address space, and includes
information like:
whether the virtual address is valid
which mode(s) the processor must be in for it to be valid
special attributes for things like memory mapped hardware registers
and the physical address to use
Linux divides the virtual address space into two: the lower portion is
used for user processes, and there is a different virtual to physical
mapping for each process. The upper portion is used for the kernel,
and the mapping is the same even when switching between different user
processes. This keep things simple, as an address is unambiguously in
user or kernel space, the page table doesn't need to be changed when
entering or leaving the kernel, and the kernel can simply dereference
pointers into user space for the
current user process. Typically on 32bit processors the split is 3G
user/1G kernel, although this can vary. Pages for the kernel portion
of the address space will be marked as accessible only when the processor
is in kernel mode to prevent them being accessible to user processes.
The portion of the kernel address space which is identity mapped to RAM
(kernel logical addresses) will be mapped using big pages when possible,
which may allow the page table to be smaller but more importantly
reduces the number of TLB misses.
When the kernel starts it creates a single page table for itself
(swapper_pg_dir) which just describes the kernel portion of the
virtual address space and with no mappings for the user portion of the
address space. Then every time a user process is created a new page
table will be generated for that process, the portion which describes
kernel memory will be the same in each of these page tables. This could be
done by copying all of the relevant portion of swapper_pg_dir, but
because page tables are normally a tree structures, the kernel is
frequently able to graft the portion of the tree which describes the
kernel address space from swapper_pg_dir into the page tables for each
user process by just copying a few entries in the upper layer of the
page table structure. As well as being more efficient in memory (and possibly
cache) usage, it makes it easier to keep the mappings consistent. This
is one of the reasons why the split between kernel and user virtual
address spaces can only occur at certain addresses.
To see how this is done for a particular architecture look at the
implementation of pgd_alloc(). For example ARM
(arch/arm/mm/pgd.c) uses:
pgd_t *pgd_alloc(struct mm_struct *mm)
{
...
init_pgd = pgd_offset_k(0);
memcpy(new_pgd + USER_PTRS_PER_PGD, init_pgd + USER_PTRS_PER_PGD,
(PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
...
}
or
x86 (arch/x86/mm/pgtable.c) pgd_alloc() calls pgd_ctor():
static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
{
/* If the pgd points to a shared pagetable level (either the
ptes in non-PAE, or shared PMD in PAE), then just copy the
references from swapper_pg_dir. */
...
clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
KERNEL_PGD_PTRS);
...
}
So, back to the original questions:
Part 1: Are kernel virtual addresses really translated by the TLB/MMU?
Yes.
Part 2: How is swapper_pg_dir "attached" to a user mode process.
All page tables (whether swapper_pg_dir or those for user processes)
have the same mappings for the portion used for kernel virtual
addresses. So as the kernel context switches between user processes,
changing the current page table, the mappings for the kernel portion
of the address space remain the same.

The kernel address space is mapped to a section of each process for example on 3:1 mapping after address 0xC0000000. If the user code try to access this address space it will generate a page fault and it is guarded by the kernel.
The kernel address space is divided into 2 parts, the logical address space and the virtual address space. It is defined by the constant VMALLOC_START. The CPU is using the MMU all the time, in user space and in kernel space (can't switch on/off).
The kernel virtual address space is mapped the same way as user space mapping. The logical address space is continuous and it is simple to translate it to physical so it can be done on demand using the MMU fault exception. That is the kernel is trying to access an address, the MMU generate fault , the fault handler map the page using macros __pa , __va and change the CPU pc register back to the previous instruction before the fault happened, now everything is ok. This process is actually platform dependent and in some hardware architectures it mapped the same way as user (because the kernel doesn't use a lot of memory).

How does ARM Linux maintain segments?

Linux translates flat virtual address to physical address by MMU. In the virtual address space of Linux, there are many types of segments:
Kernel space
User stack
Memory mapping region
User heap
Bss segment
Data segment
Text segment
How does Linux maintain these segments (aka sections)? Where are the base addresses and sizes of these segments stored? Registers, GDT/LDT, mm_struct or other data structures in kernel?
Appreciate any help.

GDT/LDT is x86 family feature. Kernel space translated via kernel part of page tables, userspace via userspace part. Page tables are in main memory, mm_struct is a structure used in Linux kernel to describe memory layout. It is per-process
User stack
User heap
Bss segment
Data segment
Text segment
This layout described in mm_struct. Also mm_struct contains ->pgd field which is a root page table pointer (loaded to ttrb0/ttrb1 on ARM)

Linux x86: Where is the real mode address space mapped to in protected kernel mode?

In Linux running on an x86 platform where is the real mode address space mapped to in protected kernel mode? In kernel mode, a thread can access the kernel address space directly. The kernel is in the lower 8MB, The page table is at a certain position, etc (as describe here). But where does the real mode address space go? Can it be accessed directly? For example the BIOS and BIOS addons (See here)?

(My x86-fu is a bit weak. I'll add some tags so that other people can (hopefully) correct me if I'm lying anywhere.)
Physical addresses are the same in real and protected mode. The only difference is in how you get from an address (offset) specified in an instruction to a physical address:
In real mode, the physical address is basically (segment_reg << 4) + offset.
In protected mode, the physical address is translate_via_page_table([segment_reg] + offset).
By [segment_reg] I mean the base address of the segment, looked up in the Global or Local Descriptor Table at the offset in segment_reg. translate_via_page_table() means the address translation done via paging (if enabled).
Looking here, it seems the BIOS ROM appears at physical addresses 0x000F0000-0x000FFFFF. To get at that memory in protected mode with paging, you would have to map it into the virtual address space somewhere by setting up correct page table entries. Assuming 4 KB pages (the usual case), mapping the entire range should require 16 ((0xFFFFF-0xF0000+1)/4096) entries.
To see how the Linux kernel does things, you could look into how e.g. /dev/mem, which allows reading of arbitrary physical addresses, is implemented. The implementation is in drivers/char/mem.c.
The following command (from e.g. this answer) will dump the memory range 0xC0000-0xFFFFF (meaning it includes the video BIOS too, per the memory map linked above):
$ dd if=/dev/mem bs=1k skip=768 count=256 > bios
1024*768 = 0xC0000, and 1024*(768+256) - 1 = 0xFFFFF, which gives the expected physical memory range.
Tracing things a bit, read_mem() in drivers/char/mem.c calls xlate_dev_mem_ptr(), which has an x86-specific implementation in arch/x86/mm/ioremap.c. The ioremap_cache() call in that function seems to be responsible for mapping in the page if needed.
Note that BIOS routines won't work in protected mode by the way. They assume the CPU is running in real mode.

For Linux x86 32 bits, the first 896MB of physical RAM is mapped to a contiguous block of virtual memory starting at virtual address 0xC0000000 to 0xF7FFFFFF. Virtual addresses from 0xF8000000 to 0xFFFFFFFF are assigned dynamically to various parts of the physical memory, so the kernel can have a window of 128MB mapped into any part of physical memory beyond the 896MB limit.
The kernel itself loads at physical address 1MB and up, leaving the first MB free. This first MB is used, for instance, to have DMA buffers that ISA devices needs to be there, because they use the 8237 DMA controller, which can only be mapped to such addresses.
So, reading from virtual memory address 0xC0000000 is actually reading from physical address 0x00000000 (provided the kernel has flagged that page as present)

x86 paging in linux kernel with mmu

In x86 arch, linux kernel 2.6.x, 32bit system
I understand that virtual address 0xC0000000 ~ 0xFFFFFFFF
is reserved for kernel.
and this virtual address can be translated to physical address by
subtracting 0xC0000000.
however, I think even the result is same, MMU will translate
the kernel virtual address(such as 0xC0851000) to physical address by walking through page table.
such as
CR3 -> page directory -> page table -> PFN.
am I correct or wrong?, please correct me if I'm wrong.
I need to develop hardware based kernel monitor in x86, linux 32bit system.
so I need to know this problem
please help.

For kernel logical addresses, you are correct. Kernel virtual addresses, like memory allocated by vmalloc, do not necessarily have a one-to-one mapping to physical addresses that characterize the logical address space, however.
Just bear in mind that kernel logical addresses aren't always translated to physical by subtracting an offset (that's true in x86 but not, say, AVR32).

"and this virtual address can be translated to physical address by
subtracting 0xC0000000"
since page tables for the kernel virtual addresses are configured that way, people have come up with a shortcut you mentioned.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string