Is kernel space mapped into user space on Linux x86?

Is kernel space mapped into user space on Linux x86? - linux

It seems that on Windows 32 bit, kernel will reserve 1G of virtual memory from the totally 4G user virtual memory space and map some of the kernel space into this 1G space.
So my questions are:
Is there any similiar situation on 32 bit Linux?
If so, how can we see the whole memory layout ?
I think
cat /proc/pid/map
can only see the user space layout of certain process..
Thank you!

Actually, on 32-bit Windows, without the /3G boot option, the kernel is mapped at the top 2GB of linear address space, leaving 2GB for user process.
Linux does a similar thing, but it maps the kernel in the top 1GB of linear space, thus leaving 3GB for user process.
I don't know if you can peek the entire memory layout by just using the /proc filesystem. For a lab I designed for my students, I created a tiny device driver that allows a user to peek at an physical memory address, and get the contents of several control registers, such as CR3 (directory page base address).
By using these two operations, one can walk through the directory page of the current process (the one which is doing this operation) and see which pages are present, which ones are owned by the user and the kernel, or just by the kernel, which ones are read/write or read only, etc. With that information, they have to display a map showing memory usage, including kernel space.
Take a look at this PDF. It's the compiled version of all the labs we did in my course.
http://www.atc.us.es/asignaturas/tpbn/PracticasTPBN2011.pdf
On page 36 of PDF (page 30 of the document) you will see how a memory map looks like. This is the result of doing exercise #3.2 from lab #3.
The text is in spanish, but I'm sure you can use a translator or something like that if there are things you cannot understand. This labs assumes the student has previously read about how the paging system works and how to interpret the layout of the directory and page entries.
The map is like this. A 16x64 block. Each cell in the block represents 4MB of the current process virtual address space. The map should be tridimensional, as there are 4MB regions that are described by a page table with 1024 entries (pages), and not all pages may be present, but to keep the map clear, the exercise requires the user to collapse these regions, showing the contents of the first page entry that describes a present page, in the hope that all subsequents pages in that page table share the same attributes (which may or may not be actually true).
This map is used with kernels 2.6.X. in which PAE is not used, and PSE is used (PAE and PSE being two bit fields from control register CR4). PAE enables 2MB pages and PSE enables 4MB pages. 4KB pages are always available.
. : PDE not present, or page table empty.
X : 4MB page, supervisor.
R : 4MB page, user, read only.
* : 4MB page, user, read/write.
x : Page table with at least one entry describing a supervisor page.
r : Page table with at least one entry describing an user page, read only.
+ : Page table with at least one entry describing an user page, read/write.
................................r...............................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
...............................+..............................+.
xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX..x...........................xx
You can see there is a vast space of 3GB of memory, almost empty in this case (the process is just a little C application, and uses less than 4MB, all contained in a page table, whose first present page is a read only page, assumed to be part of the program code, or maybe static strings).
Near the 3GB border there are two small regions read/write, which may belong to shared libraries loaded by the user program.
The last 4 rows (256 directory entries) belong almost all of them to the kernel. There are 224 entries which are actually being present and used. These maps the first 896MB of physical memory and it's the space in where the kernel lives. The last 32 entries are used by the kernel to access physical memory beyond the 896MB mark in systems with more than 896MB RAM.

Is there any similiar situation on 32 bit Linux?
Yes. On 32-bit Linux, by default, the kernel reserves the high quarter of the address space (the 1G from C0000000 to the top of memory) for its own use.
If so, how can we see the whole memory layout ?
You can't. /proc/pid/maps only displays mappings which are present in userspace. Kernel memory is not accessible from userspace applications, so it is not shown.
Keep in mind the reason why this arrangement is used - while the kernel is active, it needs to be able to install its own mappings while still keeping userspace mappings active (so that, for instance, it can copy data from or to userspace). It accomplishes this by reserving that high memory range for itself.
The locations of memory mappings within the kernel is not relevant to anything besides the kernel itself, so it is not exposed to userspace at all except by accident, or in some debug messages.

Related

Is it possible to overwrite kernel code by replacing the page tables?

So it seems that you cannot modify kernel code because the PTE that points to it is marked as executable as opposed to writeable. I was wondering if you could overwrite kernel code using the following method? (this only applies to x86 and assumes we have root access so we run the following steps as a kernel module)
Read in the contents of the CR3 register
Use kmalloc to allocate memory big enough to replicate all the PTE and the PDE
Copy all the paging data into the newly allocated memory using the value obtained from the CR3 register
Mark the relevant pages as executable and writeable
Overwrite the CR3 register with a pointer to the memory we kmalloc'ed in step 2
At this point, assuming this all worked, wouldnt you be able to overwrite return addresses and other parts of the kernel code? Whereas before doing this we would be stopped from the paging mechanism protections?

3 . Copy all the paging data into the newly allocated memory using the value obtained from the CR3 register
5 . Overwrite the CR3 register with a pointer to the memory we kmalloc'ed in step 2
These two steps might not work:
CR3 gives you an physical address; however, for reading the page data you require a virtual address. It is not even guaranteed that the PTD is currently mapped (and therefore accessible).
And to overwrite the CR3 register you need to know the physical address of the memory you have allocated using kmalloc; however, you only know the virtual address.
However, you might use virt_to_phys and phys_to_virt to translate physical to virtual addresses.
Is it possible to overwrite kernel code ...?
I'm not sure, but the following attempt should work:
The page tables themselves should be read-write - at least the ones used by kmalloc.
Instead of copying the PTD and the page tables, you could allocate some memory using kmalloc which is 2 page sizes long (8 KiB if the "traditional" 4 KiB memory pages are used). This means that "your" memory block under all circumstances contains one complete memory page.
When you have the virtual addresses of the PTD and the page tables, you can re-map "your" memory page so it does no longer point to your "kmalloc memory" but to the kernel code you want to modify...
At this point, assuming this all worked, wouldn't you be able to overwrite return addresses and other parts of the kernel code?
I'm not sure if I understand your question correctly.
But a kernel module is part of the kernel - so nothing stops a kernel module from doing something completely stupid (intentionally or because of a bug).
For this reason you have to be very careful when programming kernel modules.
And because "root" has the ability to load kernel modules, it is important that hackers or malware never get "root" access. Otherwise malware could be injected into the kernel using insmod.

How does the kernel-side page cache virt <-> phys mapping interact with the TLB?

I am writing an application that makes heavy use of mmap, including from distinct processes (not concurrently, but serially). A big determinant of performance is how the TLB is managed user and kernel side for such mappings.
I understand reasonably well the user-visible aspects of the Linux page cache. I think this understanding extends to the userland performance impacts1.
What I don't understand is how those same pages are mapped into kernel space, and how this interacts with the TLB (on x86-64). You can find lots of information on how this worked in the 32-bit x86 world2, but I didn't dig up the answer for 64-bit.
So the two questions are (both interrelated and probably answered in one shot):
How is the page cache mapped3 in kernel space on x86-64?
If you read() N pages from a file in some process, then again read exactly those N pages again from another process on the same CPU, it possible that all the kernel side reads (during the kernel -> userpace copy of the contents) hit in the TLB? Note that this is (probably) a direct consequence of (1).
My overall goal here is to understand at a deep level the performance difference of one-off accessing of cached files via mmap or non-mmap calls such as read.
1 For example, if you mmap a file into your processes' virtual address space, you have effectively asked for your process page tables to contain a mapping from the returned/requested virual address range to a physical range corresponding to the pages for that file in the page cache (even if they don't exist in the page cache, yet). If MAP_POPULATE is specified, all the page table entries will actually be populated before the mmap call returns, and if not they will be populated as you fault-in the associated pages (sometimes with optimizations such as fault-around).
2 Basically, (for 3:1 mappings anyway) Linux uses a single 1 GB page to map approximately the first 1 GB of physical RAM directly (and places it at the top 1 GB of virtual memory), which is the end of story for machines with <= 1 GB RAM (the page cache necessarily goes in that 1GB mapping and hence a single 1 GB TLB entry covers everything). With more than 1GB RAM, the page cache is preferentially allocated from "HIGHMEM" - the region above 1GB which isn't covered by the kernel's 1GB mapping, so various temporary mapping strategies are used.
3 By mapped I mean how are the page tables set up for its access, aka how does the virtual <-> physical mapping work.

Due to vast virtual address space compared to physical ram installed (128TB for the kernel), the common trick is to permanently map all the ram. This is known as "direct map".
In principle it is possible that both relevant TLB and cache entries survive the context switch and all the other code executed, but it is hard to say how likely this can be in the real world.

How exactly do kernel virtual addresses get translated to physical RAM?

On the surface, this appears to be a silly question. Some patience please.. :-)
Am structuring this qs into 2 parts:
Part 1:
I fully understand that platform RAM is mapped into the kernel segment; esp on 64-bit systems this will work well. So each kernel virtual address is indeed just an offset from physical memory (DRAM).
Also, it's my understanding that as Linux is a modern virtual memory OS, (pretty much) all addresses are treated as virtual addresses and must "go" via hardware - the TLB/MMU - at runtime and then get translated by the TLB/MMU via kernel paging tables. Again, easy to understand for user-mode processes.
HOWEVER, what about kernel virtual addresses? For efficiency, would it not be simpler to direct-map these (and an identity mapping is indeed setup from PAGE_OFFSET onwards). But still, at runtime, the kernel virtual address must go via the TLB/MMU and get translated right??? Is this actually the case? Or is kernel virtual addr translation just an offset calculation?? (But how can that be, as we must go via hardware TLB/MMU?). As a simple example, lets consider:
char *kptr = kmalloc(1024, GFP_KERNEL);
Now kptr is a kernel virtual address.
I understand that virt_to_phys() can perform the offset calculation and return the physical DRAM address.
But, here's the Actual Question: it can't be done in this manner via software - that would be pathetically slow! So, back to my earlier point: it would have to be translated via hardware (TLB/MMU).
Is this actually the case??
Part 2:
Okay, lets say this is the case, and we do use paging in the kernel to do this, we must of course setup kernel paging tables; I understand it's rooted at swapper_pg_dir.
(I also understand that vmalloc() unlike kmalloc() is a special case- it's a pure virtual region that gets backed by physical frames only on page fault).
If (in Part 1) we do conclude that kernel virtual address translation is done via kernel paging tables, then how exactly does the kernel paging table (swapper_pg_dir) get "attached" or "mapped" to a user-mode process?? This should happen in the context-switch code? How? Where?
Eg.
On an x86_64, 2 processes A and B are alive, 1 cpu.
A is running, so it's higher-canonical addr
0xFFFF8000 00000000 through 0xFFFFFFFF FFFFFFFF "map" to the kernel segment, and it's lower-canonical addr
0x0 through 0x00007FFF FFFFFFFF map to it's private userspace.
Now, if we context-switch A->B, process B's lower-canonical region is unique But
it must "map" to the same kernel of course!
How exactly does this happen? How do we "auto" refer to the kernel paging table when
in kernel mode? Or is that a wrong statement?
Thanks for your patience, would really appreciate a well thought out answer!

First a bit of background.
This is an area where there is a lot of potential variation between
architectures, however the original poster has indicated he is mainly
interested in x86 and ARM, which share several characteristics:
no hardware segments or similar partitioning of the virtual address space (when used by Linux)
hardware page table walk
multiple page sizes
physically tagged caches (at least on modern ARMs)
So if we restrict ourselves to those systems it keeps things simpler.
Once the MMU is enabled, it is never normally turned off. So all CPU
addresses are virtual, and will be translated to physical addresses
using the MMU. The MMU will first look up the virtual address in the
TLB, and only if it doesn't find it in the TLB will it refer to the
page table - the TLB is a cache of the page table - and so we can
ignore the TLB for this discussion.
The page table
describes the entire virtual 32 or 64 bit address space, and includes
information like:
whether the virtual address is valid
which mode(s) the processor must be in for it to be valid
special attributes for things like memory mapped hardware registers
and the physical address to use
Linux divides the virtual address space into two: the lower portion is
used for user processes, and there is a different virtual to physical
mapping for each process. The upper portion is used for the kernel,
and the mapping is the same even when switching between different user
processes. This keep things simple, as an address is unambiguously in
user or kernel space, the page table doesn't need to be changed when
entering or leaving the kernel, and the kernel can simply dereference
pointers into user space for the
current user process. Typically on 32bit processors the split is 3G
user/1G kernel, although this can vary. Pages for the kernel portion
of the address space will be marked as accessible only when the processor
is in kernel mode to prevent them being accessible to user processes.
The portion of the kernel address space which is identity mapped to RAM
(kernel logical addresses) will be mapped using big pages when possible,
which may allow the page table to be smaller but more importantly
reduces the number of TLB misses.
When the kernel starts it creates a single page table for itself
(swapper_pg_dir) which just describes the kernel portion of the
virtual address space and with no mappings for the user portion of the
address space. Then every time a user process is created a new page
table will be generated for that process, the portion which describes
kernel memory will be the same in each of these page tables. This could be
done by copying all of the relevant portion of swapper_pg_dir, but
because page tables are normally a tree structures, the kernel is
frequently able to graft the portion of the tree which describes the
kernel address space from swapper_pg_dir into the page tables for each
user process by just copying a few entries in the upper layer of the
page table structure. As well as being more efficient in memory (and possibly
cache) usage, it makes it easier to keep the mappings consistent. This
is one of the reasons why the split between kernel and user virtual
address spaces can only occur at certain addresses.
To see how this is done for a particular architecture look at the
implementation of pgd_alloc(). For example ARM
(arch/arm/mm/pgd.c) uses:
pgd_t *pgd_alloc(struct mm_struct *mm)
{
...
init_pgd = pgd_offset_k(0);
memcpy(new_pgd + USER_PTRS_PER_PGD, init_pgd + USER_PTRS_PER_PGD,
(PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
...
}
or
x86 (arch/x86/mm/pgtable.c) pgd_alloc() calls pgd_ctor():
static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
{
/* If the pgd points to a shared pagetable level (either the
ptes in non-PAE, or shared PMD in PAE), then just copy the
references from swapper_pg_dir. */
...
clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
KERNEL_PGD_PTRS);
...
}
So, back to the original questions:
Part 1: Are kernel virtual addresses really translated by the TLB/MMU?
Yes.
Part 2: How is swapper_pg_dir "attached" to a user mode process.
All page tables (whether swapper_pg_dir or those for user processes)
have the same mappings for the portion used for kernel virtual
addresses. So as the kernel context switches between user processes,
changing the current page table, the mappings for the kernel portion
of the address space remain the same.

The kernel address space is mapped to a section of each process for example on 3:1 mapping after address 0xC0000000. If the user code try to access this address space it will generate a page fault and it is guarded by the kernel.
The kernel address space is divided into 2 parts, the logical address space and the virtual address space. It is defined by the constant VMALLOC_START. The CPU is using the MMU all the time, in user space and in kernel space (can't switch on/off).
The kernel virtual address space is mapped the same way as user space mapping. The logical address space is continuous and it is simple to translate it to physical so it can be done on demand using the MMU fault exception. That is the kernel is trying to access an address, the MMU generate fault , the fault handler map the page using macros __pa , __va and change the CPU pc register back to the previous instruction before the fault happened, now everything is ok. This process is actually platform dependent and in some hardware architectures it mapped the same way as user (because the kernel doesn't use a lot of memory).

In Linux, physical memory pages belong to the kernel data segment are swappable or not?

I'm asking because I remember that all physical pages belong to the kernel are pinned in memory and thus are unswappable, like what is said here: http://www.cse.psu.edu/~axs53/spring01/linux/memory.ppt
However, I'm reading a research paper and feel confused as it says,
"(physical) pages frequently move between the kernel data segment and user space."
It also mentions that, in contrast, physical pages do not move between the kernel code segment and user space.
I think if a physical page sometimes belongs to the kernel data segment and sometimes belongs to user space, it must mean that physical pages belong to the kernel data segment are swappable, which is against my current understanding.
So, physical pages belong to the kernel data segment are swappable? unswappable?
P.S. The research paper is available here:
https://www.cs.cmu.edu/~arvinds/pubs/secvisor.pdf
Please search "move between" and you will find it.
P.S. again, a virtual memory area ranging from [3G + 896M] to 4G belongs to the kernel and is used for mapping physical pages in ZONE_HIGHMEM (x86 32-bit Linux, 3G + 1G setting). In such a case, the kernel may first map some virtual pages in the area to the physical pages that host the current process's page table, modify some page table entries, and unmap the virtual pages. This way, the physical pages may sometimes belong to the kernel and sometimes belong to user space, because they do not belong to the kernel after the unmapping and thus become swappable. Is this the reason?

tl;dr - the memory pools and swapping are different concepts. You can not make any deductions from one about the other.
kmalloc() and other kernel data allocation come from slab/slub, etc. The same place that the kernel gets data for user-space. Ergo pages frequently move between the kernel data segment and user space. This is correct. It doesn't say anything about swapping. That is a separate issue and you can not deduce anything.
The kernel code is typically populated at boot and marked read-only and never changes after that. Ergo physical pages do not move between the kernel code segment and user space.
Why do you think because something comes from the same pool, it is the same? The network sockets also come from the same memory pool. It is a seperation of concern. The linux-mm (memory management system) handles swap. A page can be pinned (unswappable). The check for static kernel memory (this may include .bss and .data) is a simple range check. The memory is normally pinned and marked unswappable at the linux-mm layer. The user data (whos allocation come from the same pool) can be marked as swappable by the linux-mm. For instance, even without swap, user-space text is still swappable because it is backed by an inode. Caching is much simpler for read-only data. If data is swapped, it is marked as such in the MMU tables and a fault handler must distinguish between swap and a SIGBUS; which is part of the linux-mm.
There are also versions of Linux with no-mm (or no MMU) and these will never swap anything. In theory someone might be able to swap kernel data; but the why is it in the kernel? The Linux way would be to use a module and only load them as needed. Certainly, the linux-mm data is kernel data and hopefully, you can see a problem with swapping that.
The problem with conceptual questions like this,
It can differ with Linux versions.
It can differ with Linux configurations.
The advice can change as Linux evolves.
For certain, the linux-mm code can not be swappable, nor any interrupt handler. It is possible that at some point in time, kernel code and/or data could be swapped. I don't think that this is ever the current case outside of module loading/unloading (and it is rather pedantic/esoteric as to whether you call this swapping or not).

I think if a physical page sometimes belongs to the kernel data segment and sometimes belongs to user space, it must mean that physical pages belong to the kernel data segment are swappable, which is against my current understanding.
there is no connection between swappable memory and page movement between user space and kernel space. whether a page can be swapped or not depends totally on whether it is pinned or not. Pinned pages are not swapped so their mapping is considered permanent.
So, physical pages belong to the kernel data segment are swappable? unswappable?
usually pages used by kernel are pinned and so are meant not to be swappable.

However, I'm reading a research paper and feel confused as it says, "(physical) pages frequently move between the kernel data segment and user space."
Could you please give a link of this research papaer?
As far as I known, (just from UNIX lectures and labs at school) the pages for kernel space has been allocated for kernel, with a simple, fixed mapping algorithm, and they are all pinned. After kernel turn on the paging mode, (bits operation of CR0&CR3 for x86) there will be the first user mode process, and the pages which has been allocated for kernel will not be in the available set of pages for user space.

ARM Linux kernel page table

Ref. Linux kernel ARM Translation table base (TTB0 and TTB1)
I have father doubt/query on topic discussed in previous link:
0 to 0xbfffffff is a lower part of memory (for user processes) and managed by the page table in TTB0, it contains the page-table of the current process
Ref. arm/include/asm/pgtable-2level.h : PTRS_PER_PGD =2048, PTRS_PER_PMD =1, PTRS_PER_PTE =512
0xc0000000 to 0xffffffff is upper part (OS and memory-mapped I/O) of the address space managed/translated by the page table in TTBR1.
TTB1 table is fixed in size and alignment (to 16k). Each level 1 entry of size is 32bits and represents 1MB page/segment. This is swapper_pg_dir (ref System.map) page tables that placed 16K below the actual text address
Is that the first 768 entry in swapper_pg_dir = 0 (0x0 to 0xbfffffff for user processes) and valid entry from 768 to 1024(0xc0000000 to 0xffffffff is for OS and memory-mapped I/O)?
Anyone like to share some sample code in kernel space (kernel module) to browse this swapper_pg_dir PGD?

Because of how the ARM MMU was designed, both the translations tables (TTB0 and TTB1) can only be used in a 1:1 mapping kernel mapping.
Most Linux Kernels have a 3:1 mapping (3GB User space : 1GB Kernel space for ARM).
This means that 0-0xBFFFFFFF is user space while 0xC0000000 - 0xFFFFFFFF is kernel space.
Now for the HW memory translations, only TTBR0 is used. TTBR1 only holds the address of the initial swapper page (which contains all the kernel mappings) and isn't really used for virtual address translations. TTBR0 hold the address for the current used page directory (the page table that the HW is using for translations). Now each user process has their own page tables, and for each process switch, TTBR0 changes to the current user process page table (they are all located in kernel space).
For example, for each new user process, the kernel creates a new page directory, copies all the kernel mappings from the swapper page(page frames from 3-4GB) to the new page table and clears the user pages(page frames from 0-3GB). It then sets TTB0 to the base address of this page directory and flushes cache to install the new address space. The swapper page is also always kept up to date with changes to the mappings.
For your question:
Simplified, hardwarewise the first level page have 4096 entries. Each entry represent 1MB of virtual address, totalling 4GB of ram. Entry 0-3071 represent user space and entry 3072-4095 represent kernel space.
The swapper page is usually located at address 0xC0004000 - 0xc0008000 (4096 entries *4bytes each entry = 16384 =16kb in hex = 0x4000 ). By examing the memory at 0xc0004000-0xc0007000 you can find entries for user space (empty) and from 0xc0007000-0xc0008000 you can find kernel entries. I use gdb with the command line x /100x 0xc0007000 in order to examine the first 100 kernel entries. You can then examine the technical reference manual for your current platform in order to decipher the page table attributes.
If you want to learn more about the Linux kernel, I recommend you to use Qemu to simulate the Beagleboard together with gdb to examine and debug the source code. I did this to learn how the kernel builds the page table during initialization.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string