Is it possible to overwrite kernel code by replacing the page tables? - linux

So it seems that you cannot modify kernel code because the PTE that points to it is marked as executable as opposed to writeable. I was wondering if you could overwrite kernel code using the following method? (this only applies to x86 and assumes we have root access so we run the following steps as a kernel module)
Read in the contents of the CR3 register
Use kmalloc to allocate memory big enough to replicate all the PTE and the PDE
Copy all the paging data into the newly allocated memory using the value obtained from the CR3 register
Mark the relevant pages as executable and writeable
Overwrite the CR3 register with a pointer to the memory we kmalloc'ed in step 2
At this point, assuming this all worked, wouldnt you be able to overwrite return addresses and other parts of the kernel code? Whereas before doing this we would be stopped from the paging mechanism protections?

3 . Copy all the paging data into the newly allocated memory using the value obtained from the CR3 register
5 . Overwrite the CR3 register with a pointer to the memory we kmalloc'ed in step 2
These two steps might not work:
CR3 gives you an physical address; however, for reading the page data you require a virtual address. It is not even guaranteed that the PTD is currently mapped (and therefore accessible).
And to overwrite the CR3 register you need to know the physical address of the memory you have allocated using kmalloc; however, you only know the virtual address.
However, you might use virt_to_phys and phys_to_virt to translate physical to virtual addresses.
Is it possible to overwrite kernel code ...?
I'm not sure, but the following attempt should work:
The page tables themselves should be read-write - at least the ones used by kmalloc.
Instead of copying the PTD and the page tables, you could allocate some memory using kmalloc which is 2 page sizes long (8 KiB if the "traditional" 4 KiB memory pages are used). This means that "your" memory block under all circumstances contains one complete memory page.
When you have the virtual addresses of the PTD and the page tables, you can re-map "your" memory page so it does no longer point to your "kmalloc memory" but to the kernel code you want to modify...
At this point, assuming this all worked, wouldn't you be able to overwrite return addresses and other parts of the kernel code?
I'm not sure if I understand your question correctly.
But a kernel module is part of the kernel - so nothing stops a kernel module from doing something completely stupid (intentionally or because of a bug).
For this reason you have to be very careful when programming kernel modules.
And because "root" has the ability to load kernel modules, it is important that hackers or malware never get "root" access. Otherwise malware could be injected into the kernel using insmod.

Related

What about the many addresses unpresented in the /proc/$pid/maps file?

Brief Version:
what status are the addresses unpresented in the maps file? Are they belongs to unallocated virtual pages or allocated from anonymous file or others?
Detailed Version
I'm learning about VM. In my book(CS:APP), I learned that all virtual pages can be cut into three sets: unallocated, allocated but not cached, allocated and cached.I have some questions about "what are allocated pages and unallocated pages? When are pages allocated?" And also, is stack and heap belongs to allocated pages or unallocated or only allocate when used?
Trying to solve these problems, I read the /proc/$pid/maps file, while I think I can get anything I want from it. In my mind, the file contains all memory mapping relations. But there isn't information about is it cached(I know maybe it cannot be seen from user mode...), and are the unpresented pages unallocated?
Honestly, I don't really know about the maps file. What I do know is that the information on every page is stored in page structures at all time. I'm gonna take x86-64 as an example.
For x86-64 on Linux you have Page Global Directory (PGD), Page Upper Directory (PUD), Page Middle Directory (PMD) and Page Directory (PD). The address of the bottom of the PGD table is stored in the CR3 register. PGD contains addresses of the PUDs, PUDs contain addresses of the PMDs, PMDs contain addresses of the PDs and PDs contain addresses of the physical pages.
A virtual address, of which only 48 bits are used, is split into 5 parts. The 12 least significant bits are the offset in the physical page. The next chunk of 9 bits is the offset in the PD. The next chunk is the offset in PMD etc. For example let's say you have virtual address 0x0000000000000123. This virtual address will be translated by the MMU in the CPU by looking at entry (offset) 0 of the PGD, entry 0 of the PUD, entry 0 of the PMD, entry 0 of the PD and finally offset 0x123 in the actual physical page in RAM. Every virtual address is 64 bits of which only the 48 least significant bits will be used.
At boot, the kernel makes checks to determine how much memory is available. It then builds its kernel structures accordingly.
When the kernel boots it will mark all pages as unallocated in its own structures (except for kernel needs). The page structure is important for this. The kernel has a page C structure for every page in the system (https://linux-kernel-labs.github.io/refs/heads/master/labs/memory_mapping.html and https://elixir.bootlin.com/linux/v4.6/source/include/linux/mm_types.h). This structure informs the kernel whether the page is allocated or not.
Each physical page in the system has a struct page associated with
it to keep track of whatever it is we are using the page for at the
moment. Note that we have no way to track which tasks are using
a page, though if it is a pagecache page, rmap structures can tell us
who is mapping it.
At first the pages are mostly unallocated. When you start a new process by launching an executable as the user of the system, pages are allocated for your process. On Linux, executables are ELF files. ELF is a conventional format which separates code in loadable segments. Each segment gets a virtual address at which it is going to be loaded in the virtual address space.
Let's say you have an elf file with one segment which should be loaded at virtual address 0x400000. When you launch that ELF executable, the Linux kernel will call certain functions which will look at the size of the code and allocate pages accordingly. The kernel will look at its structures and determine using algorithms where the process will be allocated in RAM. It will then setup the page tables according to where the virtual addresses for that process should land in actual physical memory.
What's important to understand is that each CPU core in the system has only one process running at a time. Each core has it's own set of page tables. When a process switch occurs for one core, the page tables are swapped completely to point to somewhere else in RAM. The same virtual address can point anywhere in RAM depending on how the page tables are set up.
The kernel holds a task_struct for every process running in the system. The task_struct contains a field named pgd which is a pointer to the PGD of the process. Each process has its very own PGD. If you dereference the pointer to the PGD you get the actual value of the first entry of the PGD. This first entry is the address of the PUD. With this only pointer, the kernel can reach every table belonging to the process and modify them at will.
While a process is running, it can ask for more memory. This is called dynamic memory allocation. The kernel has no way to know how much memory the process is going to ask in advance since it is dynamic (done while code is executing). When the process asks for more memory, the kernel determines what page to give to the process depending on an algorithm. It then marks this page as allocated to that process. The task_struct contains a mm field which is of type mm_struct (https://manybutfinite.com/post/how-the-kernel-manages-your-memory/). It is a memory descriptor for that process so that the kernel can know what memory the process is using. In itself the process doesn't need that information since the process should rely only on itself to ask for memory properly to the operating system and to not jump somewhere in RAM where it doesn't belong.
You ask about heap and stack. The stack for a process is allocated at the beginning of the process and I think it has a fixed size. If you overflow the stack, you will throw a CPU exception which will prompt the kernel to kill your process. Each CPU core has a special register called RSP. This is the stack pointer. It points to the top of the stack (the stack grows downward toward low memory). When the kernel allocates a stack for a process you launch, it will set up this register to point at the top of it. The stack pointer contains a virtual address. It will thus be translated using the page tables just like any address.
The heap is allocated and managed completely by the OS. It doesn't have special registers like the stack. It is allocated only when the process asks for more memory during code execution. The kernel knows in advance how much memory a process requires. It is all written in the ELF executable. All static memory is allocated during compilation and thus the kernel knows everything about the size of static memory. The only moment it requires to allocate new memory to a process is when the process actually asks for it. In C++ you use the keyword new to ask for heap memory dynamically. If you don't use this keyword, then the kernel knows in advance where your variables will be allocated (where they will be in memory). Only the stack will be used by static memory.

How can I force a cache flush for a process from a Linux device driver?

I'm working on a research project that requires me to perform a memory capture from custom hardware. I am working with a Zedboard SoC (dual-core ARM Cortex-A9 with FPGA fabric attached). I have designed a device driver that allows me to perform virtual memory captures and physical memory captures (using an AXI4-Lite peripheral that controls the Xilinx AXI DMA IP).
My goal is to capture all mapped pages, so I check /proc/pid/maps for mapped regions, then obtain PFNs from /proc/pid/pagemaps, pass the physical addresses into my device driver, and then pass them to my custom hardware (which invokes the Xilinx AXI DMA to obtain the contents from physical memory).
NOTE: I am using Xilinx's PetaLinux distribution, which is built on Linux version 4.14.
My device driver implements the following procedure through a series of IOCTL calls:
Stop the target process.
Perform virtual memory capture (using the access_process_vm() function).
Flush the cache (using the flush_user_range() function).
Perform physical memory capture.
Resume the target process.
What I'm noticing, however, is that the virtual memory capture and the physical memory capture differ in the [heap] section (which is the first section that extends past one page). The first page matches, but none of the other pages are even close. The [stack] section does not match at all. I should note that for the first two memory sections, .text and .rodata, the captures match exactly. The conclusion for now is that data that does not change during runtime matches between virtual and physical captures while data that does change during runtime does not match.
So this leaves me wondering: am I using the correct function to ensure coherency between the cache and the RAM? If not, what is the proper function to use to force a cache flush to RAM? It is necessary that the data in RAM is up-to-date to the point when the target process is stopped because I cannot access the caches from the custom hardware.
Edit 1:
In regards to this question being marked as a possible duplicate of this question, I am using a function from the accepted answer to initiate a cache flush. However, from my perspective, it doesn't seem to be working as the physical memory does not match the virtual memory as I would expect if a cache flush were occurring.
For anyone coming across this question in the future, the problem was not what I thought it was. The flush_user_range() function I mentioned is the correct function to use to push pages back to main memory from cache.
However, what I didn't think about at the time was that pages that are virtually contiguous are not necessarily (and are very often not) physically contiguous. In my kernel code, I passed the length of the mapped region to my hardware, which requested that length of data from the AXI DMA. What I should have done was a virtual-to-physical translation to get the physical address of each page in each region, then requested one page length of data from main memory, repeating that process for each page in each mapped region.
I realize this is a pretty specific problem that likely won't help anyone else doing the same thing I was doing, but hopefully the learned lesson can help someone in the future: Linux allocates pages in physical memory (that are usually 4kB in size, though you shouldn't assume that's the case), and a collection of physical pages is contained in one mapped region. If you're working on any code that requires you to check physical memory, be sure to be wary of where your data may cross physical page boundaries and act accordingly.

How exactly do kernel virtual addresses get translated to physical RAM?

On the surface, this appears to be a silly question. Some patience please.. :-)
Am structuring this qs into 2 parts:
Part 1:
I fully understand that platform RAM is mapped into the kernel segment; esp on 64-bit systems this will work well. So each kernel virtual address is indeed just an offset from physical memory (DRAM).
Also, it's my understanding that as Linux is a modern virtual memory OS, (pretty much) all addresses are treated as virtual addresses and must "go" via hardware - the TLB/MMU - at runtime and then get translated by the TLB/MMU via kernel paging tables. Again, easy to understand for user-mode processes.
HOWEVER, what about kernel virtual addresses? For efficiency, would it not be simpler to direct-map these (and an identity mapping is indeed setup from PAGE_OFFSET onwards). But still, at runtime, the kernel virtual address must go via the TLB/MMU and get translated right??? Is this actually the case? Or is kernel virtual addr translation just an offset calculation?? (But how can that be, as we must go via hardware TLB/MMU?). As a simple example, lets consider:
char *kptr = kmalloc(1024, GFP_KERNEL);
Now kptr is a kernel virtual address.
I understand that virt_to_phys() can perform the offset calculation and return the physical DRAM address.
But, here's the Actual Question: it can't be done in this manner via software - that would be pathetically slow! So, back to my earlier point: it would have to be translated via hardware (TLB/MMU).
Is this actually the case??
Part 2:
Okay, lets say this is the case, and we do use paging in the kernel to do this, we must of course setup kernel paging tables; I understand it's rooted at swapper_pg_dir.
(I also understand that vmalloc() unlike kmalloc() is a special case- it's a pure virtual region that gets backed by physical frames only on page fault).
If (in Part 1) we do conclude that kernel virtual address translation is done via kernel paging tables, then how exactly does the kernel paging table (swapper_pg_dir) get "attached" or "mapped" to a user-mode process?? This should happen in the context-switch code? How? Where?
Eg.
On an x86_64, 2 processes A and B are alive, 1 cpu.
A is running, so it's higher-canonical addr
0xFFFF8000 00000000 through 0xFFFFFFFF FFFFFFFF "map" to the kernel segment, and it's lower-canonical addr
0x0 through 0x00007FFF FFFFFFFF map to it's private userspace.
Now, if we context-switch A->B, process B's lower-canonical region is unique But
it must "map" to the same kernel of course!
How exactly does this happen? How do we "auto" refer to the kernel paging table when
in kernel mode? Or is that a wrong statement?
Thanks for your patience, would really appreciate a well thought out answer!
First a bit of background.
This is an area where there is a lot of potential variation between
architectures, however the original poster has indicated he is mainly
interested in x86 and ARM, which share several characteristics:
no hardware segments or similar partitioning of the virtual address space (when used by Linux)
hardware page table walk
multiple page sizes
physically tagged caches (at least on modern ARMs)
So if we restrict ourselves to those systems it keeps things simpler.
Once the MMU is enabled, it is never normally turned off. So all CPU
addresses are virtual, and will be translated to physical addresses
using the MMU. The MMU will first look up the virtual address in the
TLB, and only if it doesn't find it in the TLB will it refer to the
page table - the TLB is a cache of the page table - and so we can
ignore the TLB for this discussion.
The page table
describes the entire virtual 32 or 64 bit address space, and includes
information like:
whether the virtual address is valid
which mode(s) the processor must be in for it to be valid
special attributes for things like memory mapped hardware registers
and the physical address to use
Linux divides the virtual address space into two: the lower portion is
used for user processes, and there is a different virtual to physical
mapping for each process. The upper portion is used for the kernel,
and the mapping is the same even when switching between different user
processes. This keep things simple, as an address is unambiguously in
user or kernel space, the page table doesn't need to be changed when
entering or leaving the kernel, and the kernel can simply dereference
pointers into user space for the
current user process. Typically on 32bit processors the split is 3G
user/1G kernel, although this can vary. Pages for the kernel portion
of the address space will be marked as accessible only when the processor
is in kernel mode to prevent them being accessible to user processes.
The portion of the kernel address space which is identity mapped to RAM
(kernel logical addresses) will be mapped using big pages when possible,
which may allow the page table to be smaller but more importantly
reduces the number of TLB misses.
When the kernel starts it creates a single page table for itself
(swapper_pg_dir) which just describes the kernel portion of the
virtual address space and with no mappings for the user portion of the
address space. Then every time a user process is created a new page
table will be generated for that process, the portion which describes
kernel memory will be the same in each of these page tables. This could be
done by copying all of the relevant portion of swapper_pg_dir, but
because page tables are normally a tree structures, the kernel is
frequently able to graft the portion of the tree which describes the
kernel address space from swapper_pg_dir into the page tables for each
user process by just copying a few entries in the upper layer of the
page table structure. As well as being more efficient in memory (and possibly
cache) usage, it makes it easier to keep the mappings consistent. This
is one of the reasons why the split between kernel and user virtual
address spaces can only occur at certain addresses.
To see how this is done for a particular architecture look at the
implementation of pgd_alloc(). For example ARM
(arch/arm/mm/pgd.c) uses:
pgd_t *pgd_alloc(struct mm_struct *mm)
{
...
init_pgd = pgd_offset_k(0);
memcpy(new_pgd + USER_PTRS_PER_PGD, init_pgd + USER_PTRS_PER_PGD,
(PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
...
}
or
x86 (arch/x86/mm/pgtable.c) pgd_alloc() calls pgd_ctor():
static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
{
/* If the pgd points to a shared pagetable level (either the
ptes in non-PAE, or shared PMD in PAE), then just copy the
references from swapper_pg_dir. */
...
clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
KERNEL_PGD_PTRS);
...
}
So, back to the original questions:
Part 1: Are kernel virtual addresses really translated by the TLB/MMU?
Yes.
Part 2: How is swapper_pg_dir "attached" to a user mode process.
All page tables (whether swapper_pg_dir or those for user processes)
have the same mappings for the portion used for kernel virtual
addresses. So as the kernel context switches between user processes,
changing the current page table, the mappings for the kernel portion
of the address space remain the same.
The kernel address space is mapped to a section of each process for example on 3:1 mapping after address 0xC0000000. If the user code try to access this address space it will generate a page fault and it is guarded by the kernel.
The kernel address space is divided into 2 parts, the logical address space and the virtual address space. It is defined by the constant VMALLOC_START. The CPU is using the MMU all the time, in user space and in kernel space (can't switch on/off).
The kernel virtual address space is mapped the same way as user space mapping. The logical address space is continuous and it is simple to translate it to physical so it can be done on demand using the MMU fault exception. That is the kernel is trying to access an address, the MMU generate fault , the fault handler map the page using macros __pa , __va and change the CPU pc register back to the previous instruction before the fault happened, now everything is ok. This process is actually platform dependent and in some hardware architectures it mapped the same way as user (because the kernel doesn't use a lot of memory).

In Linux, physical memory pages belong to the kernel data segment are swappable or not?

I'm asking because I remember that all physical pages belong to the kernel are pinned in memory and thus are unswappable, like what is said here: http://www.cse.psu.edu/~axs53/spring01/linux/memory.ppt
However, I'm reading a research paper and feel confused as it says,
"(physical) pages frequently move between the kernel data segment and user space."
It also mentions that, in contrast, physical pages do not move between the kernel code segment and user space.
I think if a physical page sometimes belongs to the kernel data segment and sometimes belongs to user space, it must mean that physical pages belong to the kernel data segment are swappable, which is against my current understanding.
So, physical pages belong to the kernel data segment are swappable? unswappable?
P.S. The research paper is available here:
https://www.cs.cmu.edu/~arvinds/pubs/secvisor.pdf
Please search "move between" and you will find it.
P.S. again, a virtual memory area ranging from [3G + 896M] to 4G belongs to the kernel and is used for mapping physical pages in ZONE_HIGHMEM (x86 32-bit Linux, 3G + 1G setting). In such a case, the kernel may first map some virtual pages in the area to the physical pages that host the current process's page table, modify some page table entries, and unmap the virtual pages. This way, the physical pages may sometimes belong to the kernel and sometimes belong to user space, because they do not belong to the kernel after the unmapping and thus become swappable. Is this the reason?
tl;dr - the memory pools and swapping are different concepts. You can not make any deductions from one about the other.
kmalloc() and other kernel data allocation come from slab/slub, etc. The same place that the kernel gets data for user-space. Ergo pages frequently move between the kernel data segment and user space. This is correct. It doesn't say anything about swapping. That is a separate issue and you can not deduce anything.
The kernel code is typically populated at boot and marked read-only and never changes after that. Ergo physical pages do not move between the kernel code segment and user space.
Why do you think because something comes from the same pool, it is the same? The network sockets also come from the same memory pool. It is a seperation of concern. The linux-mm (memory management system) handles swap. A page can be pinned (unswappable). The check for static kernel memory (this may include .bss and .data) is a simple range check. The memory is normally pinned and marked unswappable at the linux-mm layer. The user data (whos allocation come from the same pool) can be marked as swappable by the linux-mm. For instance, even without swap, user-space text is still swappable because it is backed by an inode. Caching is much simpler for read-only data. If data is swapped, it is marked as such in the MMU tables and a fault handler must distinguish between swap and a SIGBUS; which is part of the linux-mm.
There are also versions of Linux with no-mm (or no MMU) and these will never swap anything. In theory someone might be able to swap kernel data; but the why is it in the kernel? The Linux way would be to use a module and only load them as needed. Certainly, the linux-mm data is kernel data and hopefully, you can see a problem with swapping that.
The problem with conceptual questions like this,
It can differ with Linux versions.
It can differ with Linux configurations.
The advice can change as Linux evolves.
For certain, the linux-mm code can not be swappable, nor any interrupt handler. It is possible that at some point in time, kernel code and/or data could be swapped. I don't think that this is ever the current case outside of module loading/unloading (and it is rather pedantic/esoteric as to whether you call this swapping or not).
I think if a physical page sometimes belongs to the kernel data segment and sometimes belongs to user space, it must mean that physical pages belong to the kernel data segment are swappable, which is against my current understanding.
there is no connection between swappable memory and page movement between user space and kernel space. whether a page can be swapped or not depends totally on whether it is pinned or not. Pinned pages are not swapped so their mapping is considered permanent.
So, physical pages belong to the kernel data segment are swappable? unswappable?
usually pages used by kernel are pinned and so are meant not to be swappable.
However, I'm reading a research paper and feel confused as it says, "(physical) pages frequently move between the kernel data segment and user space."
Could you please give a link of this research papaer?
As far as I known, (just from UNIX lectures and labs at school) the pages for kernel space has been allocated for kernel, with a simple, fixed mapping algorithm, and they are all pinned. After kernel turn on the paging mode, (bits operation of CR0&CR3 for x86) there will be the first user mode process, and the pages which has been allocated for kernel will not be in the available set of pages for user space.

Write to a cacheable physical address in linux kernel without using ioremap or mmap

I am changing the linux kernel scheduler to print the pid of the next process in a known physical memory location. mmap is used for userspace programs while i read that ioremap marks the page as non-cacheable which would slowdown the execution of the program. I would like a fast way to write to a known physical memory. phys_to_virt is the option that i think is feasible. Any idea for a different technique.
PS: i am running this linux kernel on top of qemu. the physical address will be used by qemu to read information sent by guest kernel. writing to a known io-port is not feasible since the device code backing this io-device will be called every time there is an access to the device.
EDIT : I want the physical address location of the pid to be safe. How can I make sure that a physical address that the kernel is using is not being assigned to any process. As far as my knowledge goes, ioremap would mark the page as cacheable and would hence not be of great use.
The simplest way to do this would be to do kmalloc() to get some memory in the kernel. Then you can get the physical address of the pointer that returns by passing it to virt_to_phys(). This is a total hack but for your case of debugging / tracing under qemu, it should work fine.
EDIT: I misunderstood the question. If you want to use a specific physical address, there are a couple of things you could do. Maybe the cleanest thing to do would be to modify the e820 map that qemu passes in to mark the RAM page as reserved, and then the kernel won't use it. (ie the same way that ACPI tables are passed in).
If you don't want to modify qemu, you could also modify the early kernel startup (around arch/x86/kernel/setup.c probably) to do reserve_bootmem() on the specific physical page you want to protect from being used.
To actually use the specified physical address, you can just use ioremap_cache() the same way the ACPI drivers access their tables.
It seems I misunderstood the cache coherency between VM and host part, here is an updated answer.
What you want is "virtual adress in VM" <-> "virtual or physical adress in QEMU adress space".
Then you can either kmalloc it, but it may vary from instance to instance,
or simply declare a global variable in the kernel.
Then virt_to_phys would give you access to the physical address in VM space, and I suppose you can translate this in a QEMU adress space. What do you mean by "a physical address that the kernel is using is not assigned to any process ?" You are afraid the page conatining your variable might be swapped ? kmalloced memory is not swappable
Original (and wrong) answer
If the adress where you want to write is in it's own page, I can't see how an ioremap
of this page would slow down code executing in a different page.
You need a cache flush anyway, and without SSE, I can't see how you can bypass the cache if MMU and cache are on. I can see only this two options :
ioremap and declare a particular page non cacheable
use a "normal" address, and manually do a cache flush each time you write.

Resources