Changing kernel page permission for allowing user access - linux

In x86 or x64 Linux, I am trying to make a kernel module that changes specific kernel page permission to allow user application accessing that memory. For example, if there is a readable kernel page at 0xC0001000(say it's 3:1 split), I want to change user/supervisor bit of this page and allow user applications to do something like this.
int* m = 0xC0001000;
printf("reading kernel memory from user : %08x\n", *m);
In my kernel module, I changed the access bit of corresponding kernel memory page from 0x67 to 0x63 (lower bits 111 -> 011) clearing the supervisor bit.
After that, I flushed the TLB of virtual address 0xc0001000 using invdpg instruction.
I have confirmed that the page entry I manipulated was indeed the corresponding one.
However, accessing 0xC0001000 from user application still causes me segmentation fault.
Am I missing something important here? perhaps cs segment and GDT? or is that irrelevant?
Some advice would be nice, thank you in advance :)

From your kernel module you can just change the effective user id to 0 to let it read /dev/kmem,

Related

Page Table Entry, Present Bit?

Quoting from: http://www.cburch.com/books/vm/index.html
The final bit (labeled P) indicates whether the page is present in
RAM. If this bit is 0, then any access to the page will trigger a page
fault.
My professor doesn't agree, he said the bit can be 0 while page is in RAM and he added that this can happen when the page is shared between multiple processes and someone does something or so.
Can someone kindly explain this, still I don't get it I'm looking for detailed examples when page is in RAM but it's present bit in PTE is 0 and not 1.
Yes, It's possible to have a page in RAM with p-bit disabled.
This method was useful while creating a software/kernel with multi-threading and multi-processor environment, where a process needs exclusive rights or if a piece code must not cross some other. We can temporarily disable it's access to other core/processor by demoting p-bit in page-table and the kernel/software must handle the page fault accordingly.

Where can I find the Process Control Block (PCB) and GDTR/LDTR contents using GDB and QEMU?

I have a barebone linux kernel with buildroot setup for debugging using QEMU and GDB. I am using the x86_64 architecture.
I want to check how the memory protection works for each process. So basically, I need to find the base and limit values that govern the access to the physical memory.
If I understood correctly, the GDTR register in the x86 architecture "holds the base address (32 bits in protected mode; 64 bits in IA-32e mode) and the 16-bit table limit for the GDT." If not, please let me know where such information is held.
I tried using the i r in GDB but the output does not show the GDTR/LDTR contents. I read somewhere that we can use another method while inside the kernel in order to display the results of these registers.
I also need to check the PCB (Process Control Block) contents. I can't seem to find a way to do so. I read somewhere that if we do memory dump in the kernel, we can get the PCB contents, but I can't figure out how to do so.
So, how can I check the contents of PCB and GDTR/LDTR from gdb?
The setup is a simple qemu that launches the linux kernel with buildroot, connects gdb by using target remote :1234 and execute a simple C program that has fork and exec inside of it.

Can kernel code make things read-only in a way that other kernel code can't undo?

I'm under the impression that the Linux kernel's attempts to protect itself revolve around not letting malicious code run in kernelspace. In particular, if a malicious kernel module were to get loaded, that it would be "game over" from a security standpoint. However, I recently came across a post that contradicts this belief and says that there is some way that the kernel can protect parts of itself from other parts of itself:
There is plenty of mechanisms to protect you against malicious modules. I write kernel code for fun so I have some experience in the field; it's basically a flag in the pagetable.
What's there to stop any kernel module from changing that flag in the pagetable back? The only protection against malicious modules is keeping them from loading at all. Once one loads, it's game over.
Make the pagetable readonly. Done.
Kernel modules can just make it read-write again, the same way your code made it read-only, then carry on with their changes.
You can actually lock this down so that kernel mode cannot modify the page table until an interrupt occurs. If your IDT is read-only as well there is no way for the module to do anything about it.
That doesn't make any sense to me. Am I missing something big about how kernel memory works? Can kernelspace code restrict itself from modifying the page table? Would that really prevent kernel rootkits? If so, then why doesn't the Linux kernel do that today to put an end to all kernel rootkits?
If the malicious kernel code is loaded in the trusted way (e.g. loading a kernel module and not exploiting a vulnerability) then no: kernel code is kernel code.
Intel CPUs do have a series of mechanisms to disable read/write access to kernel memory:
CR0.WP if set disallows writes accesses to both user and kernel read-only pages. Used to detect bugs in the kernel code.
CR4.PKE if set (4-level paging must be enabled, mandatory in 64-bit mode) disallows the kernel from accessing (not including instruction fetches) the user page mode unless these are tagged with the right key (which marks their RW permissions). Used to allow the kernel to write to structures like VSDO and KUSER_SHARED_DATA but not other user mode structures. The keys permissions are in an MSR, not in memory; the keys themselves are in the page table entries.
CR4.SMEP if set disallows kernel instruction fetching from user mode pages. Used to prevent attacks where a kernel function pointer is redirected to a user mode allocated page (e.g. the nelson.c privilege escalation exploit).
CR4.SMAP if set disallows kernel access to user mode pages during implicit access or during any type (implicit or explicit) of access (if EFLAGS.AC=0, thus overriding the protection keys). Used to enforce a more strictly no-user-mode-access policy.
Of course the R/W and U/S bits in the paging structures control if the item is read-only/read-write and assigned to user or kernel.
You can read how permissions are applied for supervisor-mode accesses in the Intel manual:
Data writes to supervisor-mode addresses.
Access rights depend on the value of CR0.WP:
- If CR0.WP = 0, data may be written to any supervisor-mode address.
- If CR0.WP = 1, data may be written to any supervisor-mode address with a translation for which the
R/W flag (bit 1) is 1 in every paging-structure entry controlling the translation; data may not be written
to any supervisor-mode address with a translation for which the R/W flag is 0 in any paging-structure
entry controlling the translation.
So even if the kernel protected a page X as read-only and then protected the page structures themselves as read-only, a malicious module could simply clear CR0.WP.
It could also change CR3 and use its own paging structures.
Note that Intel developed SGX to address the threat model where the kernel itself is evil.
However, running the kernel components into enclaves in a secure way (i.e. no single point of failure) may not be trivial.
Another approach is virtualizing the kernel with the VMX extension, though this is by no way trivial to implement.
Finally, the CPU has four protection levels at the segmentation layer but paging has only two: supervisor (CPL = 0) and user (CPL > 0).
It is theoretically possible to run a kernel component in "Ring 1" but then you'd need to make an interface (e.g. something like a call gate or syscall) for it to access the other kernel functions.
It's easier to run it in user mode altogether (since you don't trust the module in the first place).
I have no idea what this is supposed to mean:
You can actually lock this down so that kernel mode cannot modify the page table until an interrupt occurs.
I don't recall any mechanism by which the interrupt handling will lock/unlock anything.
I'm curious though, if anybody can shed some light they are welcome.
Security in the x86 CPUs (but this may be generalized) has always been hierarchical: whoever cames first set up the constraints for whoever cames later.
There is usually little to no protection between nonisolated components at the same hierarchical level.

page fault in copy_to_user, how kernel map a page for user space address?

I've learned that when a page fault occurs in copy_to_user function, the exception table will be used.
But I found almost all fix would just set the return value and jump to the next instruction after the one which triggers page fault.
Where does the kernel do the mapping work for user space address?
I mean at least there is some place kernel will modify page table.
Your question is very unclear, a copy_to_user is basically a function for copying data from kernel-space to user-space. Mainly for security reasons as we don't want to give user access to kernel data structures and kernel-space. So we need a mechanism to request from the kernel to give us this data.
A new mapping will be added in the page-tables indeed. The mapping is done in
kernel-space where the page-tables reside.

Can protection mode be turned off with inline assembly?

If a user didn't have root privileges, could that user still write a user space program with inline assembly to turn off protection mode on the computer to overwrite memory in other segments assuming the OS is linux?
Not unless the user knows of a security vulnerability to get root permissions. Mechanisms like /dev/mem allow root to read and write all userspace memory, and kernel module loading allows root access to kernel memory and the rest of the system's IO space.
Assuming the system is working as intended, no.
In reality, there are undoubtedly a few holes somewhere that would allow it -- given a code base that size, bugs are inevitable, and a few could probably be exploited to get into ring 0.
That said, I'm only guessing based on statistics -- I can't point to a specific vulnerability.

Resources