Which part of Linux kernel enforces privilege separation and how?

Which part of Linux kernel enforces privilege separation and how? - linux

I want to know how privilege separation is enforced by the kernel and the part of kernel that is responsible for this task.
For example, assume there are two processes running -- one at ring 0 and another at ring 3. How does the kernel keep track of the ring number of each process?
Edit: I know about ring numbers. My question is about the part of kernel (module or something) which performs checks on the processes to find out their privilege level. I believe there might be a component of kernel which would check the ring number of a process.

There is no concept of a ring number of a process.
The kernel is mapped in one area of memory, userspace is mapped in another. On boot the kernel specifies an address where the cpu has to jump to when the syscall instruction is executed. So someone does syscall, the cpu switches to ring0 and jumps to the address as instructed by the kernel. It is now executing kernel code. Then, on return, the cpu switches back to ring3 and resumes execution.
Similar story for other ways of entering the kernel like exceptions.
So, how does linux kernel enforce separation? It sets things up for usersapace to execute in ring3. Anything triggering the cpu to switch to ring0 also makes the jump to an address configured by the kernel on boot. no code other than kernel code executes in ring0

Related

Trap instruction: why must the program counter and processor status register be changed atomically?

I came across the following problem on a previous exam from my operating systems class.
Consider an architecture in which the TRAP instruction has two effects: to load a predefined value of the Processor Status Register (PCR), which contains the user/kernel mode bit, saving the value of the Program Counter (PC) to a special Save PC register and loading apredefined value into the PC. Explain why loading a new value for the PCR without also changing the PC in the same instruction cycle would be unsafe.
I know that the PCR would be set to kernel mode with memory management off. Is it unsafe because the PC is still in the user program? If so where could it go wrong? If not why is it unsafe? Why would changing the PC first also be unsafe?

Aside: there is no reason to assume that "memory management" is turned "off" by loading the new processor status; in fact, in the CPUs in my experience that would not happen. But that is not relevant to this answer.
We're executing in user mode and a TRAP instruction is fetched. The program counter is then (let's say) pointing to the instruction after TRAP.
Now the processor executes the TRAP. It loads the new processor status, which switches the CPU to kernel mode. Assume this does not in itself inhibit device interrupts.
NOW... a device interrupts. The hardware or software mechanism saves the processor status (=kernel mode) and program counter (=the user-mode address of the instruction after TRAP). The device interrupt service routine does its thing and executes a return from interrupt to restore program counter and processor status. We can't resume "half-way through the TRAP instruction" - the only thing that can happen is that we start to execute the instruction that PC points to, i.e., we're executing the instruction after the TRAP but in kernel mode.
The exact problem depends on the system architecture:
If the kernel address map is a superset of the user address map (typical on OSes where user space is half the total address space) then we're executing user-provided code in kernel mode, which is at least a serious privilege problem, and may cause us to fail by page faulting when we can't handle it.
If the kernel address map doesn't include user space (frequently the case on systems with limited virtual address size) then this is equivalent to taking wild jump into the kernel.
The summary is that you need both the processor status and program counter to define "where you are in execution", and they both need to be saved/updated together; or in other words, no change of control (such as an interrupt) can be permitted in the middle.

Linux - Accessing mmap()ed memory from Thread from Userspace in Kernel Space

Mapped this memory in my Thread in Userspace:
b7fd0000-b7fd1000 rwxp 00000000 00:00 0
Thread is running (endless loop)
Made a breakpoint in the Kernel and trying to access it:
Thread 466 received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 3908]
0xc10d4060 in kgdb_breakpoint ()
(gdb) x/01i 0xb7fd0000
0xb7fd0000: Cannot access memory at address 0xb7fd0000
But it is not accessible.
How can I access 0xb7fd0000 from Kernel space? What address will be under?
Is it even possible?
Thanks,

The address the memory will appear under depends on which user space context is currently mapped.
The way this works is, some of the virtual addresses are reserved to the kernel, and these are the same in all contexts. This is why you can set a break point on a kernel address without worrying about which user space process is currently mapped.
For user space, this is not the case. Each time a new process is mapped, the virtual addresses for US change completely.
This is, likely, an X-Y problem. You're trying to do something, and you think that a kernel level break point is how you want to achieve it.
Taking a guess, you want your kernel driver to do something to communicate with your user space thread. If that's the case, your best bet is to export a character device, and have the userspace open it and mmap from there (rather than just an anonymous mmap). You can then control which memory it receives, and thus also map it to the kernel address space, where pointers are stable.

I'm not expert in kgdb but it will be worth checking current->pid at the time breakpoint is hit, contains pid of the process you're tracking. This is because although memory location inside kernel space has been hit, the user space process may not be the one you are interested in. Also to guard against possibility of pages being swapped out, it might be safer to lock mapped pages using mlock.

There are at least two things to pay attention to.
When to break. If you simply hit crtl-C in the kernel debugger, you don't know what the userspace context is going to be. It could be any userspace process. You want the kernel debugger to pause and give you control when the userspace context refers to the process you are interested in. One way to do this is as follows :- If you have the ability to recompile the kernel, add a new system call. Invoke this system call from the userspace process, after the region is mmap'd in. Start debugging the kernel after placing a breakpoint on the newly added system call. When the breakpoint is hit, you know for a fact that the userspace context is that of the process one you are interested in.
Virtual memory late binding. Even if you follow the steps in [1], you will still have trouble when accessing the contents of the buffer in userspace, if you have not read / written anything in that location. Make sure that after mmap'ing the region, you either read or write to the mmap'd location before invoking your newly added system call.

Why keep a kernel stack for each process in linux?

What's the point in keeping a different kernel stack for each process in linux?
Why not keep just one stack for the kernel to work with?

What's the point in keeping a different kernel stack for each process in linux?
It simplifies pre-emption of processes in the kernel space.
Why not keep just one stack for the kernel to work with?
It would be a night mare to implement pre-emption without seperates stacks.
Separate kernel stacks are not really mandated. Each architecture is free to do whatever it wants. If there was no per-emption during a system call, then a single kernel stack might make sense.
However, *nix has processes and each process can make a system call. However, Linux allows one task to be pre-empted during a write(), etc and another task to schedule. The kernel stack is a snapshot of the context of kernel work that is being performed for each process.
Also, the per-process kernel stacks come with little overhead. A thread_info or some mechanism to get the process information from assembler is needed. This is at least a page allocation. By placing the kernel mode stack in the same location, a simple mask can get the thread_info from assembler. So, we already need the per-process variable and allocation. Why not use it as a stack to store kernel context and allow preemption during system calls?
The efficiency of preemption can be demonstrated by mentioned write above. If the write() is to disk or network, it will take time to complete. A 5k to 8k buffer written to disk or network will take many CPU cycles to complete (if synchronous) and the user process will block until it is finished. This transfer in the driver can be done with DMA. Here, a hardware element will complete the task of transferring the buffer to the device. In the mean time, a lower priority process can have the CPU and be allowed to make system calls when the kernel keeps different stacks per process. These stacks are near zero cost as the kernel already needs to have book keeping information for process state and the two are both keep in an 4k or 8k page.

Why not keep just one stack for the kernel to work with?
In this case only one process/thread would be able to enter the kernel at a time.
Basically, each thread has its own stack, and crossing the user-space to kernel boundary does not change this fact. Kernel also has its own kernel threads (not belonging to any user-space process) and they all have their own stacks.

Escalating to Ring 0 in Linux application

I want to use monitor and mwait instructions in a userspace application. Unfortunately, they're privileged instructions only executable by ring 0.
My application has root access. How can I escalate privileges to ring 0?
I've considered a kernel module that adds them as a syscall, but that destroys the performance improvement I need them for.
Compiling a custom kernel is an option. I have no idea where in the source the switch to ring 0 might be located however, nor if it'll have any side-effects on e.g. virtual memory.
Any ideas?

It is not possible to get a ring0 from user-space with standard linux kernel. And it's preferable to write a kernel module to do thinks you want. But if you really want to have a ring0 at user-space, I'll give you a start point.
x86 processors stores Current Privilege Level in the two least significant bits of cs register.
When new thread is created, Linux kernel checks whether this thread is user thread or kernel one and stores appropriate cs value for this task. (Proof: copy_thread() in arch/x86/kernel/process_32.c).
So, you are able to get pointer to task registers with task_pt_regs() (arch/x86/include/asm/processor.h) macro and alter cs to set ring to 0 with regs->cs &= ~0x3; or something similar.
But again, I strongly recommend you, don't do it.

what all happens in sysenter instruction is used in linux?

I am studying about how CPU changes from user mode to kernel mode in linux. I came across two different methods: Interrupts and using sysenter.
I could not understand how sysenter works. Could someone please explain what exactly happens in the cpu when the sysenter instruction is run?

The problem that a program faces when it wants to get into the kernel (aka "making syscalls") is that user programs cannot access anything kernel-related, yet the program has to somehow switch the CPU into "kernel mode".
On an interrupt, this is done by the hardware.
It also happens automatically when a (CPU-, not C++) exception occurs, like accessing memory that doesn't exist, a divison by zero, or invoking a privileged instruction in user code. Or trying to execute an unimplemented instruction. This last thing is actually a decent way to implement a "call the kernel" interface: CPU runs on an instruction that the CPU doesn't know, so it raises an exception which drops the CPU into kernel mode and into the kernel. The kernel code could then check whether the "correct" unmiplemented instruction was used and perform the syscall stuff if it was, or just kill the process if it was any other unimplemented instruction.
Of course, doing something like this isn't, well, "clean". It's more like a dirty hack, abusing what should be an error to implement a perfectly valid control flow change. Hence, CPUs do tend to have actual instructions to do essentially the same thing, just in a more "defined" way. The main purpose of anything like a "sysenter" instruction is still the same: it changes the CPU into "kernel mode", saves the position where the "sysenter" was called, and continues execution somewhere in the kernel.
As for the difference between a "software interrupt" and "sysenter": "sysenter" is specifically optimized for this kind of use case. For example, it doesn't get the kernel address to call from memory like a (software-)interrupt does, but instead uses a special register to get the address from, which saves the memory address lookup. It might also have additional optimizations internally, based on the fact that software-interrupts might be handled more like interrupts, and the sysenter instruction doesn't actually need that. I don't know the precise details of the implementations of these instructions on the CPUs, you would probably have to read the Intel manuals to really get into such details.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string