I want to use monitor and mwait instructions in a userspace application. Unfortunately, they're privileged instructions only executable by ring 0.
My application has root access. How can I escalate privileges to ring 0?
I've considered a kernel module that adds them as a syscall, but that destroys the performance improvement I need them for.
Compiling a custom kernel is an option. I have no idea where in the source the switch to ring 0 might be located however, nor if it'll have any side-effects on e.g. virtual memory.
Any ideas?
It is not possible to get a ring0 from user-space with standard linux kernel. And it's preferable to write a kernel module to do thinks you want. But if you really want to have a ring0 at user-space, I'll give you a start point.
x86 processors stores Current Privilege Level in the two least significant bits of cs register.
When new thread is created, Linux kernel checks whether this thread is user thread or kernel one and stores appropriate cs value for this task. (Proof: copy_thread() in arch/x86/kernel/process_32.c).
So, you are able to get pointer to task registers with task_pt_regs() (arch/x86/include/asm/processor.h) macro and alter cs to set ring to 0 with regs->cs &= ~0x3; or something similar.
But again, I strongly recommend you, don't do it.
Related
After a long read, I am really confused.
From what I read:
Modern OS does not use segments at all.
The GDT is used to define a segment in the memory (including constraints).
The page table has a supervisor bit that indicates if the current location is for the kernel.
Wikipedia says that "The GDT is still present in 64-bit mode; a GDT must be defined but is generally never changed or used for segmentation."
Why do we need it at all? And how linux uses it?
Modern OS does not use segments at all.
A modern OS (for 64-bit 80x86) still uses segment registers; it's just that their use is "mostly hidden" from user-space (and most user-space code can ignore them). Specifically; the CPU will determine if the code is 64-bit (or 32-bit or 16-bit) from whatever the OS loads (from GDT or LDT) into CS, interrupts still save CS and SS for the interrupted code (and load them again at iret), GS and/or FS are typically used for thread-local and/or CPU local storage, etc.
The GDT is used to define a segment in the memory (including constraints).
Code and data segments are just one of the things that GDT is used for. The other main use is defining where the Task State Segment is (which is used to find IO port permission map, values to load into CS, SS and RSP when there's a privilege level change caused by an interrupt, etc). It's also still possible for 64-bit code (and 32-bit code/processes running under a 64-bit kernel) to use call gates defined in the GDT, but most operating systems don't use that feature for 64-bit code (they use syscall instead).
The page table has a supervisor bit that indicates if the current location is for the kernel.
Yes. The page table's supervisor bit determines if code running at CPL=3 can/can't access the page (or if the code must be CPL=2, CPL=1 or CPL=0 to access the page).
Wikipedia says that "The GDT is still present in 64-bit mode; a GDT must be defined but is generally never changed or used for segmentation."
Yes - Wikipedia is right. Typically an OS will set up a GDT early during boot (for TSS, CS, SS, etc) and then not have any reason to modify it after boot; and the segment registers aren't used for "segmented memory protection" (but are used for other things - determining code size, if an interrupt handler should return to CPL=0 or not, etc).
The following quote is from the "Understanding the Linux Kernel 3rd Edition" book:
When a User Mode process attempts to access an I/O port by means of an
in or out instruction, the CPU may need to access an I/O Permission
Bitmap stored in the TSS to verify whether the process is allowed to
address the port.
More precisely, when a process executes an in or out I/O instruction
in User Mode, the control unit performs the following operations:
It checks the 2-bit IOPL field in the eflags register. If it is set to 3, the control unit executes the I/O instructions. Otherwise, it
performs the next check.
It accesses the tr register to determine the current TSS, and thus the proper I/O Permission Bitmap.
It checks the bit of the I/O Permission Bitmap corresponding to the I/O port specified in the I/O instruction. If it is cleared, the
instruction is executed; otherwise, the control unit raises a “General
protection” exception.
The following quote is also from the same book:
Although Linux doesn’t use hardware context switches, it is
nonetheless forced to set up a TSS for each distinct CPU in the
system.
Now if Linux only has one TSS structure for all processes (instead of each process having its own TSS structure), and we know that each process must have its own I/O Permission Bitmap, does that mean that when Linux schedule the execution to another process, Linux would change the value of the I/O Permission Bitmap in the only TSS structure the CPU uses to the value of the I/O Permission Bitmap of the process to be executed (which Linux presumably stores somewhere in kernel memory)?
Yes. From the same section of the book, it says:
The tss_struct structure describes the format of the TSS. As already
mentioned in Chapter 2, the init_tss array stores one TSS for each CPU
on the system. At each process switch, the kernel updates some fields
of the TSS so that the corresponding CPU’s control unit may safely
retrieve the information it needs. Thus, the TSS reflects the
privilege of the current process on the CPU, but there is no need to
maintain TSSs for processes when they’re not running.
In later versions of the kernel, init_tss was renamed to cpu_tss. The TSS structure of each processor is initialized in cpu_init, which is executed once per processor when booting the system.
When switching from one task to another, __switch_to_xtra is called, which calls switch_to_bitmap, which simply copies the IO bitmap of the next task into the TSS structure of the processor on which it is scheduled to run next.
Related: How do Intel CPUs that use the ring bus topology decode and handle port I/O operations.
I want to know how privilege separation is enforced by the kernel and the part of kernel that is responsible for this task.
For example, assume there are two processes running -- one at ring 0 and another at ring 3. How does the kernel keep track of the ring number of each process?
Edit: I know about ring numbers. My question is about the part of kernel (module or something) which performs checks on the processes to find out their privilege level. I believe there might be a component of kernel which would check the ring number of a process.
There is no concept of a ring number of a process.
The kernel is mapped in one area of memory, userspace is mapped in another. On boot the kernel specifies an address where the cpu has to jump to when the syscall instruction is executed. So someone does syscall, the cpu switches to ring0 and jumps to the address as instructed by the kernel. It is now executing kernel code. Then, on return, the cpu switches back to ring3 and resumes execution.
Similar story for other ways of entering the kernel like exceptions.
So, how does linux kernel enforce separation? It sets things up for usersapace to execute in ring3. Anything triggering the cpu to switch to ring0 also makes the jump to an address configured by the kernel on boot. no code other than kernel code executes in ring0
I am studying about how CPU changes from user mode to kernel mode in linux. I came across two different methods: Interrupts and using sysenter.
I could not understand how sysenter works. Could someone please explain what exactly happens in the cpu when the sysenter instruction is run?
The problem that a program faces when it wants to get into the kernel (aka "making syscalls") is that user programs cannot access anything kernel-related, yet the program has to somehow switch the CPU into "kernel mode".
On an interrupt, this is done by the hardware.
It also happens automatically when a (CPU-, not C++) exception occurs, like accessing memory that doesn't exist, a divison by zero, or invoking a privileged instruction in user code. Or trying to execute an unimplemented instruction. This last thing is actually a decent way to implement a "call the kernel" interface: CPU runs on an instruction that the CPU doesn't know, so it raises an exception which drops the CPU into kernel mode and into the kernel. The kernel code could then check whether the "correct" unmiplemented instruction was used and perform the syscall stuff if it was, or just kill the process if it was any other unimplemented instruction.
Of course, doing something like this isn't, well, "clean". It's more like a dirty hack, abusing what should be an error to implement a perfectly valid control flow change. Hence, CPUs do tend to have actual instructions to do essentially the same thing, just in a more "defined" way. The main purpose of anything like a "sysenter" instruction is still the same: it changes the CPU into "kernel mode", saves the position where the "sysenter" was called, and continues execution somewhere in the kernel.
As for the difference between a "software interrupt" and "sysenter": "sysenter" is specifically optimized for this kind of use case. For example, it doesn't get the kernel address to call from memory like a (software-)interrupt does, but instead uses a special register to get the address from, which saves the memory address lookup. It might also have additional optimizations internally, based on the fact that software-interrupts might be handled more like interrupts, and the sysenter instruction doesn't actually need that. I don't know the precise details of the implementations of these instructions on the CPUs, you would probably have to read the Intel manuals to really get into such details.
Presumably there is a library or simple asm blob that can get me the number of the current CPU that I am executing on.
Use sched_getcpu to determine the CPU on which the calling thread is running. See man getcpu (the system call) and man sched_getcpu (a library wrapper). However, note what it says:
The information placed in cpu is only guaranteed to be current at the time of the call: unless the CPU affinity has been fixed using sched_setaffinity(2), the kernel might change the CPU at any time. (Normally this does not happen because the scheduler tries to minimize movements between CPUs to keep caches hot, but it is possible.) The caller must be prepared to handle the situation when cpu and node are no longer the current CPU and node.
You need to do something like:
Call sched_getaffinity and identify the CPU bits
Iterate over the CPUs, doing sched_setaffinity to each one
(I'm not sure if after sched_setaffinity you're guaranteed to be on the CPU, or
need to yield explicitly ?)
Execute CPUID (asm instruction)... there is a way of getting a unique per-core ID out of one of it's outputs (see Intel docs). I vaguely recall it's the "APIC ID".
Build a table (a std::map ?) from APIC IDs to a CPU number or affinity mask or something.
If you did this on your main thread, don't forget to set sched_setaffinity back to all CPUS!
Now you can CPUID again whenever you need to and lookup which core you're on.
But I'd query why you need to do this; normally you want to take control via sched_setaffinity rather than finding out which core you're on (and even that's a pretty rare thing to want/need). (That's why I don't know the crucial detail of what to pull out of CPUID exactly, sorry!)
Update: Just learned about sched_getcpu from litb's response here. Much better! (my Debian/etch libc is too old to have it though).
I don't know of anything to get your current core id. With kernel level task/process migration, you wouldn't be guaranteed that it would remain constant for any length of time, unless you were running in some form of real-time mode.
If you want to be on a specific core, you can put use that sched_setaffinity() function or the taskset command to launch your program. I believe that these need elevated permissions to work, though. In your program, you could then run sched_getaffinity() to see the mask that was set earlier and use that as a best guess at the core on which you are executing.
sysconf(_SC_NPROCESSORS_ONLN);