Trap instruction: why must the program counter and processor status register be changed atomically?

Trap instruction: why must the program counter and processor status register be changed atomically? - kernel-mode

I came across the following problem on a previous exam from my operating systems class.
Consider an architecture in which the TRAP instruction has two effects: to load a predefined value of the Processor Status Register (PCR), which contains the user/kernel mode bit, saving the value of the Program Counter (PC) to a special Save PC register and loading apredefined value into the PC. Explain why loading a new value for the PCR without also changing the PC in the same instruction cycle would be unsafe.
I know that the PCR would be set to kernel mode with memory management off. Is it unsafe because the PC is still in the user program? If so where could it go wrong? If not why is it unsafe? Why would changing the PC first also be unsafe?

Aside: there is no reason to assume that "memory management" is turned "off" by loading the new processor status; in fact, in the CPUs in my experience that would not happen. But that is not relevant to this answer.
We're executing in user mode and a TRAP instruction is fetched. The program counter is then (let's say) pointing to the instruction after TRAP.
Now the processor executes the TRAP. It loads the new processor status, which switches the CPU to kernel mode. Assume this does not in itself inhibit device interrupts.
NOW... a device interrupts. The hardware or software mechanism saves the processor status (=kernel mode) and program counter (=the user-mode address of the instruction after TRAP). The device interrupt service routine does its thing and executes a return from interrupt to restore program counter and processor status. We can't resume "half-way through the TRAP instruction" - the only thing that can happen is that we start to execute the instruction that PC points to, i.e., we're executing the instruction after the TRAP but in kernel mode.
The exact problem depends on the system architecture:
If the kernel address map is a superset of the user address map (typical on OSes where user space is half the total address space) then we're executing user-provided code in kernel mode, which is at least a serious privilege problem, and may cause us to fail by page faulting when we can't handle it.
If the kernel address map doesn't include user space (frequently the case on systems with limited virtual address size) then this is equivalent to taking wild jump into the kernel.
The summary is that you need both the processor status and program counter to define "where you are in execution", and they both need to be saved/updated together; or in other words, no change of control (such as an interrupt) can be permitted in the middle.

Related

X86 clear interrupt flag instruction `cli` not working in user space?

I try to stop interrupts from user space for a specific isolated core,
so I set CPU affinity:
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(2, &set);
assert(sched_setaffinity(getpid(),sizeof(set),&set)==0);
and useiopl(3) to execute privileged instruction cli/sti in user space:
iopl(3);
__asm__("cli;");
// busy looping for a while
__asm__("sti;");
and there are two phenomenons I can't explain:
1 cli can't actually stop interrupts (at least not all interrupts), and interrupt, such as LOC (Local Timer Interrupt) comes out every now and then;
I notice lasted kernel patches prevent cli in user space (reference) , but this result can be reproduced in kernel 4.19.0.
2 AFAIK, cli only clear interrupt flag of CPU on which the program is running, but in practice, my whole system is stuck, not responding to my mouse or keyboard.

(2): Many parts of the Linux kernel depend on communicating with other cores, including RCU depending on for each core: run_on(core) and stuff like that. (https://lwn.net/Articles/262464/). Any kernel code doing that will get stuck when this core doesn't respond to the IPI that other cores send to ask the kernel on this core to switch to a certain task, or perhaps to do TLB shootdowns.
I don't know what exact thing would tend to lead to getting stuck, but I don't find it surprising at all that other parts of the kernel are waiting for something that depends on hearing back from the kernel on this core, and that blocks progress of something involved in getting keyboard/mouse events to an X server and to user-space. (Or even to a text console? That might have more hope, fewer layers of software.)
Or it's always possible that some keyboard or mouse interrupts get distributed to this core, and ignored.
As for (1): do you leave the NMI watchdog enabled, or other source of NMIs? That could get the kernel running temporarily in a state where (other?) interrupts are enabled.
I use kernel/nmi_watchdog = 0 in /etc/sysctl.d/99-local.conf to free up an extra perf counter, but the default is enabled.
(cli doesn't stop Non-Maskable Interrupts, as you might guess from the name.)
Other than that guess, I don't know why you'd still be occasional LOCal timer interrupts; maybe someone more familiar with modern x86 interrupts would know.

Which part of Linux kernel enforces privilege separation and how?

I want to know how privilege separation is enforced by the kernel and the part of kernel that is responsible for this task.
For example, assume there are two processes running -- one at ring 0 and another at ring 3. How does the kernel keep track of the ring number of each process?
Edit: I know about ring numbers. My question is about the part of kernel (module or something) which performs checks on the processes to find out their privilege level. I believe there might be a component of kernel which would check the ring number of a process.

There is no concept of a ring number of a process.
The kernel is mapped in one area of memory, userspace is mapped in another. On boot the kernel specifies an address where the cpu has to jump to when the syscall instruction is executed. So someone does syscall, the cpu switches to ring0 and jumps to the address as instructed by the kernel. It is now executing kernel code. Then, on return, the cpu switches back to ring3 and resumes execution.
Similar story for other ways of entering the kernel like exceptions.
So, how does linux kernel enforce separation? It sets things up for usersapace to execute in ring3. Anything triggering the cpu to switch to ring0 also makes the jump to an address configured by the kernel on boot. no code other than kernel code executes in ring0

How an actual system call is made?

I have a question about how an actual system call is made. I know that the magic of system call (like read etc.) is done in C library but don’t understand the exact mechanism. My main issues are
The c library routine is in user address space; then how can it get the address of the interrupt service routines. Are interrupt service routines predefined(on boot up) in physical memory?.
Even if somehow the ISR routine is called how does the address space change? I mean before we start the execution of ISR how will the 'page table base register' change to point to kernel's page table. If the 'C' routine does it then how does it know the address of Kernel's page table?
How are parameters copied from user space to kernel space?
Please excuse me if my questions are too basic but I am new to this. :)
Thanks
Rohit

On most systems, there's an instruction that can be executed by user code to invoke a user-defined interrupt (for example, int on x86 and swi on ARM will request a "software interrupt").
The CPU, executing in user mode, will switch to kernel mode upon seeing one of these instructions, and will jump to a predefined ISR location for that particular interrupt. The interrupt number is typically fixed, and the corresponding ISR is the system call handler for the kernel.
The kernel can inspect the user-mode registers and stack which were present at the time the interrupt was called (in a manner similar to saving all registers on the stack during a context switch), and obtain the system call arguments from there.

Ok i think i found the answer (at least i think so) at
questions about kernel space
1.The c library routine is in user address space; then how can it get the address of the interrupt service routines. Are interrupt service
routines predefined(on boot up) in physical memory?.
The ISR location is predefined as answered by nneonneo above.
2.Even if somehow the ISR routine is called how does the address space change? I mean before we start the execution of ISR how will the 'page
table base register' change to point to kernel's page table. If the
'C' routine does it then how does it know the address of Kernel's page
table?
There is no change in address space as the kernel space is essentially same as users (just the difference in protection level)

what all happens in sysenter instruction is used in linux?

I am studying about how CPU changes from user mode to kernel mode in linux. I came across two different methods: Interrupts and using sysenter.
I could not understand how sysenter works. Could someone please explain what exactly happens in the cpu when the sysenter instruction is run?

The problem that a program faces when it wants to get into the kernel (aka "making syscalls") is that user programs cannot access anything kernel-related, yet the program has to somehow switch the CPU into "kernel mode".
On an interrupt, this is done by the hardware.
It also happens automatically when a (CPU-, not C++) exception occurs, like accessing memory that doesn't exist, a divison by zero, or invoking a privileged instruction in user code. Or trying to execute an unimplemented instruction. This last thing is actually a decent way to implement a "call the kernel" interface: CPU runs on an instruction that the CPU doesn't know, so it raises an exception which drops the CPU into kernel mode and into the kernel. The kernel code could then check whether the "correct" unmiplemented instruction was used and perform the syscall stuff if it was, or just kill the process if it was any other unimplemented instruction.
Of course, doing something like this isn't, well, "clean". It's more like a dirty hack, abusing what should be an error to implement a perfectly valid control flow change. Hence, CPUs do tend to have actual instructions to do essentially the same thing, just in a more "defined" way. The main purpose of anything like a "sysenter" instruction is still the same: it changes the CPU into "kernel mode", saves the position where the "sysenter" was called, and continues execution somewhere in the kernel.
As for the difference between a "software interrupt" and "sysenter": "sysenter" is specifically optimized for this kind of use case. For example, it doesn't get the kernel address to call from memory like a (software-)interrupt does, but instead uses a special register to get the address from, which saves the memory address lookup. It might also have additional optimizations internally, based on the fact that software-interrupts might be handled more like interrupts, and the sysenter instruction doesn't actually need that. I don't know the precise details of the implementations of these instructions on the CPUs, you would probably have to read the Intel manuals to really get into such details.

How does copy_from_user from the Linux kernel work internally?

How exactly does the copy_from_user() function work internally? Does it use any buffers or is there any memory mapping done, considering the fact that kernel does have the privilege to access the user memory space?

The implementation of copy_from_user() is highly dependent on the architecture.
On x86 and x86-64, it simply does a direct read from the userspace address and write to the kernelspace address, while temporarily disabling SMAP (Supervisor Mode Access Prevention) if it is configured. The tricky part of it is that the copy_from_user() code is placed into a special region so that the page fault handler can recognise when a fault occurs within it. A memory protection fault that occurs in copy_from_user() doesn't kill the process like it would if it is triggered by any other process-context code, or panic the kernel like it would if it occured in interrupt context - it simply resumes execution in a code path which returns -EFAULT to the caller.

regarding "how bout copy_to_user since the kernel is passing on the kernel space address,how can a user space process access it"
A user space process can attempt to access any address. However, if the address is not mapped in that process user space (i.e. in the page tables of that process) or if there is a problem with the access like a write attempt to a read-only location, then a page fault is generated. Note that at least on the x86, every process has all the kernel space mapped in the lowest 1 gigabyte of that process's virtual address space, while the 3 upper gigabytes of the 4GB total address space (I'm using here the 32-bit classic case) are used for the process text (i.e. code) and data.
A copy to or from user space is executed by the kernel code that is executing on behalf of the process and actually it's the memory mapping (i.e. page tables) of that process that are in-use during the copy. This takes place while execution is in kernel mode - i.e. privileged/supervisor mode in x86 language.
Assuming the user-space code has passed a legitimate target location (i.e. an address properly mapped in that process address space) to have data copied to, copy_to_user, run from kernel context would be able to normally write to that address/region w/out problems and after the control returns to the user, user space also can read from this location setup by the process itself to start with.
More interesting details can be found in chapters 9 and 10 of Understanding the Linux Kernel, 3rd Edition, By Daniel P. Bovet, Marco Cesati. In particular, access_ok() is a necessary but not sufficient validity check. The user can still pass addresses not belong to the process address space. In this case, a Page Fault exception will occur while the kernel code is executing the copy. The most interesting part is how the kernel page fault handler determines that the page fault in such case is not due to a bug in the kernel code but rather a bad address from the user (especially if the kernel code in question is from a kernel module loaded).

The best answer has something wrong, copy_(from|to)_user can't be used in interrupt context, they may sleep, copy_(from|to)_user function can only be used in process context,
the process's page table include all the information that kernel need to access it, so kernel can direct access the user space address if we can make sure the page addressed is in memory, use copy_(from|to)_user function, because they can check it for us and if the user space addressed page is not resident, it will fix it for us directly.

The implementation of copy_from_user() system call is done using two buffers from different address spaces:
The user-space buffer in user virtual address space.
The kernel-space buffer in kernel virtual address space.
When the copy_from_user() system call is invoked, data is copied from user buffer to kernel buffer.
A part (write operation) of character device driver code where copy_from_user() is used is given below:
ssize_t cdev_fops_write(struct file *flip, const char __user *ubuf,
size_t count, loff_t *f_pos)
{
unsigned int *kbuf;
copy_from_user(kbuf, ubuf, count);
printk(KERN_INFO "Data: %d",*kbuf);
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string