User mode vs supervisor mode - switch-statement

I have a few questions on the user-mode and supervisor-mode on Unix-like machines.
What is the difference between user-mode and supervisor-mode? I know that the user processes cannot access all memory and hardware and execute all instructions. Is there more to this?
What are the advantages of having different modes?
What are the steps involved when one switches from the user-mode to the supervisor mode?
When a system call is made by a user-program, the mode has to change from user-mode to supervisor mode. I have read elsewhere that this is achieved on x86 machines by using an int x80. So how is a mode-switch different from interrupt handling?
How is it different from a context-switch?
How are supervisor modes implemented in different architectures?
Any answers or pointers will be appreciated!

The CPU will not physically allow access to the areas which are determined as "privileged". Because this is enforced in hardware, it gives your operating system the capability to protect itself. Without this mechanism there would be no "security" in an operating system, as the most obscure piece of code could simply access kernel memory and read all the passwords for instance.
User mode to supervisor mode switch is expensive because it is a context switch, and for security purposes the cache must be flushed (otherwise you might be able to access something that you weren't meant to).
As for a context switch, this inherently involves a switch to kernel mode to perform a task. When the CPU Scheduler timer interrupt fires, it switches into kernel mode, selects the next task to execute, and then switches back to user mode with the next task to resume.

Two concepts exist:
software user/kernel modes, which are switched from each other when performing a system call or a return form system call,
hardware user/supervisor modes, which are switched from each other on interrupts.
Very few code is executed in HW supervisor mode, mainly interrupt routines at low level and the very beginning of startup. Even most of SW kernel mode is executed in HW user mode.

Related

RISC-V: Handling multiple interrupts

Is it possible to give different priority for different interrupts in machine mode? Unlike different mode interrupts, how does processor controls nested traps for same mode?
Is it possible to give different priority for different interrupts in
machine mode?
As far as I understand, the different interrupts in machine mode have fixed priorities, from high to low: external, software, timer, synchronous traps, see
riscv-privileged-v1.10.pdf end of section 3.1.14. Multiple external interrupts are prioritized by an interrupt controller such as PLIC described in chapter 7.
Unlike different mode interrupts, how does processor controls nested
traps for same mode?
By stacking the global interrupt enable for the interrupted mode, and also the previous privilege mode, see section 3.1.7.
There are many subtleties to consider, but I hope the above broad answers leads in the right direction.

Does the kernel only execute on occurrence of an exception

I'm learning about embedded Linux. I can't seem to find proper answers for my questions below.
My understanding is, when user-space applications are executing, if we want to perform IO for example, a system call is made which will cause a SW interrupt, generally causing the MCU to switch from non-privileged mode to privileged mode and the kernel will perform the IO on behalf of the application.
Similarity when a hardware interrupt occurs, I'm guessing this will cause the modes to switch again and execute an interrupt handler within the kernel.
What's not clear to me is, are these the only times when the kernel code gets control of the CPU?
With only one core for example, if user application code is running, shouldn't the kernel be getting control of the CPU from time to time to check things, regardless of whether an interrupt has occurred or not. Perhaps there is a periodic timer interrupt allowing this?
Also, if we have multiple cores, could the kernel just be running all the time on one core while user applications on another?
Read Operating Systems: Three Easy Pieces since an entire book is needed to answer your questions. Later, study the source code of the kernel, with the help of https://kernelnewbies.org/
Interrupts happen really often (perhaps hundreds, or even thousands, per second). Try cat /proc/interrupts (see proc(5)) a few times in a terminal.
the kernel will perform the IO on behalf of the application.
Not always immediately. If you read a file, its content could be in the page cache (and then no physical IO is needed). If disk access (or networking) is required, the kernel will schedule (read about preemptive scheduling) some IO to happen and context switch to other runnable tasks. Much later, several interrupts have been handled (some of which may be triggered by physical devices related to your IO), and finally (many milliseconds later) your process could return -in user space- from the read(2) system call and be running again. During that delay, other processes have been running.
Also, if we have multiple cores, could the kernel just be running all the time on one core while user applications on another?
It depends a lot (even on the kernel version). Probably the kernel is not running on the same core. YMMV.
What's not clear to me is, are these the only times when the kernel code gets control of the CPU?
Yes, kernel can not interrupt user code from running. But Kernel will setup up a timer hardware, which will generate timer interrupt between consistent time period. Kernel utilize it to implement task schedule.
Also, if we have multiple cores, could the kernel just be running all the time on one core while user applications on another?
You can consider multiple cores system as multiple machines but they share memory, and are able to send interrupt to each other.

Under what circumstances does control pass from userspace to the Linux kernel space?

I'm trying to understand which events can cause a transition from userspace to the linux kernel. If it's relevant, the scope of this question can be limited to the x86/x86_64 architecture.
Here are some sources of transitions that I'm aware of:
System calls (which includes accessing devices) causes a context switch from userspace to kernel space.
Interrupts will cause a context switch. As far as I know, this also includes scheduler preemptions, since a scheduler usually relies on a timer interrupt to do its work.
Signals. It seems like at least some signals are implemented using interrupts but I don't know if some are implemented differently so I'm listing them separately.
I'm asking two things here:
Am I missing any userspace->kernel path?
What are the various code paths that are involved in these context switches?
One you are missing: Exceptions
(which can be further broken down in faults, traps and aborts)
For example a page fault, breakpoint, division by zero or floating-point exception. Technically, one can view exceptions as interrupts but not really the way you have defined an interrupt in your question.
You can find a list of x86 exceptions at this osdev webpage.
With regard to your second question:
What are the various code paths that are involved in these context
switches?
That really depends on the architecture and OS, you will need to be more specific. For x86, when an interrupt occurs you go to the IDT entry and for SYSENTER you get to to address specified in the MSR. What happens after that is completely up to the OS.
No one wrote a complete answer so I will try to incorporate the comments and partial answers into an answer. Feel free to comment or edit the answer to improve it.
For the purposes of this question and answer, userspace to kernel transitions mean a change in processor state that allows access to kernel code and memory. In short I will refer to these transistions as context switches.
When discussing events that can trigger userspace to kernel transitions, it is important to separate the OS constructs that we are used to (signals, system calls, scheduling) that require context switches and the way these constructs are implemented, using context switches.
In x86, there are two central ways for context switches to occur: interrupts and SYSENTER. Interrupts are a processor feature, which causes a context switch when certain events happen:
Hardware devices may request an interrupt, for example, a timer/clock can cause an interrupt when a certain amount of time has elapsed. A keyboard can interrupt when keys are pressed. It's also called a hardware interrupt.
Userspace can initiate an interrupt. For example, the old way to perform a system call in Linux on x86 was to execute INT 0x80 with arguments passed through the registers. Debugging breakpoints are also implemented using interrupts, with the debugger replacing an instruction with INT 0x3. This type of an interrupt is called a software interrupt.
The CPU itself generates interrupts in certain situations, like when memory is accessed without permissions, when a user divides by zero, or when one core must notify another core that it needs to do something. This type of interrupt is called an exception, and you can read more about them in #esm 's answer.
For a broader discussion of interrupts see here: http://wiki.osdev.org/Interrupt
SYSENTER is an instruction that provides the modern path to cause a context switch for the particular case of performing a system call.
The code that handles the context switching due to interrupts or SYSENTER in Linux can be found in arch/x86/kernel/entry_{32|64}.S.
There are many situations in which a higher-level Linux construct might cause a context switch. Here are a few examples:
If a system call got to int 0x80 or sysenter instruction, a context switch occurs. Some system call routines can use userspace information to get the information the system call was meant to get. In this case, no context switch will occur.
Many times scheduling doesn't require an interrupt: a thread will perform a system call, and the return from the syscall is delayed until it is scheduled again. For processses that are in a section where syscalls aren't performed, Linux relies on timer interrupts to gain control.
Virtual memory access to a memory location that was paged out will cause a segmentation fault, and therefore a context switch.
Signals are usually delivered when a process is already "switched out" (see comments by #caf on the question), but sometimes an inter-processor interrupt is used to deliver the signal between two running processes.

Linux Process Context and SVC call in ARM

As per some Linux books
kernel code that services system calls issued by user applications
runs on behalf of the corresponding application process and is said to
be executing in process context. Interrupt Handlers run in interrupt
context.
Now svc and irq are two exceptions.
So when linux is handling svc it is in process context and while it is handling irq it is in interrupt context. Is that how it is mapped ?
Just one edit to this
It is also mentioned in books that tasklets / softirqs run in interrupt context while workqueues run in Process context. So does it mean that tasklet would run in CPSR.mode = IRQ ?
If I understand your confusion in the right way:
Since Linux is a capable, preemptive, complex operating system it has much finer handling of concepts such as handling of interrupts or serving software traps compared to bare metal hardware.
For example when a supervisor call (svc) happens hardware switches to SVC mode then Linux handles this as simple as preparing some data structures for handling it further then quits from SVC mode so core can continue serving in user mode thus making it possible to run into many more exception modes instead of blocking them.
It is same for IRQ mode, Linux handles bare minimum in IRQ mode. It prepares data structures as which IRQ happened, which handler should be invoked etc then exits from IRQ mode immediately to allow more to happen on that core. Later on some other internal kernel thread may process that interrupt further. Since hardware while being relatively simple runs really fast thus handling of interrupt runs in parallel with many processes.
Downside of this advanced approach is it gives no guarantees on response time requirements or overhead of it becomes visible on slower hardware like MCUs.
So ARM's exception modes provides two things for Linux: message type and priority backed with hardware support.
Message type is what exception mode is about, if it was a SVC, IRQ, FIQ, DATA ABORT, UNDEFINED INSTRUCTION, etc. So when hardware runs into an exception mode, Linux implicitly knows what it is handling.
Priority is about providing a capable and responsive hardware, for example system should be able to acknowledge an interrupt while handling some less important supervisor call.
Hardware support is for handling above two easier and faster. For example some registers are banked, or there is an extra system mode to handle reentrant IRQ easier.

Context switching kernel processes in Linux

Consider the process keventd. It spends all it's lifetime in kernel mode.
Now, as far as I know, Linux checks if a context switch is due, while the process is switching from kernel mode to user mode, and as far as I know, keventd will never switch from kernel mode to user mode, so, how will the Linux kernel know when to switch it off?
If the kernel were to do as you say, and only check whether a process is due to be switched out on an explicit user-to-kernel-mode transition, then the following loop would lock up a core of your computer:
while (1);
Obviously, this does not happen on normal desktop operating systems. The reason why is preemption, where after a process has run for its time slice the kernel gets an alarm, steps in, and forcibly switches contexts as necessary.
Preemption could in principle work for kernel processes too. However, I'm not sure that's what keventd does - it's more likely that it voluntarily relinquishes its time slice on a regular basis (see sched_yield, a userspace call for the same effect), especially since the kernel can be configured to be non-preemptible. That is a kernel process' prerogative.

Resources