How does linux kernel switch between user-mode and kernel-mode stack when a system call or an interrupt appears? I mean what is the exact mechanism - what happens to user-mode stack pointer and where does kernel-mode stack pointer come from? What is done by hardware and what must be done by software?
All of the words below are about x86.
I will just describe entire syscall path, and this answer will contain requested information.
First of all, you need to understand what is interrupt descriptor table. This table stores addresses of exception/interrupts vectors. System call is an exception. To raise an exception user code perform
int x
assembly instruction. Each exception including system call have its own number. On x86 linux this will be look like
int 0x80
The int instruction is a complex multi step instruction. Here is an explanation of what it does:
1.) Extracts descriptor from IDT (IDT address stored in special register) and checks that CPL <= DPL. CPL is a current privilege level, which could be read from CS register. DPL is stored in the IDT descriptor.
As a consequence of this - you can't generate some exceptions (f.e. page fault) from user space directly by int instruction. If you will try to do this, you will get general protection exception
2.) The processor switches to the stack defined in TSS.
TSS was initialized earlier, and already contains values of ESP and SS, which holds the kernel stack address. So now ESP points to kernel stack.
3.) The processor pushes to the newly switched kernel stack user space registers: ss, esp, eflags, cs, eip. We need to return back after syscall is served, right?
4.) Next processor set CS and EIP from IDT descriptor. This address defines exception vector entry point.
5.) Here we are in the syscall exception vector in kernel.
And few words about ARM. ARM doesn't have TSS, it have bancked per-mode registers. So for SVC and USR modes you have separate stack pointers. If you are interested in you can take look at trap entry code
Interestring links:
MIT JOS lab 3 ,
XV6 manual
Related
While reading the vector_swi() routine for arm linux system call, I found that r0-r12 registers are copied to the kernel stack(below is the code)
ENTRY(vector_swi)
#ifdef CONFIG_CPU_V7M
v7m_exception_entry
#else
sub sp, sp, #S_FRAME_SIZE
stmia sp, {r0 - r12} # Calling r0 - r12
As per my understanding, during system call arm enters into svc mode and jumps to vector_swi() routine and begins execution. The sp register of svc mode(sp_svc)points to kernel stack. r0-r12 registers are copied to the kernel stack.
My question is how is the sp (sp_svc) register setup?
How does it know the address of kernel stack?
Is this kernel stack same as the process's(the process that called system call) kernel stack?
On the arm32 architecture, sp (r13) is banked, which means there are physically separate registers for USR and SVC modes.
For each userspace thread, the corresponding kernel thread always exists, and has its stack allocated and the SVC mode r13 points there. On system call entry, the software-visible r13 is switched to the one for SVC mode, and the instructions you point to are executed after that.
This is in reference to CVE-2018-8897 (which appears related to CVE-2018-1087), described as follows:
A statement in the System Programming Guide of the Intel 64 and IA-32 Architectures Software Developer's Manual (SDM) was mishandled in the development of some or all operating-system kernels, resulting in unexpected behavior for #DB exceptions that are deferred by MOV SS or POP SS, as demonstrated by (for example) privilege escalation in Windows, macOS, some Xen configurations, or FreeBSD, or a Linux kernel crash. The MOV to SS and POP SS instructions inhibit interrupts (including NMIs), data breakpoints, and single step trap exceptions until the instruction boundary following the next instruction (SDM Vol. 3A; section 6.8.3). (The inhibited data breakpoints are those on memory accessed by the MOV to SS or POP to SS instruction itself.) Note that debug exceptions are not inhibited by the interrupt enable (EFLAGS.IF) system flag (SDM Vol. 3A; section 2.3). If the instruction following the MOV to SS or POP to SS instruction is an instruction like SYSCALL, SYSENTER, INT 3, etc. that transfers control to the operating system at CPL < 3, the debug exception is delivered after the transfer to CPL < 3 is complete. OS kernels may not expect this order of events and may therefore experience unexpected behavior when it occurs.
When reading this related git commit to the Linux kernel, I noted that the commit message states:
x86/entry/64: Don't use IST entry for #BP stack
There's nothing IST-worthy about #BP/int3. We don't allow kprobes
in the small handful of places in the kernel that run at CPL0 with
an invalid stack, and 32-bit kernels have used normal interrupt
gates for #BP forever.
Furthermore, we don't allow kprobes in places that have usergs while
in kernel mode, so "paranoid" is also unnecessary.
In light of the vulnerability, I'm trying to understand the last sentence/paragraph in the commit message. I understand that an IST entry refers to one of the (allegedly) "known good" stack pointers in the Interrupt Stack Table that can be used to handle interrupts. I also understand that #BP refers to a breakpoint exception (equivalent to INT3), and that kprobes is the debugging mechanism that is claimed to only run in a few places in the kernel at ring 0 (CPL0) privilege level.
But I'm completely lost in the next part, which may be because "usergs" is a typo and I'm simply missing what was intended:
Furthermore, we don't allow kprobes in places that have usergs while
in kernel mode, so "paranoid" is also unnecessary.
What does this statement mean?
usergs is referring to the x86-64 swapgs instruction, which exchanges gs with an internal saved GS value for the kernel to find the kernel stack from a syscall entry point. The swaps also swap the cached gsbase segment info, rather than reloading from the GDT based on the gs value itself. (wrgsbase can change the GS base independently of the GDT/LDT)
AMD's design is that syscall doesn't change RSP to point to the kernel stack, and doesn't read/write any memory, so syscall itself can be fast. But then you enter the kernel with all registers holding their user-space values. See Why does Windows64 use a different calling convention from all other OSes on x86-64? for some links to mailing list discussions between kernel devs and AMD architects in ~2000, tweaking the design of syscall and swapgs to make it usable before any AMD64 CPUs were sold.
Apparently keeping track of whether GS is currently the kernel or user value is tricky for error handling: There's no way to say "I want kernelgs now"; you have to know whether to run swapgs or not in any error-handling path. The only instruction is a swap, not a set it to one vs. the other.
Read comments in arch/x86/entry/entry_64.S e.g. https://github.com/torvalds/linux/blob/9fb71c2f230df44bdd237e9a4457849a3909017d/arch/x86/entry/entry_64.S#L1267 (from current Linux) which mentions usergs, and the next block of comments describes doing a swapgs before jumping to some error handling code with kernel gsbase.
IIRC, the Linux kernel [gs:0] holds a thread info block, at the lowest addresses of the kernel stack for that thread. The block includes the kernel stack pointer (as an absolute address, not relative to gs).
I wouldn't be surprised if this bug is basically tricking the kernel to loading kernel rsp from a user-controlled gsbase, or otherwise screwing up the dead-reckoning of swapgs so it has the wrong gs at some point.
Before a program is executed, when is the esp register set to point to a valid address? During the call to exec? Or in user space itself? I've gone through the kernel code and can't seem to find it anywhere.
Background
x86 CPUs have two (actually four) stacks (per task): One for the user mode and one for the kernel mode.
When an interrupt occurs in user mode the CPU will set esp to the address of the kernel's stack (see "TSS" for more information) and push the original value of esp (location of the user mode's stack) to the (kernel's) stack. eip, cs and eflags are always pushed to the stack when an interrupt occurs.
When returning from the interrupt the iret instruction will pop the "old" register values from the (kernel's) stack and the stack pointer will point to the user's stack again.
A preemptive multi-tasking operating system typically works the following way:
Some task is running which means that this task takes 100% of the CPU load for a very small amount of time. When a timer interrupt occurs the register values of the currently running task are stored on the stack (by the CPU). The OS will push the values of all other registers and change the esp value to the kernel stack of another task (which were saved when another timer interrupt happened). Then it pops the registers and performs an iret so all registers contain the values of another task and the other task is running.
In Linux (4.12.2), x86-32 this is done by the function __switch_to_asm in the assembly source "entry_32.S".
Direct answer to your question
When a new task is created the two stacks (user and kernel stack) are allocated for that task and the initial register values to be poped in the interrupt are written to the kernel stack. This includes the initial value of esp for the user mode.
Some timer interrupt later the task is started the first time (the same way an already running task is re-activated).
In (old versions of) Linux there are two commands used to create a new task:
fork() will simply copy the kernel stack. fork() will duplicate an existing task so all register values (including esp) must be identical to the already existing task
execve() will not allocate a new kernel stack (now new task is created but another executable is being run in the current task). Execve will allocate a new user stack and overwrite the esp value on the kernel stack. (Mark Plotnick's comment is showing you the position where this is done.)
In 32 bit Intel architecture, the mmap2 system call has 6 parameters. The sixth parameter is stored in the ebp register. However, right before entering the kernel via sysenter, this happens (in linux-gate.so.1, the page of code mapped into user processes by the kernel):
push %ebp
movl %esp, %ebp
sysenter
This means that ebp should now have the stack pointer's contents in it instead of the sixth parameter. How does Linux get the parameter right?
That blog post you linked in comments has a link to Linus's post, which gave me the clue to the answer:
Which means that now the kernel can happily trash %ebp as part of the
sixth argument setup, since system call restarting will re-initialize
it to point to the user-level stack that we need in %ebp because
otherwise it gets totally lost.
I'm a disgusting pig, and proud of it to boot.
-- Linus Torvalds
It turns out sysenter is designed to require user-space to cooperate with the kernel in saving the return address and user-space stack pointer. (Upon entering the kernel, %esp will be the kernel stack.) It does way less stuff than int 0x80, which is why it's way faster.
After entry into the kernel, the kernel has user-space's %esp value in %ebp, which it needs anyway. It accesses the 6th param from the user-space stack memory, along with the return address for SYSEXIT. Immediately after entry, (%ebp) holds the 6th syscall param. (Matching the standard int 0x80 ABI where user-space puts the 6th parameter there directly.)
From Michael's comment: "Here's the 32-bit sysenter_target code: look at the part starting at line 417"
From Intel's instruction reference manual entry for SYSENTER (links in the x86 wiki):
The SYSENTER and SYSEXIT instructions are companion instructions, but
they do not constitute a call/return pair. When executing a SYSENTER
instruction, the processor does not save state information for the
user code (e.g., the instruction pointer), and neither the SYSENTER
nor the SYSEXIT instruction supports passing parameters on the stack.
To use the SYSENTER and SYSEXIT instructions as companion instructions
for transitions between privilege level 3 code and privilege level 0
operating system procedures, the following conventions must be
followed:
The segment descriptors for the privilege level 0 code and
stack segments and for the privilege level 3 code and stack segments
must be contiguous in a descriptor table. This convention allows the
processor to compute the segment selectors from the value entered in
the SYSENTER_CS_MSR MSR.
The fast system call “stub” routines
executed by user code (typically in shared libraries or DLLs) must
save the required return IP and processor state information if a
return to the calling procedure is required. Likewise, the operating
system or executive procedures called with SYSENTER instructions must
have access to and use this saved return and state information when
returning to the user code.
How could I retrieve the system call address from /proc/kcore. I could get the system call table address from System.map file.
If you're using an x86-based machine, you can use the sidt instruction to get the interrupt descriptor table register and consequently the interrupt descriptor table itself. With that in hand, you can get the address of the system_call (or the ia32 equivalent for x86-64 compatibility) function invoked by the 0x80 system-call interrupt. Disassembling that interrupt handler and scanning for a specific indirect call instruction, you can extract the address within the call instruction. That address is your system call table (on x86) or the IA32 compatibility system call table on x86-64.
Getting the x86-64 native system call table is similar: instead of reconstructing the interrupt table with sidt, read the processor's IA32_LSTAR MSR. The address at (high << 32 | low) is the system call dispatcher. Scan the memory as before, extract the sys_call_table address from the call instruction, but remember to mask the high 32 bits of the address.
This glosses over a lot of even more technical information (like which bytes to search for) that you should understand before poking around in the kernel code. After a quick Google search I found the entire process documented (with example module code) here.
Good luck, and try not to blow yourself up!