How does Linux extract the sixth parameter of syscall? - linux

In 32 bit Intel architecture, the mmap2 system call has 6 parameters. The sixth parameter is stored in the ebp register. However, right before entering the kernel via sysenter, this happens (in linux-gate.so.1, the page of code mapped into user processes by the kernel):
push %ebp
movl %esp, %ebp
sysenter
This means that ebp should now have the stack pointer's contents in it instead of the sixth parameter. How does Linux get the parameter right?

That blog post you linked in comments has a link to Linus's post, which gave me the clue to the answer:
Which means that now the kernel can happily trash %ebp as part of the
sixth argument setup, since system call restarting will re-initialize
it to point to the user-level stack that we need in %ebp because
otherwise it gets totally lost.
I'm a disgusting pig, and proud of it to boot.
-- Linus Torvalds
It turns out sysenter is designed to require user-space to cooperate with the kernel in saving the return address and user-space stack pointer. (Upon entering the kernel, %esp will be the kernel stack.) It does way less stuff than int 0x80, which is why it's way faster.
After entry into the kernel, the kernel has user-space's %esp value in %ebp, which it needs anyway. It accesses the 6th param from the user-space stack memory, along with the return address for SYSEXIT. Immediately after entry, (%ebp) holds the 6th syscall param. (Matching the standard int 0x80 ABI where user-space puts the 6th parameter there directly.)
From Michael's comment: "Here's the 32-bit sysenter_target code: look at the part starting at line 417"
From Intel's instruction reference manual entry for SYSENTER (links in the x86 wiki):
The SYSENTER and SYSEXIT instructions are companion instructions, but
they do not constitute a call/return pair. When executing a SYSENTER
instruction, the processor does not save state information for the
user code (e.g., the instruction pointer), and neither the SYSENTER
nor the SYSEXIT instruction supports passing parameters on the stack.
To use the SYSENTER and SYSEXIT instructions as companion instructions
for transitions between privilege level 3 code and privilege level 0
operating system procedures, the following conventions must be
followed:
The segment descriptors for the privilege level 0 code and
stack segments and for the privilege level 3 code and stack segments
must be contiguous in a descriptor table. This convention allows the
processor to compute the segment selectors from the value entered in
the SYSENTER_CS_MSR MSR.
The fast system call “stub” routines
executed by user code (typically in shared libraries or DLLs) must
save the required return IP and processor state information if a
return to the calling procedure is required. Likewise, the operating
system or executive procedures called with SYSENTER instructions must
have access to and use this saved return and state information when
returning to the user code.

Related

Explain Linux commit message that patches/secures POP SS followed by a #BP interrupt (INT3)

This is in reference to CVE-2018-8897 (which appears related to CVE-2018-1087), described as follows:
A statement in the System Programming Guide of the Intel 64 and IA-32 Architectures Software Developer's Manual (SDM) was mishandled in the development of some or all operating-system kernels, resulting in unexpected behavior for #DB exceptions that are deferred by MOV SS or POP SS, as demonstrated by (for example) privilege escalation in Windows, macOS, some Xen configurations, or FreeBSD, or a Linux kernel crash. The MOV to SS and POP SS instructions inhibit interrupts (including NMIs), data breakpoints, and single step trap exceptions until the instruction boundary following the next instruction (SDM Vol. 3A; section 6.8.3). (The inhibited data breakpoints are those on memory accessed by the MOV to SS or POP to SS instruction itself.) Note that debug exceptions are not inhibited by the interrupt enable (EFLAGS.IF) system flag (SDM Vol. 3A; section 2.3). If the instruction following the MOV to SS or POP to SS instruction is an instruction like SYSCALL, SYSENTER, INT 3, etc. that transfers control to the operating system at CPL < 3, the debug exception is delivered after the transfer to CPL < 3 is complete. OS kernels may not expect this order of events and may therefore experience unexpected behavior when it occurs.
When reading this related git commit to the Linux kernel, I noted that the commit message states:
x86/entry/64: Don't use IST entry for #BP stack
There's nothing IST-worthy about #BP/int3. We don't allow kprobes
in the small handful of places in the kernel that run at CPL0 with
an invalid stack, and 32-bit kernels have used normal interrupt
gates for #BP forever.
Furthermore, we don't allow kprobes in places that have usergs while
in kernel mode, so "paranoid" is also unnecessary.
In light of the vulnerability, I'm trying to understand the last sentence/paragraph in the commit message. I understand that an IST entry refers to one of the (allegedly) "known good" stack pointers in the Interrupt Stack Table that can be used to handle interrupts. I also understand that #BP refers to a breakpoint exception (equivalent to INT3), and that kprobes is the debugging mechanism that is claimed to only run in a few places in the kernel at ring 0 (CPL0) privilege level.
But I'm completely lost in the next part, which may be because "usergs" is a typo and I'm simply missing what was intended:
Furthermore, we don't allow kprobes in places that have usergs while
in kernel mode, so "paranoid" is also unnecessary.
What does this statement mean?
usergs is referring to the x86-64 swapgs instruction, which exchanges gs with an internal saved GS value for the kernel to find the kernel stack from a syscall entry point. The swaps also swap the cached gsbase segment info, rather than reloading from the GDT based on the gs value itself. (wrgsbase can change the GS base independently of the GDT/LDT)
AMD's design is that syscall doesn't change RSP to point to the kernel stack, and doesn't read/write any memory, so syscall itself can be fast. But then you enter the kernel with all registers holding their user-space values. See Why does Windows64 use a different calling convention from all other OSes on x86-64? for some links to mailing list discussions between kernel devs and AMD architects in ~2000, tweaking the design of syscall and swapgs to make it usable before any AMD64 CPUs were sold.
Apparently keeping track of whether GS is currently the kernel or user value is tricky for error handling: There's no way to say "I want kernelgs now"; you have to know whether to run swapgs or not in any error-handling path. The only instruction is a swap, not a set it to one vs. the other.
Read comments in arch/x86/entry/entry_64.S e.g. https://github.com/torvalds/linux/blob/9fb71c2f230df44bdd237e9a4457849a3909017d/arch/x86/entry/entry_64.S#L1267 (from current Linux) which mentions usergs, and the next block of comments describes doing a swapgs before jumping to some error handling code with kernel gsbase.
IIRC, the Linux kernel [gs:0] holds a thread info block, at the lowest addresses of the kernel stack for that thread. The block includes the kernel stack pointer (as an absolute address, not relative to gs).
I wouldn't be surprised if this bug is basically tricking the kernel to loading kernel rsp from a user-controlled gsbase, or otherwise screwing up the dead-reckoning of swapgs so it has the wrong gs at some point.

How does linux kernel switch between user-mode and kernel-mode stack?

How does linux kernel switch between user-mode and kernel-mode stack when a system call or an interrupt appears? I mean what is the exact mechanism - what happens to user-mode stack pointer and where does kernel-mode stack pointer come from? What is done by hardware and what must be done by software?
All of the words below are about x86.
I will just describe entire syscall path, and this answer will contain requested information.
First of all, you need to understand what is interrupt descriptor table. This table stores addresses of exception/interrupts vectors. System call is an exception. To raise an exception user code perform
int x
assembly instruction. Each exception including system call have its own number. On x86 linux this will be look like
int 0x80
The int instruction is a complex multi step instruction. Here is an explanation of what it does:
1.) Extracts descriptor from IDT (IDT address stored in special register) and checks that CPL <= DPL. CPL is a current privilege level, which could be read from CS register. DPL is stored in the IDT descriptor.
As a consequence of this - you can't generate some exceptions (f.e. page fault) from user space directly by int instruction. If you will try to do this, you will get general protection exception
2.) The processor switches to the stack defined in TSS.
TSS was initialized earlier, and already contains values of ESP and SS, which holds the kernel stack address. So now ESP points to kernel stack.
3.) The processor pushes to the newly switched kernel stack user space registers: ss, esp, eflags, cs, eip. We need to return back after syscall is served, right?
4.) Next processor set CS and EIP from IDT descriptor. This address defines exception vector entry point.
5.) Here we are in the syscall exception vector in kernel.
And few words about ARM. ARM doesn't have TSS, it have bancked per-mode registers. So for SVC and USR modes you have separate stack pointers. If you are interested in you can take look at trap entry code
Interestring links:
MIT JOS lab 3 ,
XV6 manual

Allocate memory (x64 assembly)

How can I allocate memory in the Heap in x64 assembly. I want to store the value of the sidt function, but I can't seem to find a way on how to do so?
I use Visual studio 2012.
You will have two options (assuming you're running in user space on top of an operating system).
use whatever your operating system provides to map you some writable memory (in UNIX brk/sbrk/mmap)
call the malloc library function in the C standard library (which will do (1) under the hood for you)
I'd go for number 2 as it's much simpler and kind of portable.
Something similar to the following should do the trick:
movq $0x10, %rdi
callq malloc
; %rax will now contain the pointer to the memory
Assuming ADM64 (System V AMD64 ABI) calling conventions, that'll call malloc(16) which should return you a pointer to a memory block with 16 bytes. The address should reside in the %rax register after the call returns (or 0 if not enough memory).
EDIT: Wikipedia says about the x86-64 calling conventions that Microsoft apparently uses a different calling convention (first register in RCX not RDI). So you'd need to modify movl $0x10, %rdi to movl $0x10, %rcx.
Judging by your environment, I'm guessing that you're writing assembly code in Windows. You'll need to use the Windows equivelent to an sbrk system call. You may find this MSDN reference useful!
Write the code to call malloc in C, then have the compiler produce an assembly listing, which will show you the name used for malloc (probably _malloc in the case of Microsoft compilers), and how to call it.
Another option would be to allocate space from the stack with a subtract from esp, equal to the size of a structure that will hold the sidt information.

What is better "int 0x80" or "syscall" in 32-bit code on Linux?

I study the Linux kernel and found out that for x86_64 architecture the interrupt int 0x80 doesn't work for calling system calls1.
For the i386 architecture (32-bit x86 user-space), what is more preferable: syscall or int 0x80 and why?
I use Linux kernel version 3.4.
Footnote 1: int 0x80 does work in some cases in 64-bit code, but is never recommended. What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?
syscall is the default way of entering kernel mode on x86-64. This instruction is not available in 32 bit modes of operation on Intel processors.
sysenter is an instruction most frequently used to invoke system calls in 32 bit modes of operation. It is similar to syscall, a bit more difficult to use though, but that is the kernel's concern.
int 0x80 is a legacy way to invoke a system call and should be avoided.
The preferred way to invoke a system call is to use vDSO, a part of memory mapped in each process address space that allows to use system calls more efficiently (for example, by not entering kernel mode in some cases at all). vDSO also takes care of more difficult, in comparison to the legacy int 0x80 way, handling of syscall or sysenter instructions.
Also, see this and this.
My answer here covers your question.
In practice, recent kernels are implementing a VDSO, notably to dynamically optimize system calls (the kernel sets the VDSO to some code best for the current processor). So you should use the VDSO, and you'll better use, for existing syscalls, the interface provided by the libc.
Notice that, AFAIK, a significant part of the cost of simple syscalls is going from user-space to kernel and back. Hence, for some syscalls (probably gettimeofday, getpid ...) the VDSO might avoid even that (and technically might avoid doing a real syscall). For most syscalls (like open, read, send, mmap ....) the kernel cost of the syscall is large enough to make any improvement of the user-space to kernel space transition (e.g. using SYSENTER or SYSCALL machine instructions instead of INT) insignificant.
Beware of this before changing : system call numbers differ when doing 0x80 or syscall, e.g sys_write is 4 with 0x80 and 1 with syscall.
http://docs.cs.up.ac.za/programming/asm/derick_tut/syscalls.html for 32 bits or 0x80
http://blog.rchapman.org/post/36801038863/linux-system-call-table-for-x86-64 for syscall

Get system call address in system call table from /proc/kcore

How could I retrieve the system call address from /proc/kcore. I could get the system call table address from System.map file.
If you're using an x86-based machine, you can use the sidt instruction to get the interrupt descriptor table register and consequently the interrupt descriptor table itself. With that in hand, you can get the address of the system_call (or the ia32 equivalent for x86-64 compatibility) function invoked by the 0x80 system-call interrupt. Disassembling that interrupt handler and scanning for a specific indirect call instruction, you can extract the address within the call instruction. That address is your system call table (on x86) or the IA32 compatibility system call table on x86-64.
Getting the x86-64 native system call table is similar: instead of reconstructing the interrupt table with sidt, read the processor's IA32_LSTAR MSR. The address at (high << 32 | low) is the system call dispatcher. Scan the memory as before, extract the sys_call_table address from the call instruction, but remember to mask the high 32 bits of the address.
This glosses over a lot of even more technical information (like which bytes to search for) that you should understand before poking around in the kernel code. After a quick Google search I found the entire process documented (with example module code) here.
Good luck, and try not to blow yourself up!

Resources