In Linux x86_64 are syscalls and int 0x80 related? - linux

I know that in Linux x64 "syscall" and "int 0x80" assembler instructions generate an interrupt in software asking the kernel to do some work. They have different opcodes (0F 05 vs CD 80) and the former is faster.
It's not clear to me if there is any relationship between them: are they really independent? (i.e.: does "syscall" call "int 0x80"?)
Thank you.

The syscall (x86-64) and sysenter (x86-32) instructions are newer and faster, and so are used when available; but the int 0x80 mechanism is preserved for compatibility with old binaries. There is no semantic difference -- system call numbering is the same regardless of which instruction is used to transfer control into the kernel, and I think the arguments are all in the same places as well.
I have a dim recollection of there being a small number of system calls that could only be made using int 0x80 because of their unusual stack-related behavior (clone? execve? sigreturn?) but that might no longer be true.

int 0x80 is rumored to be obsolete (since slow). BTW, you really want to use vdso(7)
AFAIK, both instructions are going inside the kernel, and each has some (short) sequence of kernel processing which ultimately jump into the syscall table.

int 0x80 is the 32-bit interrupt used with 8086 - 80386 assembly. x86_64 uses syscall in its place. See the differences in /usr/include/asm/unistd_32.h and unistd_64.h for the different call numbers each expect to invoke the various kernel functions.

Related

Does asmlinkage mean stack or registers?

In most languages, C included, the stack is used for function calls. That's why you get a "Stack Overflow" error if you are not careful in recursion. (Pun not intended).
If that is true, then what is so special about the asmlinkage GCC directive.
It says, from #kernelnewbies
The asmlinkage tag is one other thing that we should observe about
this simple function. This is a #define for some gcc magic that tells
the compiler that the function should not expect to find any of its
arguments in registers (a common optimization), but only on the CPU's
stack.
I mean I don't think the registers are used in normal function calls.
What is even more strange is when you learn it is implemented using the GCC regparm function attribute on x86.
The documentation of regparm is as follows:
On x86-32 targets, the regparm attribute causes the compiler to pass
arguments number one to number if they are of integral type in
registers EAX, EDX, and ECX instead of on the stack.
This is basically saying the opposite of what asmlinkage is trying do.
So what happens? Are they on the stack or in the registers.
Where am I going wrong?
The information isn't very clear.
On x86 32bit, the asmlinkage macro expands to __attribute__((regparam(0))), which basically tells GCC that no parameters should be passed through registers (the 0 is the important part). As of Linux 5.17, x86-32 and Itanium64 seem to be the only two architectures re-defining this macro, which by default expands to no attribute at all.
So asmlinkage does not by itself mean "parameters are passed on the stack". By default, the normal calling convention is used. This includes x86 64bit, which follows the System V AMD64 ABI calling convention, passing function parameters through RDI, RSI, RDX, RCX, R8, R9, [XYZ]MM0–7.
HOWEVER there is an important clarification to make: even with no special __attribute__ to force the compiler to use the stack for parameters, syscalls in recent kernel versions still take parameters from the stack indirectly through a pointer to a pt_regs structure (holding all the user-space registers saved on the stack on syscall entry). This is achieved through a moderately complex set of macros (SYSCALL_DEFINEx) that does everything transparently.
So technically, although asmlinkage does not change the calling convention, parameters are not passed inside registers as one would think by simply looking at the syscall function signature.
For example, the following syscall:
SYSCALL_DEFINE3(something, int, one, int, two, int, three)
{
// ...
do_something(one, two, three);
// ...
}
Actually becomes (roughly):
asmlinkage __x64_sys_something(struct pt_regs *regs)
{
// ...
do_something(regs->di, regs->si, regs->dx);
// ...
}
Which compiles to something like:
/* ... */
mov rdx,QWORD PTR [rdi+0x60]
mov rsi,QWORD PTR [rdi+0x68]
mov rdi,QWORD PTR [rdi+0x70]
call do_something
/* ... */
On i386 and x86-64 at least, asmlinkage means to use the standard calling convention you'd get with no GCC options and no __attribute__. (Like what user-space programs normally use for that target.)
For i386, that means stack args only. For x86-64, it's still the same registers as usual.
For x86-64, there's no difference; the kernel already uses the standard calling convention from the AMD64 System V ABI doc everywhere, because it's well-designed for efficiency, passing the first 6 integer args in registers.
But i386 has more historical baggage, with the standard calling convention (i386 SysV ABI) inefficiently passing all args on the stack. Presumably at some point in ancient history, Linux was compiled by GCC using this convention, and the hand-written asm entry points that called C functions were already using that convention.
So (I'm guessing here), when Linux wanted to switch from gcc -m32 to gcc -m32 -mregparm=3 to build the kernel with a more efficient calling convention, they had a choice to either modify the hand-written asm at the same time to use the new convention, or to force a few specific C functions to still use the traditional calling convention so the hand-written asm could stay the same.
If they'd made the former choice, asmlinkage for i386 would be __attribute__((regparm(3))) to force that convention even if the kernel is compiled a different way.
But instead, they chose to keep the asm the same, and #define asmlinkage __attribute__((regparm(0))) for i386, which indeed is zero register args, using the stack right away.
I don't know if that maybe had any debugging benefit, like in terms of being able to see what args got passed into a C function from asm without the only copy likely getting modified right away.
If -mregparm=3 and the corresponding attribute were new GCC features, Linus probably wanted to keep it possible to build the kernel with older GCC. That would rule out changing the asm to require __attribute__((regparm(3))). The asmlinkage = regparm(0) choice they actually made also has the advantage of not having to modify any asm, which means no correctness concerns, and that can be disentangled from any possible GCC bugs with using the new(?)-at-the-time calling convention.
At this point I think it would be totally possible to modify the asm code that calls asmlinkage functions, and swap it to being regparm(3). But that's a pretty minor thing. And not worth doing now since 32-bit x86 kernels are long since obsolete for almost all use cases. You almost always want a 64-bit kernel even if using a 32-bit user-space.
There might even be an efficiency benefit to stack args if saving the registers at a system-call entry point involved saving them with EBX at the lowest address, where they're already in place to be used as function args. You'd be all set to call *ia32_sys_call_table(,%eax,4). But that isn't actually safe because callees own their stack args and are allowed to write them, even though GCC usually doesn't use the incoming stack arg locations as scratch space. So I doubt Linux would have done this.
Other ISAs cope just fine with asmlinkage passing args in registers, so there's nothing fundamental about stack args that's important for how Linux works. (Except possibly for i386-specific code, but I doubt even that.)
The whole "asmlinkage means to pass args on the stack" is purely an i386 thing.
Most other ISAs that Linux runs on are more recent than 32-bit x86 (and/or are RISC-like with more registers), and have a standard calling convention that's more efficient with modern CPUs, using registers for the first few args. That includes x86-64.

Can ptrace tell if an x86 system call used the 64-bit or 32-bit ABI?

I'm trying to use ptrace to trace all syscalls made by a separate process, be it 32-bit (IA-32) or 64-bit (x86-64). My tracer would run on a 64-bit x86 installation with IA-32 emulation enabled, but ideally would be able to trace both 64-bit and 32-bit applications, including if a 64-bit application forks and execs a 32-bit process.
The issue is that, since 32-bit and 64-bit syscall numbers differ, I need to know whether a process is 32-bit or 64-bit to determine which syscall it used, even if I have the syscall number. There seem to be imperfect methods, like checking /proc/<pid>/exec or (as strace does) the size of the registers struct, but nothing reliable.
Complicating this is the fact that 64-bit processes can switch out of long mode to execute 32-bit code directly. They can also make 32-bit int $0x80 syscalls, which, of course, use the 32-bit syscall numbers. I don't "trust" the processes I trace to not use these tricks, so I want to detect them correctly. And I've independently verified that in at least the latter case, ptrace sees the 32-bit syscall numbers and argument register assignments, not the 64-bit ones.
I poked around in the kernel source and came across the TS_COMPAT flag in arch/x86/include/asm/processor.h, which appears to be set whenever a 32-bit syscall is made by a 64-bit process. The only problem is that I have no idea how to access this flag from userland, or if it is even possible.
I also thought about reading the %cs and comparing it to $0x23 or $0x33, inspired by this method for switching bitness in a running process. But this only detects 32-bit processes, not necessarily 32-bit syscalls (those made with int $0x80) from a 64-bit process. It's also fragile since it relies on undocumented kernel behavior.
Finally, I noticed that the x86 architecture has a bit for long mode in the Extended Feature Enable Register MSR. But ptrace has no way of reading the MSR from a tracee, and I feel like reading it from within my tracer will be inadequate because my tracer is always running in long mode.
I'm at a loss. Perhaps I could try and use one of those hacks—at this point I'm leaning towards %cs or the /proc/<pid>/exec method—but I want something durable that will actually distinguish between 32-bit and 64-bit syscalls. How can a process using ptrace under x86-64, which has detected that its tracee made a syscall, reliably determine whether that syscall was made with the 32-bit (int $0x80) or 64-bit (syscall) ABI? Is there some other way for a user process to gain this information about another process that it is authorized to ptrace?
Interesting, I hadn't realized that there wasn't an obvious smarter way that strace could use to correctly decode int 0x80 from 64-bit processes. (This is being worked on, see this answer for links to a proposed kernel patch to add PTRACE_GET_SYSCALL_INFO to the ptrace API. strace 4.26 already supports it on patched kernels.)
Update: now supports per-syscall detection IDK which mainline kernel version added the feature. I tested on Arch Linux with kernel version 5.5 and strace version 5.5.
e.g. this NASM source assembled into a static executable:
mov eax, 4
int 0x80
mov eax, 60
syscall
gives this trace: nasm -felf64 foo.asm && ld foo.o && strace ./a.out
execve("./foo", ["./foo"], 0x7ffcdc233180 /* 51 vars */) = 0
strace: [ Process PID=1262249 runs in 32 bit mode. ]
write(0, NULL, 0) = 0
strace: [ Process PID=1262249 runs in 64 bit mode. ]
exit(0) = ?
+++ exited with 0 +++
strace prints a message every time a system call uses a different ABI bitness than previously. Note that the message about runs in 32 bit mode is completely wrong; it's merely using the 32-bit ABI from 64-bit mode. "Mode" has a specific technical meaning for x86-64, and this is not it.
With older kernels
As a workaround, I think you could disassemble the code at RIP and check whether it was the syscall instruction (0F 05) or not, because ptrace does let you read the target process's memory.
But for a security use-case like disallowing some system calls, this would be vulnerable to a race condition: another thread in the syscall process could rewrite the syscall bytes to int 0x80 after they execute, but before you can peek at them with ptrace.
You only need to do that if the process is running in 64-bit mode, otherwise only the 32-bit ABI is available. If it's not, you don't need to check. (The vdso page can potentially use 32-bit mode syscall on AMD CPUs that support it but not sysenter. Not checking in the first place for 32-bit processes avoids this corner case.) I think you're saying you have a reliable way to detect that at least.
(I haven't used the ptrace API directly, just the tools like strace that use it. So I hope this answer makes sense.)

Fastest Linux system call

On an x86-64 Intel system that supports syscall and sysret what's the "fastest" system call from 64-bit user code on a vanilla kernel?
In particular, it must be a system call that exercises the syscall/sysret user <-> kernel transition1, but does the least amount of work beyond that. It doesn't even need to do the syscall itself: some type of early error which never dispatches to the specific call on the kernel side is fine, as long as it doesn't go down some slow path because of that.
Such a call could be used to estimate the raw syscall and sysret overhead independent of any work done by the call.
1 In particular, this excludes things that appear to be system calls but are implemented in the VDSO (e.g., clock_gettime) or are cached by the runtime (e.g., getpid).
One that doesn't exist, and therefore returns -ENOSYS quickly.
From arch/x86/entry/entry_64.S:
#if __SYSCALL_MASK == ~0
cmpq $__NR_syscall_max, %rax
#else
andl $__SYSCALL_MASK, %eax
cmpl $__NR_syscall_max, %eax
#endif
ja 1f /* return -ENOSYS (already in pt_regs->ax) */
movq %r10, %rcx
/*
* This call instruction is handled specially in stub_ptregs_64.
* It might end up jumping to the slow path. If it jumps, RAX
* and all argument registers are clobbered.
*/
#ifdef CONFIG_RETPOLINE
movq sys_call_table(, %rax, 8), %rax
call __x86_indirect_thunk_rax
#else
call *sys_call_table(, %rax, 8)
#endif
.Lentry_SYSCALL_64_after_fastpath_call:
movq %rax, RAX(%rsp)
1:
Use an invalid system call number so the dispatching code simply returns with
eax = -ENOSYS instead of dispatching to a system-call handling function at all.
Unless this causes the kernel to use the iret slow path instead of sysret / sysexit. That might explain the measurements showing an invalid number being 17 cycles slower than syscall(SYS_getpid), because glibc error handling (setting errno) probably doesn't explain it. But from my reading of the kernel source, I don't see any reason why it wouldn't still use sysret while returning -ENOSYS.
This answer is for sysenter, not syscall. The question originally said sysenter / sysret (which was weird because sysexit goes with sysenter, while sysret goes with syscall). I answered based on sysenter for a 32-bit process on an x86-64 kernel.
Native 64-bit syscall is handled more efficiently inside the kernel. (Update; with Meltdown / Spectre mitigation patches, it still dispatches via C do_syscall_64 in 4.16-rc2).
My What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? Q&A gives an overview of the kernel side of system-call entry points from compat mode into an x86-64 kernel (entry_64_compat.S). This answer is just taking the relevant parts of that.
The links in that answer and this are to Linux 4.12 sources, which doesn't contain the Meltdown mitigation page-table manipulation, so that will be significant extra overhead.
int 0x80 and sysenter have different entry points. You're looking for entry_SYSENTER_compat. AFAIK, sysenter always goes there, even if you execute it in a 64-bit user-space process. Linux's entry point pushes a constant __USER32_CS as the saved CS value, so it will always return to user-space in 32-bit mode.
After pushing registers to construct a struct pt_regs on the kernel stack, there's a TRACE_IRQS_OFF hook (no idea how many instructions that amounts to), then call do_fast_syscall_32 which is written in C. (Native 64-bit syscall dispatching is done directly from asm, but 32-bit compat system calls are always dispatched through C).
do_syscall_32_irqs_on in arch/x86/entry/common.c is pretty light-weight: just a check if the process is being traced (I think this is how strace can hook system calls via ptrace), then
...
if (likely(nr < IA32_NR_syscalls)) {
regs->ax = ia32_sys_call_table[nr]( ... arg );
}
syscall_return_slowpath(regs);
}
AFAIK, the kernel can use sysexit after this function returns.
So the return path is the same whether or not EAX had a valid system call number, and obviously returning without dispatching at all is the fastest path through that function, especially in a kernel with Spectre mitigation where the indirect branch on the table of function pointers would go through a retpoline and always mispredict.
If you want to really test sysenter/sysexit without all that extra overhead, you'll need to modify Linux to put a much simpler entry point without checking for tracing or pushing / popping all the registers.
You'd probably also want to modify the ABI to pass a return address in a register (like syscall does on its own) instead of saved on the user-space stack which Linux's current sysenter ABI does; it has to get_user() to read the EIP value it should return to.
Of if all this overhead is part of what you want to measure, you're definitely all set with an eax that gives you -ENOSYS; at worst you'll be getting one extra branch miss from the range-check if branch predictors are hot for that branch based on normal 32-bit system calls.
In this benchmark by Brendan Gregg (linked from this blog post which is interesting reading on the topic) close(999) (or some other fd not in use) is recommended.
Some system calls don't even go thru any user->kernel transition, read vdso(7).
I suspect that these VDSO system calls (e.g. time(2), ...) are the fastest. You could claim that there are no "real" system calls.
BTW, you could add a dummy system call to your kernel (e.g. some system call always returning 0, or a hello world system call, see also this) and measure it.
I suspect (without having benchmarked it) that getpid(2) should be a very fast system call, because the only thing it needs to do is fetch some data from the kernel memory. And AFAIK, it is a genuine system call, not using VDSO techniques. And you could use syscall(2) to avoid its caching done by your libc and forcing the genuine system call.
I maintain my position (given in a comment to your initial question): without actual motivation your question does not make any concrete sense. Then I still do think that syscall(2) doing getpid is measuring the typical overhead to make a system call (and I guess you really care about that one). In practice almost all system calls are doing more work that such a getpid (or getppid).

Linux syscall strategy through vsyscall page

I am reading about VM handling on Linux. Apparently to perform a syscall there's a page at 0xFFFFF000 on x86. called vsyscall page. In the past, the strategy to call a syscall was to use int 0x80. Is this vsyscall page strategy still using int 0x80 under the hood, or is it using a different call strategy (e.g. syscall opcode?). Collateral question: is the int 0x80 method outdated?
If you run ldd on a modern Linux binary, you'll see that it's linked to a dynamic library called linux-vdso.1 (on amd64) or linux-gate.so.1 (on x86), which is located in that vsyscall page. This is a shared library provided by the kernel, mapped into every process's address space, which contains C functions that encapsulate the specifics of how to perform a system call.
The reason for this encapsulation is that the "preferred" way to perform a system call can differ from one machine to another. The interrupt 0x80 method should always work on x86, but recent processors support the sysenter (Intel) or syscall (AMD) instructions, which are much more efficient. You want your programs to use those when available, but you also want the same compiled binary to run on both Intel and AMD (and other) processors, so it shouldn't contain vendor-specific opcodes. The linux-vdso/linux-gate library hides these processor-specific decisions behind a consistent interface.
For more information, see this article.

Which linux process handles syscalls?

This might be a silly question, but I was debugging a binary with gdb trying to "reverse engineer" it and reached an instruction that makes a syscall after which the effect I want to reverse engineer appears. I assume that another process is taking over and does the job so I was wondering if it was possible to debug the kernel code that handles the syscall with gdb.
Here is the x86 assembly snippet that makes the syscall (it appears that it is sys_getpid):
0x00007ffff7660d3e <+14>: movsxd rdx,edx
0x00007ffff7660d41 <+17>: movsxd rdi,edi
0x00007ffff7660d44 <+20>: mov eax,0x14
0x00007ffff7660d49 <+25>: syscall
The syscall (or sysenter or int 0x80 etc...) machine instruction is for making syscalls which by definition are handled by the Linux kernel. Details are defined in the x86-64 ABI specification. Read Advanced Linux Programming to get an overview of most of them. See also Linux Assembly HowTo.
From the point of view of a user application, a syscall is a virtual atomic instruction.
No specific userland process is handling syscalls, it is the job of the kernel to handle them, and it is nearly the sole way for an application to interact with the kernel.
the processing of syscalls by the kernel for a given process is accounted as system CPU time, e.g. by time(1)
The list of documented syscalls is given in syscalls(2). See also <asm/unistd.h> and <asm/unistd_64.h> etc... headers.
You could use strace(1) to understand the sequence of syscalls done by a particular run (of some process).
See also vdso(7).

Resources