Does asmlinkage mean stack or registers? - linux

In most languages, C included, the stack is used for function calls. That's why you get a "Stack Overflow" error if you are not careful in recursion. (Pun not intended).
If that is true, then what is so special about the asmlinkage GCC directive.
It says, from #kernelnewbies
The asmlinkage tag is one other thing that we should observe about
this simple function. This is a #define for some gcc magic that tells
the compiler that the function should not expect to find any of its
arguments in registers (a common optimization), but only on the CPU's
stack.
I mean I don't think the registers are used in normal function calls.
What is even more strange is when you learn it is implemented using the GCC regparm function attribute on x86.
The documentation of regparm is as follows:
On x86-32 targets, the regparm attribute causes the compiler to pass
arguments number one to number if they are of integral type in
registers EAX, EDX, and ECX instead of on the stack.
This is basically saying the opposite of what asmlinkage is trying do.
So what happens? Are they on the stack or in the registers.
Where am I going wrong?
The information isn't very clear.

On x86 32bit, the asmlinkage macro expands to __attribute__((regparam(0))), which basically tells GCC that no parameters should be passed through registers (the 0 is the important part). As of Linux 5.17, x86-32 and Itanium64 seem to be the only two architectures re-defining this macro, which by default expands to no attribute at all.
So asmlinkage does not by itself mean "parameters are passed on the stack". By default, the normal calling convention is used. This includes x86 64bit, which follows the System V AMD64 ABI calling convention, passing function parameters through RDI, RSI, RDX, RCX, R8, R9, [XYZ]MM0–7.
HOWEVER there is an important clarification to make: even with no special __attribute__ to force the compiler to use the stack for parameters, syscalls in recent kernel versions still take parameters from the stack indirectly through a pointer to a pt_regs structure (holding all the user-space registers saved on the stack on syscall entry). This is achieved through a moderately complex set of macros (SYSCALL_DEFINEx) that does everything transparently.
So technically, although asmlinkage does not change the calling convention, parameters are not passed inside registers as one would think by simply looking at the syscall function signature.
For example, the following syscall:
SYSCALL_DEFINE3(something, int, one, int, two, int, three)
{
// ...
do_something(one, two, three);
// ...
}
Actually becomes (roughly):
asmlinkage __x64_sys_something(struct pt_regs *regs)
{
// ...
do_something(regs->di, regs->si, regs->dx);
// ...
}
Which compiles to something like:
/* ... */
mov rdx,QWORD PTR [rdi+0x60]
mov rsi,QWORD PTR [rdi+0x68]
mov rdi,QWORD PTR [rdi+0x70]
call do_something
/* ... */

On i386 and x86-64 at least, asmlinkage means to use the standard calling convention you'd get with no GCC options and no __attribute__. (Like what user-space programs normally use for that target.)
For i386, that means stack args only. For x86-64, it's still the same registers as usual.
For x86-64, there's no difference; the kernel already uses the standard calling convention from the AMD64 System V ABI doc everywhere, because it's well-designed for efficiency, passing the first 6 integer args in registers.
But i386 has more historical baggage, with the standard calling convention (i386 SysV ABI) inefficiently passing all args on the stack. Presumably at some point in ancient history, Linux was compiled by GCC using this convention, and the hand-written asm entry points that called C functions were already using that convention.
So (I'm guessing here), when Linux wanted to switch from gcc -m32 to gcc -m32 -mregparm=3 to build the kernel with a more efficient calling convention, they had a choice to either modify the hand-written asm at the same time to use the new convention, or to force a few specific C functions to still use the traditional calling convention so the hand-written asm could stay the same.
If they'd made the former choice, asmlinkage for i386 would be __attribute__((regparm(3))) to force that convention even if the kernel is compiled a different way.
But instead, they chose to keep the asm the same, and #define asmlinkage __attribute__((regparm(0))) for i386, which indeed is zero register args, using the stack right away.
I don't know if that maybe had any debugging benefit, like in terms of being able to see what args got passed into a C function from asm without the only copy likely getting modified right away.
If -mregparm=3 and the corresponding attribute were new GCC features, Linus probably wanted to keep it possible to build the kernel with older GCC. That would rule out changing the asm to require __attribute__((regparm(3))). The asmlinkage = regparm(0) choice they actually made also has the advantage of not having to modify any asm, which means no correctness concerns, and that can be disentangled from any possible GCC bugs with using the new(?)-at-the-time calling convention.
At this point I think it would be totally possible to modify the asm code that calls asmlinkage functions, and swap it to being regparm(3). But that's a pretty minor thing. And not worth doing now since 32-bit x86 kernels are long since obsolete for almost all use cases. You almost always want a 64-bit kernel even if using a 32-bit user-space.
There might even be an efficiency benefit to stack args if saving the registers at a system-call entry point involved saving them with EBX at the lowest address, where they're already in place to be used as function args. You'd be all set to call *ia32_sys_call_table(,%eax,4). But that isn't actually safe because callees own their stack args and are allowed to write them, even though GCC usually doesn't use the incoming stack arg locations as scratch space. So I doubt Linux would have done this.
Other ISAs cope just fine with asmlinkage passing args in registers, so there's nothing fundamental about stack args that's important for how Linux works. (Except possibly for i386-specific code, but I doubt even that.)
The whole "asmlinkage means to pass args on the stack" is purely an i386 thing.
Most other ISAs that Linux runs on are more recent than 32-bit x86 (and/or are RISC-like with more registers), and have a standard calling convention that's more efficient with modern CPUs, using registers for the first few args. That includes x86-64.

Related

Fastest Linux system call

On an x86-64 Intel system that supports syscall and sysret what's the "fastest" system call from 64-bit user code on a vanilla kernel?
In particular, it must be a system call that exercises the syscall/sysret user <-> kernel transition1, but does the least amount of work beyond that. It doesn't even need to do the syscall itself: some type of early error which never dispatches to the specific call on the kernel side is fine, as long as it doesn't go down some slow path because of that.
Such a call could be used to estimate the raw syscall and sysret overhead independent of any work done by the call.
1 In particular, this excludes things that appear to be system calls but are implemented in the VDSO (e.g., clock_gettime) or are cached by the runtime (e.g., getpid).
One that doesn't exist, and therefore returns -ENOSYS quickly.
From arch/x86/entry/entry_64.S:
#if __SYSCALL_MASK == ~0
cmpq $__NR_syscall_max, %rax
#else
andl $__SYSCALL_MASK, %eax
cmpl $__NR_syscall_max, %eax
#endif
ja 1f /* return -ENOSYS (already in pt_regs->ax) */
movq %r10, %rcx
/*
* This call instruction is handled specially in stub_ptregs_64.
* It might end up jumping to the slow path. If it jumps, RAX
* and all argument registers are clobbered.
*/
#ifdef CONFIG_RETPOLINE
movq sys_call_table(, %rax, 8), %rax
call __x86_indirect_thunk_rax
#else
call *sys_call_table(, %rax, 8)
#endif
.Lentry_SYSCALL_64_after_fastpath_call:
movq %rax, RAX(%rsp)
1:
Use an invalid system call number so the dispatching code simply returns with
eax = -ENOSYS instead of dispatching to a system-call handling function at all.
Unless this causes the kernel to use the iret slow path instead of sysret / sysexit. That might explain the measurements showing an invalid number being 17 cycles slower than syscall(SYS_getpid), because glibc error handling (setting errno) probably doesn't explain it. But from my reading of the kernel source, I don't see any reason why it wouldn't still use sysret while returning -ENOSYS.
This answer is for sysenter, not syscall. The question originally said sysenter / sysret (which was weird because sysexit goes with sysenter, while sysret goes with syscall). I answered based on sysenter for a 32-bit process on an x86-64 kernel.
Native 64-bit syscall is handled more efficiently inside the kernel. (Update; with Meltdown / Spectre mitigation patches, it still dispatches via C do_syscall_64 in 4.16-rc2).
My What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? Q&A gives an overview of the kernel side of system-call entry points from compat mode into an x86-64 kernel (entry_64_compat.S). This answer is just taking the relevant parts of that.
The links in that answer and this are to Linux 4.12 sources, which doesn't contain the Meltdown mitigation page-table manipulation, so that will be significant extra overhead.
int 0x80 and sysenter have different entry points. You're looking for entry_SYSENTER_compat. AFAIK, sysenter always goes there, even if you execute it in a 64-bit user-space process. Linux's entry point pushes a constant __USER32_CS as the saved CS value, so it will always return to user-space in 32-bit mode.
After pushing registers to construct a struct pt_regs on the kernel stack, there's a TRACE_IRQS_OFF hook (no idea how many instructions that amounts to), then call do_fast_syscall_32 which is written in C. (Native 64-bit syscall dispatching is done directly from asm, but 32-bit compat system calls are always dispatched through C).
do_syscall_32_irqs_on in arch/x86/entry/common.c is pretty light-weight: just a check if the process is being traced (I think this is how strace can hook system calls via ptrace), then
...
if (likely(nr < IA32_NR_syscalls)) {
regs->ax = ia32_sys_call_table[nr]( ... arg );
}
syscall_return_slowpath(regs);
}
AFAIK, the kernel can use sysexit after this function returns.
So the return path is the same whether or not EAX had a valid system call number, and obviously returning without dispatching at all is the fastest path through that function, especially in a kernel with Spectre mitigation where the indirect branch on the table of function pointers would go through a retpoline and always mispredict.
If you want to really test sysenter/sysexit without all that extra overhead, you'll need to modify Linux to put a much simpler entry point without checking for tracing or pushing / popping all the registers.
You'd probably also want to modify the ABI to pass a return address in a register (like syscall does on its own) instead of saved on the user-space stack which Linux's current sysenter ABI does; it has to get_user() to read the EIP value it should return to.
Of if all this overhead is part of what you want to measure, you're definitely all set with an eax that gives you -ENOSYS; at worst you'll be getting one extra branch miss from the range-check if branch predictors are hot for that branch based on normal 32-bit system calls.
In this benchmark by Brendan Gregg (linked from this blog post which is interesting reading on the topic) close(999) (or some other fd not in use) is recommended.
Some system calls don't even go thru any user->kernel transition, read vdso(7).
I suspect that these VDSO system calls (e.g. time(2), ...) are the fastest. You could claim that there are no "real" system calls.
BTW, you could add a dummy system call to your kernel (e.g. some system call always returning 0, or a hello world system call, see also this) and measure it.
I suspect (without having benchmarked it) that getpid(2) should be a very fast system call, because the only thing it needs to do is fetch some data from the kernel memory. And AFAIK, it is a genuine system call, not using VDSO techniques. And you could use syscall(2) to avoid its caching done by your libc and forcing the genuine system call.
I maintain my position (given in a comment to your initial question): without actual motivation your question does not make any concrete sense. Then I still do think that syscall(2) doing getpid is measuring the typical overhead to make a system call (and I guess you really care about that one). In practice almost all system calls are doing more work that such a getpid (or getppid).

How to inform GCC to not use a particular register

Assume I have a very big source code and intend to make the rdx register totally unused during the execution, i.e., while generating the assembly code, all I want is to inform my compiler (GCC) that it should not use rdx at all.
NOTE: register rdx is just an example. I am OK with any available Intel x86 register.
I am even happy to update the source code of the compiler and use my custom GCC. But which changes to the source code are needed?
You tell GCC not to allocate a register via the -ffixed-reg option (gcc docs).
-ffixed-reg
Treat the register named reg as a fixed register; generated code should never refer to it (except perhaps as a stack pointer, frame pointer or in some other fixed role).
reg must be the name of a register. The register names accepted are machine-specific and are defined in the REGISTER_NAMES macro in the machine description macro file.
For example, gcc -ffixed-r13 will make gcc leave it alone entirely. Using registers that are part of the calling convention, or required for certain instructions, may be problematic.
You can put some global variable to this register.
For ARM CPU you can do it this way:
register volatile type *global_ptr asm ("r8")
This instruction uses general purpose register "r8" to hold
the value of global_ptr pointer.
See the source in U-Boot for real-life example:
http://git.denx.de/?p=u-boot.git;a=blob;f=arch/arm/include/asm/global_data.h;h=4e3ea55e290a19c766017b59241615f7723531d5;hb=HEAD#l83
File arch/arm/include/asm/global_data.h (line ~83).
#define DECLARE_GLOBAL_DATA_PTR register volatile gd_t *gd asm ("r8")
I don't know whether there is a simple mechanism to tell that to gcc at run time. I would assume that you must recompile. From what I read I understand that there are description files for the different CPUs, e.g. this file, but what exactly needs to be changed in order to prevent gcc from using the register, and what potential side effects such a change could have, is beyond me.
I would ask on the gcc mailing list for assistence. Chances are that the modification is not so difficult per se, except that building gcc isn't trivial in my experience. In your case, if I analyze the situation correctly, a caveat applies. You are essentially cross-compiling, i.e building for a different architecture. In particular I understand that you have to build your system and other libraries which your program uses because their code would normally use that register. If you intend to link dynamically you probably would also have to build your own ld.so (the dynamic loader) because starting a dynamically linked executable actually starts that loader which would use that register. (Therefore maybe linking statically is better.)
Consider the divq instruction - the dividend is represented by [rdx][rax], and, assuming the divisor (D) satisfies rdx < D, the quotient is stored in %rax and remainder in %rdx. There are no alternative registers that can be used here.
The same applies with the mul/mulq instructions, where the product is stored in [rdx][rax] - even the recent mulx instruction, while more flexible, still uses %rdx as a source register. (If memory serves)
More importantly, %rdx is used to pass parameters in the x86-64 ELF ABI. You could never call C library functions (or any other ELF library for that matter) - even kernel syscalls use %rdx to pass parameters - though the register use is not the same.
I'm not clear on your motivation - but the fact is, you won't be able to do anything practical on any x86[-64] platform (let alone an ELF/Linux platform) - at least in user-space.

In Linux x86_64 are syscalls and int 0x80 related?

I know that in Linux x64 "syscall" and "int 0x80" assembler instructions generate an interrupt in software asking the kernel to do some work. They have different opcodes (0F 05 vs CD 80) and the former is faster.
It's not clear to me if there is any relationship between them: are they really independent? (i.e.: does "syscall" call "int 0x80"?)
Thank you.
The syscall (x86-64) and sysenter (x86-32) instructions are newer and faster, and so are used when available; but the int 0x80 mechanism is preserved for compatibility with old binaries. There is no semantic difference -- system call numbering is the same regardless of which instruction is used to transfer control into the kernel, and I think the arguments are all in the same places as well.
I have a dim recollection of there being a small number of system calls that could only be made using int 0x80 because of their unusual stack-related behavior (clone? execve? sigreturn?) but that might no longer be true.
int 0x80 is rumored to be obsolete (since slow). BTW, you really want to use vdso(7)
AFAIK, both instructions are going inside the kernel, and each has some (short) sequence of kernel processing which ultimately jump into the syscall table.
int 0x80 is the 32-bit interrupt used with 8086 - 80386 assembly. x86_64 uses syscall in its place. See the differences in /usr/include/asm/unistd_32.h and unistd_64.h for the different call numbers each expect to invoke the various kernel functions.

System calls : difference between sys_exit(), SYS_exit and exit()

What is the difference between SYS_exit, sys_exit() and exit()?
What I understand :
The linux kernel provides system calls, which are listed in man 2 syscalls.
There are wrapper functions of those syscalls provided by glibc which have mostly similar names as the syscalls.
My question : In man 2 syscalls, there is no mention of SYS_exit and sys_exit(), for example. What are they?
Note : The syscall exit here is only an example. My question really is : What are SYS_xxx and sys_xxx()?
I'll use exit() as in your example although this applies to all system calls.
The functions of the form sys_exit() are the actual entry points to the kernel routine that implements the function you think of as exit(). These symbols are not even available to user-mode programmers. That is, unless you are hacking the kernel, you cannot link to these functions because their symbols are not available outside the kernel. If I wrote libmsw.a which had a file scope function like
static int msw_func() {}
defined in it, you would have no success trying to link to it because it is not exported in the libmsw symbol table; that is:
cc your_program.c libmsw.a
would yield an error like:
ld: cannot resolve symbol msw_func
because it isn't exported; the same applies for sys_exit() as contained in the kernel.
In order for a user program to get to kernel routines, the syscall(2) interface needs to be used to effect a switch from user-mode to kernel mode. When that mode-switch (somtimes called a trap) occurs a small integer is used to look up the proper kernel routine in a kernel table that maps integers to kernel functions. An entry in the table has the form
{SYS_exit, sys_exit},
Where SYS_exit is an preprocessor macro which is
#define SYS_exit (1)
and has been 1 since before you were born because there hasn't been reason to change it. It also happens to be the first entry in the table of system calls which makes look up a simple array index.
As you note in your question, the proper way for a regular user-mode program to access sys_exit is through the thin wrapper in glibc (or similar core library). The only reason you'd ever need to mess with SYS_exit or sys_exit is if you were writing kernel code.
This is now addressed in man syscall itself,
Roughly speaking, the code belonging to the system call with number __NR_xxx defined in /usr/include/asm/unistd.h can be found in the Linux kernel source in the routine sys_xxx(). (The dispatch table for i386 can be found in /usr/src/linux/arch/i386/kernel/entry.S.) There are many exceptions, however, mostly because older system calls were superseded by newer ones, and this has been treated somewhat unsystematically. On platforms with proprietary operating-system emulation, such as parisc, sparc, sparc64, and alpha, there are many additional system calls; mips64 also contains a full set of 32-bit system calls.
At least now /usr/include/asm/unistd.h is a preprocessor hack that links to either,
/usr/include/asm/unistd_32.h
/usr/include/asm/unistd_x32.h
/usr/include/asm/unistd_64.h
The C function exit() is defined in stdlib.h. Think of this as a high level event driven interface that allows you to register a callback with atexit()
/* Call all functions registered with `atexit' and `on_exit',
in the reverse of the order in which they were registered,
perform stdio cleanup, and terminate program execution with STATUS. */
extern void exit (int __status) __THROW __attribute__ ((__noreturn__));
So essentially the kernel provides an interface (C symbols) called __NR_xxx. Traditionally people want sys_exit() which is defined with a preprocessor macro SYS_exit. This macro creates the sys_exit() function. The exit() function is part of the standard C library stdlib.h and ported to other operating systems that lack the Linux Kernel ABI entirely (there may not be __NR_xxx functions) and potentially don't even have sys_* functions available either (you could write exit() to send the interrupt or use VDSO in Assembly).

Can we modify the int 0x80 routine?

How does linux 2.6 differ from 2.4?
Can we modify the source kernel?
Can we modify the int 0x80 service routine?
UPDATE:
1. the 0x80 handler is essentially the same between 2.4 and 2.6, although the function called from the handler is called by the 'syscall' instruction handler for x86-64 in 2.6.
2. the 0x80 handler can be modified like the rest of the kernel.
3. You won't break anything by modifying it, unless you remove backwards compatibility. E.g., you can add your own trace or backdoor if you feel so inclined. The other post that says you will break your libs and toolchain if you modify the handler is incorrect. If you break the dispatch algorithm, or modify the dispatch table incorrectly, then you will break things.
3a. As I originally posted, the best way to extend the 0x80 service is to extend the system call handler.
As the kernel source says:
What: The kernel syscall interface
Description:
This interface matches much of the POSIX interface and is based
on it and other Unix based interfaces. It will only be added to
over time, and not have things removed from it.
Note that this interface is different for every architecture
that Linux supports. Please see the architecture-specific
documentation for details on the syscall numbers that are to be
mapped to each syscall.
The system call table entries for i386 are in:
arch/i386/kernel/syscall_table.S
Note that the table is a sequence of pointers, so if you want to maintain a degree of forward compatibility with the kernel maintainers, you'd need to pad the table before placement of your pointer.
The syscall vector number is defined in irq_vectors.h
Then traps.c sets the address of the system_call function via set_system_gate, which places the entry into the interrupt descriptor table. The system_call function itself is in entry.S, and calls the requested pointer from the system call table.
There are a few housekeeping details, which you can see reading the code, but direct modification of the 0x80 interrupt handler is accomplished in entry.S inside the system_call function. In a more sane fashion, you can modify the system call table, inserting your own function without modifying the dispatch mechanism.
In fact, having read the 2.6 source, it says directly that int 0x80 and x86-64 syscall use the same code, so far. So you can make portable changes for x86-32 and x86-64.
END Update
The INT 0x80 method invokes the system call table handler. This matches register arguments to a call table, invoking kernel functions based on the contents of the EAX register. You can easily extend the system call table to add custom kernel API functions.
This may even work with the new syscall code on x86-64, as it uses the system call table, too.
If you alter the current system call table in any manner other than to extend it, you will break all dependent libraries and code, including libc, init, etc.
Here's the current Linux system call table: http://asm.sourceforge.net/syscall.html
It's an architectural overhaul. Everything has changed internally. SMP support is complete, the process scheduler is vastly improved, memory management got an overhaul, and many, many other things.
Yes. It's open-source software. If you do not have a copy of the source, you can get it from your vendor or from kernel.org.
Yes, but it's not advisable because it will break libc, it will break your baselayout, and it will break your toolchain if you change the sequence of existing syscalls, and nearly everything you might think you want to do should be done in userspace when at all possible.

Resources