Does eax always, and only, store the system call? - linux

[I'm confused about the CPU registers and I haven't found any truly clear and coherent explanation of them across the whole internet. If anyone has a link to something useful I'd really appreciate it if you'd post it in a comment or answer.]
The primary reason I'm here now is because I have been looking at sample NASM programs in a [thus far vain] attempt to learn the language. The program always ends by placing a system call code in eax and then calling int 0x80 (which I would love if someone could explain as well). However, from what I understand, eax is a 32 bit register - why do you need 32 bits to store system calls (I'm sure there aren't 232 worth). Also, sometimes I see other values and strings moved into eax during the program itself. Does that mean eax only has a special use when you finally want to perform a system call but for the rest of the time you can do with it as you please?

All bits of eax are used because that's how the system call interface is implemented. It's true there aren't 232 system calls, not even 216. But that's how it is. It allows for easy extension of the set of the system calls. You don't need to think hard about it, just accept it as a fact and live on.
eax is a general purpose register and you can do with it anything you please. The fact that it's used to contain the system call ID is just an established convention and nothing else. eax is not anyhow forbidden for other uses.

The program always ends by placing a system call code in eax and then calling int 0x80 (which I would love if someone could explain as well).
This is because you're only looking at old 32-bit examples for Linux, and that is what the Linux developers felt like doing. There's no reason why they couldn't have used a different register, and not much reason they couldn't have used half a register (e.g. a ax instead of eax, or bx or ..). In a similar way, there's no reason they couldn't have used a call gate or a different interrupt number. Of course once Linux developers made their decision ("kernel will expect function number in EAX and use int 0x80") everything that calls their kernel has to comply with their decision; and they can't easily change their decision without breaking all existing software (but can, and did, support alternatives - e.g. adding support for sysenter and syscall when those instructions got invented, while ensuring that int 0x80 still works the same).
However, from what I understand, eax is a 32 bit register - why do you need 32 bits to store system calls (I'm sure there aren't 232 worth)
They didn't "need" 32-bits; but you can expect that the function number will (after a "is the value too big" sanity check) end up being used inside a call [table+eax*4] instruction to call the selected function, and because that uses 32-bit addressing it needs to use a 32-bit register. Using half (or a quarter) of a register would've involved zero extension (e.g. an extra and eax,0x0000FFFF or movzx eax,ax instruction) to convert the 16-bit value into a 32-bit value. It's also typically faster to use all 32 bits for other reasons (e.g. a mov ax,123 that sets the lowest 16 bits of EAX and leaves the highest 16 bits unchanged will depend on the previous value of the highest 16 bits, and that can cause a "dependency stall" in the CPU if it needs to wait until the previous value of EAX is known).
Does that mean eax only has a special use when you finally want to perform a system call but for the rest of the time you can do with it as you please?
It means that when you call someone else's code, you have to comply with someone else's calling conventions, regardless of what they are. This can mean using other registers (ebx, ecx, etc) for whatever purpose they decided, and can mean using a specific stack layout (e.g. pushing things onto stack in a specific order).
Note that there are various instructions that do expect specific registers to be used in a specific way - mul, div, stosd, movsd, loop, in, out, enter, leave, etc; and there are "rare special cases" for every general purpose register. Despite this; they are still "general purpose registers" because they are not "specific purpose registers" (like eip or flags, which can only be used for one specific purpose and can never be used for anything else).

eax is a general purpose register, you can put whatever you want in it. int 0x80 is the interrupt for a system call... that interrupt looks at the value in eax and calls that system routine.

Related

Fastest Linux system call

On an x86-64 Intel system that supports syscall and sysret what's the "fastest" system call from 64-bit user code on a vanilla kernel?
In particular, it must be a system call that exercises the syscall/sysret user <-> kernel transition1, but does the least amount of work beyond that. It doesn't even need to do the syscall itself: some type of early error which never dispatches to the specific call on the kernel side is fine, as long as it doesn't go down some slow path because of that.
Such a call could be used to estimate the raw syscall and sysret overhead independent of any work done by the call.
1 In particular, this excludes things that appear to be system calls but are implemented in the VDSO (e.g., clock_gettime) or are cached by the runtime (e.g., getpid).
One that doesn't exist, and therefore returns -ENOSYS quickly.
From arch/x86/entry/entry_64.S:
#if __SYSCALL_MASK == ~0
cmpq $__NR_syscall_max, %rax
#else
andl $__SYSCALL_MASK, %eax
cmpl $__NR_syscall_max, %eax
#endif
ja 1f /* return -ENOSYS (already in pt_regs->ax) */
movq %r10, %rcx
/*
* This call instruction is handled specially in stub_ptregs_64.
* It might end up jumping to the slow path. If it jumps, RAX
* and all argument registers are clobbered.
*/
#ifdef CONFIG_RETPOLINE
movq sys_call_table(, %rax, 8), %rax
call __x86_indirect_thunk_rax
#else
call *sys_call_table(, %rax, 8)
#endif
.Lentry_SYSCALL_64_after_fastpath_call:
movq %rax, RAX(%rsp)
1:
Use an invalid system call number so the dispatching code simply returns with
eax = -ENOSYS instead of dispatching to a system-call handling function at all.
Unless this causes the kernel to use the iret slow path instead of sysret / sysexit. That might explain the measurements showing an invalid number being 17 cycles slower than syscall(SYS_getpid), because glibc error handling (setting errno) probably doesn't explain it. But from my reading of the kernel source, I don't see any reason why it wouldn't still use sysret while returning -ENOSYS.
This answer is for sysenter, not syscall. The question originally said sysenter / sysret (which was weird because sysexit goes with sysenter, while sysret goes with syscall). I answered based on sysenter for a 32-bit process on an x86-64 kernel.
Native 64-bit syscall is handled more efficiently inside the kernel. (Update; with Meltdown / Spectre mitigation patches, it still dispatches via C do_syscall_64 in 4.16-rc2).
My What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? Q&A gives an overview of the kernel side of system-call entry points from compat mode into an x86-64 kernel (entry_64_compat.S). This answer is just taking the relevant parts of that.
The links in that answer and this are to Linux 4.12 sources, which doesn't contain the Meltdown mitigation page-table manipulation, so that will be significant extra overhead.
int 0x80 and sysenter have different entry points. You're looking for entry_SYSENTER_compat. AFAIK, sysenter always goes there, even if you execute it in a 64-bit user-space process. Linux's entry point pushes a constant __USER32_CS as the saved CS value, so it will always return to user-space in 32-bit mode.
After pushing registers to construct a struct pt_regs on the kernel stack, there's a TRACE_IRQS_OFF hook (no idea how many instructions that amounts to), then call do_fast_syscall_32 which is written in C. (Native 64-bit syscall dispatching is done directly from asm, but 32-bit compat system calls are always dispatched through C).
do_syscall_32_irqs_on in arch/x86/entry/common.c is pretty light-weight: just a check if the process is being traced (I think this is how strace can hook system calls via ptrace), then
...
if (likely(nr < IA32_NR_syscalls)) {
regs->ax = ia32_sys_call_table[nr]( ... arg );
}
syscall_return_slowpath(regs);
}
AFAIK, the kernel can use sysexit after this function returns.
So the return path is the same whether or not EAX had a valid system call number, and obviously returning without dispatching at all is the fastest path through that function, especially in a kernel with Spectre mitigation where the indirect branch on the table of function pointers would go through a retpoline and always mispredict.
If you want to really test sysenter/sysexit without all that extra overhead, you'll need to modify Linux to put a much simpler entry point without checking for tracing or pushing / popping all the registers.
You'd probably also want to modify the ABI to pass a return address in a register (like syscall does on its own) instead of saved on the user-space stack which Linux's current sysenter ABI does; it has to get_user() to read the EIP value it should return to.
Of if all this overhead is part of what you want to measure, you're definitely all set with an eax that gives you -ENOSYS; at worst you'll be getting one extra branch miss from the range-check if branch predictors are hot for that branch based on normal 32-bit system calls.
In this benchmark by Brendan Gregg (linked from this blog post which is interesting reading on the topic) close(999) (or some other fd not in use) is recommended.
Some system calls don't even go thru any user->kernel transition, read vdso(7).
I suspect that these VDSO system calls (e.g. time(2), ...) are the fastest. You could claim that there are no "real" system calls.
BTW, you could add a dummy system call to your kernel (e.g. some system call always returning 0, or a hello world system call, see also this) and measure it.
I suspect (without having benchmarked it) that getpid(2) should be a very fast system call, because the only thing it needs to do is fetch some data from the kernel memory. And AFAIK, it is a genuine system call, not using VDSO techniques. And you could use syscall(2) to avoid its caching done by your libc and forcing the genuine system call.
I maintain my position (given in a comment to your initial question): without actual motivation your question does not make any concrete sense. Then I still do think that syscall(2) doing getpid is measuring the typical overhead to make a system call (and I guess you really care about that one). In practice almost all system calls are doing more work that such a getpid (or getppid).

What is the use of EAX register in the context of system calls in Linux?

In the text book Linux Kernel Development by Robert Love, it is mentioned that (pg no. 101):
The return value is sent to user-space also via register. On x86, it
is written into the eax register.
And in the text book The Linux Programming Interface by Michael Kerrisk, it is mentioned that (pg no. 88):
Since all system calls enter the kernel in the same way, the kernel
needs some method of identifying the system call. To permit this, the
wrapper function copies the system call number into a specific CPU
register (%eax).
Then, what conclusion can I draw upon the utility of EAX register in system calls?
When the kernel comes across a system call the it copies the system call number to the EAX register, for which the value in the register is replaced by the return value of the system call at the time of return from the system call. Is this conclusion correct?
X86 Assembly/Interfacing with Linux
Making a syscall
For making a syscall using an interrupt, you have to pass all required
information to the kernel by copying them into general purpose
registers.
Each syscall has a fixed number (note: the numbers differ between int
$0x80 and syscall!). You specify the syscall by writing the number
into the eax/rax register.
Most syscalls take parameters to perform their task. Those parameters
are passed by writing them in the appropriate registers before making
the actual call. Each parameter index has a specific register. See the
tables in the subsections as the mapping differs between int $0x80
and syscall. Parameters are passed in the order they appear in the
function signature of the corresponding C wrapper function. You may
find syscall functions and their signatures in every Linux API
documentation, like the reference manual (type man 2 open to see the
signature of the open syscall).
After everything is set up correctly, you call the interrupt using
int $0x80 or syscall and the kernel performs the task.
The return / error value of a syscall is written to eax/rax.
The kernel uses its own stack to perform the actions. The user stack
is not touched in any way.
So to sum up:
In user space:
prepare the syscall by writing the parameters into specified registers
put the syscall number into eax
call the interrupt by int $0x80 or syscall
In kernel space:
the kernel reads the syscall number from eax
reads the parameters from the specific registers
performs the task (on it's own stack)
writes the result in eax
returns to control to user space
In user space again:
you can find the result of the interrupt in the eax register

How to inform GCC to not use a particular register

Assume I have a very big source code and intend to make the rdx register totally unused during the execution, i.e., while generating the assembly code, all I want is to inform my compiler (GCC) that it should not use rdx at all.
NOTE: register rdx is just an example. I am OK with any available Intel x86 register.
I am even happy to update the source code of the compiler and use my custom GCC. But which changes to the source code are needed?
You tell GCC not to allocate a register via the -ffixed-reg option (gcc docs).
-ffixed-reg
Treat the register named reg as a fixed register; generated code should never refer to it (except perhaps as a stack pointer, frame pointer or in some other fixed role).
reg must be the name of a register. The register names accepted are machine-specific and are defined in the REGISTER_NAMES macro in the machine description macro file.
For example, gcc -ffixed-r13 will make gcc leave it alone entirely. Using registers that are part of the calling convention, or required for certain instructions, may be problematic.
You can put some global variable to this register.
For ARM CPU you can do it this way:
register volatile type *global_ptr asm ("r8")
This instruction uses general purpose register "r8" to hold
the value of global_ptr pointer.
See the source in U-Boot for real-life example:
http://git.denx.de/?p=u-boot.git;a=blob;f=arch/arm/include/asm/global_data.h;h=4e3ea55e290a19c766017b59241615f7723531d5;hb=HEAD#l83
File arch/arm/include/asm/global_data.h (line ~83).
#define DECLARE_GLOBAL_DATA_PTR register volatile gd_t *gd asm ("r8")
I don't know whether there is a simple mechanism to tell that to gcc at run time. I would assume that you must recompile. From what I read I understand that there are description files for the different CPUs, e.g. this file, but what exactly needs to be changed in order to prevent gcc from using the register, and what potential side effects such a change could have, is beyond me.
I would ask on the gcc mailing list for assistence. Chances are that the modification is not so difficult per se, except that building gcc isn't trivial in my experience. In your case, if I analyze the situation correctly, a caveat applies. You are essentially cross-compiling, i.e building for a different architecture. In particular I understand that you have to build your system and other libraries which your program uses because their code would normally use that register. If you intend to link dynamically you probably would also have to build your own ld.so (the dynamic loader) because starting a dynamically linked executable actually starts that loader which would use that register. (Therefore maybe linking statically is better.)
Consider the divq instruction - the dividend is represented by [rdx][rax], and, assuming the divisor (D) satisfies rdx < D, the quotient is stored in %rax and remainder in %rdx. There are no alternative registers that can be used here.
The same applies with the mul/mulq instructions, where the product is stored in [rdx][rax] - even the recent mulx instruction, while more flexible, still uses %rdx as a source register. (If memory serves)
More importantly, %rdx is used to pass parameters in the x86-64 ELF ABI. You could never call C library functions (or any other ELF library for that matter) - even kernel syscalls use %rdx to pass parameters - though the register use is not the same.
I'm not clear on your motivation - but the fact is, you won't be able to do anything practical on any x86[-64] platform (let alone an ELF/Linux platform) - at least in user-space.

How to interpret segment register accesses on x86-64?

With this function:
mov 1069833(%rip),%rax # 0x2b5c1bf9ef90 <_fini+3250648>
add %fs:0x0,%rax
retq
How do I interpret the second instruction and find out what was added to RAX?
This code:
mov 1069833(%rip),%rax # 0x2b5c1bf9ef90 <_fini+3250648>
add %fs:0x0,%rax
retq
is returning the address of a thread-local variable. %fs:0x0 is the address of the TCB (Thread Control Block), and 1069833(%rip) is the offset from there to the variable, which is known since the variable resides either in the program or on some dynamic library loaded at program's load time (libraries loaded at runtime via dlopen() need some different code).
This is explained in great detail in Ulrich Drepper's TLS document, specially §4.3 and §4.3.6.
I'm not sure they've been called segment register since the bad old days of segmented architecture. I believe the proper term is a selector (but I could be wrong).
However, I think you just need at the first quadword (64 bits) in the fs area.
The %fs:0x0 bit means the contents of the memory at fs:0. Since you've used the generic add (rather than addl for example), I think it will take the data width from the target %rax.
In terms of getting the actual value, it depends on whether you're in legacy or long mode.
In legacy mode, you'll have to get the fs value and look it up in the GDT (or possibly LDT) in order to get the base address.
In long mode, you'll need to look at the relevant model specific registers. If you're at this point, you've moved beyond my level of expertise unfortunately.

Can we modify the int 0x80 routine?

How does linux 2.6 differ from 2.4?
Can we modify the source kernel?
Can we modify the int 0x80 service routine?
UPDATE:
1. the 0x80 handler is essentially the same between 2.4 and 2.6, although the function called from the handler is called by the 'syscall' instruction handler for x86-64 in 2.6.
2. the 0x80 handler can be modified like the rest of the kernel.
3. You won't break anything by modifying it, unless you remove backwards compatibility. E.g., you can add your own trace or backdoor if you feel so inclined. The other post that says you will break your libs and toolchain if you modify the handler is incorrect. If you break the dispatch algorithm, or modify the dispatch table incorrectly, then you will break things.
3a. As I originally posted, the best way to extend the 0x80 service is to extend the system call handler.
As the kernel source says:
What: The kernel syscall interface
Description:
This interface matches much of the POSIX interface and is based
on it and other Unix based interfaces. It will only be added to
over time, and not have things removed from it.
Note that this interface is different for every architecture
that Linux supports. Please see the architecture-specific
documentation for details on the syscall numbers that are to be
mapped to each syscall.
The system call table entries for i386 are in:
arch/i386/kernel/syscall_table.S
Note that the table is a sequence of pointers, so if you want to maintain a degree of forward compatibility with the kernel maintainers, you'd need to pad the table before placement of your pointer.
The syscall vector number is defined in irq_vectors.h
Then traps.c sets the address of the system_call function via set_system_gate, which places the entry into the interrupt descriptor table. The system_call function itself is in entry.S, and calls the requested pointer from the system call table.
There are a few housekeeping details, which you can see reading the code, but direct modification of the 0x80 interrupt handler is accomplished in entry.S inside the system_call function. In a more sane fashion, you can modify the system call table, inserting your own function without modifying the dispatch mechanism.
In fact, having read the 2.6 source, it says directly that int 0x80 and x86-64 syscall use the same code, so far. So you can make portable changes for x86-32 and x86-64.
END Update
The INT 0x80 method invokes the system call table handler. This matches register arguments to a call table, invoking kernel functions based on the contents of the EAX register. You can easily extend the system call table to add custom kernel API functions.
This may even work with the new syscall code on x86-64, as it uses the system call table, too.
If you alter the current system call table in any manner other than to extend it, you will break all dependent libraries and code, including libc, init, etc.
Here's the current Linux system call table: http://asm.sourceforge.net/syscall.html
It's an architectural overhaul. Everything has changed internally. SMP support is complete, the process scheduler is vastly improved, memory management got an overhaul, and many, many other things.
Yes. It's open-source software. If you do not have a copy of the source, you can get it from your vendor or from kernel.org.
Yes, but it's not advisable because it will break libc, it will break your baselayout, and it will break your toolchain if you change the sequence of existing syscalls, and nearly everything you might think you want to do should be done in userspace when at all possible.

Resources