Understanding Linux x86_64 Syscall Implementation in NASM

Understanding Linux x86_64 Syscall Implementation in NASM - linux

I'm using the linux kernel as the basis for my 'OS' (which is just a game). Obviously, I use system calls to interact with the kernel, e.g. input, sound, graphics, etc. As I don't have a linker on this system to use pre-defined wrapper in <sys/syscall.h>, I use this syscall wrapper for C in NASM asm:
global _syscall
_syscall:
mov rax, rdi
mov rdi, rsi
mov rsi, rdx
mov rdx, rcx
mov r10, r8
mov r8, r9
mov r9, [rsp + 8]
syscall
ret
I understand the registers used based on System V calling convention. However, I don't understand the last line mov r9, [rsp + 8]. I feel like it has something to do with return address, but I'm not sure.

The 7th arg is passed on the stack in x86-64 System V.
This is taking the first C arg and putting it in RAX, then copying the next 6 C args to the 6 arg-passing registers of the kernel's system-call calling convention. (Like functions, but with R10 instead of RCX).
The only reason the craptastic glibc syscall() function exists / is written that way is because there's no way to tell C compilers about a custom calling convention where an arg is also passed in RAX. That wrapper makes it look just like any other C function with a prototype.
It's fine for messing around with new system calls, but as you noted it's inefficient. If you wanted something better in C, use inline asm macros for your ISA, e.g. https://github.com/linux-on-ibm-z/linux-syscall-support/blob/master/linux_syscall_support.h. Inline asm is hard, and historically some syscall1 / syscall2 (per number of args) macros have been missing things like a "memory" clobber to tell the compiler that pointed-to memory could also be an input or output. That github project is safe and has code for various ISAs. (Some missed optimizations, like could use a dummy input operand instead of a full "memory" clobber... But that's irrelevant to asm)
Of course, you can do much better if you're writing in asm:
Just use the syscall instruction directly with args in the right registers (RDI, RSI, RDX, R10, R8, R9) instead of call _syscall with the function-calling convention. That's strictly worse than just inlining the syscall instruction: With syscall you know that registers are unmodified except for RAX (return value) and RCX/R11 (syscall itself uses them to save RIP and RFLAGS before kernel code runs.) And it would take just as much code to get args into registers for a function call as it would for syscall.
If you do want a wrapper function at all (e.g. to cmp rax, -4095 / jae handle_syscall_error afterwards and maybe set errno), use the same calling convention for it as the kernel expects, so the first instruction can be syscall, not all that stupid shuffling of args over by 1.
Functions in asm (that you only need to call from asm) can use whatever calling convention is convenient. It's a good idea to use a good standard one most of the time, but any "obviously special" function can certainly use a special convention.

Related

How to use ptrace in 64 bit process to modify registers in a 32-bit process and make it do a system-call?

I'm working on a program that needs to make a 32 bit process invoke a syscall.
I wish to keep my program architecture independent, but the target will always be 32 bit.
To set the registers I'm using ptrace with PTRACE_SETREGS, which takes regs struct pointer as its data argument.
x86_64 and x86 have different definitions for struct user_regs_struct so I've tried simply using the x86_64 on, which passes the syscall number correctly but none of the arguments, I have verified this by passing 1 (__NR_exit on x86) as the syscall number and 21 as the first argument, but the process only exits with 0.
I have also tried copying the x86 definition for struct user_regs_struct with that only causing segfaults.
Since the two definitions of the struct use completely different data types (unsigned long long int on x86_64 and long int on x86) I doubt its implicitly accessing the right data, but it doesn't appear to be using the same registers as x86_64 does for arguments (rdi, rsi, rdx, r10, r8, and r9).

As luck would have it, I couldn't figure this out for days, but as soon as you ask the question you realize what you should be doing.
You just have to use the x86_64 equivilant of the registers.
So, for example
eax, ebx, ecx, edx, esi, edi, edp
becomes
rax, rbx, rcx, rdx, rsi, rdi, rdp

NASM Syscall write with stack src dont work :/ [duplicate]

int 0x80 on Linux always invokes the 32-bit ABI, regardless of what mode it's called from: args in ebx, ecx, ... and syscall numbers from /usr/include/asm/unistd_32.h. (Or crashes on 64-bit kernels compiled without CONFIG_IA32_EMULATION).
64-bit code should use syscall, with call numbers from /usr/include/asm/unistd_64.h, and args in rdi, rsi, etc. See What are the calling conventions for UNIX & Linux system calls on i386 and x86-64. If your question was marked a duplicate of this, see that link for details on how you should make system calls in 32 or 64-bit code. If you want to understand what exactly happened, keep reading.
(For an example of 32-bit vs. 64-bit sys_write, see Using interrupt 0x80 on 64-bit Linux)
syscall system calls are faster than int 0x80 system calls, so use native 64-bit syscall unless you're writing polyglot machine code that runs the same when executed as 32 or 64 bit. (sysenter always returns in 32-bit mode, so it's not useful from 64-bit userspace, although it is a valid x86-64 instruction.)
Related: The Definitive Guide to Linux System Calls (on x86) for how to make int 0x80 or sysenter 32-bit system calls, or syscall 64-bit system calls, or calling the vDSO for "virtual" system calls like gettimeofday. Plus background on what system calls are all about.
Using int 0x80 makes it possible to write something that will assemble in 32 or 64-bit mode, so it's handy for an exit_group() at the end of a microbenchmark or something.
Current PDFs of the official i386 and x86-64 System V psABI documents that standardize function and syscall calling conventions are linked from https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI.
See the x86 tag wiki for beginner guides, x86 manuals, official documentation, and performance optimization guides / resources.
But since people keep posting questions with code that uses int 0x80 in 64-bit code, or accidentally building 64-bit binaries from source written for 32-bit, I wonder what exactly does happen on current Linux?
Does int 0x80 save/restore all the 64-bit registers? Does it truncate any registers to 32-bit? What happens if you pass pointer args that have non-zero upper halves?
Does it work if you pass it 32-bit pointers?

TL:DR: int 0x80 works when used correctly, as long as any pointers fit in 32 bits (stack pointers don't fit). But beware that strace decodes it wrong unless you have a very recent strace + kernel.
int 0x80 zeros r8-r11 for reasons, and preserves everything else. Use it exactly like you would in 32-bit code, with the 32-bit call numbers. (Or better, don't use it!)
Not all systems even support int 0x80: The Windows Subsystem for Linux version 1 (WSL1) is strictly 64-bit only: int 0x80 doesn't work at all. It's also possible to build Linux kernels without IA-32 emulation either. (No support for 32-bit executables, no support for 32-bit system calls). See this re: making sure your WSL is actually WSL2 (which uses an actual Linux kernel in a VM.)
The details: what's saved/restored, which parts of which regs the kernel uses
int 0x80 uses eax (not the full rax) as the system-call number, dispatching to the same table of function-pointers that 32-bit user-space int 0x80 uses. (These pointers are to sys_whatever implementations or wrappers for the native 64-bit implementation inside the kernel. System calls are really function calls across the user/kernel boundary.)
Only the low 32 bits of arg registers are passed. The upper halves of rbx-rbp are preserved, but ignored by int 0x80 system calls. Note that passing a bad pointer to a system call doesn't result in SIGSEGV; instead the system call returns -EFAULT. If you don't check error return values (with a debugger or tracing tool), it will appear to silently fail.
All registers (except eax of course) are saved/restored (including RFLAGS, and the upper 32 of integer regs), except that r8-r11 are zeroed. r12-r15 are call-preserved in the x86-64 SysV ABI's function calling convention, so the registers that get zeroed by int 0x80 in 64-bit are the call-clobbered subset of the "new" registers that AMD64 added.
This behaviour has been preserved over some internal changes to how register-saving was implemented inside the kernel, and comments in the kernel mention that it's usable from 64-bit, so this ABI is probably stable. (I.e. you can count on r8-r11 being zeroed, and everything else being preserved.)
The return value is sign-extended to fill 64-bit rax. (Linux declares 32-bit sys_ functions as returning signed long.) This means that pointer return values (like from void *mmap()) need to be zero-extended before use in 64-bit addressing modes
Unlike sysenter, it preserves the original value of cs, so it returns to user-space in the same mode that it was called in. (Using sysenter results in the kernel setting cs to $__USER32_CS, which selects a descriptor for a 32-bit code segment.)
Older strace decodes int 0x80 incorrectly for 64-bit processes. It decodes as if the process had used syscall instead of int 0x80. This can be very confusing. e.g. strace prints write(0, NULL, 12 <unfinished ... exit status 1> for eax=1 / int $0x80, which is actually _exit(ebx), not write(rdi, rsi, rdx).
I don't know the exact version where the PTRACE_GET_SYSCALL_INFO feature was added, but Linux kernel 5.5 / strace 5.5 handle it. It misleadingly says the process "runs in 32-bit mode" but does decode correctly. (Example).
int 0x80 works as long as all arguments (including pointers) fit in the low 32 of a register. This is the case for static code and data in the default code model ("small") in the x86-64 SysV ABI. (Section 3.5.1
: all symbols are known to be located in the virtual addresses in the range 0x00000000 to 0x7effffff, so you can do stuff like mov edi, hello (AT&T mov $hello, %edi) to get a pointer into a register with a 5 byte instruction).
But this is not the case for position-independent executables, which many Linux distros now configure gcc to make by default (and they enable ASLR for executables). For example, I compiled a hello.c on Arch Linux, and set a breakpoint at the start of main. The string constant passed to puts was at 0x555555554724, so a 32-bit ABI write system call would not work. (GDB disables ASLR by default, so you always see the same address from run to run, if you run from within GDB.)
Linux puts the stack near the "gap" between the upper and lower ranges of canonical addresses, i.e. with the top of the stack at 2^48-1. (Or somewhere random, with ASLR enabled). So rsp on entry to _start in a typical statically-linked executable is something like 0x7fffffffe550, depending on size of env vars and args. Truncating this pointer to esp does not point to any valid memory, so system calls with pointer inputs will typically return -EFAULT if you try to pass a truncated stack pointer. (And your program will crash if you truncate rsp to esp and then do anything with the stack, e.g. if you built 32-bit asm source as a 64-bit executable.)
How it works in the kernel:
In the Linux source code, arch/x86/entry/entry_64_compat.S defines
ENTRY(entry_INT80_compat). Both 32 and 64-bit processes use the same entry point when they execute int 0x80.
entry_64.S is defines native entry points for a 64-bit kernel, which includes interrupt / fault handlers and syscall native system calls from long mode (aka 64-bit mode) processes.
entry_64_compat.S defines system-call entry-points from compat mode into a 64-bit kernel, plus the special case of int 0x80 in a 64-bit process. (sysenter in a 64-bit process may go to that entry point as well, but it pushes $__USER32_CS, so it will always return in 32-bit mode.) There's a 32-bit version of the syscall instruction, supported on AMD CPUs, and Linux supports it too for fast 32-bit system calls from 32-bit processes.
I guess a possible use-case for int 0x80 in 64-bit mode is if you wanted to use a custom code-segment descriptor that you installed with modify_ldt. int 0x80 pushes segment registers itself for use with iret, and Linux always returns from int 0x80 system calls via iret. The 64-bit syscall entry point sets pt_regs->cs and ->ss to constants, __USER_CS and __USER_DS. (It's normal that SS and DS use the same segment descriptors. Permission differences are done with paging, not segmentation.)
entry_32.S defines entry points into a 32-bit kernel, and is not involved at all.
The int 0x80 entry point in Linux 4.12's entry_64_compat.S:
/*
* 32-bit legacy system call entry.
*
* 32-bit x86 Linux system calls traditionally used the INT $0x80
* instruction. INT $0x80 lands here.
*
* This entry point can be used by 32-bit and 64-bit programs to perform
* 32-bit system calls. Instances of INT $0x80 can be found inline in
* various programs and libraries. It is also used by the vDSO's
* __kernel_vsyscall fallback for hardware that doesn't support a faster
* entry method. Restarted 32-bit system calls also fall back to INT
* $0x80 regardless of what instruction was originally used to do the
* system call.
*
* This is considered a slow path. It is not used by most libc
* implementations on modern hardware except during process startup.
...
*/
ENTRY(entry_INT80_compat)
... (see the github URL for the full source)
The code zero-extends eax into rax, then pushes all the registers onto the kernel stack to form a struct pt_regs. This is where it will restore from when the system call returns. It's in a standard layout for saved user-space registers (for any entry point), so ptrace from other process (like gdb or strace) will read and/or write that memory if they use ptrace while this process is inside a system call. (ptrace modification of registers is one thing that makes return paths complicated for the other entry points. See comments.)
But it pushes $0 instead of r8/r9/r10/r11. (sysenter and AMD syscall32 entry points store zeros for r8-r15.)
I think this zeroing of r8-r11 is to match historical behaviour. Before the Set up full pt_regs for all compat syscalls commit, the entry point only saved the C call-clobbered registers. It dispatched directly from asm with call *ia32_sys_call_table(, %rax, 8), and those functions follow the calling convention, so they preserve rbx, rbp, rsp, and r12-r15. Zeroing r8-r11 instead of leaving them undefined was to avoid info leaks from a 64-bit kernel to 32-bit user-space (which could far jmp to a 64-bit code segment to read anything the kernel left there).
The current implementation (Linux 4.12) dispatches 32-bit-ABI system calls from C, reloading the saved ebx, ecx, etc. from pt_regs. (64-bit native system calls dispatch directly from asm, with only a mov %r10, %rcx needed to account for the small difference in calling convention between functions and syscall. Unfortunately it can't always use sysret, because CPU bugs make it unsafe with non-canonical addresses. It does try to, so the fast-path is pretty damn fast, although syscall itself still takes tens of cycles.)
Anyway, in current Linux, 32-bit syscalls (including int 0x80 from 64-bit) eventually end up indo_syscall_32_irqs_on(struct pt_regs *regs). It dispatches to a function pointer ia32_sys_call_table, with 6 zero-extended args. This maybe avoids needing a wrapper around the 64-bit native syscall function in more cases to preserve that behaviour, so more of the ia32 table entries can be the native system call implementation directly.
Linux 4.12 arch/x86/entry/common.c
if (likely(nr < IA32_NR_syscalls)) {
/*
* It's possible that a 32-bit syscall implementation
* takes a 64-bit parameter but nonetheless assumes that
* the high bits are zero. Make sure we zero-extend all
* of the args.
*/
regs->ax = ia32_sys_call_table[nr](
(unsigned int)regs->bx, (unsigned int)regs->cx,
(unsigned int)regs->dx, (unsigned int)regs->si,
(unsigned int)regs->di, (unsigned int)regs->bp);
}
syscall_return_slowpath(regs);
In older versions of Linux that dispatch 32-bit system calls from asm (like 64-bit still did until 4.151), the int80 entry point itself puts args in the right registers with mov and xchg instructions, using 32-bit registers. It even uses mov %edx,%edx to zero-extend EDX into RDX (because arg3 happen to use the same register in both conventions). code here. This code is duplicated in the sysenter and syscall32 entry points.
Footnote 1: Linux 4.15 (I think) introduced Spectre / Meltdown mitigations, and a major revamp of the entry points that made them them a trampoline for the meltdown case. It also sanitized the incoming registers to avoid user-space values other than actual args being in registers during the call (when some Spectre gadget might run), by storing them, zeroing everything, then calling to a C wrapper that reloads just the right widths of args from the struct saved on entry.
I'm planning to leave this answer describing the much simpler mechanism because the conceptually useful part here is that the kernel side of a syscall involves using EAX or RAX as an index into a table of function pointers, with other incoming register values copied going to the places where the calling convention wants args to go. i.e. syscall is just a way to make a call into the kernel, to its dispatch code.
Simple example / test program:
I wrote a simple Hello World (in NASM syntax) which sets all registers to have non-zero upper halves, then makes two write() system calls with int 0x80, one with a pointer to a string in .rodata (succeeds), the second with a pointer to the stack (fails with -EFAULT).
Then it uses the native 64-bit syscall ABI to write() the chars from the stack (64-bit pointer), and again to exit.
So all of these examples are using the ABIs correctly, except for the 2nd int 0x80 which tries to pass a 64-bit pointer and has it truncated.
If you built it as a position-independent executable, the first one would fail too. (You'd have to use a RIP-relative lea instead of mov to get the address of hello: into a register.)
I used gdb, but use whatever debugger you prefer. Use one that highlights changed registers since the last single-step. gdbgui works well for debugging asm source, but is not great for disassembly. Still, it does have a register pane that works well for integer regs at least, and it worked great on this example.
See the inline ;;; comments describing how register are changed by system calls
global _start
_start:
mov rax, 0x123456789abcdef
mov rbx, rax
mov rcx, rax
mov rdx, rax
mov rsi, rax
mov rdi, rax
mov rbp, rax
mov r8, rax
mov r9, rax
mov r10, rax
mov r11, rax
mov r12, rax
mov r13, rax
mov r14, rax
mov r15, rax
;; 32-bit ABI
mov rax, 0xffffffff00000004 ; high garbage + __NR_write (unistd_32.h)
mov rbx, 0xffffffff00000001 ; high garbage + fd=1
mov rcx, 0xffffffff00000000 + .hello
mov rdx, 0xffffffff00000000 + .hellolen
;std
after_setup: ; set a breakpoint here
int 0x80 ; write(1, hello, hellolen); 32-bit ABI
;; succeeds, writing to stdout
;;; changes to registers: r8-r11 = 0. rax=14 = return value
; ebx still = 1 = STDOUT_FILENO
push 'bye' + (0xa<<(3*8))
mov rcx, rsp ; rcx = 64-bit pointer that won't work if truncated
mov edx, 4
mov eax, 4 ; __NR_write (unistd_32.h)
int 0x80 ; write(ebx=1, ecx=truncated pointer, edx=4); 32-bit
;; fails, nothing printed
;;; changes to registers: rax=-14 = -EFAULT (from /usr/include/asm-generic/errno-base.h)
mov r10, rax ; save return value as exit status
mov r8, r15
mov r9, r15
mov r11, r15 ; make these regs non-zero again
;; 64-bit ABI
mov eax, 1 ; __NR_write (unistd_64.h)
mov edi, 1
mov rsi, rsp
mov edx, 4
syscall ; write(edi=1, rsi='bye\n' on the stack, rdx=4); 64-bit
;; succeeds: writes to stdout and returns 4 in rax
;;; changes to registers: rax=4 = length return value
;;; rcx = 0x400112 = RIP. r11 = 0x302 = eflags with an extra bit set.
;;; (This is not a coincidence, it's how sysret works. But don't depend on it, since iret could leave something else)
mov edi, r10d
;xor edi,edi
mov eax, 60 ; __NR_exit (unistd_64.h)
syscall ; _exit(edi = first int 0x80 result); 64-bit
;; succeeds, exit status = low byte of first int 0x80 result = 14
section .rodata
_start.hello: db "Hello World!", 0xa, 0
_start.hellolen equ $ - _start.hello
Build it into a 64-bit static binary with
yasm -felf64 -Worphan-labels -gdwarf2 abi32-from-64.asm
ld -o abi32-from-64 abi32-from-64.o
Run gdb ./abi32-from-64. In gdb, run set disassembly-flavor intel and layout reg if you don't have that in your ~/.gdbinit already. (GAS .intel_syntax is like MASM, not NASM, but they're close enough that it's easy to read if you like NASM syntax.)
(gdb) set disassembly-flavor intel
(gdb) layout reg
(gdb) b after_setup
(gdb) r
(gdb) si # step instruction
press return to repeat the last command, keep stepping
Press control-L when gdb's TUI mode gets messed up. This happens easily, even when programs don't print to stdout themselves.

syscall WRITE wont write from stack? [duplicate]

TL:DR: int 0x80 works when used correctly, as long as any pointers fit in 32 bits (stack pointers don't fit). But beware that strace decodes it wrong unless you have a very recent strace + kernel.
int 0x80 zeros r8-r11 for reasons, and preserves everything else. Use it exactly like you would in 32-bit code, with the 32-bit call numbers. (Or better, don't use it!)
Not all systems even support int 0x80: The Windows Subsystem for Linux version 1 (WSL1) is strictly 64-bit only: int 0x80 doesn't work at all. It's also possible to build Linux kernels without IA-32 emulation either. (No support for 32-bit executables, no support for 32-bit system calls). See this re: making sure your WSL is actually WSL2 (which uses an actual Linux kernel in a VM.)
The details: what's saved/restored, which parts of which regs the kernel uses
int 0x80 uses eax (not the full rax) as the system-call number, dispatching to the same table of function-pointers that 32-bit user-space int 0x80 uses. (These pointers are to sys_whatever implementations or wrappers for the native 64-bit implementation inside the kernel. System calls are really function calls across the user/kernel boundary.)
Only the low 32 bits of arg registers are passed. The upper halves of rbx-rbp are preserved, but ignored by int 0x80 system calls. Note that passing a bad pointer to a system call doesn't result in SIGSEGV; instead the system call returns -EFAULT. If you don't check error return values (with a debugger or tracing tool), it will appear to silently fail.
All registers (except eax of course) are saved/restored (including RFLAGS, and the upper 32 of integer regs), except that r8-r11 are zeroed. r12-r15 are call-preserved in the x86-64 SysV ABI's function calling convention, so the registers that get zeroed by int 0x80 in 64-bit are the call-clobbered subset of the "new" registers that AMD64 added.
This behaviour has been preserved over some internal changes to how register-saving was implemented inside the kernel, and comments in the kernel mention that it's usable from 64-bit, so this ABI is probably stable. (I.e. you can count on r8-r11 being zeroed, and everything else being preserved.)
The return value is sign-extended to fill 64-bit rax. (Linux declares 32-bit sys_ functions as returning signed long.) This means that pointer return values (like from void *mmap()) need to be zero-extended before use in 64-bit addressing modes
Unlike sysenter, it preserves the original value of cs, so it returns to user-space in the same mode that it was called in. (Using sysenter results in the kernel setting cs to $__USER32_CS, which selects a descriptor for a 32-bit code segment.)
Older strace decodes int 0x80 incorrectly for 64-bit processes. It decodes as if the process had used syscall instead of int 0x80. This can be very confusing. e.g. strace prints write(0, NULL, 12 <unfinished ... exit status 1> for eax=1 / int $0x80, which is actually _exit(ebx), not write(rdi, rsi, rdx).
I don't know the exact version where the PTRACE_GET_SYSCALL_INFO feature was added, but Linux kernel 5.5 / strace 5.5 handle it. It misleadingly says the process "runs in 32-bit mode" but does decode correctly. (Example).
int 0x80 works as long as all arguments (including pointers) fit in the low 32 of a register. This is the case for static code and data in the default code model ("small") in the x86-64 SysV ABI. (Section 3.5.1
: all symbols are known to be located in the virtual addresses in the range 0x00000000 to 0x7effffff, so you can do stuff like mov edi, hello (AT&T mov $hello, %edi) to get a pointer into a register with a 5 byte instruction).
But this is not the case for position-independent executables, which many Linux distros now configure gcc to make by default (and they enable ASLR for executables). For example, I compiled a hello.c on Arch Linux, and set a breakpoint at the start of main. The string constant passed to puts was at 0x555555554724, so a 32-bit ABI write system call would not work. (GDB disables ASLR by default, so you always see the same address from run to run, if you run from within GDB.)
Linux puts the stack near the "gap" between the upper and lower ranges of canonical addresses, i.e. with the top of the stack at 2^48-1. (Or somewhere random, with ASLR enabled). So rsp on entry to _start in a typical statically-linked executable is something like 0x7fffffffe550, depending on size of env vars and args. Truncating this pointer to esp does not point to any valid memory, so system calls with pointer inputs will typically return -EFAULT if you try to pass a truncated stack pointer. (And your program will crash if you truncate rsp to esp and then do anything with the stack, e.g. if you built 32-bit asm source as a 64-bit executable.)
How it works in the kernel:
In the Linux source code, arch/x86/entry/entry_64_compat.S defines
ENTRY(entry_INT80_compat). Both 32 and 64-bit processes use the same entry point when they execute int 0x80.
entry_64.S is defines native entry points for a 64-bit kernel, which includes interrupt / fault handlers and syscall native system calls from long mode (aka 64-bit mode) processes.
entry_64_compat.S defines system-call entry-points from compat mode into a 64-bit kernel, plus the special case of int 0x80 in a 64-bit process. (sysenter in a 64-bit process may go to that entry point as well, but it pushes $__USER32_CS, so it will always return in 32-bit mode.) There's a 32-bit version of the syscall instruction, supported on AMD CPUs, and Linux supports it too for fast 32-bit system calls from 32-bit processes.
I guess a possible use-case for int 0x80 in 64-bit mode is if you wanted to use a custom code-segment descriptor that you installed with modify_ldt. int 0x80 pushes segment registers itself for use with iret, and Linux always returns from int 0x80 system calls via iret. The 64-bit syscall entry point sets pt_regs->cs and ->ss to constants, __USER_CS and __USER_DS. (It's normal that SS and DS use the same segment descriptors. Permission differences are done with paging, not segmentation.)
entry_32.S defines entry points into a 32-bit kernel, and is not involved at all.
The int 0x80 entry point in Linux 4.12's entry_64_compat.S:
/*
* 32-bit legacy system call entry.
*
* 32-bit x86 Linux system calls traditionally used the INT $0x80
* instruction. INT $0x80 lands here.
*
* This entry point can be used by 32-bit and 64-bit programs to perform
* 32-bit system calls. Instances of INT $0x80 can be found inline in
* various programs and libraries. It is also used by the vDSO's
* __kernel_vsyscall fallback for hardware that doesn't support a faster
* entry method. Restarted 32-bit system calls also fall back to INT
* $0x80 regardless of what instruction was originally used to do the
* system call.
*
* This is considered a slow path. It is not used by most libc
* implementations on modern hardware except during process startup.
...
*/
ENTRY(entry_INT80_compat)
... (see the github URL for the full source)
The code zero-extends eax into rax, then pushes all the registers onto the kernel stack to form a struct pt_regs. This is where it will restore from when the system call returns. It's in a standard layout for saved user-space registers (for any entry point), so ptrace from other process (like gdb or strace) will read and/or write that memory if they use ptrace while this process is inside a system call. (ptrace modification of registers is one thing that makes return paths complicated for the other entry points. See comments.)
But it pushes $0 instead of r8/r9/r10/r11. (sysenter and AMD syscall32 entry points store zeros for r8-r15.)
I think this zeroing of r8-r11 is to match historical behaviour. Before the Set up full pt_regs for all compat syscalls commit, the entry point only saved the C call-clobbered registers. It dispatched directly from asm with call *ia32_sys_call_table(, %rax, 8), and those functions follow the calling convention, so they preserve rbx, rbp, rsp, and r12-r15. Zeroing r8-r11 instead of leaving them undefined was to avoid info leaks from a 64-bit kernel to 32-bit user-space (which could far jmp to a 64-bit code segment to read anything the kernel left there).
The current implementation (Linux 4.12) dispatches 32-bit-ABI system calls from C, reloading the saved ebx, ecx, etc. from pt_regs. (64-bit native system calls dispatch directly from asm, with only a mov %r10, %rcx needed to account for the small difference in calling convention between functions and syscall. Unfortunately it can't always use sysret, because CPU bugs make it unsafe with non-canonical addresses. It does try to, so the fast-path is pretty damn fast, although syscall itself still takes tens of cycles.)
Anyway, in current Linux, 32-bit syscalls (including int 0x80 from 64-bit) eventually end up indo_syscall_32_irqs_on(struct pt_regs *regs). It dispatches to a function pointer ia32_sys_call_table, with 6 zero-extended args. This maybe avoids needing a wrapper around the 64-bit native syscall function in more cases to preserve that behaviour, so more of the ia32 table entries can be the native system call implementation directly.
Linux 4.12 arch/x86/entry/common.c
if (likely(nr < IA32_NR_syscalls)) {
/*
* It's possible that a 32-bit syscall implementation
* takes a 64-bit parameter but nonetheless assumes that
* the high bits are zero. Make sure we zero-extend all
* of the args.
*/
regs->ax = ia32_sys_call_table[nr](
(unsigned int)regs->bx, (unsigned int)regs->cx,
(unsigned int)regs->dx, (unsigned int)regs->si,
(unsigned int)regs->di, (unsigned int)regs->bp);
}
syscall_return_slowpath(regs);
In older versions of Linux that dispatch 32-bit system calls from asm (like 64-bit still did until 4.151), the int80 entry point itself puts args in the right registers with mov and xchg instructions, using 32-bit registers. It even uses mov %edx,%edx to zero-extend EDX into RDX (because arg3 happen to use the same register in both conventions). code here. This code is duplicated in the sysenter and syscall32 entry points.
Footnote 1: Linux 4.15 (I think) introduced Spectre / Meltdown mitigations, and a major revamp of the entry points that made them them a trampoline for the meltdown case. It also sanitized the incoming registers to avoid user-space values other than actual args being in registers during the call (when some Spectre gadget might run), by storing them, zeroing everything, then calling to a C wrapper that reloads just the right widths of args from the struct saved on entry.
I'm planning to leave this answer describing the much simpler mechanism because the conceptually useful part here is that the kernel side of a syscall involves using EAX or RAX as an index into a table of function pointers, with other incoming register values copied going to the places where the calling convention wants args to go. i.e. syscall is just a way to make a call into the kernel, to its dispatch code.
Simple example / test program:
I wrote a simple Hello World (in NASM syntax) which sets all registers to have non-zero upper halves, then makes two write() system calls with int 0x80, one with a pointer to a string in .rodata (succeeds), the second with a pointer to the stack (fails with -EFAULT).
Then it uses the native 64-bit syscall ABI to write() the chars from the stack (64-bit pointer), and again to exit.
So all of these examples are using the ABIs correctly, except for the 2nd int 0x80 which tries to pass a 64-bit pointer and has it truncated.
If you built it as a position-independent executable, the first one would fail too. (You'd have to use a RIP-relative lea instead of mov to get the address of hello: into a register.)
I used gdb, but use whatever debugger you prefer. Use one that highlights changed registers since the last single-step. gdbgui works well for debugging asm source, but is not great for disassembly. Still, it does have a register pane that works well for integer regs at least, and it worked great on this example.
See the inline ;;; comments describing how register are changed by system calls
global _start
_start:
mov rax, 0x123456789abcdef
mov rbx, rax
mov rcx, rax
mov rdx, rax
mov rsi, rax
mov rdi, rax
mov rbp, rax
mov r8, rax
mov r9, rax
mov r10, rax
mov r11, rax
mov r12, rax
mov r13, rax
mov r14, rax
mov r15, rax
;; 32-bit ABI
mov rax, 0xffffffff00000004 ; high garbage + __NR_write (unistd_32.h)
mov rbx, 0xffffffff00000001 ; high garbage + fd=1
mov rcx, 0xffffffff00000000 + .hello
mov rdx, 0xffffffff00000000 + .hellolen
;std
after_setup: ; set a breakpoint here
int 0x80 ; write(1, hello, hellolen); 32-bit ABI
;; succeeds, writing to stdout
;;; changes to registers: r8-r11 = 0. rax=14 = return value
; ebx still = 1 = STDOUT_FILENO
push 'bye' + (0xa<<(3*8))
mov rcx, rsp ; rcx = 64-bit pointer that won't work if truncated
mov edx, 4
mov eax, 4 ; __NR_write (unistd_32.h)
int 0x80 ; write(ebx=1, ecx=truncated pointer, edx=4); 32-bit
;; fails, nothing printed
;;; changes to registers: rax=-14 = -EFAULT (from /usr/include/asm-generic/errno-base.h)
mov r10, rax ; save return value as exit status
mov r8, r15
mov r9, r15
mov r11, r15 ; make these regs non-zero again
;; 64-bit ABI
mov eax, 1 ; __NR_write (unistd_64.h)
mov edi, 1
mov rsi, rsp
mov edx, 4
syscall ; write(edi=1, rsi='bye\n' on the stack, rdx=4); 64-bit
;; succeeds: writes to stdout and returns 4 in rax
;;; changes to registers: rax=4 = length return value
;;; rcx = 0x400112 = RIP. r11 = 0x302 = eflags with an extra bit set.
;;; (This is not a coincidence, it's how sysret works. But don't depend on it, since iret could leave something else)
mov edi, r10d
;xor edi,edi
mov eax, 60 ; __NR_exit (unistd_64.h)
syscall ; _exit(edi = first int 0x80 result); 64-bit
;; succeeds, exit status = low byte of first int 0x80 result = 14
section .rodata
_start.hello: db "Hello World!", 0xa, 0
_start.hellolen equ $ - _start.hello
Build it into a 64-bit static binary with
yasm -felf64 -Worphan-labels -gdwarf2 abi32-from-64.asm
ld -o abi32-from-64 abi32-from-64.o
Run gdb ./abi32-from-64. In gdb, run set disassembly-flavor intel and layout reg if you don't have that in your ~/.gdbinit already. (GAS .intel_syntax is like MASM, not NASM, but they're close enough that it's easy to read if you like NASM syntax.)
(gdb) set disassembly-flavor intel
(gdb) layout reg
(gdb) b after_setup
(gdb) r
(gdb) si # step instruction
press return to repeat the last command, keep stepping
Press control-L when gdb's TUI mode gets messed up. This happens easily, even when programs don't print to stdout themselves.

Using sys_write from an x86_64 function that takes only a char* argument?

I'm currently learning assembly and I would need some help with this simple program.
The say_something() function is created by x86_64 assembly (using NASM on 64-bit Linux) and is simply supposed to print out the phrase and the newline.
The main program is written in C and looks like this:
extern void say_something(char *);
int main(void) {
say_something("First line.");
say_something("Second sentence on another line.");
return 0;
}
... and the content of the *.asm file is this:
section .text
global say_something
say_something:
push rbp
mov rbp, rsp
mov rax, 4 ; syscall for write
mov rbx, 1
mov rcx, rdi
int 0x80
pop rbp
ret
But apparently my ASM function is buggy, because when I compile the program and run it, the output is just some junk, whereas it's supposed to print this:
First line.
Second sentence on another line.
What changes should I do on my assembly part to make it work correctly?

You have one major problem here, and two other major problems that might not be visible with this particular caller:
You forgot to set edx to the string length. The system call you're using is
sys_write(int fd, const char *buf, size_t len);, so the len you're passing is whatever garbage your caller left in edx.
If that doesn't make sys_write return -EFAULT without printing anything, then maybe you're printing a bunch of binary garbage before it detects a problem, if you ended up with a very large len? Hmm, write probably checks the whole range before writing anything, so maybe you ended up with a medium-sized value in edx and just printed some binary garbage after your strings. Your strings should be there at the start of your output, though.
Use strace ./my_program to find out, or GDB to look at edx before the call. (strace won't actually work right for int 0x80 from 64-bit code; it decodes as if you were using the native 64-bit syscall ABI).
You clobber the caller's rbx without saving/restoring it, violating the calling convention (the x86-64 System V ABI) where rbx is a call-preserved register. See http://agner.org/optimize/ for a nice doc about calling conventions. This may have no effect; your main probably doesn't compile to code that uses rbx for anything, so the rbx you're destroying is "owned" by the CRT start function that calls main.
Strangely, you save/restore rbp even though you don't use it at all.
Most importantly, you're using the 32-bit int 0x80 ABI from 64-bit code!! This is why I've been saying that write takes its length are in edx, not rdx, because the 32-bit int 0x80 ABI is exactly the same ABI that 32-bit user-space can use, where it only looks at the low 32 bits of registers.
Your code will break with pointers outside the low 32 bits of virtual address space. e.g. if you compiled with gcc -fPIE -pie, even those string literals in the .rodata section will be outside the low 32 bits, and sys_write will probably return -EFAULT without doing anything.
-pie is the default for gcc on many modern Linux distros, so you're "lucky"(?) that your program printed anything at all instead of just having write fail with -EFAULT. You aren't checking error return values, and it doesn't segfault your program the way it would if you tried to read/write from a bad point in user-space, so it fails silently.
As #fuz points out, Why cant i sys_write from a register? shows some example code for using the 64-bit syscall for sys_write, but it hard-codes a length. You're going to need to strlen your string, either with the libc function or with a loop. (Or a slow but compact repe scasb).

Linux 64-abi, calling convention

I'm reading intel manual about calling convention and which register has which purpose. Here is what was specified in the Figure 3.4: Register Usage:
%rax temporary register; with variable arguments
passes information about the number of vector
registers used; 1st return register
But in linux api we use rax to pass the function number. Is it consistent with what was specified in the intel manual? Actually I expected that (according to the manual) we will pass the function number into rdi (it is used for the 1st argument). And so forth...
Can I use rax to pass the first function argument in my hand-written functions? E.g.
mov rax, [array_lenght_ptr]
mov rdi, array_start_ptr
callq _array_sum

That quote is talking about the function-calling convention, which is standardized by x86-64 System V ABI doc.
You're thinking of Linux's system-call calling convention, which is described in an appendix to the ABI doc, but that part isn't normative. Anyway, the system call ABI puts the call number in rax because it's not an arg to the system call. Alternatively, you can think of it like a 0th arg, the same way that variadic function calls pass the number of FP register-args in al. (Fun fact: that makes it possible for the caller to pass even the first FP arg on the stack if they want to.)
But more importantly, because call number in RAX makes a better ABI, and because of tradition: it's what the i386 system-call ABI does, too. And the i386 System V function-call ABI is totally different, using stack args exclusively.
This means system-call wrapper functions can just set eax and run syscall instead of needing to do something like
libc_write_wrapper_for_your_imagined_syscall_convention:
; copy all args to the next slot over
mov r10, rdx ; size_t count
mov rdx, rsi ; void *buf
mov esi, edi ; int fd
mov edi, 1 ; SYS_write
syscall
cmp rax, -4095
jae set_errno
ret
instead of
actual_libc_write_wrapper: ; glibc's actual code I think also checks for pthread cancellation points or something...
mov eax, 1 ; SYS_write
syscall
cmp rax, -4095
jae set_errno
ret
Note the use of r10 instead of rcx because syscall clobbers rcx and r11 with the saved RIP and RFLAGS, so it doesn't have to write any memory with return information, and doesn't force user-space to put it somewhere the kernel can read it (like 32-bit sysenter does).
So the system-calling convention couldn't be the same as the function-calling convention. (Or the function-calling convention would have had to choose different registers.)
For system calls with 4 or more args (or a generic wrapper that works for any system call) you do need a mov r10, rcx, but that's all. (Unlike the 32-bit convention where a wrapper has to load args from the stack, and save/restore ebx because the kernel's poorly-chosen ABI uses it for the first arg.)
Can I use rax to pass the first function argument in my hand-written functions?
Yes, do whatever you want for private helper functions that you don't need to call from C.
Choose arg registers to make things easier for the callers (or for the most important caller), or you'll be using any registers with fixed register choices (like div).
Note which registers are clobbered and which are preserved with a comment. Only bother to save/restore registers that your caller actually needs saving / restoring, and choose which tmp regs you use to minimize push/pop. Avoid a push/pop save/reload of registers that are part of a critical latency path in your caller, if your function is short.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string