Is int 0x80 is the only interrupt number used in linux assembly programming? - linux

Do we use only 80H in assembly programming to request a service to linux kernel?
what is the utility of other interrupt numbers in linux?
I am transitioning from windows to linux.

int3 (the debug breakpoint) and int 80h (old system call) are the two software interrupts commonly used on linux. Hardware interrupts are used by device drivers, but those probably don't concern you.
That said, on 32 bit systems the kernel provides code mapped into each process that can be invoked to perform a system call and it will use the most appropriate mechanism automatically (syscall, sysenter or int 80h). Since all 64 bit systems support the syscall instruction, that's the one normally used in long mode. Note that 64 bit system call numbers differ from 32 bit.
Finally, you don't typically use system calls from assembly on linux. You either use the c library or avoid system calls entirely because they are slow and one of the main uses of assembly is for speed. There are exceptions of course, such as security-related code or compiler/language development.

Related

How is hardware context switct used/unused in Linux?

Old x86 intel architecture provided context switching (in the form of TSS) at hardware level. But I have read that, linux has long "abandoned" using hardware context switching functionality as they were less optimised, less flexible and was not available on all architextures.
What confuses me is how can software (linux) control hardware operations (saving/restoring context)? Linux can choose not to use context setup by hardware but hardware context switch would nevertheless happen (making "optimisation" argument irrelevant).
Also if linux is not using hardware context switch, how can then the value %eip (pointing to next instruction in user program) be saved and kernel stack pointer restored by the kernel? (and vice-versa process)
I think kernel would need some support from hardware to save user program %eip and switch %esp (user to kernel stack) registers even before interrupt service routine starts.
If this support indeed is provided by hardware then how is linux not using hardware context switches?
Terribly confused!!!

Explain Linux commit message that patches/secures POP SS followed by a #BP interrupt (INT3)

This is in reference to CVE-2018-8897 (which appears related to CVE-2018-1087), described as follows:
A statement in the System Programming Guide of the Intel 64 and IA-32 Architectures Software Developer's Manual (SDM) was mishandled in the development of some or all operating-system kernels, resulting in unexpected behavior for #DB exceptions that are deferred by MOV SS or POP SS, as demonstrated by (for example) privilege escalation in Windows, macOS, some Xen configurations, or FreeBSD, or a Linux kernel crash. The MOV to SS and POP SS instructions inhibit interrupts (including NMIs), data breakpoints, and single step trap exceptions until the instruction boundary following the next instruction (SDM Vol. 3A; section 6.8.3). (The inhibited data breakpoints are those on memory accessed by the MOV to SS or POP to SS instruction itself.) Note that debug exceptions are not inhibited by the interrupt enable (EFLAGS.IF) system flag (SDM Vol. 3A; section 2.3). If the instruction following the MOV to SS or POP to SS instruction is an instruction like SYSCALL, SYSENTER, INT 3, etc. that transfers control to the operating system at CPL < 3, the debug exception is delivered after the transfer to CPL < 3 is complete. OS kernels may not expect this order of events and may therefore experience unexpected behavior when it occurs.
When reading this related git commit to the Linux kernel, I noted that the commit message states:
x86/entry/64: Don't use IST entry for #BP stack
There's nothing IST-worthy about #BP/int3. We don't allow kprobes
in the small handful of places in the kernel that run at CPL0 with
an invalid stack, and 32-bit kernels have used normal interrupt
gates for #BP forever.
Furthermore, we don't allow kprobes in places that have usergs while
in kernel mode, so "paranoid" is also unnecessary.
In light of the vulnerability, I'm trying to understand the last sentence/paragraph in the commit message. I understand that an IST entry refers to one of the (allegedly) "known good" stack pointers in the Interrupt Stack Table that can be used to handle interrupts. I also understand that #BP refers to a breakpoint exception (equivalent to INT3), and that kprobes is the debugging mechanism that is claimed to only run in a few places in the kernel at ring 0 (CPL0) privilege level.
But I'm completely lost in the next part, which may be because "usergs" is a typo and I'm simply missing what was intended:
Furthermore, we don't allow kprobes in places that have usergs while
in kernel mode, so "paranoid" is also unnecessary.
What does this statement mean?
usergs is referring to the x86-64 swapgs instruction, which exchanges gs with an internal saved GS value for the kernel to find the kernel stack from a syscall entry point. The swaps also swap the cached gsbase segment info, rather than reloading from the GDT based on the gs value itself. (wrgsbase can change the GS base independently of the GDT/LDT)
AMD's design is that syscall doesn't change RSP to point to the kernel stack, and doesn't read/write any memory, so syscall itself can be fast. But then you enter the kernel with all registers holding their user-space values. See Why does Windows64 use a different calling convention from all other OSes on x86-64? for some links to mailing list discussions between kernel devs and AMD architects in ~2000, tweaking the design of syscall and swapgs to make it usable before any AMD64 CPUs were sold.
Apparently keeping track of whether GS is currently the kernel or user value is tricky for error handling: There's no way to say "I want kernelgs now"; you have to know whether to run swapgs or not in any error-handling path. The only instruction is a swap, not a set it to one vs. the other.
Read comments in arch/x86/entry/entry_64.S e.g. https://github.com/torvalds/linux/blob/9fb71c2f230df44bdd237e9a4457849a3909017d/arch/x86/entry/entry_64.S#L1267 (from current Linux) which mentions usergs, and the next block of comments describes doing a swapgs before jumping to some error handling code with kernel gsbase.
IIRC, the Linux kernel [gs:0] holds a thread info block, at the lowest addresses of the kernel stack for that thread. The block includes the kernel stack pointer (as an absolute address, not relative to gs).
I wouldn't be surprised if this bug is basically tricking the kernel to loading kernel rsp from a user-controlled gsbase, or otherwise screwing up the dead-reckoning of swapgs so it has the wrong gs at some point.

networking system call multiplexing on x86 but not on x64

I was reading an article on how networking related system calls are made on x86 and I saw that the calls were multiplexed through a single system call "socketcall". The reason for this additional level of hierarchy seems to be to conserve system call numbers.
Taking a quick look at x64, this does not seem to be the case. Why is this so? Each register in an x86 processor is 32 bits long and should not have trouble storing bigger values for system call numbers; so what is the reason for socketcall not being implemented on x64?
Pure speculation, but on some architectures with a small number of registers like x86, functions beyond a certain number of parameters cannot efficiently pass all of the parameters directly into registers (for x86 this is about 6). For example, sendto and recvfrom take 6 + 1 for the syscall number. At this point it is more efficient to pass a pointer to an array of longs and as for the others with parameters less than the threshold I am guessing it was a matter of convenience and code sharing between related function.

Determine if a linux port supports cmpxchg in hardware

I am writing a Linux kernel patch which uses cmpxchg to speed up a few cases besides fixing a few semantic issues, However I've noted that certain architectures only support xchg and not cmpxchg, How do i determine at compile time if the architecture the kernel is being compiled for supports cmpxchg in hardware or not ?
How about #ifdef __HAVE_ARCH_CMPXCHG ?

What is better "int 0x80" or "syscall" in 32-bit code on Linux?

I study the Linux kernel and found out that for x86_64 architecture the interrupt int 0x80 doesn't work for calling system calls1.
For the i386 architecture (32-bit x86 user-space), what is more preferable: syscall or int 0x80 and why?
I use Linux kernel version 3.4.
Footnote 1: int 0x80 does work in some cases in 64-bit code, but is never recommended. What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?
syscall is the default way of entering kernel mode on x86-64. This instruction is not available in 32 bit modes of operation on Intel processors.
sysenter is an instruction most frequently used to invoke system calls in 32 bit modes of operation. It is similar to syscall, a bit more difficult to use though, but that is the kernel's concern.
int 0x80 is a legacy way to invoke a system call and should be avoided.
The preferred way to invoke a system call is to use vDSO, a part of memory mapped in each process address space that allows to use system calls more efficiently (for example, by not entering kernel mode in some cases at all). vDSO also takes care of more difficult, in comparison to the legacy int 0x80 way, handling of syscall or sysenter instructions.
Also, see this and this.
My answer here covers your question.
In practice, recent kernels are implementing a VDSO, notably to dynamically optimize system calls (the kernel sets the VDSO to some code best for the current processor). So you should use the VDSO, and you'll better use, for existing syscalls, the interface provided by the libc.
Notice that, AFAIK, a significant part of the cost of simple syscalls is going from user-space to kernel and back. Hence, for some syscalls (probably gettimeofday, getpid ...) the VDSO might avoid even that (and technically might avoid doing a real syscall). For most syscalls (like open, read, send, mmap ....) the kernel cost of the syscall is large enough to make any improvement of the user-space to kernel space transition (e.g. using SYSENTER or SYSCALL machine instructions instead of INT) insignificant.
Beware of this before changing : system call numbers differ when doing 0x80 or syscall, e.g sys_write is 4 with 0x80 and 1 with syscall.
http://docs.cs.up.ac.za/programming/asm/derick_tut/syscalls.html for 32 bits or 0x80
http://blog.rchapman.org/post/36801038863/linux-system-call-table-for-x86-64 for syscall

Resources