How does linux 2.6 differ from 2.4?
Can we modify the source kernel?
Can we modify the int 0x80 service routine?
UPDATE:
1. the 0x80 handler is essentially the same between 2.4 and 2.6, although the function called from the handler is called by the 'syscall' instruction handler for x86-64 in 2.6.
2. the 0x80 handler can be modified like the rest of the kernel.
3. You won't break anything by modifying it, unless you remove backwards compatibility. E.g., you can add your own trace or backdoor if you feel so inclined. The other post that says you will break your libs and toolchain if you modify the handler is incorrect. If you break the dispatch algorithm, or modify the dispatch table incorrectly, then you will break things.
3a. As I originally posted, the best way to extend the 0x80 service is to extend the system call handler.
As the kernel source says:
What: The kernel syscall interface
Description:
This interface matches much of the POSIX interface and is based
on it and other Unix based interfaces. It will only be added to
over time, and not have things removed from it.
Note that this interface is different for every architecture
that Linux supports. Please see the architecture-specific
documentation for details on the syscall numbers that are to be
mapped to each syscall.
The system call table entries for i386 are in:
arch/i386/kernel/syscall_table.S
Note that the table is a sequence of pointers, so if you want to maintain a degree of forward compatibility with the kernel maintainers, you'd need to pad the table before placement of your pointer.
The syscall vector number is defined in irq_vectors.h
Then traps.c sets the address of the system_call function via set_system_gate, which places the entry into the interrupt descriptor table. The system_call function itself is in entry.S, and calls the requested pointer from the system call table.
There are a few housekeeping details, which you can see reading the code, but direct modification of the 0x80 interrupt handler is accomplished in entry.S inside the system_call function. In a more sane fashion, you can modify the system call table, inserting your own function without modifying the dispatch mechanism.
In fact, having read the 2.6 source, it says directly that int 0x80 and x86-64 syscall use the same code, so far. So you can make portable changes for x86-32 and x86-64.
END Update
The INT 0x80 method invokes the system call table handler. This matches register arguments to a call table, invoking kernel functions based on the contents of the EAX register. You can easily extend the system call table to add custom kernel API functions.
This may even work with the new syscall code on x86-64, as it uses the system call table, too.
If you alter the current system call table in any manner other than to extend it, you will break all dependent libraries and code, including libc, init, etc.
Here's the current Linux system call table: http://asm.sourceforge.net/syscall.html
It's an architectural overhaul. Everything has changed internally. SMP support is complete, the process scheduler is vastly improved, memory management got an overhaul, and many, many other things.
Yes. It's open-source software. If you do not have a copy of the source, you can get it from your vendor or from kernel.org.
Yes, but it's not advisable because it will break libc, it will break your baselayout, and it will break your toolchain if you change the sequence of existing syscalls, and nearly everything you might think you want to do should be done in userspace when at all possible.
Related
On an x86-64 Intel system that supports syscall and sysret what's the "fastest" system call from 64-bit user code on a vanilla kernel?
In particular, it must be a system call that exercises the syscall/sysret user <-> kernel transition1, but does the least amount of work beyond that. It doesn't even need to do the syscall itself: some type of early error which never dispatches to the specific call on the kernel side is fine, as long as it doesn't go down some slow path because of that.
Such a call could be used to estimate the raw syscall and sysret overhead independent of any work done by the call.
1 In particular, this excludes things that appear to be system calls but are implemented in the VDSO (e.g., clock_gettime) or are cached by the runtime (e.g., getpid).
One that doesn't exist, and therefore returns -ENOSYS quickly.
From arch/x86/entry/entry_64.S:
#if __SYSCALL_MASK == ~0
cmpq $__NR_syscall_max, %rax
#else
andl $__SYSCALL_MASK, %eax
cmpl $__NR_syscall_max, %eax
#endif
ja 1f /* return -ENOSYS (already in pt_regs->ax) */
movq %r10, %rcx
/*
* This call instruction is handled specially in stub_ptregs_64.
* It might end up jumping to the slow path. If it jumps, RAX
* and all argument registers are clobbered.
*/
#ifdef CONFIG_RETPOLINE
movq sys_call_table(, %rax, 8), %rax
call __x86_indirect_thunk_rax
#else
call *sys_call_table(, %rax, 8)
#endif
.Lentry_SYSCALL_64_after_fastpath_call:
movq %rax, RAX(%rsp)
1:
Use an invalid system call number so the dispatching code simply returns with
eax = -ENOSYS instead of dispatching to a system-call handling function at all.
Unless this causes the kernel to use the iret slow path instead of sysret / sysexit. That might explain the measurements showing an invalid number being 17 cycles slower than syscall(SYS_getpid), because glibc error handling (setting errno) probably doesn't explain it. But from my reading of the kernel source, I don't see any reason why it wouldn't still use sysret while returning -ENOSYS.
This answer is for sysenter, not syscall. The question originally said sysenter / sysret (which was weird because sysexit goes with sysenter, while sysret goes with syscall). I answered based on sysenter for a 32-bit process on an x86-64 kernel.
Native 64-bit syscall is handled more efficiently inside the kernel. (Update; with Meltdown / Spectre mitigation patches, it still dispatches via C do_syscall_64 in 4.16-rc2).
My What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? Q&A gives an overview of the kernel side of system-call entry points from compat mode into an x86-64 kernel (entry_64_compat.S). This answer is just taking the relevant parts of that.
The links in that answer and this are to Linux 4.12 sources, which doesn't contain the Meltdown mitigation page-table manipulation, so that will be significant extra overhead.
int 0x80 and sysenter have different entry points. You're looking for entry_SYSENTER_compat. AFAIK, sysenter always goes there, even if you execute it in a 64-bit user-space process. Linux's entry point pushes a constant __USER32_CS as the saved CS value, so it will always return to user-space in 32-bit mode.
After pushing registers to construct a struct pt_regs on the kernel stack, there's a TRACE_IRQS_OFF hook (no idea how many instructions that amounts to), then call do_fast_syscall_32 which is written in C. (Native 64-bit syscall dispatching is done directly from asm, but 32-bit compat system calls are always dispatched through C).
do_syscall_32_irqs_on in arch/x86/entry/common.c is pretty light-weight: just a check if the process is being traced (I think this is how strace can hook system calls via ptrace), then
...
if (likely(nr < IA32_NR_syscalls)) {
regs->ax = ia32_sys_call_table[nr]( ... arg );
}
syscall_return_slowpath(regs);
}
AFAIK, the kernel can use sysexit after this function returns.
So the return path is the same whether or not EAX had a valid system call number, and obviously returning without dispatching at all is the fastest path through that function, especially in a kernel with Spectre mitigation where the indirect branch on the table of function pointers would go through a retpoline and always mispredict.
If you want to really test sysenter/sysexit without all that extra overhead, you'll need to modify Linux to put a much simpler entry point without checking for tracing or pushing / popping all the registers.
You'd probably also want to modify the ABI to pass a return address in a register (like syscall does on its own) instead of saved on the user-space stack which Linux's current sysenter ABI does; it has to get_user() to read the EIP value it should return to.
Of if all this overhead is part of what you want to measure, you're definitely all set with an eax that gives you -ENOSYS; at worst you'll be getting one extra branch miss from the range-check if branch predictors are hot for that branch based on normal 32-bit system calls.
In this benchmark by Brendan Gregg (linked from this blog post which is interesting reading on the topic) close(999) (or some other fd not in use) is recommended.
Some system calls don't even go thru any user->kernel transition, read vdso(7).
I suspect that these VDSO system calls (e.g. time(2), ...) are the fastest. You could claim that there are no "real" system calls.
BTW, you could add a dummy system call to your kernel (e.g. some system call always returning 0, or a hello world system call, see also this) and measure it.
I suspect (without having benchmarked it) that getpid(2) should be a very fast system call, because the only thing it needs to do is fetch some data from the kernel memory. And AFAIK, it is a genuine system call, not using VDSO techniques. And you could use syscall(2) to avoid its caching done by your libc and forcing the genuine system call.
I maintain my position (given in a comment to your initial question): without actual motivation your question does not make any concrete sense. Then I still do think that syscall(2) doing getpid is measuring the typical overhead to make a system call (and I guess you really care about that one). In practice almost all system calls are doing more work that such a getpid (or getppid).
I am implementing a Linux security sandbox for a custom bytecode interpreter through seccomp mode. To minimize as much as possible the attack surface, I want to run it in a completely clean virtual address space. I only need code and data segments plus stack available, but I do not need vsyscall, vdso nor vvar.
Is there any way to disable allocation of this pages for a given process?
Basically, no, you will have to disable vsyscall/vDSO globally if you want the mapping itself to be unavailable. If you only want the program to be unable to call vsyscall/vDSO syscalls, then seccomp will be able to do it. Some caveats though:
See https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt
On x86-64, vsyscall emulation is enabled by default. (vsyscalls are
legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccomp, with a few oddities:
A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to
the vsyscall entry for the given call and not the address after the
'syscall' instruction. Any code which wants to restart the call
should be aware that (a) a ret instruction has been emulated and (b)
trying to resume the syscall will again trigger the standard vsyscall
emulation security checks, making resuming the syscall mostly
pointless.
A return value of SECCOMP_RET_TRACE will signal the tracer as usual,
but the syscall may not be changed to another system call using the
orig_rax register. It may only be changed to -1 order to skip the
currently emulated call. Any other change MAY terminate the process.
The rip value seen by the tracer will be the syscall entry address;
this is different from normal behavior. The tracer MUST NOT modify
rip or rsp. (Do not rely on other changes terminating the process.
They might work. For example, on some kernels, choosing a syscall
that only exists in future kernels will be correctly emulated (by
returning -ENOSYS).
To detect this quirky behavior, check for addr & ~0x0C00 ==
0xFFFFFFFFFF600000. (For SECCOMP_RET_TRACE, use rip. For
SECCOMP_RET_TRAP, use siginfo->si_call_addr.) Do not check any other
condition: future kernels may improve vsyscall emulation and current
kernels in vsyscall=native mode will behave differently, but the
instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these
cases.
Note that modern systems are unlikely to use vsyscalls at all -- they
are a legacy feature and they are considerably slower than standard
syscalls. New code will use the vDSO, and vDSO-issued system calls
are indistinguishable from normal system calls.
So emulated vsyscalls can be confined by seccomp, and vDSOs are likewise confined by seccomp. If you disable gettimeofday(), the confined program will not be able to call that syscall through emulated vsyscall, vDSO, or regular syscall. If you confine them this way with seccomp, you shouldn't have to worry about the attack surface they create.
If you are worried about an attacker exploiting the vDSO mapping itself (which doesn't require calling a syscall), then I don't believe there's a way to disable it on a per-process basis reliably. You can prevent it from being linked in, but it would be hard to prevent a compromised bytecode interpreter from allocating memory and putting it back. You can boot with the vdso=0 kernel parameter which will disable it globally, though, so linking it in would do nothing.
I am reading about VM handling on Linux. Apparently to perform a syscall there's a page at 0xFFFFF000 on x86. called vsyscall page. In the past, the strategy to call a syscall was to use int 0x80. Is this vsyscall page strategy still using int 0x80 under the hood, or is it using a different call strategy (e.g. syscall opcode?). Collateral question: is the int 0x80 method outdated?
If you run ldd on a modern Linux binary, you'll see that it's linked to a dynamic library called linux-vdso.1 (on amd64) or linux-gate.so.1 (on x86), which is located in that vsyscall page. This is a shared library provided by the kernel, mapped into every process's address space, which contains C functions that encapsulate the specifics of how to perform a system call.
The reason for this encapsulation is that the "preferred" way to perform a system call can differ from one machine to another. The interrupt 0x80 method should always work on x86, but recent processors support the sysenter (Intel) or syscall (AMD) instructions, which are much more efficient. You want your programs to use those when available, but you also want the same compiled binary to run on both Intel and AMD (and other) processors, so it shouldn't contain vendor-specific opcodes. The linux-vdso/linux-gate library hides these processor-specific decisions behind a consistent interface.
For more information, see this article.
What is the difference between SYS_exit, sys_exit() and exit()?
What I understand :
The linux kernel provides system calls, which are listed in man 2 syscalls.
There are wrapper functions of those syscalls provided by glibc which have mostly similar names as the syscalls.
My question : In man 2 syscalls, there is no mention of SYS_exit and sys_exit(), for example. What are they?
Note : The syscall exit here is only an example. My question really is : What are SYS_xxx and sys_xxx()?
I'll use exit() as in your example although this applies to all system calls.
The functions of the form sys_exit() are the actual entry points to the kernel routine that implements the function you think of as exit(). These symbols are not even available to user-mode programmers. That is, unless you are hacking the kernel, you cannot link to these functions because their symbols are not available outside the kernel. If I wrote libmsw.a which had a file scope function like
static int msw_func() {}
defined in it, you would have no success trying to link to it because it is not exported in the libmsw symbol table; that is:
cc your_program.c libmsw.a
would yield an error like:
ld: cannot resolve symbol msw_func
because it isn't exported; the same applies for sys_exit() as contained in the kernel.
In order for a user program to get to kernel routines, the syscall(2) interface needs to be used to effect a switch from user-mode to kernel mode. When that mode-switch (somtimes called a trap) occurs a small integer is used to look up the proper kernel routine in a kernel table that maps integers to kernel functions. An entry in the table has the form
{SYS_exit, sys_exit},
Where SYS_exit is an preprocessor macro which is
#define SYS_exit (1)
and has been 1 since before you were born because there hasn't been reason to change it. It also happens to be the first entry in the table of system calls which makes look up a simple array index.
As you note in your question, the proper way for a regular user-mode program to access sys_exit is through the thin wrapper in glibc (or similar core library). The only reason you'd ever need to mess with SYS_exit or sys_exit is if you were writing kernel code.
This is now addressed in man syscall itself,
Roughly speaking, the code belonging to the system call with number __NR_xxx defined in /usr/include/asm/unistd.h can be found in the Linux kernel source in the routine sys_xxx(). (The dispatch table for i386 can be found in /usr/src/linux/arch/i386/kernel/entry.S.) There are many exceptions, however, mostly because older system calls were superseded by newer ones, and this has been treated somewhat unsystematically. On platforms with proprietary operating-system emulation, such as parisc, sparc, sparc64, and alpha, there are many additional system calls; mips64 also contains a full set of 32-bit system calls.
At least now /usr/include/asm/unistd.h is a preprocessor hack that links to either,
/usr/include/asm/unistd_32.h
/usr/include/asm/unistd_x32.h
/usr/include/asm/unistd_64.h
The C function exit() is defined in stdlib.h. Think of this as a high level event driven interface that allows you to register a callback with atexit()
/* Call all functions registered with `atexit' and `on_exit',
in the reverse of the order in which they were registered,
perform stdio cleanup, and terminate program execution with STATUS. */
extern void exit (int __status) __THROW __attribute__ ((__noreturn__));
So essentially the kernel provides an interface (C symbols) called __NR_xxx. Traditionally people want sys_exit() which is defined with a preprocessor macro SYS_exit. This macro creates the sys_exit() function. The exit() function is part of the standard C library stdlib.h and ported to other operating systems that lack the Linux Kernel ABI entirely (there may not be __NR_xxx functions) and potentially don't even have sys_* functions available either (you could write exit() to send the interrupt or use VDSO in Assembly).
From a not so far removed picture of what is going on, could someone expound more on what is the difference between Linux's system calls like read() and write() etc. and writing them in assembly using the x86 INT opcode along with setting up the specified registers?
The actual function read() is a C library wrapper over what is called the 'system call gate' . The C library wrapper is primarily responsible for things like setting errno on failure, as well as mapping between structures used in userspace and those used by the low-level syscall.
The system call gate, in turn, is what actually switches from usermode to kernel mode. This depends on the CPU architecture - on x86, you have two options - one is to use INT 080h after setting up registers with the syscall number and arguments; another is to call into a symbol provided by a library mapped into every executable's address space, with the same register setup. This library then picks between several potential options for user->kernel transitions, including SYSENTER, SYSCALL, or a fallback to INT 080h. Other architectures use yet different methods. In any case, the CPU shifts into kernelspace, where the syscall number is used to lookup the appropriate handler in a big table.
interrupt is not the only way to invoke system call, you use special instructs like sysenter, syscall or simple jump to specific address in protected mode.