What is the difference between SYS_exit, sys_exit() and exit()?
What I understand :
The linux kernel provides system calls, which are listed in man 2 syscalls.
There are wrapper functions of those syscalls provided by glibc which have mostly similar names as the syscalls.
My question : In man 2 syscalls, there is no mention of SYS_exit and sys_exit(), for example. What are they?
Note : The syscall exit here is only an example. My question really is : What are SYS_xxx and sys_xxx()?
I'll use exit() as in your example although this applies to all system calls.
The functions of the form sys_exit() are the actual entry points to the kernel routine that implements the function you think of as exit(). These symbols are not even available to user-mode programmers. That is, unless you are hacking the kernel, you cannot link to these functions because their symbols are not available outside the kernel. If I wrote libmsw.a which had a file scope function like
static int msw_func() {}
defined in it, you would have no success trying to link to it because it is not exported in the libmsw symbol table; that is:
cc your_program.c libmsw.a
would yield an error like:
ld: cannot resolve symbol msw_func
because it isn't exported; the same applies for sys_exit() as contained in the kernel.
In order for a user program to get to kernel routines, the syscall(2) interface needs to be used to effect a switch from user-mode to kernel mode. When that mode-switch (somtimes called a trap) occurs a small integer is used to look up the proper kernel routine in a kernel table that maps integers to kernel functions. An entry in the table has the form
{SYS_exit, sys_exit},
Where SYS_exit is an preprocessor macro which is
#define SYS_exit (1)
and has been 1 since before you were born because there hasn't been reason to change it. It also happens to be the first entry in the table of system calls which makes look up a simple array index.
As you note in your question, the proper way for a regular user-mode program to access sys_exit is through the thin wrapper in glibc (or similar core library). The only reason you'd ever need to mess with SYS_exit or sys_exit is if you were writing kernel code.
This is now addressed in man syscall itself,
Roughly speaking, the code belonging to the system call with number __NR_xxx defined in /usr/include/asm/unistd.h can be found in the Linux kernel source in the routine sys_xxx(). (The dispatch table for i386 can be found in /usr/src/linux/arch/i386/kernel/entry.S.) There are many exceptions, however, mostly because older system calls were superseded by newer ones, and this has been treated somewhat unsystematically. On platforms with proprietary operating-system emulation, such as parisc, sparc, sparc64, and alpha, there are many additional system calls; mips64 also contains a full set of 32-bit system calls.
At least now /usr/include/asm/unistd.h is a preprocessor hack that links to either,
/usr/include/asm/unistd_32.h
/usr/include/asm/unistd_x32.h
/usr/include/asm/unistd_64.h
The C function exit() is defined in stdlib.h. Think of this as a high level event driven interface that allows you to register a callback with atexit()
/* Call all functions registered with `atexit' and `on_exit',
in the reverse of the order in which they were registered,
perform stdio cleanup, and terminate program execution with STATUS. */
extern void exit (int __status) __THROW __attribute__ ((__noreturn__));
So essentially the kernel provides an interface (C symbols) called __NR_xxx. Traditionally people want sys_exit() which is defined with a preprocessor macro SYS_exit. This macro creates the sys_exit() function. The exit() function is part of the standard C library stdlib.h and ported to other operating systems that lack the Linux Kernel ABI entirely (there may not be __NR_xxx functions) and potentially don't even have sys_* functions available either (you could write exit() to send the interrupt or use VDSO in Assembly).
Related
In most languages, C included, the stack is used for function calls. That's why you get a "Stack Overflow" error if you are not careful in recursion. (Pun not intended).
If that is true, then what is so special about the asmlinkage GCC directive.
It says, from #kernelnewbies
The asmlinkage tag is one other thing that we should observe about
this simple function. This is a #define for some gcc magic that tells
the compiler that the function should not expect to find any of its
arguments in registers (a common optimization), but only on the CPU's
stack.
I mean I don't think the registers are used in normal function calls.
What is even more strange is when you learn it is implemented using the GCC regparm function attribute on x86.
The documentation of regparm is as follows:
On x86-32 targets, the regparm attribute causes the compiler to pass
arguments number one to number if they are of integral type in
registers EAX, EDX, and ECX instead of on the stack.
This is basically saying the opposite of what asmlinkage is trying do.
So what happens? Are they on the stack or in the registers.
Where am I going wrong?
The information isn't very clear.
On x86 32bit, the asmlinkage macro expands to __attribute__((regparam(0))), which basically tells GCC that no parameters should be passed through registers (the 0 is the important part). As of Linux 5.17, x86-32 and Itanium64 seem to be the only two architectures re-defining this macro, which by default expands to no attribute at all.
So asmlinkage does not by itself mean "parameters are passed on the stack". By default, the normal calling convention is used. This includes x86 64bit, which follows the System V AMD64 ABI calling convention, passing function parameters through RDI, RSI, RDX, RCX, R8, R9, [XYZ]MM0–7.
HOWEVER there is an important clarification to make: even with no special __attribute__ to force the compiler to use the stack for parameters, syscalls in recent kernel versions still take parameters from the stack indirectly through a pointer to a pt_regs structure (holding all the user-space registers saved on the stack on syscall entry). This is achieved through a moderately complex set of macros (SYSCALL_DEFINEx) that does everything transparently.
So technically, although asmlinkage does not change the calling convention, parameters are not passed inside registers as one would think by simply looking at the syscall function signature.
For example, the following syscall:
SYSCALL_DEFINE3(something, int, one, int, two, int, three)
{
// ...
do_something(one, two, three);
// ...
}
Actually becomes (roughly):
asmlinkage __x64_sys_something(struct pt_regs *regs)
{
// ...
do_something(regs->di, regs->si, regs->dx);
// ...
}
Which compiles to something like:
/* ... */
mov rdx,QWORD PTR [rdi+0x60]
mov rsi,QWORD PTR [rdi+0x68]
mov rdi,QWORD PTR [rdi+0x70]
call do_something
/* ... */
On i386 and x86-64 at least, asmlinkage means to use the standard calling convention you'd get with no GCC options and no __attribute__. (Like what user-space programs normally use for that target.)
For i386, that means stack args only. For x86-64, it's still the same registers as usual.
For x86-64, there's no difference; the kernel already uses the standard calling convention from the AMD64 System V ABI doc everywhere, because it's well-designed for efficiency, passing the first 6 integer args in registers.
But i386 has more historical baggage, with the standard calling convention (i386 SysV ABI) inefficiently passing all args on the stack. Presumably at some point in ancient history, Linux was compiled by GCC using this convention, and the hand-written asm entry points that called C functions were already using that convention.
So (I'm guessing here), when Linux wanted to switch from gcc -m32 to gcc -m32 -mregparm=3 to build the kernel with a more efficient calling convention, they had a choice to either modify the hand-written asm at the same time to use the new convention, or to force a few specific C functions to still use the traditional calling convention so the hand-written asm could stay the same.
If they'd made the former choice, asmlinkage for i386 would be __attribute__((regparm(3))) to force that convention even if the kernel is compiled a different way.
But instead, they chose to keep the asm the same, and #define asmlinkage __attribute__((regparm(0))) for i386, which indeed is zero register args, using the stack right away.
I don't know if that maybe had any debugging benefit, like in terms of being able to see what args got passed into a C function from asm without the only copy likely getting modified right away.
If -mregparm=3 and the corresponding attribute were new GCC features, Linus probably wanted to keep it possible to build the kernel with older GCC. That would rule out changing the asm to require __attribute__((regparm(3))). The asmlinkage = regparm(0) choice they actually made also has the advantage of not having to modify any asm, which means no correctness concerns, and that can be disentangled from any possible GCC bugs with using the new(?)-at-the-time calling convention.
At this point I think it would be totally possible to modify the asm code that calls asmlinkage functions, and swap it to being regparm(3). But that's a pretty minor thing. And not worth doing now since 32-bit x86 kernels are long since obsolete for almost all use cases. You almost always want a 64-bit kernel even if using a 32-bit user-space.
There might even be an efficiency benefit to stack args if saving the registers at a system-call entry point involved saving them with EBX at the lowest address, where they're already in place to be used as function args. You'd be all set to call *ia32_sys_call_table(,%eax,4). But that isn't actually safe because callees own their stack args and are allowed to write them, even though GCC usually doesn't use the incoming stack arg locations as scratch space. So I doubt Linux would have done this.
Other ISAs cope just fine with asmlinkage passing args in registers, so there's nothing fundamental about stack args that's important for how Linux works. (Except possibly for i386-specific code, but I doubt even that.)
The whole "asmlinkage means to pass args on the stack" is purely an i386 thing.
Most other ISAs that Linux runs on are more recent than 32-bit x86 (and/or are RISC-like with more registers), and have a standard calling convention that's more efficient with modern CPUs, using registers for the first few args. That includes x86-64.
I am reading about VM handling on Linux. Apparently to perform a syscall there's a page at 0xFFFFF000 on x86. called vsyscall page. In the past, the strategy to call a syscall was to use int 0x80. Is this vsyscall page strategy still using int 0x80 under the hood, or is it using a different call strategy (e.g. syscall opcode?). Collateral question: is the int 0x80 method outdated?
If you run ldd on a modern Linux binary, you'll see that it's linked to a dynamic library called linux-vdso.1 (on amd64) or linux-gate.so.1 (on x86), which is located in that vsyscall page. This is a shared library provided by the kernel, mapped into every process's address space, which contains C functions that encapsulate the specifics of how to perform a system call.
The reason for this encapsulation is that the "preferred" way to perform a system call can differ from one machine to another. The interrupt 0x80 method should always work on x86, but recent processors support the sysenter (Intel) or syscall (AMD) instructions, which are much more efficient. You want your programs to use those when available, but you also want the same compiled binary to run on both Intel and AMD (and other) processors, so it shouldn't contain vendor-specific opcodes. The linux-vdso/linux-gate library hides these processor-specific decisions behind a consistent interface.
For more information, see this article.
From a not so far removed picture of what is going on, could someone expound more on what is the difference between Linux's system calls like read() and write() etc. and writing them in assembly using the x86 INT opcode along with setting up the specified registers?
The actual function read() is a C library wrapper over what is called the 'system call gate' . The C library wrapper is primarily responsible for things like setting errno on failure, as well as mapping between structures used in userspace and those used by the low-level syscall.
The system call gate, in turn, is what actually switches from usermode to kernel mode. This depends on the CPU architecture - on x86, you have two options - one is to use INT 080h after setting up registers with the syscall number and arguments; another is to call into a symbol provided by a library mapped into every executable's address space, with the same register setup. This library then picks between several potential options for user->kernel transitions, including SYSENTER, SYSCALL, or a fallback to INT 080h. Other architectures use yet different methods. In any case, the CPU shifts into kernelspace, where the syscall number is used to lookup the appropriate handler in a big table.
interrupt is not the only way to invoke system call, you use special instructs like sysenter, syscall or simple jump to specific address in protected mode.
I have been playing around with yasm in an attempt to grasp a basic understanding of x86 assembly. From my tests, it seems you call functions from the kernel by setting the EAX register with the number of the function you want. Then, you push the function arguments onto the stack and issue a syscall (0x80) to execute the instruction. This is Mac OS X / BSD style, I know Linux uses registers to hold arguments instead of using the stack. Does this sound right? Is this the basic idea?
I am a little confused because where are the functions documented? How would I know what arguments, and in what order, to push them onto the stack? Should I look in syscall.h for the answers? It seems there would be a specific reference for supported kernel calls other than C headers.
Also, do standard C functions like printf() rely on the kernel's built-in functions for say, writing to stdout? In other words, does the C compiler know what the kernel functions are and is it trying to "figure out" how to take C code and translate it to kernel functions (which the assembler then translates to machine code)?
C code -> C compiler -> kernel calls / asm -> assembler -> machine binary
I'm sure these are really basic questions, but my understanding of everything that happens after the C compiler is rather muddy.
System Call Documentation
Make sure you have the XCode Developer Tools installed for the UNIX manpages for Mac OS X and then run man 2 intro on the commandline. For a list of system calls, you can use syscall.h (which is useful for the system call numbers) or you can run man 2 syscalls. Then to look up each specific system call, you can run man 2 syscall_name i.e. for read, you can run man 2 read.
UNIX manpages are a historically significant documentation reference for UNIX systems. Pretty much any low-level POSIX function or system call will be documented using them, as well as most commands. Section 2 covers just system calls, and so when you run man 2 pagename, you're asking for the manpage in the system calls section. Section 3 also deals with library functions, so you can run man 3 sprintf the next time you want to read about sprintf.
How C Libraries relate to System Calls
As for how C libraries implement their functionality, usually they build everything on top of system calls, especially in UNIX-like operating systems. malloc internally uses mmap() or brk() on a lot of platforms to get a hold of the actual memory for your process and I/O functions will often use buffers with read, write calls. If there's some other mechanism or library providing the needed functionality, they may also choose to use those instead (i.e. some C libraries for DOS may make use of direct BIOS interrupts instead of calling only DOS interrupts, whereas C libraries for Windows might use Win32 API calls).
Often only a subset of the library functions will need system calls or underlying mechanisms to be implemented though, since the remainder can be written in terms of that subset.
To actually know what's going on with your specific implementation, you should investigate what's happening in a debugger (just keep stepping into all the function calls) or browse the source code of the C library you're using.
How your C code using C libraries relates to machine code
In your question you also suggested:
C code -> C compiler -> kernel calls / asm -> assembler -> machine binary
This is combining two very different concepts. Functions and function calls are supported at the machine code and assembly level, so your C code has a very direct mapping to machine code:
C code -> C compiler -> Assembler -> Linker -> Machine Binary
That is, the compiler translates your function calls in C to function calls in Assembly and system calls in C to system calls in Assembly.
However on most platforms, that machine code contains references to shared libraries and functions in those libraries, so your machine code might have a function that calls other functions from a shared library. The OS then loads that shared library's machine code (if it hasn't been loaded yet for something else) and then runs the machine code for the library function. Then if that library function calls system calls via interrupts, the kernel receives the system call request and does low-level operations directly with the hardware or the BIOS.
So in a protected mode OS, your machine code can be seen as doing the following:
<----------+
|
Function call to -> Other function calls --+
or -> System calls to -> Direct hardware access (inside kernel)
or -> BIOS calls (inside kernel)
You can, of course, call system calls directly in your program as well, skipping the need for any libraries, but unless you're writing your own library, there's usually very little need to do this. If you want even lower-level access, you have to write kernel-level code such as drivers or kernel subsystems.
The recommended way is not doing INT 0x80 by yourself, but to use the wrapper functions from the stdlib. These are, of course, available for assembly as well.
Concerning printf, this works this way:
printf internally calls fprintf(stdout, ...), which in turn uses the FILE * stdout to write to the file descriptor 1 and does write(1, ...). This calls a small wrapper function to set the proper registers to the arguments and perform the kernel call.
How does linux 2.6 differ from 2.4?
Can we modify the source kernel?
Can we modify the int 0x80 service routine?
UPDATE:
1. the 0x80 handler is essentially the same between 2.4 and 2.6, although the function called from the handler is called by the 'syscall' instruction handler for x86-64 in 2.6.
2. the 0x80 handler can be modified like the rest of the kernel.
3. You won't break anything by modifying it, unless you remove backwards compatibility. E.g., you can add your own trace or backdoor if you feel so inclined. The other post that says you will break your libs and toolchain if you modify the handler is incorrect. If you break the dispatch algorithm, or modify the dispatch table incorrectly, then you will break things.
3a. As I originally posted, the best way to extend the 0x80 service is to extend the system call handler.
As the kernel source says:
What: The kernel syscall interface
Description:
This interface matches much of the POSIX interface and is based
on it and other Unix based interfaces. It will only be added to
over time, and not have things removed from it.
Note that this interface is different for every architecture
that Linux supports. Please see the architecture-specific
documentation for details on the syscall numbers that are to be
mapped to each syscall.
The system call table entries for i386 are in:
arch/i386/kernel/syscall_table.S
Note that the table is a sequence of pointers, so if you want to maintain a degree of forward compatibility with the kernel maintainers, you'd need to pad the table before placement of your pointer.
The syscall vector number is defined in irq_vectors.h
Then traps.c sets the address of the system_call function via set_system_gate, which places the entry into the interrupt descriptor table. The system_call function itself is in entry.S, and calls the requested pointer from the system call table.
There are a few housekeeping details, which you can see reading the code, but direct modification of the 0x80 interrupt handler is accomplished in entry.S inside the system_call function. In a more sane fashion, you can modify the system call table, inserting your own function without modifying the dispatch mechanism.
In fact, having read the 2.6 source, it says directly that int 0x80 and x86-64 syscall use the same code, so far. So you can make portable changes for x86-32 and x86-64.
END Update
The INT 0x80 method invokes the system call table handler. This matches register arguments to a call table, invoking kernel functions based on the contents of the EAX register. You can easily extend the system call table to add custom kernel API functions.
This may even work with the new syscall code on x86-64, as it uses the system call table, too.
If you alter the current system call table in any manner other than to extend it, you will break all dependent libraries and code, including libc, init, etc.
Here's the current Linux system call table: http://asm.sourceforge.net/syscall.html
It's an architectural overhaul. Everything has changed internally. SMP support is complete, the process scheduler is vastly improved, memory management got an overhaul, and many, many other things.
Yes. It's open-source software. If you do not have a copy of the source, you can get it from your vendor or from kernel.org.
Yes, but it's not advisable because it will break libc, it will break your baselayout, and it will break your toolchain if you change the sequence of existing syscalls, and nearly everything you might think you want to do should be done in userspace when at all possible.