In user space, the ioctl system call has the following prototype:
int ioctl(int fd, unsigned long cmd, ...);
The prototype stands out in the list of Unix system calls because of the dots, which usually mark the function as having a variable number ofarguments. In a real system, however, a system call cannot actually have a variable number of arguments. System calls must have a well-defined prototype, because user programs can access them only through hardware "gates".
So what are these hardware gates? The page numbers are 135 and 136.
Hardware "gates" are specific instructions that allow switching to the kernel's context, usually to let a program request something from the kernel. This might be an instruction like syscall, sysenter, or int 0x80, depending on your system.
I should note that these aren't usually called "hardware gates" in practice, but rather something like "system calls instructions."
Related
In an OS book, when it talks about client-server communication, it says:
Client-server communication is a common pattern in many systems, and so one can ask: how can we improve its performance? One step is to recognize that both the client and the server issue a write immediately followed by a read, to wait for the other side to reply; at the cost of adding a system call, these can be combined to eliminate two kernel crossings per round trip.
I wonder how "issue a write immediately followed by a read" can save 2 kernel crossings per round trip.
A write issues a system call into the kernel, causes a kernel crossing from user mode to kernel mode. When the write finishes, the OS returns to user-code, from kernel mode to user mode.
Then, read is called, and causes a kernel crossing from user mode to kernel mode, and then it returns to user-code, from kernel mode to user mode.
So what is the saved kernel crossing? Does it mean that the when the write finishes, it does not return to user code and user mode, instead, it directly runs read in kernel mode?
As far as understand the OS book, it is a potential optimization. OS may have a syscall that do write and read at once. It could be a hypothetical syscall like int write_read(int fd, char *write_buf, size_t write_len, char *read_buf, size_t *read_len). But there is no such call the linux kernel.
Modern kernels do not use interrupts for syscalls so the optimization would not help much. Moreover modern applications that are performance critical usually use some kind of asynchronous, non-blocking handling so the proposed optimization would be useless for them anyway. Further problem with that optimization would be error reporting. If something failed the caller could not easily recognize wheteher read failed or write failed.
What is the difference between SYS_exit, sys_exit() and exit()?
What I understand :
The linux kernel provides system calls, which are listed in man 2 syscalls.
There are wrapper functions of those syscalls provided by glibc which have mostly similar names as the syscalls.
My question : In man 2 syscalls, there is no mention of SYS_exit and sys_exit(), for example. What are they?
Note : The syscall exit here is only an example. My question really is : What are SYS_xxx and sys_xxx()?
I'll use exit() as in your example although this applies to all system calls.
The functions of the form sys_exit() are the actual entry points to the kernel routine that implements the function you think of as exit(). These symbols are not even available to user-mode programmers. That is, unless you are hacking the kernel, you cannot link to these functions because their symbols are not available outside the kernel. If I wrote libmsw.a which had a file scope function like
static int msw_func() {}
defined in it, you would have no success trying to link to it because it is not exported in the libmsw symbol table; that is:
cc your_program.c libmsw.a
would yield an error like:
ld: cannot resolve symbol msw_func
because it isn't exported; the same applies for sys_exit() as contained in the kernel.
In order for a user program to get to kernel routines, the syscall(2) interface needs to be used to effect a switch from user-mode to kernel mode. When that mode-switch (somtimes called a trap) occurs a small integer is used to look up the proper kernel routine in a kernel table that maps integers to kernel functions. An entry in the table has the form
{SYS_exit, sys_exit},
Where SYS_exit is an preprocessor macro which is
#define SYS_exit (1)
and has been 1 since before you were born because there hasn't been reason to change it. It also happens to be the first entry in the table of system calls which makes look up a simple array index.
As you note in your question, the proper way for a regular user-mode program to access sys_exit is through the thin wrapper in glibc (or similar core library). The only reason you'd ever need to mess with SYS_exit or sys_exit is if you were writing kernel code.
This is now addressed in man syscall itself,
Roughly speaking, the code belonging to the system call with number __NR_xxx defined in /usr/include/asm/unistd.h can be found in the Linux kernel source in the routine sys_xxx(). (The dispatch table for i386 can be found in /usr/src/linux/arch/i386/kernel/entry.S.) There are many exceptions, however, mostly because older system calls were superseded by newer ones, and this has been treated somewhat unsystematically. On platforms with proprietary operating-system emulation, such as parisc, sparc, sparc64, and alpha, there are many additional system calls; mips64 also contains a full set of 32-bit system calls.
At least now /usr/include/asm/unistd.h is a preprocessor hack that links to either,
/usr/include/asm/unistd_32.h
/usr/include/asm/unistd_x32.h
/usr/include/asm/unistd_64.h
The C function exit() is defined in stdlib.h. Think of this as a high level event driven interface that allows you to register a callback with atexit()
/* Call all functions registered with `atexit' and `on_exit',
in the reverse of the order in which they were registered,
perform stdio cleanup, and terminate program execution with STATUS. */
extern void exit (int __status) __THROW __attribute__ ((__noreturn__));
So essentially the kernel provides an interface (C symbols) called __NR_xxx. Traditionally people want sys_exit() which is defined with a preprocessor macro SYS_exit. This macro creates the sys_exit() function. The exit() function is part of the standard C library stdlib.h and ported to other operating systems that lack the Linux Kernel ABI entirely (there may not be __NR_xxx functions) and potentially don't even have sys_* functions available either (you could write exit() to send the interrupt or use VDSO in Assembly).
In Linux, to create a socket we include the sys/socket.h header file and use the socket() function. The header file is located at /usr/include/sys/socket.h.
extern int socket (int __domain, int __type, int __protocol) __THROW;
Can anyone please tell the location where the socket() function is actually implemented.
Thanks.
Acutally,
int socket (int __domain, int __type, int __protocol) __THROW
implemented in glibc,
and the glibc calls the kernel function sys_socket implemented in kernel file net/socket.c.
asmlinkage long sys_socket(int family, int type, int protocol);
socket(2) is a ssytem call. The socket function inside Glibc is just a tiny wrapper to make the real system call.
From an application's point of view, system calls are atomic; in other words, the virtual machine on which your Linux application program is running is the x86 machine (the non-priviledged instruction set) augmented with the more than 300 system calls provided by the kernel. See also Assembly Howto which explains how a system call can be coded. Read more about the linux kernel and the syscalls(2) and intro(2) man page.
The real work about sockets is done inside the kernel, it is the networking subsystem.
Here it is => socket.c.
Usually most of the socket functions, including this one, are just wrappers around system calls (direct calls to the kernel), hence it is all handled by the almighty kernel itself.
Here is the Kernel's implementation: SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol){...}
From a not so far removed picture of what is going on, could someone expound more on what is the difference between Linux's system calls like read() and write() etc. and writing them in assembly using the x86 INT opcode along with setting up the specified registers?
The actual function read() is a C library wrapper over what is called the 'system call gate' . The C library wrapper is primarily responsible for things like setting errno on failure, as well as mapping between structures used in userspace and those used by the low-level syscall.
The system call gate, in turn, is what actually switches from usermode to kernel mode. This depends on the CPU architecture - on x86, you have two options - one is to use INT 080h after setting up registers with the syscall number and arguments; another is to call into a symbol provided by a library mapped into every executable's address space, with the same register setup. This library then picks between several potential options for user->kernel transitions, including SYSENTER, SYSCALL, or a fallback to INT 080h. Other architectures use yet different methods. In any case, the CPU shifts into kernelspace, where the syscall number is used to lookup the appropriate handler in a big table.
interrupt is not the only way to invoke system call, you use special instructs like sysenter, syscall or simple jump to specific address in protected mode.
How does linux 2.6 differ from 2.4?
Can we modify the source kernel?
Can we modify the int 0x80 service routine?
UPDATE:
1. the 0x80 handler is essentially the same between 2.4 and 2.6, although the function called from the handler is called by the 'syscall' instruction handler for x86-64 in 2.6.
2. the 0x80 handler can be modified like the rest of the kernel.
3. You won't break anything by modifying it, unless you remove backwards compatibility. E.g., you can add your own trace or backdoor if you feel so inclined. The other post that says you will break your libs and toolchain if you modify the handler is incorrect. If you break the dispatch algorithm, or modify the dispatch table incorrectly, then you will break things.
3a. As I originally posted, the best way to extend the 0x80 service is to extend the system call handler.
As the kernel source says:
What: The kernel syscall interface
Description:
This interface matches much of the POSIX interface and is based
on it and other Unix based interfaces. It will only be added to
over time, and not have things removed from it.
Note that this interface is different for every architecture
that Linux supports. Please see the architecture-specific
documentation for details on the syscall numbers that are to be
mapped to each syscall.
The system call table entries for i386 are in:
arch/i386/kernel/syscall_table.S
Note that the table is a sequence of pointers, so if you want to maintain a degree of forward compatibility with the kernel maintainers, you'd need to pad the table before placement of your pointer.
The syscall vector number is defined in irq_vectors.h
Then traps.c sets the address of the system_call function via set_system_gate, which places the entry into the interrupt descriptor table. The system_call function itself is in entry.S, and calls the requested pointer from the system call table.
There are a few housekeeping details, which you can see reading the code, but direct modification of the 0x80 interrupt handler is accomplished in entry.S inside the system_call function. In a more sane fashion, you can modify the system call table, inserting your own function without modifying the dispatch mechanism.
In fact, having read the 2.6 source, it says directly that int 0x80 and x86-64 syscall use the same code, so far. So you can make portable changes for x86-32 and x86-64.
END Update
The INT 0x80 method invokes the system call table handler. This matches register arguments to a call table, invoking kernel functions based on the contents of the EAX register. You can easily extend the system call table to add custom kernel API functions.
This may even work with the new syscall code on x86-64, as it uses the system call table, too.
If you alter the current system call table in any manner other than to extend it, you will break all dependent libraries and code, including libc, init, etc.
Here's the current Linux system call table: http://asm.sourceforge.net/syscall.html
It's an architectural overhaul. Everything has changed internally. SMP support is complete, the process scheduler is vastly improved, memory management got an overhaul, and many, many other things.
Yes. It's open-source software. If you do not have a copy of the source, you can get it from your vendor or from kernel.org.
Yes, but it's not advisable because it will break libc, it will break your baselayout, and it will break your toolchain if you change the sequence of existing syscalls, and nearly everything you might think you want to do should be done in userspace when at all possible.