Kernel 2.6.24.6
I'm writing a syscall interposer. I'd like to interpose mmap, but need to copy and paste the code up to the actualy system call, methinks. I can't find that code to paste? Where is this thing?
Thanks
The mmap(2) C function is just doing a syscall. You might find that MUSL Libc code is eeasier to read in its mmap.c file, which wraps the syscall (perhaps using mmap2). The actual processing of syscalls is done inside the kernel.
The Linux Assembly Howto is explaining how the syscalls are actually done. See also the x86-64 ABI spec.
Because the kernel can't actually provide functions in your program, the calls into the application VDSO are enclosed in the libc implementation you're using. Most likly, you're using GNU libc, for which you can find the sources here.
You can download a kernel source and look at (Kernel Source)/drivers/char/mem.c
There you can find a simple implementation for mmap with the prototype:
static int mmap_mem(struct file *file, struct vm_area_struct *vma)
At some point, mmap() should call remap_pfn_range() to remap kernel memory to user space.
I think this is the simplest implementation of mmap() for a driver, but you can also look at other driver codes to find out about their implementation of mmap(). You should look for the following structure in the driver code:
static const struct file_operations mem_fops = {
.llseek = memory_lseek,
.read = read_mem,
.write = write_mem,
.mmap = mmap_mem, // Function implementing mmap
.open = open_mem,
.get_unmapped_area = get_unmapped_area_mem,
};
Related
I do not quite understand what will poll() do even after I tried to search for it on Google. Is there any documentation related to this function or all the interfaces in the file_operations?
poll() method in VFS struct file_operations is (quoting kernel vfs documentation):
called by the VFS when a process wants to check if there is activity on this file and (optionally) go to sleep until there is activity. Called by the select(2) and poll(2) system calls
Note that struct contains a pointer to function of signature __poll_t (*poll) (struct file *, struct poll_table_struct *);. If I remember correctly, this (and most of other) file_operations methods are located (and pointers populated) partially by specific filesystem itself (usually in its file.c or similar) or - by filesystem calling out to VFS and VFS populating them by specific code for underlying block/character device.
What is the difference between SYS_exit, sys_exit() and exit()?
What I understand :
The linux kernel provides system calls, which are listed in man 2 syscalls.
There are wrapper functions of those syscalls provided by glibc which have mostly similar names as the syscalls.
My question : In man 2 syscalls, there is no mention of SYS_exit and sys_exit(), for example. What are they?
Note : The syscall exit here is only an example. My question really is : What are SYS_xxx and sys_xxx()?
I'll use exit() as in your example although this applies to all system calls.
The functions of the form sys_exit() are the actual entry points to the kernel routine that implements the function you think of as exit(). These symbols are not even available to user-mode programmers. That is, unless you are hacking the kernel, you cannot link to these functions because their symbols are not available outside the kernel. If I wrote libmsw.a which had a file scope function like
static int msw_func() {}
defined in it, you would have no success trying to link to it because it is not exported in the libmsw symbol table; that is:
cc your_program.c libmsw.a
would yield an error like:
ld: cannot resolve symbol msw_func
because it isn't exported; the same applies for sys_exit() as contained in the kernel.
In order for a user program to get to kernel routines, the syscall(2) interface needs to be used to effect a switch from user-mode to kernel mode. When that mode-switch (somtimes called a trap) occurs a small integer is used to look up the proper kernel routine in a kernel table that maps integers to kernel functions. An entry in the table has the form
{SYS_exit, sys_exit},
Where SYS_exit is an preprocessor macro which is
#define SYS_exit (1)
and has been 1 since before you were born because there hasn't been reason to change it. It also happens to be the first entry in the table of system calls which makes look up a simple array index.
As you note in your question, the proper way for a regular user-mode program to access sys_exit is through the thin wrapper in glibc (or similar core library). The only reason you'd ever need to mess with SYS_exit or sys_exit is if you were writing kernel code.
This is now addressed in man syscall itself,
Roughly speaking, the code belonging to the system call with number __NR_xxx defined in /usr/include/asm/unistd.h can be found in the Linux kernel source in the routine sys_xxx(). (The dispatch table for i386 can be found in /usr/src/linux/arch/i386/kernel/entry.S.) There are many exceptions, however, mostly because older system calls were superseded by newer ones, and this has been treated somewhat unsystematically. On platforms with proprietary operating-system emulation, such as parisc, sparc, sparc64, and alpha, there are many additional system calls; mips64 also contains a full set of 32-bit system calls.
At least now /usr/include/asm/unistd.h is a preprocessor hack that links to either,
/usr/include/asm/unistd_32.h
/usr/include/asm/unistd_x32.h
/usr/include/asm/unistd_64.h
The C function exit() is defined in stdlib.h. Think of this as a high level event driven interface that allows you to register a callback with atexit()
/* Call all functions registered with `atexit' and `on_exit',
in the reverse of the order in which they were registered,
perform stdio cleanup, and terminate program execution with STATUS. */
extern void exit (int __status) __THROW __attribute__ ((__noreturn__));
So essentially the kernel provides an interface (C symbols) called __NR_xxx. Traditionally people want sys_exit() which is defined with a preprocessor macro SYS_exit. This macro creates the sys_exit() function. The exit() function is part of the standard C library stdlib.h and ported to other operating systems that lack the Linux Kernel ABI entirely (there may not be __NR_xxx functions) and potentially don't even have sys_* functions available either (you could write exit() to send the interrupt or use VDSO in Assembly).
I would like to know in kernel source version >= 2.6 where brk is defined. That is which c file contains its definition? grep is not revealing much. Also sbrk is implemented in glibc correct?
It's in mmap.c. Look for:
SYSCALL_DEFINE1(brk, unsigned long, brk)
The manual page says:
On Linux, sbrk() is implemented as a library function that uses the
brk() system call, and does some internal bookkeeping so that it can
return the old break value.
In Linux, to create a socket we include the sys/socket.h header file and use the socket() function. The header file is located at /usr/include/sys/socket.h.
extern int socket (int __domain, int __type, int __protocol) __THROW;
Can anyone please tell the location where the socket() function is actually implemented.
Thanks.
Acutally,
int socket (int __domain, int __type, int __protocol) __THROW
implemented in glibc,
and the glibc calls the kernel function sys_socket implemented in kernel file net/socket.c.
asmlinkage long sys_socket(int family, int type, int protocol);
socket(2) is a ssytem call. The socket function inside Glibc is just a tiny wrapper to make the real system call.
From an application's point of view, system calls are atomic; in other words, the virtual machine on which your Linux application program is running is the x86 machine (the non-priviledged instruction set) augmented with the more than 300 system calls provided by the kernel. See also Assembly Howto which explains how a system call can be coded. Read more about the linux kernel and the syscalls(2) and intro(2) man page.
The real work about sockets is done inside the kernel, it is the networking subsystem.
Here it is => socket.c.
Usually most of the socket functions, including this one, are just wrappers around system calls (direct calls to the kernel), hence it is all handled by the almighty kernel itself.
Here is the Kernel's implementation: SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol){...}
I can't find this information anywhere. Everywhere I look, I find things referring to how the stack looks once you hit "main" (whatever your entry point is), which would be the program arguments, and environment, but what I'm looking for is how the system sets up the stack to cooperate with the switch_to macro. The first time the task gets switched to, it would need to have EFLAGS, EBP, the registers that GCC saves, and the return address from the schedule() function on the stack pointed to by "tsk->thread->esp", but what I can't figure out is how the kernel sets up this stack, since it lets GCC save the general purpose registers (using the output parameters for inline assembly).
I am referring to x86 PCs only. I am researching the Linux scheduler/process system for my own small kernel I am (attempting) to write, and I can't get my head around what I'm missing. I know I'm missing something since the fact that Slackware is running on my computer is a testament to the fact that the scheduler works :P
EDIT: I seem to have worded this badly. I am looking for information on how the tasks kernel stack is setup not how the tasks user task is setup. More specifically, the stack which tsk->thread->esp points to, and that "switch_to" switches to.
The initial kernel stack for a new process is set in copy_thread(), which is an arch-specific function. The x86 version, for example, starts out like this:
int copy_thread(unsigned long clone_flags, unsigned long sp,
unsigned long unused,
struct task_struct *p, struct pt_regs *regs)
{
struct pt_regs *childregs;
struct task_struct *tsk;
int err;
childregs = task_pt_regs(p);
*childregs = *regs;
childregs->ax = 0;
childregs->sp = sp;
p->thread.sp = (unsigned long) childregs;
p->thread.sp0 = (unsigned long) (childregs+1);
p->thread.ip = (unsigned long) ret_from_fork;
p->thread.sp and p->thread.ip are the new thread's kernel stack pointer and instruction pointer respectively.
Note that it does not place a saved %eflags, %ebp etc there, because when a newly-created thread of execution is first switched to, it starts out executing at ret_from_fork (this is where __switch_to() returns to for a new thread), which means that it doesn't execute the second-half of the switch_to() routine.
The state of the stack at process creation is described in the X86-64 SVR4 ABI supplement (for AMD64, ie x86-64 64 bits machines). The equivalent for 32 bits Intel processor is probably ABI i386. I strongly recommend reading also Assembly HOWTO. And of course, you should perhaps read the relevant Linux kernel file.
Google for "linux stack layout process startup" gives this link: "Startup state of a Linux/i386 ELF binary", which describes the set up that the kernel performs just before transferring control to the libc startup code.