How an actual system call is made? - linux

I have a question about how an actual system call is made. I know that the magic of system call (like read etc.) is done in C library but don’t understand the exact mechanism. My main issues are
The c library routine is in user address space; then how can it get the address of the interrupt service routines. Are interrupt service routines predefined(on boot up) in physical memory?.
Even if somehow the ISR routine is called how does the address space change? I mean before we start the execution of ISR how will the 'page table base register' change to point to kernel's page table. If the 'C' routine does it then how does it know the address of Kernel's page table?
How are parameters copied from user space to kernel space?
Please excuse me if my questions are too basic but I am new to this. :)
Thanks
Rohit

On most systems, there's an instruction that can be executed by user code to invoke a user-defined interrupt (for example, int on x86 and swi on ARM will request a "software interrupt").
The CPU, executing in user mode, will switch to kernel mode upon seeing one of these instructions, and will jump to a predefined ISR location for that particular interrupt. The interrupt number is typically fixed, and the corresponding ISR is the system call handler for the kernel.
The kernel can inspect the user-mode registers and stack which were present at the time the interrupt was called (in a manner similar to saving all registers on the stack during a context switch), and obtain the system call arguments from there.

Ok i think i found the answer (at least i think so) at
questions about kernel space
1.The c library routine is in user address space; then how can it get the address of the interrupt service routines. Are interrupt service
routines predefined(on boot up) in physical memory?.
The ISR location is predefined as answered by nneonneo above.
2.Even if somehow the ISR routine is called how does the address space change? I mean before we start the execution of ISR how will the 'page
table base register' change to point to kernel's page table. If the
'C' routine does it then how does it know the address of Kernel's page
table?
There is no change in address space as the kernel space is essentially same as users (just the difference in protection level)

Related

Trap instruction: why must the program counter and processor status register be changed atomically?

I came across the following problem on a previous exam from my operating systems class.
Consider an architecture in which the TRAP instruction has two effects: to load a predefined value of the Processor Status Register (PCR), which contains the user/kernel mode bit, saving the value of the Program Counter (PC) to a special Save PC register and loading apredefined value into the PC. Explain why loading a new value for the PCR without also changing the PC in the same instruction cycle would be unsafe.
I know that the PCR would be set to kernel mode with memory management off. Is it unsafe because the PC is still in the user program? If so where could it go wrong? If not why is it unsafe? Why would changing the PC first also be unsafe?
Aside: there is no reason to assume that "memory management" is turned "off" by loading the new processor status; in fact, in the CPUs in my experience that would not happen. But that is not relevant to this answer.
We're executing in user mode and a TRAP instruction is fetched. The program counter is then (let's say) pointing to the instruction after TRAP.
Now the processor executes the TRAP. It loads the new processor status, which switches the CPU to kernel mode. Assume this does not in itself inhibit device interrupts.
NOW... a device interrupts. The hardware or software mechanism saves the processor status (=kernel mode) and program counter (=the user-mode address of the instruction after TRAP). The device interrupt service routine does its thing and executes a return from interrupt to restore program counter and processor status. We can't resume "half-way through the TRAP instruction" - the only thing that can happen is that we start to execute the instruction that PC points to, i.e., we're executing the instruction after the TRAP but in kernel mode.
The exact problem depends on the system architecture:
If the kernel address map is a superset of the user address map (typical on OSes where user space is half the total address space) then we're executing user-provided code in kernel mode, which is at least a serious privilege problem, and may cause us to fail by page faulting when we can't handle it.
If the kernel address map doesn't include user space (frequently the case on systems with limited virtual address size) then this is equivalent to taking wild jump into the kernel.
The summary is that you need both the processor status and program counter to define "where you are in execution", and they both need to be saved/updated together; or in other words, no change of control (such as an interrupt) can be permitted in the middle.

Linux - Accessing mmap()ed memory from Thread from Userspace in Kernel Space

Mapped this memory in my Thread in Userspace:
b7fd0000-b7fd1000 rwxp 00000000 00:00 0
Thread is running (endless loop)
Made a breakpoint in the Kernel and trying to access it:
Thread 466 received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 3908]
0xc10d4060 in kgdb_breakpoint ()
(gdb) x/01i 0xb7fd0000
0xb7fd0000: Cannot access memory at address 0xb7fd0000
But it is not accessible.
How can I access 0xb7fd0000 from Kernel space? What address will be under?
Is it even possible?
Thanks,
The address the memory will appear under depends on which user space context is currently mapped.
The way this works is, some of the virtual addresses are reserved to the kernel, and these are the same in all contexts. This is why you can set a break point on a kernel address without worrying about which user space process is currently mapped.
For user space, this is not the case. Each time a new process is mapped, the virtual addresses for US change completely.
This is, likely, an X-Y problem. You're trying to do something, and you think that a kernel level break point is how you want to achieve it.
Taking a guess, you want your kernel driver to do something to communicate with your user space thread. If that's the case, your best bet is to export a character device, and have the userspace open it and mmap from there (rather than just an anonymous mmap). You can then control which memory it receives, and thus also map it to the kernel address space, where pointers are stable.
I'm not expert in kgdb but it will be worth checking current->pid at the time breakpoint is hit, contains pid of the process you're tracking. This is because although memory location inside kernel space has been hit, the user space process may not be the one you are interested in. Also to guard against possibility of pages being swapped out, it might be safer to lock mapped pages using mlock.
There are at least two things to pay attention to.
When to break. If you simply hit crtl-C in the kernel debugger, you don't know what the userspace context is going to be. It could be any userspace process. You want the kernel debugger to pause and give you control when the userspace context refers to the process you are interested in. One way to do this is as follows :- If you have the ability to recompile the kernel, add a new system call. Invoke this system call from the userspace process, after the region is mmap'd in. Start debugging the kernel after placing a breakpoint on the newly added system call. When the breakpoint is hit, you know for a fact that the userspace context is that of the process one you are interested in.
Virtual memory late binding. Even if you follow the steps in [1], you will still have trouble when accessing the contents of the buffer in userspace, if you have not read / written anything in that location. Make sure that after mmap'ing the region, you either read or write to the mmap'd location before invoking your newly added system call.

Performance Read() and Write() to/from Linux SKB's

Based on a standard Linux system, where there is a userland application and the kernel network stack. Ive read that moving frames from user space to kernel space (and vica-versa) can be expensive in terms of CPU cycles.
My questions are,
Why? and is moving the frame in one direction (i.e from user to
kernel) have a higher impact.
Also, how do things differ when you
move into TAP based interfaces. As the frame will still be going
between user/kernel space. Do the space concerns apply, or is there some form of zero-copy in play?
Addressing questions in-line:
Why? and is moving the frame in one direction (i.e from user to
kernel) have a higher impact.
Moving to/from user/kernel spaces is expensive because the OS has to:
Validate the pointers for the copy operation.
Transfer the actual data.
Incur the usual costs involved in transitioning between user/kernel mode.
There are some exceptions to this, such as if your driver implements a strategy such as "page flipping", which effectively remaps a chunk/page of memory so that it is accessible to a userspace application. This is "close enough" to a zero copy operation.
With respect to copy_to_user/copy_from_user performance, the performance of the two functions is apparently comparable.
Also, how do things differ when you move into TAP based interfaces. As
the frame will still be going between user/kernel space. Do the space
concerns apply, or is there some form of zero-copy in play?
With TUN/TAP based interfaces, the same considerations apply, unless you're utilizing some sort of DMA, page flipping, etc; logic.
Context Switch
Moving frames from user space to kernel space is called context switch, which is usually caused by system call (which invoke the int 0x80 interrupt).
Interrupt happens, entering kernel space;
When interrupt happens, os will store all of the registers' value into the kernel stack of a thread: ds, es, fs, eax, cr3 etc
Then it jumps to IRQ handler like a function call;
Through some common IRQ execution path, it will choose next thread to run by some algorithm;
The runtime info (all the registers) is loaded from next thread;
Back to user space;
As we can see, we will do a lot of works when moving frame into/out kernel, which is much more work than a simple function call (just setting ebp, esp, eip). That is why this behavior is relatively time-consuming.
Virtual Devices
As a virtual network devices, writing to TAP has no differences compared with writing to a /dev/xxx.
If you write to TAP, os will be interrupted like upper description, then it will copy your arguments into kernel and block your current thread (in blocking IO). Kernel driver thread will be notified in some ways (e.g. message queue) to receive the arguments and consume it.
In Andorid, there exists some zero-copy system call, and in my demo implementations, this can be done through the address translation between the user and kernel. Because kernel and user thread not share same address space and user thread's data may be changed, we usually copy data into kernel. So if we meet the condition, we can avoid copy:
this system call must be blocked, i.e. data won't change;
translate between addresses by page tables, i.e. kernel can refer to right data;
Code
The following are codes from my demo os, which is related to this question if you are interested in detail:
interrupt handle procedure: do_irq.S, irq_handle.c
system call: syscall.c, ide.c
address translation: MM_util.c

Which part of Linux kernel enforces privilege separation and how?

I want to know how privilege separation is enforced by the kernel and the part of kernel that is responsible for this task.
For example, assume there are two processes running -- one at ring 0 and another at ring 3. How does the kernel keep track of the ring number of each process?
Edit: I know about ring numbers. My question is about the part of kernel (module or something) which performs checks on the processes to find out their privilege level. I believe there might be a component of kernel which would check the ring number of a process.
There is no concept of a ring number of a process.
The kernel is mapped in one area of memory, userspace is mapped in another. On boot the kernel specifies an address where the cpu has to jump to when the syscall instruction is executed. So someone does syscall, the cpu switches to ring0 and jumps to the address as instructed by the kernel. It is now executing kernel code. Then, on return, the cpu switches back to ring3 and resumes execution.
Similar story for other ways of entering the kernel like exceptions.
So, how does linux kernel enforce separation? It sets things up for usersapace to execute in ring3. Anything triggering the cpu to switch to ring0 also makes the jump to an address configured by the kernel on boot. no code other than kernel code executes in ring0

How does copy_from_user from the Linux kernel work internally?

How exactly does the copy_from_user() function work internally? Does it use any buffers or is there any memory mapping done, considering the fact that kernel does have the privilege to access the user memory space?
The implementation of copy_from_user() is highly dependent on the architecture.
On x86 and x86-64, it simply does a direct read from the userspace address and write to the kernelspace address, while temporarily disabling SMAP (Supervisor Mode Access Prevention) if it is configured. The tricky part of it is that the copy_from_user() code is placed into a special region so that the page fault handler can recognise when a fault occurs within it. A memory protection fault that occurs in copy_from_user() doesn't kill the process like it would if it is triggered by any other process-context code, or panic the kernel like it would if it occured in interrupt context - it simply resumes execution in a code path which returns -EFAULT to the caller.
regarding "how bout copy_to_user since the kernel is passing on the kernel space address,how can a user space process access it"
A user space process can attempt to access any address. However, if the address is not mapped in that process user space (i.e. in the page tables of that process) or if there is a problem with the access like a write attempt to a read-only location, then a page fault is generated. Note that at least on the x86, every process has all the kernel space mapped in the lowest 1 gigabyte of that process's virtual address space, while the 3 upper gigabytes of the 4GB total address space (I'm using here the 32-bit classic case) are used for the process text (i.e. code) and data.
A copy to or from user space is executed by the kernel code that is executing on behalf of the process and actually it's the memory mapping (i.e. page tables) of that process that are in-use during the copy. This takes place while execution is in kernel mode - i.e. privileged/supervisor mode in x86 language.
Assuming the user-space code has passed a legitimate target location (i.e. an address properly mapped in that process address space) to have data copied to, copy_to_user, run from kernel context would be able to normally write to that address/region w/out problems and after the control returns to the user, user space also can read from this location setup by the process itself to start with.
More interesting details can be found in chapters 9 and 10 of Understanding the Linux Kernel, 3rd Edition, By Daniel P. Bovet, Marco Cesati. In particular, access_ok() is a necessary but not sufficient validity check. The user can still pass addresses not belong to the process address space. In this case, a Page Fault exception will occur while the kernel code is executing the copy. The most interesting part is how the kernel page fault handler determines that the page fault in such case is not due to a bug in the kernel code but rather a bad address from the user (especially if the kernel code in question is from a kernel module loaded).
The best answer has something wrong, copy_(from|to)_user can't be used in interrupt context, they may sleep, copy_(from|to)_user function can only be used in process context,
the process's page table include all the information that kernel need to access it, so kernel can direct access the user space address if we can make sure the page addressed is in memory, use copy_(from|to)_user function, because they can check it for us and if the user space addressed page is not resident, it will fix it for us directly.
The implementation of copy_from_user() system call is done using two buffers from different address spaces:
The user-space buffer in user virtual address space.
The kernel-space buffer in kernel virtual address space.
When the copy_from_user() system call is invoked, data is copied from user buffer to kernel buffer.
A part (write operation) of character device driver code where copy_from_user() is used is given below:
ssize_t cdev_fops_write(struct file *flip, const char __user *ubuf,
size_t count, loff_t *f_pos)
{
unsigned int *kbuf;
copy_from_user(kbuf, ubuf, count);
printk(KERN_INFO "Data: %d",*kbuf);
}

Resources