What register state is saved on a context switch in Linux? - linux

Where in Linux would you look to find out what registers are saved on a context switch? I'm wondering, for example, if it is safe to use FP or vector registers in kernel-mode driver code (mostly interested in x86-64 and ARM, but I'm hoping for an architecture-independent answer).

Since no one seems to have answered this, let me venture.
Take a look at the _math_restore_cpu and __unlazy_fpu methods.
You can find them here:
http://www.cs.fsu.edu/~baker/devices/lxr/http/ident?i=math_state_restore
http://www.cs.fsu.edu/~baker/devices/lxr/http/ident?i=__unlazy_fpu
The x86 like processors have separate instructions for saving (fnsave) and restore (frstor) FPU state and so it looks like the OS is burdened with saving/restoring them.
I presume unless the FPU unit has been used by the usermode process, linux context switch will not save it for you.
So you need to do it yourself (in your driver) to be sure. You can use kernel_fpu_begin/end to do it in your driver, but is generally not a good idea.
http://www.cs.fsu.edu/~baker/devices/lxr/http/ident?i=kernel_fpu_begin
http://www.cs.fsu.edu/~baker/devices/lxr/http/ident?i=kernel_fpu_end
Why it is not a good idea? From Linus himself: http://lkml.indiana.edu/hypermail/linux/kernel/0405.3/1620.html
Quoted:
You can do it "safely" on x86 using
kernel_fpu_begin(); ...
kernel_fpu_end();
and make sure that all the FP stuff
is in between those two things, and
that you don't do anything that
might fault or sleep.
The kernel_fpu_xxx() macros make sure
that preemption is turned off etc, so
the above should always be safe.
Even then, of course, using FP in the
kernel assumes that you actually
have an FPU, of course. The in-kernel FP emulation package is
not supposed to work with kernel FP instructions.
Oh, and since the kernel doesn't link
with libc, you can't use anything
even remotely fancy. It all has to be
stuff that gcc can do in-line,
without any function calls.
In other words: the rule is that you
really shouldn't use FP in the
kernel. There are ways to do it, but
they tend to be for some real
special cases, notably for doing
MMX/XMM work. Ie the only "proper" FPU
user is actually the RAID checksumming
MMX stuff.
Linus
In any case, do you really want to rely on Intel's floating point unit? http://en.wikipedia.org/wiki/Pentium_FDIV_bug (just kidding :-)).

Related

Linux System Calls & Kernel Mode

I understand that system calls exist to provide access to capabilities that are disallowed in user space, such as accessing a HDD using the read() system call. I also understand that these are abstracted by a user-mode layer in the form of library calls such as fread(), to provide compatibility across hardware.
So from the application developers point of view, we have something like;
//library //syscall //k_driver //device_driver
fread() -> read() -> k_read() -> d_read()
My question is; what is stopping me inlining all the instructions in the fread() and read() functions directly into my program? The instructions are the same, so the CPU should behave in the same way? I have not tried it, but I assume that this does not work for some reason I am missing. Otherwise any application could get arbitrary kernel mode operation.
TL;DR: What allows system calls to 'enter' kernel mode that is not copy-able by an application?
System calls do not enter the kernel themselves. More precisely, for example the read function you call is still, as far as your application is concerned, a library call. What read(2) does internally is calling the actual system call using some interruption or the syscall(2) assembly instruction, depending on the CPU architecture and OS.
This is the only way for userland code to have privileged code to be executed, but it is an indirect way. The userland and kernel code execute in different contexts.
That means you cannot add the kernel source code to your userland code and expect it to do anything useful but crash. In particular, the kernel code has access to physical memory addresses required to interact with the hardware. Userland code is limited to access a virtual memory space that has not this capability. Also, the instructions userland code is allowed to execute is a subset of the ones the CPU support. Several I/O, interruption and virtualization related instructions are examples of prohibited code. They are known as privileged instructions and require to be in an lower ring or supervisor mode depending on the CPU architecture.
You could inline them. You can issue system calls directly through syscall(2), but that soon gets messy. Note that the system call overhead (context switches back and forth, in-kernel checks, ...), not to mention the time the system call itself takes, makes your gain by inlining dissapear in the noise (if there is any gain, more code means cache isn't so useful, and performance suffers). Trust the libc/kernel folks to have studied the matter and done the inlining for you behind your back (in the relevant *.h file) if it really is a measurable gain.

Xen binary rewriting method

In full virtualization, what is the CPL of guest OS?
in paravertualiation, CPL of guest OS is 1(ring 1)
is it same in full virtualization?
and I heard that some of the x86 privileged instructions are
not easily handled, thus "binary rewriting" method is required...
how does this "binary rewriting" happens??
I understand that in virtualization, CPU is not emulated.
so how can hypervisor change the binary instruction codes before
the CPU executes them?? do they predict the next instruction on memory and
update the memory contents before CPU gets there??
if this is true, I think hypervisor code(performing binary rewriting)
needs to intercept the CPU every time before some instruction of guest OS is
executed. I think this is absurd.
specific explanation will be appreciated.
thank you in advance..!!
If by full virtualization, you mean hardware supported virtualization, then the CPL of the guest is identical to if it was running on bare-metal.
Xen never rewrites the binary.
This is something that VMWare (as far as I understand). To the best of my understanding (but I have never seen the VMWare source code), the method consists of basically doing runtime patching of code that needs to run differently - typically, this involves replacing an existing op-code with something else - either causing a trap to the hypervisor, or a replacement set of code that "does the right thing". If I understand how this works in VMWare is that the hypervisor "learns" the code by single-stepping through a block, and either applies binary patches or marks the section as "clear" (doesn't need changing). The next time this code gets executed, it has already been patched or is clear, so it can run at "full speed".
In Xen, using paravirtualization (ring compression), then the code in the OS has been modified to be aware of the virtualized environment, and as such is "trusted" to understand certain things. But the hypervisor will still trap for example writes to the page-table (otherwise someone could write a malicious kernel module that modifies the page-table to map in another guest's memory, or some such).
The HVM method does intercept CERTAIN instructions - but the rest of the code runs at normal full speed, thanks to the hardware support in modern processors, such as SVM in AMD and VMX in Intel processors. ARM has a similar technology in the latest models of their processors, but I'm not sure what the name of it is.
I'm not sure if I've answered quite all of your questions, if I've missed something, or it's not clear enough, feel free to ask...

pinning a pthread to a single core

I am trying to measure the performance of some library calls. My primary measurement tool is the rdtsc call. After doing some reading I realize that I need to disable preemption and interrupts in order to get the most accurate readings. Can someone help me figure out how to do these? I know that pthreads have a 'set affinity' mechanism. Is that enough to get the job done?
I also read somewhere that I can make calls into the kernel of the sort
preempt_disable()
raw_local_irq_save(...)
Is there any benefit to using one approach over the other? I tried the latter approach and got this error.
error: 'preempt_disable' was not declared in this scope
which can be fixed by including linux/preempt.h but the compiler still complains.
linux/preempt.h: No such file or directory
Obviously I have not done any kernel hacking and I could not find this file on my system anywhere. I am really hoping I wont have to install a new linux kernel. :)
Thanks for your input.
Pinning a pthread to a single CPU can be done using pthread_setaffinity_np
But what you want to achieve at the end is not so simple. I'll explain you why.
preempt.h is part of the Linux Kernel source. Its located here. You need to have kernel sources with you. Anyways, you need to write a kernel module to access it, you cannot use it from user space. Learn how to write a kernel module here. Same is the case with functions preempt_disable and other interrupt disabling kernel functions
Now the point is, pthreads are in user space and your preemption disabling function is in kernel space. How to interact?
Either you need to write a new system call of your own where you do your preemption and interrupt disabling and call it from user space. Or you need to resort to other Kernel-User Space Interfaces like procfs, sysfs, ioctl etc
But I am really skeptical as to how all these will help you to benchmark library functions. You may want to have a look at how performance is typically measured using rdtsc

programming my own kernel

I need some directions to start learning about programming my own operating system kernel.
Just for educational purpouses.
How can I write my own Kernel?
I would first ask: why did you pick "writing a kernel?" Any answer other than "the idea of implementing my own task structures in memory to be swapped by a scheduler that I write and using memory that is managed by code that I wrote and is protected by abstractions of machine-level atomic instructions and is given I/O access through abstractions that sit atop actual hardware interfaces appeals to me" is probably a bad answer that indicates you haven't done any research whatsoever and are wasting your time.
If you answered similarly to the above, then you have a good starting point and you know what you need to research (that is, you are able to pinpoint to some degree what information you do not know but need to find out).
Either way, I don't think this question is worth asking. In one case, you have done no research of your own to discover if you can actually do this, and in the other case you asked an overly-broad question.
It isn't that hard, but you need to learn about proper resource management and low-level device I/O. If you're targeting a commodity x86 box, then you'll need to learn about how the BIOS works and how the disk is structured. For example, the BIOS will read the first block of the disk into memory at some fixed address and then jump to that address. Since there probably won't be enough space in one block to store your kernel, you'll need to write a boot loader to read your kernel off the disk and load it.
Writing a minimal kernel that does some simple multitasking and performs I/O using just the BIOS isn't too difficult, just don't expect to be throwing up any windows and mousing around any time soon. You'll be busy trying to implement a simple file system and getting read() and write() to work.
Maybe you can start by looking into OS/161, which is a Harvard's simplified operating system for educational purposes. The OS runs on a simulator, so you don't need a new machine to run it. I used it for my operating system course, and it really did help a lot.
Also I think you may really want to consider taking an operating system course if you haven't done so.

What are coding conventions for using floating-point in Linux device drivers?

This is related to
this question.
I'm not an expert on Linux device drivers or kernel modules, but I've been reading "Linux Device Drivers" [O'Reilly] by Rubini & Corbet and a number of online sources, but I haven't been able to find anything on this specific issue yet.
When is a kernel or driver module allowed to use floating-point registers?
If so, who is responsible for saving and restoring their contents?
(Assume x86-64 architecture)
If I understand correctly, whenever a KM is running, it is using a hardware context (or hardware thread or register set -- whatever you want to call it) that has been preempted from some application thread. If you write your KM in c, the compiler will correctly insure that the general-purpose registers are properly saved and restored (much as in an application), but that doesn't automatically happen with floating-point registers. For that matter, a lot of KMs can't even assume that the processor has any floating-point capability.
Am I correct in guessing that a KM that wants to use floating-point has to carefully save and restore the floating-point state? Are there standard kernel functions for doing this?
Are the coding conventions for this spelled out anywhere? Are they different for SMP-non SMP drivers? Are they different for older non-preemptive kernels and newer preemptive kernels?
Linus's answer provides this pretty clear quote to use as a guideline:
In other words: the rule is that you really shouldn't use FP in the kernel.
Short answer: Kernel code can use floating point if this use is surrounded by kernel_fpu_begin()/kernel_fpu_end(). These function handle saving and restoring the fpu context. Also, they call preempt_disable()/preempt_enable(), which means no sleeping, page faults etc. in the code between those functions. Google the function names for more information.
If I understand correctly, whenever a
KM is running, it is using a hardware
context (or hardware thread or
register set -- whatever you want to
call it) that has been preempted from
some application thread.
No, a kernel module can run in user context as well (eg. when userspace calls syscalls on a device provided by the KM). It has, however, no relation to the float issue.
If you write your KM in c, the
compiler will correctly insure that
the general-purpose registers are
properly saved and restored (much as
in an application), but that doesn't
automatically happen with
floating-point registers.
That is not because of the compiler, but because of the kernel context-switching code.

Resources