Why do we need separation between kernel space and user space, when every data structure, every memory resource is managed by the kernel even if we create a process in user space? Why do we need such a big architecture?
All the page tables, context switching etc. are managed by the kernel, then why user space? Can we not use only one space and develop all the things over it?
This question may sound weird, but I want to really understand why is it required? If I have to design a completely new OS and I do not perform such division between kernel space and user space, what would be the problem?
Thanks
Your question is not weird at all -- are absolutely correct in you intuition that the kernel space may represent an unnecessary overhead. Indeed, it is overhead that many operating systems do not have. Your OS most likely won't have it either.
The rationale for the extra complexity goes something like this:
We want process/task separation.
We want to make sure that one task doing bad things does not affect another one. This requires that the tasks execute in their respective isolated spaces.
We want similar protection for shared hardware controllers.
A task cannot set up its own boundaries; if that was the case a misbehaving task could purposely break the rules. So we need a third party with higher privileges than what the tasks have, and there is your kernel (or privileged) space.
If the cost of the extra kernel space and separation is too high, either in size or in computation overhead, the operating system will not have those features. Many real-time or embedded systems run in a single space with a single (same) privilege level (the privilege levels are sometimes referred to as protection rings) exactly as you described it.
If there is no protection between kernel space and user space or between different user processes then anybody can write some code which will intentionally or accidentally modify either memory in kernel space or memory in the space of another user process. Look into the good old MS-DOS which lacked this protection.
Separating the spaces is not strictly necessary, but if you need to protect OS structures from user code to prevent wrong or bad manipulation of them, then you need some kind protection. Otherwise such bad manipulations would break OS inner coherence which is usually something you don't want.
kernel space and user space is about the system protection, to make the system more robust. kernel space is privileged mode and can do things (like directly interacting with hardware/system resources) which user space cannot. All the user space interaction with the hardware has to go through kernel space only.
So, a wrongly written user code cannot crash the system/ cannot make the system unstable. Without this separation, the wrongly written user code could even crash the system.
Related
Accroding to this paper: https://doi.org/10.1109/SP.2013.13, Memory corruption bugs are one of the oldest problems in computer security. The lack of memory safety and type safety has caused countless bugs, causing billions of dollars and huge efforts to fix them.
But the root of C/C++'s memory vulnerability can trace down to the ISA level. At ISA level, every instruction can access any memory address without any fine grained safe check (only corase grained check like page fault). Sure, we can implement memory safe at a higher software level, like Java (JVM), but this leads to significant cost of performance. In a word, we can't have both safety and performance at the same time on existing CPUs.
My question is, why can't we implement the safety at the hardware level? If the CPU has a safe ISA, which ensures the memory safe by, I don't know, taking the responsbilities of malloc and free, then maybe we can get rid of the performance decline of software safe checking. If anyone professional in microelectronics can tell me, is this idea realistic?
Depending on what you mean, it could make it impossible implement memory-unsafe languages like C in a normal way. e.g. every memory access would have to be to some object that has a known size? I'd guess an operating system for such a machine might have to work around that "feature" by telling it that the entire address space was one large array object. Or else you'd need some mechanism for a read system call to know the proper bounds of the object it's writing in the copy_to_user() part of its job. And then there's other OS stuff like accessing the same physical page from different virtual pages.
The OP (via asking on Reddit) found the CHERI project which is an attempt at this idea, involving "... revisit fundamental design choices in hardware and software to dramatically improve system security." Changing hardware alone can't work; compilers need to change, too. But they were able to adapt "Clang/LLVM, FreeBSD, FreeRTOS, and applications such as WebKit," so their approach could be practical. (Unlike the hypothetical versions I was imagining when writing other parts of this answer.)
CHERI uses "fine-grained memory protection", and "Language and compiler extensions" to implement memory-safe C and C++, and higher-level languages.
So it's not a drop-in replacement, and it sounds like you have to actively use the features to gain safety. As I argue in the rest of the answer, hardware can't do it alone, and it's highly non-trivial even with software cooperation. It's easy to come up with ways that wouldn't work. :P
For hardware-enforced memory-safety to be possible, hardware would have to know about every object and its size, and be able to cache that structure in a way that allows efficient lookups to find the bounds. Page tables (4k granularity, or larger in more modern ISAs) are already hard enough for hardware for hardware to cache efficiently for large programs, and that's without even considering which pointer goes with which object.
Checking a TLBs as part of every load and store can be done efficiently, but checking another structure in parallel with that might be problematic. Especially when the ranges don't have power-of-2 sizes and natural alignment, the way pages do, which makes it possible to build a TLB from content-addressable memory that checks for a match against each of several possible values for the high bits. (e.g. a page is 4k in size, always starting at a 4k alignment boundary.)
You mean it may cost too much at hardware level, like the die area?
Die area might not even be the biggest problem, especially these days. It would cost power, and/or cost latency in very important critical paths such as L1d hit load-use latency. Even if you could come up with some plausible way for software to make tables that hardware could check, or otherwise solve the other parts of this problem.
Modifying a page-table entry requires invalidating the entry, including TLB shootdown for other cores. If every free (and some malloc) cost inter-core communication to do similar things for object tables, that would be very expensive.
I think inventing a way for software to tell the hardware about objects would be an even bigger problem. malloc and free aren't something you can just build in to a CPU where memory addressing works anything like existing CPUs, or like it does in C. Software needs to manage memory, it doesn't make sense to try to build that in to a CPU. So then malloc and free (and mmap with file-backed mappings and shared memory...) need a way to tell the CPU about objects. Seems like a mess.
I think at best an ISA could provide more tools software can use to make bounds-checks cheaper. Perhaps some kind of extra semantics on loads/stores, like an extra operand for indexed addressing modes for load or store that takes a max?
At least if we want an ISA to work anything like current ones, rather than work like a JVM or a Transmeta Crusoe and internally recompile for some real ISA.
Intel's MPX ISA extension to x86 was an attempt to let software set up bound ranges, but it's been mostly abandoned due to lower performance than pure software. Intel even dropped it from their recent CPUs (Not present in 10th Gen CPUs using 10nm lithography, or later.)
This is all just off the top of my head; I haven't searched for any serious proposals for how a system could plausibly work.
I don't think memory safety is something you can easily add after the fact to languages like C that weren't originally designed with it.
Have a look to "Code for malloc and free" at SO. Those commands are very, very far away from even being defined within an instruction set.
I'm trying to find any system functionality that would allow a process to allocate "temporary" memory - i.e. memory that is considered discardable by the process, and can be take away by the system when memory is needed, but allowing the process to benefit from available memory when possible. In other words, the process tells the system it's OK to sacrifice the block of memory when the process is not using it. Freeing the block is also preferable to swapping it out (it's more expensive, or as expensive, to swap it out rather then re-constitute its contents).
Systems (e.g. Linux), have those things in the kernel, like F/S memory cache. I am looking for something like this, but available to the user space.
I understand there are ways to do this from the program, but it's really more of a kernel job to deal with this. To some extent, I'm asking the kernel:
if you need to reduce my, or another process residency, take these temporary pages off first
if you are taking these temporary pages off, don't swap them out, just unmap them
Specifically, I'm interested on a solution that would work on Linux, but would be interested to learn if any exist for any other O/S.
UPDATE
An example on how I expect this to work:
map a page (over swap). No difference to what's available right now.
tell the kernel that the page is "temporary" (for the lack of a better name), meaning that if this page goes away, I don't want it paged in.
tell the kernel that I need the temporary page "back". If the page was unmapped since I marked it "temporary", I am told that happened. If it hasn't, then it starts behaving as a regular page.
Here are the problems to have that done over existing MM:
To make pages not being paged in, I have to allocate them over nothing. But then, they can get paged out at any time, without notice. Testing with mincore() doesn't guarantee that the page will still be there by the time mincore() finishes. Using mlock() requires elevated privileges.
So, the closest I can get to this is by using mlock(), and anonymous pages. Following the expectations I outlined earlier, it would be:
map an anonymous, locked page. (MAP_ANON|MAP_LOCKED|MAP_NORESERVE). Stamp the page with magic.
for making page "temporary", unlock the page
when needing the page, lock it again. If the magic is there, it's my data, otherwise it's been lost, and I need to reconstitute it.
However, I don't really need for pages to be locked in RAM when I'm using them. Also, MAP_NORESERVE is problematic if memory is overcommitted.
This is what the VmWare ESXi server aka the Virtual Machine Monitor (VMM) layer implements. This is used in the Virtual Machines and is a way to reclaim memory from the virtual machine guests. Virtual machines that have more memory allocated than they actually are using/require are made to release/free it to the VMM so that it can assign it back to the Virtual Machines guests that are in need of it.
This technique of Memory Reclamation is mentioned in this paper: http://www.vmware.com/files/pdf/mem_mgmt_perf_vsphere5.pdf
On similar lines, something similar you can implement in your kernel.
I'm not sure to understand exactly your needs. Remember that processes run in virtual memory (their address space is virtual), that the kernel is dealing with virtual to physical address translation (using the MMU) and with paging. So page fault can happen at any time. The kernel will choose to page-in or page-out at arbitrary moments - and will choose which page to swap (only the kernel care about RAM, and it can page-out any physical RAM page at will). Perhaps you want the kernel to tell you when a page is genuinely discarded. How would the kernel take away temporary memory from your process without your process being notified ? The kernel could take away and later give back some RAM.... (so you want to know when the given back memory is fresh)
You might use mmap(2) with MAP_NORESERVE first, then again (on the same memory range) with MAP_FIXED|MAP_PRIVATE. See also mincore(2) and mlock(2)
You can also later use madvise(2) with MADV_WONTNEED or MADV_WILLNEED etc..
Perhaps you want to mmap some device like /dev/null, /dev/full, /dev/zero or (more likely) write your own kernel module providing a similar device.
GNU Hurd has an external pager mechanism... You cannot yet get exactly that on Linux. (Perhaps consider mmap on some FUSE mounted file).
I don't understand what you want to happen when the kernel is paging out your memory, and what you want to happen when the kernel is paging in again such a page because your process is accessing it. Do you want to get a zero-ed page, or a SIGSEGV ?
I am looking to write a PWM driver. I know that there are two ways we can control a hardware driver:
User space driver.
Kernel space driver
If in general (do not consider a PWM driver case) we have to make a decision whether to go for user space or kernel space driver. Then what factors we have to take into consideration apart from these?
User space driver can directly mmap() /dev/mem memory to their virtual address space and need no context switching.
Userspace driver cannot have interrupt handlers implemented (They have to poll for interrupt).
Userspace driver cannot perform DMA (As DMA capable memory can be allocated from kernel space).
From those three factors that you have listed only the first one is actually correct. As for the rest — not really. It is possible for a user space code to perform DMA operations — no problem with that. There are many hardware appliance companies who employ this technique in their products. It is also possible to have an interrupt driven user-space application, even when all of the I/O is done with a full kernel-bypass. Of course, it is not as easy simply doing an mmap() on /dev/mem.
You would have to have a minimal portion of your driver in the kernel — that is needed in order to provide your user space with a bare minimum that it needs from the kernel (because if you think about it — /dev/mem is also backed up by a character device driver).
For DMA, it is actually too darn easy — all you have to do is to handle mmap request and map a DMA buffer into the user space. For interrupts — it is a little bit more tricky, the interrupt must be handled by the kernel no matter what, however, the kernel may not do any work and just wake up the process that calls, say, epoll_wait(). Another approach is to deliver a signal to the process as done by DOSEMU, but that is very slow and is not recommended.
As for your actual question, one factor that you should take into consideration is resource sharing. As long as you don't have to share a device across multiple applications and there is nothing that you cannot do in user space — go for the user space. You will probably save tons of time during the development cycle as writing user space code is extremely easy. When, however, two or more applications need to share the device (or its resources) then chances are that you will spend tremendous amount of time making it possible — just imagine multiple processes forking, crashing, mapping (the same?) memory concurrently etc. And after all, IPC is generally done through the kernel, so if application would need to start "talking" to each other, the performance might degrade greatly. This is still done in real-life for certain performance-critical applications, though, but I don't want to go into those details.
Another factor is the kernel infrastructure. Let's say you want to write a network device driver. That's not a problem to do it in user space. However, if you do that then you'd need to write a full network stack too as it won't be possible to user Linux's default one that lives in the kernel.
I'd say go for user space if it is possible and the amount of effort to make things work is less than writing a kernel driver, and keeping in mind that one day it might be necessary to move code into the kernel. In fact, this is a common practice to have the same code being compiled for both user space and kernel space depending on whether some macro is defined or not, because testing in user space is a lot more pleasant.
Another consideration: it is far easier to debug user-space drivers. You can use gdb, valgrind, etc. Heck, you don't even have to write your driver in C.
There's a third option beyond just user space or kernel space drivers: some of both. You can do just the kernel-space-only stuff in a kernel driver and do everything else in user space. You might not even have to write the kernel space driver if you use the Linux UIO driver framework (see https://www.kernel.org/doc/html/latest/driver-api/uio-howto.html).
I've had luck writing a DMA-capable driver almost completely in user space. UIO provides the infrastructure so you can just read/select/epoll on a file to wait on an interrupt.
You should be cognizant of the security implications of programming the DMA descriptors from user space: unless you have some protection in the device itself or an IOMMU, the user space driver can cause the device to read from or write to any address in physical memory.
In order to write to a read-only memory location (an example for such a memory location would be the sys_call table) in kernel module, is it sufficient to disable the page protection by manipulating the 16th bit of CR0 register?
Or do we need something more to write to a read-only memory location?
If you disable page write protection, you may break something dependent on it (e.g. any copy-on-write occurring on kernel pages). If you do it that way, you probably want to temporarily disable interrupts/scheduling, so the memory modification looks atomic on that CPU, this will also avoid moving of the thread to a different CPU if you have more than 1.
I'm not sure that using hard-coded addresses like 0xc12c9e90 is a good idea. I don't know how Linux lays out things in the kernel portion of the address space, but addresses may change from one boot to another either because of dynamic memory allocation or for security reasons (moving things around is useful thing as it reduces the chances of exploitation of kernel bugs).
Some people want to move code from user space to kernel space in Linux for some reason. A lot of times the reason seems to be that the code should have particularly high priority or simply "kernel space is faster".
This seems strange to me. When should I consider writing a kernel module? Are there a set of criterias?
How can I motivate keeping code in user space that (I believe) belong there?
Rule of thumb: try your absolute best to keep your code in user-space. If you don't think you can, spend as much time researching alternatives to kernel code as you would writing the code (ie: a long time), and then try again to implement it in user-space. If you still can't, research more to ensure you're making the right choice, then very cautiously move into the kernel. As others have said, there are very few circumstances that dictate writing kernel modules and debugging kernel code can be quite hellish, so steer clear at all costs.
As far as concrete conditions you should check for when considering writing kernel-mode code, here are a few: Does it need access to extremely low-level resources, such as interrupts? Is your code defining a new interface/driver for hardware that cannot be built on top of currently exported functionality? Does your code require access to data structures or primitives that are not exported out of kernel space? Are you writing something that will be primarily used by other kernel subsystems, such as a scheduler or VM system (even here it isn't entirely necessary that the subsystem be kernel-mode: Mach has strong support for user-mode virtual memory pagers, so it can definitely be done)?
There are very limited reasons to put stuff into the kernel. If you're writing device drivers it's ok. Any standard application: never.
The drawbacks are huge. Debugging gets harder, errors become more frequent and hard to find. You might compromise security and stability. You might have to adapt to kernel changes more frequently. It becomes impossible to port to other UNIX OSs.
The closest I've ever come to the kernel was a custom filesystem (with mysql in the background) and even for that we used FUSE (where the U stands for userspace).
I'm not sure the question is the right way around. There should be a good reason to move things to kernel space. If there aren't any reasons, don't do it.
For one thing, debugging is made harder, and the effect of bugs is far worse (crash/panic instead of simple coredump).
Basically, I agree with rpj. Code has to be in user-space, unless it's REALLY necessary.
But, to emphasize your question, which condition?
Some people claims that driver has to be in the kernel, which is not true. Some drivers are not timing sensitive, in fact lots of drivers are like that.
For example, the framer, RTC timer, i2c devices, etc. Those drivers can be easily moved to user space. There are even some file-systems that are written in user-space.
You should move to kernel space where the overhead, eg. user-kernel swap, becomes unacceptable for your code to work properly.
But there are lots of way to deal with this. For example, the /dev/mem provides a good way to access your physical memory, just like you do it from the kernel space.
When people talk about going to RTOS, I'm usually skeptical.
These days, the processor is so powerful, that most of the time, the real-time aspect becomes negligible.
But even, let's say, you're dealing with SONET, and you need to do a protection switching within 50ms (actually even less, since the 50ms constrains applies to the whole ring), you still can do the switching very fast, IF your hardware supports it.
Lots of framer these days can give you a hardware support that reduces the amount of writes that you need to do. Your job is basically responds to the interrupt as quickly as possible. And Linux is not bad at all. The interrupt latency I got was less 1ms, even if I have tons of other interrupts running (eg. IDE, ethernet, etc.).
And if that's still not enough, then maybe your hardware design is wrong. Some things are better left on the hardware. And when I said hardware, I mean ASIC, FPGA, Network Processor, or other advanced logic.
Code running in the kernel accesses memory, peripherals, system functions in ways that are different from userspace code and thus has the ability to be more efficient. Not to mention the reduced security restrictions for kernel code. However, all this usually comes at a cost, such as increasing the possibility of opening the kernel up to security threats, locking up the OS, complicating the debugging, and so forth.
If your people want really high priority, determinism, low latency etc, the right way to go is to use some real-time version of Linux (or other OS).
Also look at the preemptible kernel options etc. Exactly what you should do depends on the requirements, but to put the code in kernel modules is not likely the right solution, unless you are interfacing some hardware directly.
Another reason to not move code into kernel space is that when you use it in production or commercial situations, you will have to publish that code due to the GPL agreement. A situation that many software companies don't want to come into. :)
As general rule. Think on what you want to know and if that is something you would see in an operating system development book or class then it has a good chance to belong into the kernel. If not, keep it out of the kernel. If you have a very good reason to break that rule, be sure, you will have enough knowledge to know it by yourself or you will be working with someone that has that knowledge.
Yes, might sound harsh, but this exactly what I meant, if you don't know, then be almost sure the answer is no, don't do it in the kernel. Moving your development to kernel space opens a giant can of worms that you must be sure to be able to handle.
If you just need lower latency, higher throughput, etc., it is probably cheaper to buy a faster computer than to develop kernel code.
Kernel modules may be faster (due to less context switches, less system call overhead, and less interruptions), and certainly do run at very high priority. If you want to export a small amount of fairly simple code into kernel space, this might be OK. That is, if a small piece of code is found to be crucial to performance, and is the sort of code that would benefit from being placed in kernel mode, then it may be justified to place it there.
But moving large parts of your program into kernel space should be avoided unless all other options are completely exhausted. Aside from the difficulty of doing so, the performance benefit is not likely to be very large.
If you're asking such a question, then you shouldn't go to the kernel layer. Basically just wondering means you don't need to. The time of the context switch is so negligible that it doesn't matter anyway these days.