Using user-space processes to assist kernel modules - linux

I'm working on one piece of a very high performance piece of hardware that works under Linux. We'd like to cache some data but we're worried about memory consumption - so the idea is to create a user process to manage the cache. That way, the cache can be in virtual memory, not in kernel space, et cetera.
The question is: what's the best way to do this? My first instinct is to have the kernel module create a character device file, and have a user program that opens that file, then sits on a select statement waiting for commands to arrive on it. But I'm concerned that this might not be optimal. A friend mentioned he knew of a socket-based interface, but when pressed he couldn't provide any details....
Any suggestions?

I think you're looking for the netlink interface. See Why and How to Use Netlink Socket [sic] for more information. Be careful of security issues when talking between the kernel and user space; there was a recent vulnerability when udev neglected to check that messages were coming from the kernel rather than user space.

Related

Why applications cannot access a hardware device directly ? Why we need to switch to kernel space in order to do this?

I wondered why we need to switch to kernel space when we want to access a hardware device. I understand that sometimes, for specific actions such as memory allocation, we need to make system calls in order to switch from user space to kernel space because the operating system needs to organize everything and make a separation between processes and how they use memory and others. But why we can't directly access a hardware device ?
There is no problem in writing your own driver to access the hardware from User Space and plenty of documentation is available. For example, this tutorial at xatlantis seems to be recent and good source.
The reason it has been designed like that is because mainly due to security reasons .Most systems I know about specifically do not allow user programs to do I/O or to access kernel space memory. Such things would lead to wildly insecure systems, because with access to the kernel a user program could change permissions and get access to any data anywhere in the system, and presumably change it.
References:
XATLANTIS
STACKEXCHANGE
A device-driver may choose to provide access from user processes to device registers, device memory, or both. A common method is a device-specific service connected with an mmap() request. Consider a frame-buffer's on-board memory, and efficiency from a user process being able to r/w that space directly. For devices in general, notably there are security considerations and drivers that provide direct access often set limits to processes with sufficient credentials. Files within /dev are usually set with owner/group access permissions similarly limited.

Is it possible to circumvent OS security by not using the supplied System Calls?

I understand that an Operating System forces security policies on users when they use the system and filesystem via the System Calls supplied by stated OS.
Is it possible to circumvent this security by implementing your own hardware instructions instead of making use of the supplied System Call Interface of the OS? Even writing a single bit to a file where you normally have no access to would be enough.
First, for simplicity, I'm considering the OS and Kernel are the same thing.
A CPU can be in different modes when executing code.
Lets say a hypothetical CPU has just two modes of execution (Supervisor and User)
When in Supervisor mode, you are allowed to execute any instructions, and you have full access to the hardware resources.
When in User mode, there is subset of instructions you don't have access to, such has instructions to deal with hardware or change the CPU mode. Trying to execute one of those instructions will cause the OS to be notified your application is misbehaving, and it will be terminated. This notification is done through interrupts. Also, when in User mode, you will only have access to a portion of the memory, so your application can't even touch memory it is not supposed to.
Now, the trick for this to work is that while in Supervisor Mode, you can switch to User Mode, since it's a less privileged mode, but while in User Mode, you can't go back to Supervisor Mode, since the instructions for that are not permitted anymore.
The only way to go back to Supervisor mode is through system calls, or interrupts. That enables the OS to have full control of the hardware.
A possible example how everything fits together for this hypothetical CPU:
The CPU boots in Supervisor mode
Since the CPU starts in Supervisor Mode, the first thing to run has access to the full system. This is the OS.
The OS setups the hardware anyway it wants, memory protections, etc.
The OS launches any application you want after configuring permissions for that application. Launching the application switches to User Mode.
The application is running, and only has access to the resources the OS allowed when launching it. Any access to hardware resources need to go through System Calls.
I've only explained the flow for a single application.
As a bonus to help you understand how this fits together with several applications running, a simplified view of how preemptive multitasking works:
In a real-world situation. The OS will setup an hardware timer before launching any applications.
When this timer expires, it causes the CPU to interrupt whatever it was doing (e.g: Running an application), switch to Supervisor Mode and execute code at a predetermined location, which belongs to the OS and applications don't have access to.
Since we're back into Supervisor Mode and running OS code, the OS now picks the next application to run, setups any required permissions, switches to User Mode and resumes that application.
This timer interrupts are how you get the illusion of multitasking. The OS keeps changing between applications quickly.
The bottom line here is that unless there are bugs in the OS (or the hardware design), the only way an application can go from User Mode to Supervisor Mode is through the OS itself with a System Call.
This is the mechanism I use in my hobby project (a virtual computer) https://github.com/ruifig/G4DevKit.
HW devices are connected to CPU trough bus, and CPU does use to communicate with them in/out instructions to read/write values at I/O ports (not used with current HW too much, in early age of home computers this was the common way), or a part of device memory is "mapped" into CPU address space, and CPU controls the device by writing values at defined locations in that shared memory.
All of this should be not accessible at "user level" context, where common applications are executed by OS (so application trying to write to that shared device memory would crash on illegal memory access, actually that piece of memory is usually not even mapped into user space, ie. not existing from user application point of view). Direct in/out instructions are blocked too at CPU level.
The device is controlled by the driver code, which is either run is specially configured user-level context, which has the particular ports and memory mapped (micro-kernel model, where drivers are not part of kernel, like OS MINIX). This architecture is more robust (crash in driver can't take down kernel, kernel can isolate problematic driver and restart it, or just kill it completely), but the context switches between kernel and user level are a very costly operation, so the throughput of data is hurt a bit.
Or the device drivers code runs on kernel-level (monolithic kernel model like Linux), so any vulnerability in driver code can attack the kernel directly (still not trivial, but lot more easier than trying to get tunnel out of user context trough some kernel bug). But the overall performance of I/O is better (especially with devices like graphics cards or RAID disc clusters, where the data bandwidth goes into GiBs per second). For example this is the reason why early USB drivers are such huge security risk, as they tend to be bugged a lot, so a specially crafted USB device can execute some rogue code from device in kernel-level context.
So, as Hyd already answered, under ordinary circumstances, when everything works as it should, user-level application should be not able to emit single bit outside of it's user sandbox, and suspicious behaviour outside of system calls will be either ignored, or crash the app.
If you find a way to break this rule, it's security vulnerability and those get usually patched ASAP, when the OS vendor gets notified about it.
Although some of the current problems are difficult to patch. For example "row hammering" of current DRAM chips can't be fixed at SW (OS) or CPU (configuration/firmware flash) level at all! Most of the current PC HW is vulnerable to this kind of attack.
Or in mobile world the devices are using the radiochips which are based on legacy designs, with closed source firmware developed years ago, so if you have enough resources to pay for a research on these, it's very likely you would be able to seize any particular device by fake BTS station sending malicious radio signal to the target device.
Etc... it's constant war between vendors with security researchers to patch all vulnerabilities, and hackers to find ideally zero day exploit, or at least picking up users who don't patch their devices/SW fast enough with known bugs.
Not normally. If it is possible it is because of an operating system software error. If the software error is discovered it is fixed fast as it is considered to be a software vulnerability, which equals bad news.
"System" calls execute at a higher processor level than the application: generally kernel mode (but system systems have multiple system level modes).
What you see as a "system" call is actually just a wrapper that sets up registers then triggers a Change Mode Exception of some kind (the method is system specific). The system exception hander dispatches to the appropriate system server.
You cannot just write your own function and do bad things. True, sometimes people find bugs that allow circumventing the system protections. As a general principle, you cannot access devices unless you do it through the system services.

Patch / replace linux kernel in memory

I have ARM-based device with linux on-board. Its very difficult to flash custom kernel for some reasons (uBoot cant load kernel via tftp or something else)
I need to test my custom kernel.
So, idea is - replace kernel in memory. How do you think, is it possible?
Tell me any suggestions please.
Take a look at this link
It's for a project called Ksplice that allows one to patch a running kernel.
At one point this code was open, but Oracle bought it... So they may have closed it up and made it cost money. If that's the case, look around and see if you can find the formerly open code in the wild...

Hardware clock signals implementation in Linux Kernel

I am looking at some pointers for understanding how the Linux kernel implements the setting up of various hardware clocks. This basically relates to working with setting up the various clocks that hardware features like the LCD, UART etc will use. For example when Linux boots how does it handle setting up the clocks for UART or USB. Maybe something like a Clock manager or something.
I am basically trying to implement something similar for a different OS on a new hardware that i am working on. Any help would be really appreciated.
[Edit]
Thanks for the replies and the links. So here is what i have implemented up until now. This should give you an idea of where I'm headed.
I looked up the Hardware Reference Manual for the particular system I'm targeting and wrote some code to monitor/modify the signals/pins of the peripherals I am interested in i.e. turning them ON/OFF from the command line.Now a collection of these clocks/signals together control a peripheral.The HRM would say that if you want to turn on the UART or something then turn on such and such signals/pins. And #BjoernD yes I am using something like a mmap() function to talk to the peripherals.
The meat of my question is that I want to understand the design and implementation of a Clock/Peripheral Manager which uses the utility that I have already written. This Clock/Peripheral Manager would give me the control of enabling/disabling the peripherals I want.Basically this Manager would enable me to make changes in the init code that is right now running. Also during run time processes can call this Manager to turn ON/OFF the devices so that power consumption is optimized. It might not have made perfect sense but I'm myself trying to wrap my head around this.
Now I'm sure something like this would have been implemented in Linux or for that matter any OS for performance issues (nobody would want to waste power by turning on all peripherals at boot time). I want to understand the Software Architecture of it. Reference from any OS would do as of now to atleast get a headstart. Also I am not writing my own OS, there is an OS in place but Im looking more at a board level software aka BSP to work on. But thanks for the OS link anyways, they are really good. Appreciate it.
Thanks!
What you want to achieve is highly specific to a) the platform you are using and b) the device you want to use. For instance, on x86 there are 3 ways to communicate with a device:
Interrupts allow the device to signal the CPU. The OS usually provides mechanisms to register interrupt handlers - functions that are called upon occurrence of an interrupt. In Linux see request_irq() and friends in linux/include/interrupt.h
Memory-mapped I/O is physical memory of the device that the platform's BIOS makes available in the same way you also access plain physical memory - simply by writing to a memory address. What exactly is behind such memory (e.g., network interface config registers or an LCD frame buffer) depends on the device and is usually specified in the device's data sheet.
I/O ports are accessed through a special address space and special instructions (INB/OUTB & co.). Other than that they work similar to I/O memory.
There's a multitude of ways to find out what resources a device provies and where the BIOS mapped them. Some platforms use ACPI tables (google yourself for the 1,000k page spec), PCI provides info on devices in a standardized way through the PCI config space, USB has similar ways of discovering devices attached to the bus, and some devices, e.g., UARTS, are simply specified to be available at a pre-configured I/O range that is fixed for your platform.
As a start for understanding Linux, I'd recommend "Understanding the Linux kernel". For specifics on how Linux handles devices and what is there to write drivers, have a look at Linux Device Drivers. Furthermore, you will need to have a look at the peculiarities of your platform and the device you want to drive.
If you want to start an own OS, a UART is certainly something that will be veeery helpful to print debug output, so you might want to go for this first.
Now that I wrote down all this, it seems that your actual question is: How to get started with Operating System design. This question should be highly valuable for you: What are some resources for getting started in operating system development?
The two big power users in most computers are the CPU and the disks. Both of these have capabilities for power saving in Linux. The CPU clock can be slowed down when the system is not busy, and the disk motors can be stopped when no I/O is happening. For a UART, even if you save all of the power that it uses by turning off its clock, it is still tiny compared to the others because a UART doesn't have much logic in it.
Best ways to save power are
1) more efficient power supply
2) replace rotating disk with SSD
3) Slow down the CPU and memory bus

How to do power save on a ARM-based Embedded Linux system?

I plan to develop a nice little application that will run on an arm-based embedded Linux platform; however, since that platform will be battery-powered, I'm searching for relevant information on how to handle power save.
It is kind of important to get decent battery time.
I think the Linux kernel implemented some support for this, but I can't find any documentation on this subject.
Any input on how to design my program and the system is welcome.
Any input on how the Linux kernel tries to solves this type of problem is also welcome.
Other questions:
How much does the program in user space need to do?
And do you need to modify the kernel?
What kernel system calls or APIs are good to know about?
Update:
It seems like the folks involved with the "Free Electrons" site have produced some nice presentations on this subject.
http://free-electrons.com/services/power-management/
http://free-electrons.com/docs/power
http://free-electrons.com/docs/optimizations
But maybe someone else has even more information on this subject?
Update:
It seems like Adam Shiemke's idea to go look at the MeeGo project may be the best tip so far.
It may be the best battery powered Embedded Linux project out there at this moment.
And Nokia is usually kind of good at this type of thing.
Update:
One has to be careful about Android since it has a "modified" Linux kernel in the bottom, and some of the things the folks at Google have done do not use baseline/normal Linux kernels. I think that some of their power management ideas could be troublesome to reuse for other projects.
I haven't actually done this, but I have experience with the two apart (Linux and embedded power management). There are two main Linux distributions that come to mind when thinking about power management, Android and MeeGo. MeeGo uses (as far as I can tell) an unmodified 2.6 kernel with some extras hanging on. I wasn't able to find a lot on exactly what their power management strategy is, although I suspect more will be coming out about it in the near future as the product approaches maturity.
There is much more information available on Android, however. They run a fairly heavily modified 2.6 kernel. You can see a good bit on the different strategies implemented in http://elinux.org/Android_Power_Management (as well as kernel drama). Some other links:
https://groups.google.com/group/android-kernel/browse_thread/thread/ee356c298276ad00/472613d15af746ea?lnk=raot&pli=1
http://www.ok-labs.com/blog/entry/context-switching-in-context/
I'm sure that you can find more links of this nature. Since both projects are open source, you can grab the kernel code, and probably get further information from people who actually know what they are talking about in forms and groups.
At the driver level, you need to make sure that your drivers can properly handle suspend and shut devices off that are not in use. Most devices aimed at the mobile market offer very fine-grained support to turn individual components off, and to tweak clock settings (remember, power is proportional to clock^2).
Hope this helps.
You can do quite a bit of power-saving without requiring any special support from the OS, assuming you are writing (or at least have the source code for) your application and drivers.
Your drivers need to be able to disable their associated devices and bring them back up without requiring a restart or introducing system instability. If your devices are connected to a PCI/PCIe bus, research which power states they support (D0 - D3) and what your driver needs to do to transition between these low-power modes. If you are selecting hardware devices to use, look for devices that adhere to the PCI Power Management Specification or have similar functionality (such as a sleep mode and a "wake up" interrupt signal).
When your device boots up, every device that has the ability to detect whether it is connected to anything needs to do so. If any ports or buses detect that they are not being used, power them down or put them to sleep. A port running at full power but sitting unused can waste more power than you might think it would. Depending on your particular hardware and use case, it might also be useful to have a background app that monitors device usage, identifies unused/idle resources, and acts appropriately (like a "screen saver" for your hardware).
Your application software should make sure to detect whether hardware devices are powered up before attempting to use them. If you need to access a device that might be placed in a low-power mode, your application needs to be able to handle a potentially lengthy delay in waiting for the device to wake up and respond. Your applications should also be considerate of a device's need to sleep. If you need to send a series of commands to a hardware device, try to buffer them up and send them out all at once instead of spacing them out and requiring multiple wakeup->send->sleep cycles.
Don't be afraid to under-clock your system components slightly. Besides saving power, this can help them run cooler (which requires less power for cooling). I have seen some designs that use a CPU that is more powerful than necessary by a decent margin, which is then under-clocked by as much as 40% (bringing the performance down to the original level but at a fraction of the power cost). Also, don't be afraid to spend power to save power. That is, don't be afraid to use CPU time monitoring hardware devices for opportunities to disable/hibernate them (even if it will cause your CPU to use a bit more power). Most of the time, this tradeoff results in a net power savings.
One of the most important things to think of as a power aware application developer is to avoid unnecessary timers. If possible use interrupt driven solutions instead of polled solutions. If a timer must be used then use as long poll interval as is possible.
For example if something special should be done at a certain room temperature it is unnecessary to check the temperature every 100 ms since temperature in a room changes slowly. A more reasonable polling interval is could be 60 s.
This affects the power consumption in several ways. In Linux the CPUIDLE subsystem takes the CPU (SOC) to as deep power saving state as possible depending on when it predicts the next wakeup to occur. Having a lot of timers in a system will fragment the sleep making it impossible to go to the deeper sleep states for longer periods. A typical deep sleep state for CPUIDLE turns the CPU off but keeps the RAM in self refresh. When a timer triggers the CPU will boot and serve the timer of the application.
It's not actually your topic, but it might come in handy to log your progress: i was looking for testing / measuring my embedded linux system. chris desjardins from this forum recommended me this:
I have successfully used bootchart in the past:
http://elinux.org/Bootchart
Here is a list of other things that may also help:
http://elinux.org/Boot_Time

Resources