Difference between user-space driver and kernel driver [duplicate]

Difference between user-space driver and kernel driver [duplicate] - linux

This question already has answers here:
Userspace vs kernel space driver
(2 answers)
Closed 5 years ago.
I have been reading "Linux Device Drivers" by Jonathan Corbet. I have some questions that I want to know:
What are the main differences between a user-space driver and a kernel driver?
What are the limitations of both of them?
Why user-space drivers are commonly used and preferred nowadays over kernel drivers?

What are the main differences between a user-space driver and a kernel driver?
User space drivers run in user space. Kernel drivers run in kernel space.
What are the limitations of both of them?
The kernel driver can do anything the kernel can, so you could say it has no limitations. But kernel drivers are much harder to "prove correct" and debug. It's all-to-easy to introduce race conditions, or use a kernel function in the wrong context or with the wrong locking. Things will appear to work for a while, but cause problems (including crashing the whole system) down the road. Drivers must also be wary when reading all user input (both from the device and from userspace) because invalid data can sometimes cause crashes.
A user-space driver usually needs a small shim in the kernel to do it's bidding. Usually, that 'shim' provides a simpler API. For example, the FUSE layer lets people write file systems in any language. They can be mounted, read/written, then unmounted. The shim must also protect the kernel against all invalid input.
User-space drivers have lots of limitations. For example, the kernel reserves some memory for use during emergencies, but that is not available for users-space. During memory pressure, the kernel will kill random user-space programs, but never kill kernel threads. User-space programs may be swapped out, which could lead to your device being unavailable for several seconds. (Kernel code can not be swapped out.) Running code in user-space requires several context switches. These waste a "lot" of CPU time. If your device is a 300 baud modem, nobody will notice. But if it's a gigabit Ethernet card, and every packet has to go to your userspace driver before it gets to the real user, the system will have major bottlenecks.
User space programs are also "harder" to use because you have to install that user-space software, which often has many library dependencies. Kernel modules "just work".
Why user-space drivers are commonly used and preferred nowadays over kernel drivers?
The question is "Does this complexity really need to be in the kernel?"
I used to work for a company that made USB dongles that talked a particular protocol. We could have written a full kernel driver, but instead just wrote our program on top of libUSB.
The advantages: The program was portable between Linux, Mac, Win. No worrying about our code vs the GPL.
The disadvantages: If the device needed to data to the PC and get a response quickly, there is no guarantee that would happen. For example, if we needed a real-time control loop on the PC, it would be harder to have bounded response times. (Maybe not entirely impossible on Linux.)
If there is a way to do it in userspace, I would try that first. Only if there are significant performance bottlenecks, or significant complexity in keeping it in userspace would you move it. Even then, consider the "shim" approach, and/or the "emulator" approach (where your kernel module makes your device look like a serial port or a block device.)
On the other hand, if there are already several kernel modules similar to what you want, then start there.

Related

Access PCI memory BAR with low latency (Linux)

Background:
I have a PCI card, which is basically a clock. It gets the time by GPS and saves the current time in a certain register.
Goal:
I want to read a limited number of registers/bytes (for example the current time) over and over again, with the lowest possible latency. (The clock provides very high precision and I think I will loose precision the higher the latency is.). The operating system is RedHat. The programming language is C/C++. I also want to write to the device memory, whereby latency is not an issue.
Possible Ways to go:
I see these ways. If you see another, please tell me:
Writing a Linux kernel module driver, which creates a character device (or one character device for each register to read). Then a user space application can do a "read" on the /dev/ file(s).
DMA
mmap the sysfs resourceX file to user space by a user space application (systemcall). (like here for example)
Write a Linux kernel module driver which implements a mmap file operation.
Questions:
Which is the way with the lowest latency when it comes to the actual reading of the register? I am aware that mmap causes a lot of overhead in the kernel, but as far as I understand that is only for initialisation.
Is way 3 a legit way to go? It looks like a hack to me. How can I determine the /sys/ path automatically from the application?
Is there a difference between way 3 and 4? I am new to PCI driver programming and I think I didn't really understand how way 4 works. I read this (and other chapters of that book), but maybe you can give me a hint or an example. I would appreciate that.

Method 3 or 4 should work fine. There’s no difference between them with respect to latency. Latency would be in the order of 100 ns.
Method 4 would be needed if you need to initialize the device, or control which applications are allowed to access it, or enforce one reader at a time, etc. Method 3 does seem like a bit of a hack because it skips all of this. But it is simpler if you don’t need such things.
A character device is definitely higher latency, because it requires a kernel transition each time the device is read.
The latency of a DMA method depends entirely on how frequently the device writes the time to memory. It is lower latency for the CPU to access memory than MMIO, but if the device only does DMA once a millisecond, then that would be your latency. Also, that method generates a lot of useless DMA traffic, since the CPU would read the value far less often than it is written.

Adding to #prl's answer...
Method 3 seems perfectly legit to me. That's what it's for. You may want to take a look at the kernel documentation file: https://www.kernel.org/doc/Documentation/filesystems/sysfs-pci.txt
You can also use the /sys filesystem to find your device. First, note the vendor ID and device ID for your clock card (and subsystem vendor / device if necessary), then you can easily walk the /sys/devices hierarchy, looking for a matching device (using the vendor, device, etc. special files). Once you've found it, you presumably know which resourceN file to open from the device's data sheet, then mmap it at the appropriate offset and you're done.
That all assumes that your device is configured and enabled already. Typically a PCI device is not enabled to do anything when the system boots. Some driver needs to claim the device, and initialize / configure it. Once that is done, if the time is accessible just by reading a register or two, you can can go with method 3. (I'm not sure: it may be possible for a PCI device to be self-initializing but I've never seen one. I think probably something needs to enable its memory space at the very least. Likely that could be done from user-space if the setup is small enough / simple enough.)
The primary difference with method 4 is that the driver controlling the device would provide support for allowing the area to be mmap'd explicitly. For the user-space application, there is little difference between the two methods aside from the device name used. For method 4, the driver's probably going to provide a symbolic device name /dev/clock0 or something like that for use by the user-space application (and presumably the application then doesn't need to go find the device, it would just know the device file name to open).
From user-space, you will do the mmap operation in much the same way with either method. In method 4, the driver internally supplies the physical address to map -- and possibly the offset -- instead of the generic PCI subsystem doing so, but either way, it's just open + mmap.
Linux driver programming is not terribly difficult, but there's a significant learning curve there if you haven't done it before, so I definitely wouldn't go with method 4 unless there were a real need to do so.

I/O Memory Mapping

I am reviewing the essentials of I/O, and while I think I understand most of what's going on, I'm still confused as to how either physical addresses or separate ports are mapped to individual devices. Does the computer poll the bus on system boot, assigning addresses to devices one by one, or are there fixed addresses that are loaded into memory somewhere? If this is done via the BIOS, how is this memory layout information relayed to the operating system?
Thanks for your help!

(this question has been asked and answered before, you should search first)
depends on the platform, you were not specific
some systems, some peripherals in those systems, are hardcoded by the chip/system designers.
for pci(e), as defined by that, you enumerate the bus(es) searching for attached peripherals, and those peripherals configuration spaces (which are defined by the peripheral vendor per their needs) indicate how many and how big they need. For an x86 pc, the bios does this enumeration not the operating system. for other platforms it depends on that platform it may be the bootloader or operating system. but someone has to take the available space (basically hardcoded essentially for that platform knowing the platform and what is used already) and divide it up. for x86 it used to be just one gig that was divided up in the 32 bit days, and still happens on some systems, but for 64 bit systems the bioses open that up to 2gig for everyting, and can place that in a high address space to avoid ram (ever wonder why your 32 bit system with 4gig of dram only had 3gig usable?). naturally a flat memory space is only an illusion, the windows asked for by the pci peripherals can be small windows into their space, video cards with lots of ram for example. you use the csrs to move the window about, kind of like standing in your house looking out a small window and physically moving side to side to see more stuff through the window, but only the size of the window at any one time.
same goes for usb, it is enumerated, the busses are searched and the peripherals answer. with usb though it doesnt map into the address space of the host.
how the operating system finds this information is heavily dependent on the type of system. with bios on an x86 there is a known way to get that info, I think you can also get at the same info in dos (yes dos is still heavily used). for non pcie or usb the operating system drivers have to find the peripherals or just know, if the platform is consistent (address of the serial ports in a pc) or have a way of finding them without harming other devices or crashing. there are the cases where the operating system itself did the enumeration. or the bootloader if that is the place where enumeration happened. but each combination of bootloaders and operating systems on top of various platforms may each have their own different solution, no reason to expect them to be the same.
okay you did say bios and have a bios tag, implying x86 systems. the bios does pci/pcie enumeration at boot time, if you dont setup your bios to know that your operating system is 64 bit it may take a gig out of your lower 4Gig space for the pcie devices (and if you set for 64 bit but install a 32 bit operating system, then you are in trouble there for other reasons). I dont remember, but would assume there are bios calls the operating system can use to find out what the bios had done, should not be hard at all to find this information. Anything not discoverable in this way is likely legacy and hardcoded or uses legacy techniques for being discoverable (isa bus style search for a bios across a range of addresses, etc). the pcie/usb vendor and product id information tell the drivers what is there and from that they have hardcoded offsets into those spaces to complete the addresses needed to communicate with the peripherals.

Userspace vs kernel space driver

I am looking to write a PWM driver. I know that there are two ways we can control a hardware driver:
User space driver.
Kernel space driver
If in general (do not consider a PWM driver case) we have to make a decision whether to go for user space or kernel space driver. Then what factors we have to take into consideration apart from these?
User space driver can directly mmap() /dev/mem memory to their virtual address space and need no context switching.
Userspace driver cannot have interrupt handlers implemented (They have to poll for interrupt).
Userspace driver cannot perform DMA (As DMA capable memory can be allocated from kernel space).

From those three factors that you have listed only the first one is actually correct. As for the rest — not really. It is possible for a user space code to perform DMA operations — no problem with that. There are many hardware appliance companies who employ this technique in their products. It is also possible to have an interrupt driven user-space application, even when all of the I/O is done with a full kernel-bypass. Of course, it is not as easy simply doing an mmap() on /dev/mem.
You would have to have a minimal portion of your driver in the kernel — that is needed in order to provide your user space with a bare minimum that it needs from the kernel (because if you think about it — /dev/mem is also backed up by a character device driver).
For DMA, it is actually too darn easy — all you have to do is to handle mmap request and map a DMA buffer into the user space. For interrupts — it is a little bit more tricky, the interrupt must be handled by the kernel no matter what, however, the kernel may not do any work and just wake up the process that calls, say, epoll_wait(). Another approach is to deliver a signal to the process as done by DOSEMU, but that is very slow and is not recommended.
As for your actual question, one factor that you should take into consideration is resource sharing. As long as you don't have to share a device across multiple applications and there is nothing that you cannot do in user space — go for the user space. You will probably save tons of time during the development cycle as writing user space code is extremely easy. When, however, two or more applications need to share the device (or its resources) then chances are that you will spend tremendous amount of time making it possible — just imagine multiple processes forking, crashing, mapping (the same?) memory concurrently etc. And after all, IPC is generally done through the kernel, so if application would need to start "talking" to each other, the performance might degrade greatly. This is still done in real-life for certain performance-critical applications, though, but I don't want to go into those details.
Another factor is the kernel infrastructure. Let's say you want to write a network device driver. That's not a problem to do it in user space. However, if you do that then you'd need to write a full network stack too as it won't be possible to user Linux's default one that lives in the kernel.
I'd say go for user space if it is possible and the amount of effort to make things work is less than writing a kernel driver, and keeping in mind that one day it might be necessary to move code into the kernel. In fact, this is a common practice to have the same code being compiled for both user space and kernel space depending on whether some macro is defined or not, because testing in user space is a lot more pleasant.

Another consideration: it is far easier to debug user-space drivers. You can use gdb, valgrind, etc. Heck, you don't even have to write your driver in C.
There's a third option beyond just user space or kernel space drivers: some of both. You can do just the kernel-space-only stuff in a kernel driver and do everything else in user space. You might not even have to write the kernel space driver if you use the Linux UIO driver framework (see https://www.kernel.org/doc/html/latest/driver-api/uio-howto.html).
I've had luck writing a DMA-capable driver almost completely in user space. UIO provides the infrastructure so you can just read/select/epoll on a file to wait on an interrupt.
You should be cognizant of the security implications of programming the DMA descriptors from user space: unless you have some protection in the device itself or an IOMMU, the user space driver can cause the device to read from or write to any address in physical memory.

Disabling Multithreading during runtime

I am wondering if Intel's processor provides instructions in their instruction set
to turn on and off the multithreading or hyperthreading capability? Basically, I wanna
know if an Operating System can control these feature via instructions somehow?
Thank you so much
Mareike

Most operating systems have a facility for changing a process' CPU affinity, thereby restricting it to a single physical or virtual core. But multithreading is a program architecture, not a CPU facility.

I think that what you are trying to ask is, "Is there a way to prevent the OS from utilizing hyperthreading and/or multiple cores?"
The answer is, definitely. This isn't governed by a single instruction, and indeed it's not like you can just write a device driver that would automagically disable all of that hardware. Most of this depends on how the kernel configures the interrupt controllers at boot time.
When a machine is first started, there is a designated processor that is used for bootstrapping. It is the responsibility of the OS to configure the multiprocessor hardware accordingly. On PC platforms this would involve reading information about the multiprocessor configuration from in-memory tables provided by the boot firmware. This data would likely conform to either the ACPI or the Intel multiprocessor specifications. The kernel then uses that date to configure the APIC hardware accordingly.

Multithreading and multitasking are not special instructions or modes in the CPU. They're just fancy ways people who write operating systems use interrupts. There is a hardware timer, basically a counter being incremented by a clocking signal, that triggers an interrupt when it overflows. The exact interrupt is platform specific. In the olden days this timer is actually a separate chip/circuit on the motherboard that is simply attached to one of the CPU's interrupt pin. Modern CPUs have this timer built in. So, to turn off multithreading and multitasking the OS can simply disable the interrupt signal.
Alternatively, since it's the OS's job to actually schedule processes/threads, the OS can simply decide to ignore all threads and not run them.
Hyperthreading is a different thing. It sort of allows the OS to see a second virtual CPU that it can execute code on. Never had to deal with the thing directly so I'm not sure how to turn it off (or even if it is possible).

There is no x86 instruction that disables HyperThreading or additional cores. But, there is BIOS settings that can turn off these features. Because it can be set in BIOS, it requires rebooting, and generally it's beyond OS control. There is Windows booting option that limits the number of active core, but HyperThreading can be turn on/off only by BIOS. Current Intel's HyperThreading implementation doesn't allow dynamic turn on and off (and it won't be easily implemented in a near time).
I have assumed 'multithreading' in your question as 'hardware multithreading' which is technically identical to HyperThreading. However, if you really intended software-level multithreading (i.e., multitasking), then it's totally different question. It is (almost) impossible for modern operating systems since they are by default supports multitasking. And, this question actually doesn't make sense. It can make sense if you want to run MS-DOS (as real mode of x86, where a single task can be done).
p.s. Please note that 'multithreading' can be either hardware or software. Also I agree all others' answers such as processor/thread affinity.

What is the ideal & fastest way to communicate between kernel and user space?

I know that information exchange can happen via following interfaces between kernel and user space programs
system calls
ioctls
/proc & /sys
netlink
I want to find out
If I have missed any other interface?
Which one of them is the fastest way to exchange large amounts of data?
(and if there is any document/mail/explanation supporting such a claim that I can refer to)
Which one is the recommended way to communicate? (I think its netlink, but still would love to hear opinions)

The fastest way to exchange vast amount of data is memory mapping. The mmap call can be used on a device file, and the corresponding kernel driver can then decide to map kernel memory to user address space. A good example of this is the Video For Linux drivers, and I suppose the frame buffer driver works the same way. For an good explanation of how the V4L2 driver works, you have :
The lwn.net article about streaming I/O
The V4L2 spec itself
You can't beat memory mapping for large amount of data, because there is no memcopy like operation involved, the physical underlying memory is effectively shared between kernel and userspace. Of course, like in all shared memory mechanism, you have to provide some synchronisation so that kernel and userspace don't think they have ownership at the same time.

Shared Memory between kernel and usespace is doable.
http://kerneltrap.org/node/14326
For instructions/examples.
You can also use a named pipe which are pretty fast.
All this really depends on what data you are sharing, is it concurrently accessed and what the data is structured like. Calls may be enough for simple data.
Linux kernel /proc FIFO/pipe
Might also help
good luck

You may also consider relay (formerly relayfs):
"Basically relayfs is just a bunch of per-cpu kernel buffers that can be efficiently written into from kernel code. These buffers are represented as files which can be mmap'ed and directly read from in user space. The purpose of this setup is to provide the simplest possible mechanism allowing potentially large amounts of data to be logged in the kernel and 'relayed' to user space."
http://relayfs.sourceforge.net/

You can obviously do shared memory with copy_from_user etc, you can easily set up a character device driver basically all you have to do is make a file_operation structures but this is by far not the fastest way.
I have no benchmarks but system calls on moderns systems should be the fastest. My reasoning is that its what's been most optimized for. It used to be that to get to from user -> kernel one had to create an interrupt, which would then go to the Interrupt table(an array) then locate the interrupt handlex(0x80) and then go to kernel mode. This was really slow, and then came the .sysenter instruction, which basically makes this process really fast. Without going into details, .sysenter reads form a register CS:EIP immediately and the change is quite fast.
Shared memory on the contrary requires writing to and reading from memory, which is infinitely more expensive than reading from a register.

Here is a possible compilation of all the possible interface, although in some ways they overlapped one another (eg, socket and system call are both effectively using system calls):
Procfs
Sysfs
Configfs
Debugfs
Sysctl
devfs (eg, Character Devices)
TCP/UDP Sockets
Netlink Sockets
Ioctl
Kernel System Calls
Signals
Mmap

As for shared memory , I've found that even with NUMA the two thread running on two differrent cores communicate through shared memory still required write/read from L3 cache which if lucky (in one socket)is
about 2X slower than syscall , and if(not on one socket ),is about 5X-UP
slower than syscall,i think syscall's hardware mechanism helped.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string