Access PCI memory BAR with low latency (Linux)

Access PCI memory BAR with low latency (Linux) - linux

Background:
I have a PCI card, which is basically a clock. It gets the time by GPS and saves the current time in a certain register.
Goal:
I want to read a limited number of registers/bytes (for example the current time) over and over again, with the lowest possible latency. (The clock provides very high precision and I think I will loose precision the higher the latency is.). The operating system is RedHat. The programming language is C/C++. I also want to write to the device memory, whereby latency is not an issue.
Possible Ways to go:
I see these ways. If you see another, please tell me:
Writing a Linux kernel module driver, which creates a character device (or one character device for each register to read). Then a user space application can do a "read" on the /dev/ file(s).
DMA
mmap the sysfs resourceX file to user space by a user space application (systemcall). (like here for example)
Write a Linux kernel module driver which implements a mmap file operation.
Questions:
Which is the way with the lowest latency when it comes to the actual reading of the register? I am aware that mmap causes a lot of overhead in the kernel, but as far as I understand that is only for initialisation.
Is way 3 a legit way to go? It looks like a hack to me. How can I determine the /sys/ path automatically from the application?
Is there a difference between way 3 and 4? I am new to PCI driver programming and I think I didn't really understand how way 4 works. I read this (and other chapters of that book), but maybe you can give me a hint or an example. I would appreciate that.

Method 3 or 4 should work fine. There’s no difference between them with respect to latency. Latency would be in the order of 100 ns.
Method 4 would be needed if you need to initialize the device, or control which applications are allowed to access it, or enforce one reader at a time, etc. Method 3 does seem like a bit of a hack because it skips all of this. But it is simpler if you don’t need such things.
A character device is definitely higher latency, because it requires a kernel transition each time the device is read.
The latency of a DMA method depends entirely on how frequently the device writes the time to memory. It is lower latency for the CPU to access memory than MMIO, but if the device only does DMA once a millisecond, then that would be your latency. Also, that method generates a lot of useless DMA traffic, since the CPU would read the value far less often than it is written.

Adding to #prl's answer...
Method 3 seems perfectly legit to me. That's what it's for. You may want to take a look at the kernel documentation file: https://www.kernel.org/doc/Documentation/filesystems/sysfs-pci.txt
You can also use the /sys filesystem to find your device. First, note the vendor ID and device ID for your clock card (and subsystem vendor / device if necessary), then you can easily walk the /sys/devices hierarchy, looking for a matching device (using the vendor, device, etc. special files). Once you've found it, you presumably know which resourceN file to open from the device's data sheet, then mmap it at the appropriate offset and you're done.
That all assumes that your device is configured and enabled already. Typically a PCI device is not enabled to do anything when the system boots. Some driver needs to claim the device, and initialize / configure it. Once that is done, if the time is accessible just by reading a register or two, you can can go with method 3. (I'm not sure: it may be possible for a PCI device to be self-initializing but I've never seen one. I think probably something needs to enable its memory space at the very least. Likely that could be done from user-space if the setup is small enough / simple enough.)
The primary difference with method 4 is that the driver controlling the device would provide support for allowing the area to be mmap'd explicitly. For the user-space application, there is little difference between the two methods aside from the device name used. For method 4, the driver's probably going to provide a symbolic device name /dev/clock0 or something like that for use by the user-space application (and presumably the application then doesn't need to go find the device, it would just know the device file name to open).
From user-space, you will do the mmap operation in much the same way with either method. In method 4, the driver internally supplies the physical address to map -- and possibly the offset -- instead of the generic PCI subsystem doing so, but either way, it's just open + mmap.
Linux driver programming is not terribly difficult, but there's a significant learning curve there if you haven't done it before, so I definitely wouldn't go with method 4 unless there were a real need to do so.

Related

For emulating the Gameboy, why does it matter that the memory is broken up into different areas?

So I'm writing a gameboy emulator, and I'm not 100% sure why other projects took the time to break up the memory into proper categories. I don't know if there is a major technical dilemma I'm missing (maybe handling illegal parameters in instructions?), but it seems like the only thing that matters is that the address given by a write instruction is retrievable by the proper read instruction. So for a sub question, if I'm working under the assumption that the assembly is perfectly legal (meaning nothing is trying to read/write where it can't), can I just make a big array and read and write to it?
Note that this is a conceptual question and that I am aware a big array would be a memory hog, I'm not necessarily looking for the best way to do it, simply trying to learn how it works and why other emulator developers did it the way they did.

You are going to have read only memory, read/write memory and memory mapped I/O (peripherals etc). So you need to decode the address to some extent to break it into the major categories, then for the peripherals you have to emulate all of those so you have to get very detailed in your address decoding.
For the peripherals you will need to detect a read/write to some address which you cannot do by simply landing the writes in an array (two writes of the same value for example make a difference, you cant just scan some array to look for changes you have to trigger on reads and writes and perform the hardware action).
If you wish to be cycle accurate you will also need to know the timings for the rams and roms in order to mimic those, depending on how many banks of each or if timing is dependent on that you will need to decode the address further.
Hardware decodes these addresses to the same level so if you are emulating hardware then you need to...emulate hardware...and do the same amount of address decoding.

I'm going to be gameboy specific here. Look at gameboy's address space map. The address space itself is divided, it's not that emulators do it. Hardware itself operates that way.
Here's some of the regions that can't be implemented as just an array:
0x0000-0x3FFF. First bank of a ROM. It's read-only but not quite. Read the next one
0x4000-0x7FFF. Switchable ROM bank, it's also not quite read-only. Cartridges that don't fit into gameboy's address space contain memory bank controller. ROM will write to some specific read-only ROM regions to actually select which ROM bank is mapped into 0x4000-0x7FFF address range. So you have to detect these writes and then redirect reads into the selected ROM bank.
0xA000-0xBFFF. Switchable RAM bank. Same thing as with switchable ROM banks but now for RAM. Cartridges may contain additional RAM that's being mapped into gameboy's address space. Which bank of the RAM is mapped is controlled, again, by writes to specific read-only regions.
0xFF00-0xFF4B. IO ports. Here you have hardware registers mapped into address space. Gameboy has several hardware components each with it's own registers and even memory (display controller, sound processor, timers etc). To control that hardware ROM reads and writes into the IO ports. You obviously have to detect these writes so you can emulate the hardware they correspond to. It's not just CPU and memory you have to emulate. I would even say that the least part of it and the easy one. For example, it much harder to get display controller and sound channels right. They have complicated logic, bugs and very tricky behaviour that's not documented very well but is crucial to achieve accurate emulation. Wave channel in particular gave me a hard time.

How does an OS kernel define all the inputs and outputs?

I want to know how an OS kernel defines its own inputs and outputs to make the computer run. Of course you need the right hardware for it to work, but how can you simply just make some variable and call it USB_PORT_1 or something? Is it having to do with firmware as well? Assigning arbitrary values do nothing on its own, so there must be something I am missing between the interaction of the hardware and software when you plug in a 1 terabyte HDD into your USB 3.0 Slot that is marked by the kernel as USB3_PORT_0. At this point there is obviously some stuff going on in the firmware, so what is it?
Reason: I'm making one.

To truely understand the interaction between hardware and software you have to understand how things work at a low level. What is a variable? In a programming language, variables may be assigned values that can later be modified, etc. But where is this stored physically in the machine? The truth is that it can be stored in several places. It could be in one of the processor's registers, it could be in RAM, or it could be in a completely difference place entirely.
When the kernel wishes to communicate with hardware, sometimes it may go through what you call firmware but for the most part it doesn't have to. Hardware exposes itself to the kernel in a variety of ways but the simplest way to think about it is as RAM. RAM is accessable with an address, so 0x1000 is a memory address somewhere in RAM. Speaking generallly, There is no reason that any particular address has to be mapped to RAM. Suppose I have a USB controller. I can map some address (lets call it 0xDEADBEEF) to this memory controller. So, if I read from 0xDEADBEEF it might tell me how many devices are connected to the system. Another adjacent address may tell me which port, etc etc. Every device does this differently, so we have device drivers that tell the kernel how the device is accessed, and then the kernel doesn't have to wory about specific memory addresses or anything, it simply abstracts everything into something called "USB3_PORT_0." The kernel and software simply use this to refer to the device, and the device driver translates that into a set of accesses via memory with interrupts, etc.
It is impossible for me to enumerate the number of ways harware and software can interact, however this should give you an idea of how it is done.

Any built-in Linux methods for AXI-burst type devices?

I need to communicate with an FPGA device based on an AXI-burst interface. What are the ways to access such a device through Linux without involving a DMA? Burst is an intrinsic property of the AXI standard, which should typically be triggered automatically when large amounts of data are being transferred. And the bigger problem is the FPGA is designed so as to respond only to burst-type requests over the AXI bus. So this causes serious issues on Linux when the application tries sequential copy. I have already tried memcpy and it doesn't work.

I assume your “FPGA device” is a custom block, memory-mapped over AXI interface to Cortex-A9. I think there are 2 or 3 ways you could make this work.
1) Cacheable mapping. Cache hardware interface does burst-transfer of an entire cache line at a time. You would need to manually clean (after writes) and invalidate (before reads).
2) Non-cacheable mapping, and have an ARM assembly language routine handle the low-level transfer. I think the “Load and Store Multiple registers” instructions can provide what you are looking for.
I had a similar problem where an AXI peripheral (custom memory controller) needed to be accessed with 8-byte transfers from Cortex-A9 processor. The usual ARM instructions, of course, transfer 1, 2, or 4 bytes (byte, halfword, word). Those worked through cacheable mapping, but not through non-cacheable mapping. LDM/STM, 2 words at a time, worked with both mappings.
AHB/AXI transfer modes are implementation dependent, of course. Per your description, you need INCR or WRAP modes rather than SINGLE. But it should not have to be that way. That brings up the third way you could make this work:
3) Talk with your digital hardware designer, make him aware of the software impact of his implementation.
In my opinion, you shouldn’t have to do unusual / custom low-level MMU operations. Linux has high-level methods, you would put standard hooks in your device driver and/or board.c, the main option is whether to go uncached (i.e. COHERENT). Refer to LDD3.

You need to set up the MMU to inform the hardware of the capabilities of your component. You also need to ensure that the entire interconnect supports bursts and isn't doing any conversions (which can happen if there's any ambiguity about your component's capabilities when the interconnect is generated).
To set up the MMU you perform a call like this:
/* shareable device: S=b0 TEX=b000 AP=b11, C=b0, B=b1 = 0xC06*/
Xil_SetTlbAttributes(COMPONENT_BASE_ADDRESS, 0xC06);
attributes are defined as follows (from Zynq Technical Reference Manual):
Encoding Bits Cache Attribute
C B
0 0 Non-cacheable
0 1 Write-back, write-allocate
1 0 Write-through, no write-allocate
1 1 Write-back, no write-allocate
So the above line would set the region to write-back, write allocate, which may give you burst access on write.
See also Xilinx AR#47406 and this forum post.

Userspace vs kernel space driver

I am looking to write a PWM driver. I know that there are two ways we can control a hardware driver:
User space driver.
Kernel space driver
If in general (do not consider a PWM driver case) we have to make a decision whether to go for user space or kernel space driver. Then what factors we have to take into consideration apart from these?
User space driver can directly mmap() /dev/mem memory to their virtual address space and need no context switching.
Userspace driver cannot have interrupt handlers implemented (They have to poll for interrupt).
Userspace driver cannot perform DMA (As DMA capable memory can be allocated from kernel space).

From those three factors that you have listed only the first one is actually correct. As for the rest — not really. It is possible for a user space code to perform DMA operations — no problem with that. There are many hardware appliance companies who employ this technique in their products. It is also possible to have an interrupt driven user-space application, even when all of the I/O is done with a full kernel-bypass. Of course, it is not as easy simply doing an mmap() on /dev/mem.
You would have to have a minimal portion of your driver in the kernel — that is needed in order to provide your user space with a bare minimum that it needs from the kernel (because if you think about it — /dev/mem is also backed up by a character device driver).
For DMA, it is actually too darn easy — all you have to do is to handle mmap request and map a DMA buffer into the user space. For interrupts — it is a little bit more tricky, the interrupt must be handled by the kernel no matter what, however, the kernel may not do any work and just wake up the process that calls, say, epoll_wait(). Another approach is to deliver a signal to the process as done by DOSEMU, but that is very slow and is not recommended.
As for your actual question, one factor that you should take into consideration is resource sharing. As long as you don't have to share a device across multiple applications and there is nothing that you cannot do in user space — go for the user space. You will probably save tons of time during the development cycle as writing user space code is extremely easy. When, however, two or more applications need to share the device (or its resources) then chances are that you will spend tremendous amount of time making it possible — just imagine multiple processes forking, crashing, mapping (the same?) memory concurrently etc. And after all, IPC is generally done through the kernel, so if application would need to start "talking" to each other, the performance might degrade greatly. This is still done in real-life for certain performance-critical applications, though, but I don't want to go into those details.
Another factor is the kernel infrastructure. Let's say you want to write a network device driver. That's not a problem to do it in user space. However, if you do that then you'd need to write a full network stack too as it won't be possible to user Linux's default one that lives in the kernel.
I'd say go for user space if it is possible and the amount of effort to make things work is less than writing a kernel driver, and keeping in mind that one day it might be necessary to move code into the kernel. In fact, this is a common practice to have the same code being compiled for both user space and kernel space depending on whether some macro is defined or not, because testing in user space is a lot more pleasant.

Another consideration: it is far easier to debug user-space drivers. You can use gdb, valgrind, etc. Heck, you don't even have to write your driver in C.
There's a third option beyond just user space or kernel space drivers: some of both. You can do just the kernel-space-only stuff in a kernel driver and do everything else in user space. You might not even have to write the kernel space driver if you use the Linux UIO driver framework (see https://www.kernel.org/doc/html/latest/driver-api/uio-howto.html).
I've had luck writing a DMA-capable driver almost completely in user space. UIO provides the infrastructure so you can just read/select/epoll on a file to wait on an interrupt.
You should be cognizant of the security implications of programming the DMA descriptors from user space: unless you have some protection in the device itself or an IOMMU, the user space driver can cause the device to read from or write to any address in physical memory.

What is the ideal & fastest way to communicate between kernel and user space?

I know that information exchange can happen via following interfaces between kernel and user space programs
system calls
ioctls
/proc & /sys
netlink
I want to find out
If I have missed any other interface?
Which one of them is the fastest way to exchange large amounts of data?
(and if there is any document/mail/explanation supporting such a claim that I can refer to)
Which one is the recommended way to communicate? (I think its netlink, but still would love to hear opinions)

The fastest way to exchange vast amount of data is memory mapping. The mmap call can be used on a device file, and the corresponding kernel driver can then decide to map kernel memory to user address space. A good example of this is the Video For Linux drivers, and I suppose the frame buffer driver works the same way. For an good explanation of how the V4L2 driver works, you have :
The lwn.net article about streaming I/O
The V4L2 spec itself
You can't beat memory mapping for large amount of data, because there is no memcopy like operation involved, the physical underlying memory is effectively shared between kernel and userspace. Of course, like in all shared memory mechanism, you have to provide some synchronisation so that kernel and userspace don't think they have ownership at the same time.

Shared Memory between kernel and usespace is doable.
http://kerneltrap.org/node/14326
For instructions/examples.
You can also use a named pipe which are pretty fast.
All this really depends on what data you are sharing, is it concurrently accessed and what the data is structured like. Calls may be enough for simple data.
Linux kernel /proc FIFO/pipe
Might also help
good luck

You may also consider relay (formerly relayfs):
"Basically relayfs is just a bunch of per-cpu kernel buffers that can be efficiently written into from kernel code. These buffers are represented as files which can be mmap'ed and directly read from in user space. The purpose of this setup is to provide the simplest possible mechanism allowing potentially large amounts of data to be logged in the kernel and 'relayed' to user space."
http://relayfs.sourceforge.net/

You can obviously do shared memory with copy_from_user etc, you can easily set up a character device driver basically all you have to do is make a file_operation structures but this is by far not the fastest way.
I have no benchmarks but system calls on moderns systems should be the fastest. My reasoning is that its what's been most optimized for. It used to be that to get to from user -> kernel one had to create an interrupt, which would then go to the Interrupt table(an array) then locate the interrupt handlex(0x80) and then go to kernel mode. This was really slow, and then came the .sysenter instruction, which basically makes this process really fast. Without going into details, .sysenter reads form a register CS:EIP immediately and the change is quite fast.
Shared memory on the contrary requires writing to and reading from memory, which is infinitely more expensive than reading from a register.

Here is a possible compilation of all the possible interface, although in some ways they overlapped one another (eg, socket and system call are both effectively using system calls):
Procfs
Sysfs
Configfs
Debugfs
Sysctl
devfs (eg, Character Devices)
TCP/UDP Sockets
Netlink Sockets
Ioctl
Kernel System Calls
Signals
Mmap

As for shared memory , I've found that even with NUMA the two thread running on two differrent cores communicate through shared memory still required write/read from L3 cache which if lucky (in one socket)is
about 2X slower than syscall , and if(not on one socket ),is about 5X-UP
slower than syscall,i think syscall's hardware mechanism helped.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string