How does an OS kernel define all the inputs and outputs? - io

I want to know how an OS kernel defines its own inputs and outputs to make the computer run. Of course you need the right hardware for it to work, but how can you simply just make some variable and call it USB_PORT_1 or something? Is it having to do with firmware as well? Assigning arbitrary values do nothing on its own, so there must be something I am missing between the interaction of the hardware and software when you plug in a 1 terabyte HDD into your USB 3.0 Slot that is marked by the kernel as USB3_PORT_0. At this point there is obviously some stuff going on in the firmware, so what is it?
Reason: I'm making one.

To truely understand the interaction between hardware and software you have to understand how things work at a low level. What is a variable? In a programming language, variables may be assigned values that can later be modified, etc. But where is this stored physically in the machine? The truth is that it can be stored in several places. It could be in one of the processor's registers, it could be in RAM, or it could be in a completely difference place entirely.
When the kernel wishes to communicate with hardware, sometimes it may go through what you call firmware but for the most part it doesn't have to. Hardware exposes itself to the kernel in a variety of ways but the simplest way to think about it is as RAM. RAM is accessable with an address, so 0x1000 is a memory address somewhere in RAM. Speaking generallly, There is no reason that any particular address has to be mapped to RAM. Suppose I have a USB controller. I can map some address (lets call it 0xDEADBEEF) to this memory controller. So, if I read from 0xDEADBEEF it might tell me how many devices are connected to the system. Another adjacent address may tell me which port, etc etc. Every device does this differently, so we have device drivers that tell the kernel how the device is accessed, and then the kernel doesn't have to wory about specific memory addresses or anything, it simply abstracts everything into something called "USB3_PORT_0." The kernel and software simply use this to refer to the device, and the device driver translates that into a set of accesses via memory with interrupts, etc.
It is impossible for me to enumerate the number of ways harware and software can interact, however this should give you an idea of how it is done.

Related

Access PCI memory BAR with low latency (Linux)

Background:
I have a PCI card, which is basically a clock. It gets the time by GPS and saves the current time in a certain register.
Goal:
I want to read a limited number of registers/bytes (for example the current time) over and over again, with the lowest possible latency. (The clock provides very high precision and I think I will loose precision the higher the latency is.). The operating system is RedHat. The programming language is C/C++. I also want to write to the device memory, whereby latency is not an issue.
Possible Ways to go:
I see these ways. If you see another, please tell me:
Writing a Linux kernel module driver, which creates a character device (or one character device for each register to read). Then a user space application can do a "read" on the /dev/ file(s).
DMA
mmap the sysfs resourceX file to user space by a user space application (systemcall). (like here for example)
Write a Linux kernel module driver which implements a mmap file operation.
Questions:
Which is the way with the lowest latency when it comes to the actual reading of the register? I am aware that mmap causes a lot of overhead in the kernel, but as far as I understand that is only for initialisation.
Is way 3 a legit way to go? It looks like a hack to me. How can I determine the /sys/ path automatically from the application?
Is there a difference between way 3 and 4? I am new to PCI driver programming and I think I didn't really understand how way 4 works. I read this (and other chapters of that book), but maybe you can give me a hint or an example. I would appreciate that.
Method 3 or 4 should work fine. There’s no difference between them with respect to latency. Latency would be in the order of 100 ns.
Method 4 would be needed if you need to initialize the device, or control which applications are allowed to access it, or enforce one reader at a time, etc. Method 3 does seem like a bit of a hack because it skips all of this. But it is simpler if you don’t need such things.
A character device is definitely higher latency, because it requires a kernel transition each time the device is read.
The latency of a DMA method depends entirely on how frequently the device writes the time to memory. It is lower latency for the CPU to access memory than MMIO, but if the device only does DMA once a millisecond, then that would be your latency. Also, that method generates a lot of useless DMA traffic, since the CPU would read the value far less often than it is written.
Adding to #prl's answer...
Method 3 seems perfectly legit to me. That's what it's for. You may want to take a look at the kernel documentation file: https://www.kernel.org/doc/Documentation/filesystems/sysfs-pci.txt
You can also use the /sys filesystem to find your device. First, note the vendor ID and device ID for your clock card (and subsystem vendor / device if necessary), then you can easily walk the /sys/devices hierarchy, looking for a matching device (using the vendor, device, etc. special files). Once you've found it, you presumably know which resourceN file to open from the device's data sheet, then mmap it at the appropriate offset and you're done.
That all assumes that your device is configured and enabled already. Typically a PCI device is not enabled to do anything when the system boots. Some driver needs to claim the device, and initialize / configure it. Once that is done, if the time is accessible just by reading a register or two, you can can go with method 3. (I'm not sure: it may be possible for a PCI device to be self-initializing but I've never seen one. I think probably something needs to enable its memory space at the very least. Likely that could be done from user-space if the setup is small enough / simple enough.)
The primary difference with method 4 is that the driver controlling the device would provide support for allowing the area to be mmap'd explicitly. For the user-space application, there is little difference between the two methods aside from the device name used. For method 4, the driver's probably going to provide a symbolic device name /dev/clock0 or something like that for use by the user-space application (and presumably the application then doesn't need to go find the device, it would just know the device file name to open).
From user-space, you will do the mmap operation in much the same way with either method. In method 4, the driver internally supplies the physical address to map -- and possibly the offset -- instead of the generic PCI subsystem doing so, but either way, it's just open + mmap.
Linux driver programming is not terribly difficult, but there's a significant learning curve there if you haven't done it before, so I definitely wouldn't go with method 4 unless there were a real need to do so.

How u-boot start instruction is found by ROM Code

I am trying to understand ARM Linux Boot Process.
These are the things I understood:
When reset button is pressed in any processor, it jumps to the reset vector or address, in case of ARM it is either 0x00 or 0xFFFF0000.
This location contains the start up code or ROM Code or Boot ROM Code
My query is how this Boot ROM Code gets the address of u-boot first instruction ?
It depends on the SoC, and the scheme used for booting will differ from one SoC to the other. It is usually documented in the SoC's reference manual, and it does describe the various conventions (where to read u-boot from, specific addresses) the u-boot port specific to this SoC should follow in order to the code in ROM to be able to load u-boot, and ultimately transfer control to u-boot.
This code in the ROM could do something like: - If pin x is 0, read 64KiB from first sector of the eMMC into the On Chip Static RAM, then transfer control to the code located at offset 256 of the OCRAM for example. - If pin x is 1, configure the UART for 19200 bauds, 8 bits parity, no stop bits, attempt to read 64KiB from the serial port using the X-MODEM protocol into the OCRAM, then transfer control to the code located at offset 256 of the OCRAM.This code, which is often named the Secondary Program Loader (SPL) would then be responsible for, say, configuring the SDRAM controller, then read the non-SPL part of u-boot into at the beginnning of the SDRAM, then jump to a specific address in the SDRAM. The SPL for a given SoC should be small enough to fit in the SoC On Chip RAM. The ROM code would be the Primary Boot Loader in this scenario.
In the case of the TI AM335x Cortex-A8 SoCs for example, section 26.1.6 of the Technical Reference Manual, and more specifically figure 26-10, explains the boot process. Some input pins may be used by the ROM code to direct the boot process - see SYSBOOT Configuration Pins in table 26-7. See The AM335x U-Boot User's Guide for more u-boot specific, AM335x-related information.
ARM doesnt make chips it makes IP that chip vendors purchase. It is one block in their chip design, usually they have many other blocks, usb controllers (likely purchased ip), pcie controller (likely purchased ip), ddr, ethernet, sata, emmc/sd, etc. Plus glue logic, plus whatever their secret sauce is they add to this to make it different and more interesting to the consumer than the competition.
The address space, particularly for the full sized arms is fairly wide open, so even if they use the same usb ip as everyone else, doesnt mean it is at the same address as everyone else.
There is no reason to assume that all chips with a cortex-a are centered around the cortex-a, the cortex-a may just be there to boot and manage the real thing the chip was made for. The chips you are asking about are most likely centered around the ARM processor the purpose of the chip is to make a "CPU" that is ARM based. What we have seen in that market is a need to support various non-volatile storage solutions. Some may wish to be ram heavy and dont care that a slow spi flash is used to get the kernel and root file system over and runtime everything is ram based including the file system. Some may wish to support traditional hard drives as well as ram, the file system being on something sata for example, spinning media or ssd. Some may wish to use eMMC, SD, etc. With the very high cost of chip production, it does not make sense to make one chip for each combination, rather make one chip that supports many combinations. You use several "strap" pins (which are not pins but balls/pads on the BGA) that the customer ties to ground or "high" whatever the definition of that voltage is so that when the chip comes out of reset (whichever of the reset pins for that product are documented as sampling the strap pins) those strap pins tell the "processor" (chip as an entity) how you desire it to boot. I want to you to first look for an sd card on this spi bus if nothing there then look for a sata drive on this interface if nothing there then please fall into the xmodem bootloader on uart0.
This leads into Frant's excellent answer. What IP is in the chip, what possible non-volatile storage is supported and what possible solutions of loading a bootloader if the "chip" itself supports it are very much chip specific not just broadcom does it this way and ti does it another but a specific chip or family of chips within their possibly vast array of products, no reason to assume any two products from a vendor work the same way, you read the documentation for each one you are interested in. Certainly dont assume any two vendors details are remotely the same, it is very likely that they have purchased similar ip for certain technologies (maybe everyone uses the same usb ip for example, or most usb ip conforms to a common set of registers, or maybe not...).
I have not gotten to the arm core, you could in these designs likely change your mind and pull the arm out and put a mips in and sell that as a product...
Now does it make sense to say write logic to read a spi flash that loads the contents of that flash into an internal sram and then for that boot mode put that sram at the arm processors address zero then reset the arm? yes that is not a horrible idea to do that in logic only. But does it make sense for example to have logic dig through a file system of a sata drive to find some bootloader? Maybe not so much, possible sure, but maybe your product will be viable longer if you instead put a rom in the product that can be aimed at the arm address zero the arm boots that, the arm code in that rom reads the straps, makes the decision as to what the boot media is spins up that peripheral (sata, emmc, spi, etc) wades through the filesystem looking for a filename, copies that file to sram, re-maps the arm address space (not using an mmu but using the logic in the chip) and fakes a reset to that by branching to address zero. (the rom is mapped in two places at least address zero then some other address so that it can branch to the other address space allowing address zero to be remapped and reused). so that if down the road you find a bug all you hve to do is change the image burned into the rom before you deliver the chips, rather than spin the chip to change the transistors and/or wiring of the transistors (pennies and days/weeks vs millions of dollars and months). so you may actually never see or be the code that the arm processor boots into on reset. The reset line to the arm core you might never have any physical nor software access to.
THEN depending on the myriad of boot options for this or any of the many chip offerings, the next step is very much specific to that chip and possibly that boot mode. You as owning all the bootcode for that board level product may have to per the chip and board design, bring up ddr, bring up pcie, bring up usb. Or perhaps some chip vendor logic/code has done some of that for you (unlikely, but maybe for specific boot cases). Now you have these generic and popular "boot loaders" like u-boot, you as the software designer and implementer may choose to have code that preceeds u-boot that does a fair amount of the work because maybe porting u-boot is a PITA, maybe not. Also note u-boot is in no way required for linux, booting linux is easy, u-boot is a monstrosity, a beast of its own, the easiest way to boot linux is to not bother porting u-boot. what u-boot gives is an already written bootloader, it is an argument that can go either way is it cheaper to port u-boot or is it cheaper to just roll your own (or port one of the competitors to u-boot)? Depends on the boot options you want, if you want bootp/tftp or anything network stack, well thats a task although there are off the shelf solutions. If you want to access a file system on some media, well that is another strong argument to just use u-boot. but if you dont need all of that, then maybe you dont need u-boot.
You have to walk the list of things that need to happen before linux boots, the chips tend to not have enough on chip ram to hold the linux kernel image and the root file system, so you need to get ddr up before linux. you probably need to get pcie up and enumerated and maybe usb I have not looked at that. but ethernet that can be brought up by the linux driver for that peripheral as an example.
The requirements to "boot" linux on arm ports of linux and probably others are relatively simple. you copy the linux kernel to some space in memory ideally aligned or at an agreed offset from an aligned address (say 0x10001000 for example, just pulling that out of the air), you then provide a table of information, how much ram there is, the ascii kernel boot string, and these days the device tree information. you branch to the linux kernel with one of the registers say r0 pointed at this table (google ATAG arm linux or some such combination of words). thats it booting linux using a not that old kernel is setting a few dozen bytes in ram, copy the kernel to ram, and branch to it, a few dozen lines of code, no need for the u-boot monstrosity. Now it is more than a few dozen bytes but it is still a table generated independent of u-boot, place that in ram, place the kernel in ram set one or more registers to point at the table, branch to the address where the kernel lives "booting linux" is complete or the linux bootloader is complete.
you still have to port linux which is a task that requires lots of trial and error and eventually years of experience. particularly since linux is a constantly evolving beast in and of itself.
How do you get to u-boot code? you may have some pre-u-boot code you have to write to find u-boot and copy it to ram then branch to it. the chip vendor may have solved this for you and "all you have to do" is put u-boot where they tell you for the media choice, and then u-boot is placed at address zero in the arm memory space for some internal sram, or u-boot is placed at some non-zero address in the arm memory space and some magic (a rom based bootloader in the chip) causes your u-boot code to execute from that address.
One I messed with recently is the ti chip used on various beagle boards, black, green, white, pocket, etc...One of the boot modes it looks at a specific offset on the sd card (not part of a file system yet, a specific logical block if you will or basically specific offset in the sd card address space) for a table, that table includes where in the "processors" address space you want the attached "bootloader" to be copied to, is it compressed, etc. you make your bootloader (roll your own or build a u-boot port) you build the correct table per the documentation with a destination address how much data, possibly a crc/checksum, whatever the docs say. the "chip" magically (probably software but might be pure logic) copies that over and causes the arm to start executing at that address (likely arm software that simply branches there). And that is how YOU get u-boot running on that product line with that boot option.
The SAME product line has other strap options, has other sd-card boot options to get a u-boot loaded and running.
Other products from other vendors have different solutions.
The broadcom chip in the raspberry pi, totally different beast, or at least how it is used. It has a broadcom (invented or purchased) gpu in it, that gpu boots some rom based code that knows how to find its own first stage bootloader on an sd card, that first stage loader does things like initialize DDR, there isnt pcie so that doesnt have to happen and I dont think the gpu cares about usb so that doesnt have to get enumerated either. but it does search out a second stage bootloader of gpu code, which is really an RTOS that it is loading, the code the GPU uses to do its graphics features to offload the burden on the ARM. In addition to that that software also looks for a third file on the flash (and fourth and nth) lets just go with third kernel.img which it copies to ram (the ddr is shared between the gpu and the arm but with different addressing schemes) at an agreed offset (0x8000 if kernel.img is used without config.txt adjustments to that) the gpu then writes a bootstrap program and ATAGs into arms memory at address zero and then releases reset on the ARM core(s). The GPU is the bootloader, with relatively limited options, but for that platform design/solution one media option, a removable sd card, what operating system, etc you run on the arm is whatever is on that sd card.
I think you will find the lots of straps driving multiple possible non-volatile storage media peripherals being the more common solution. Whether or not one or any of these boot options for a particular SOC can take u-boot (or choose your bootloader or write your own) directly or of a pre-u-boot program is required for any number of reasons (on chip sram is too small for a full u-boot lets say for example, sake of argument) is specific to that boot option for that chip from that vendor and is documented somewhere although if you are not part of the company making the board that signed the NDA with that chip vendor you may not get to see that documentation. And/or as you may know or will learn, that doesnt mean the documentation is good or makes sense. Some companies or products do a bad job, some do a good job and most are somewhere in between. If you are paying them for parts and have an NDA you at least have the option of getting or buying tech support and can ask direct questions (again the answers may not be useful, depends on the company and their support).
Just because there is an ARM inside means next to nothing. ARM makes processor cores not IP, depending on the SOC design it may be easy or moderatly painful but possible to pull the arm out and put some other purchased ip (like MIPS) in there or free ip (like risc-v) and re-use the rest of your already tested design/glue. Trying to generalize ARM based processors is like trying to generalize automobiles with V-6 engines, if I have a vehicle with a V-6 engine does that mean the headlight controls are always on the dash to the left of the steering column? Nope!

For emulating the Gameboy, why does it matter that the memory is broken up into different areas?

So I'm writing a gameboy emulator, and I'm not 100% sure why other projects took the time to break up the memory into proper categories. I don't know if there is a major technical dilemma I'm missing (maybe handling illegal parameters in instructions?), but it seems like the only thing that matters is that the address given by a write instruction is retrievable by the proper read instruction. So for a sub question, if I'm working under the assumption that the assembly is perfectly legal (meaning nothing is trying to read/write where it can't), can I just make a big array and read and write to it?
Note that this is a conceptual question and that I am aware a big array would be a memory hog, I'm not necessarily looking for the best way to do it, simply trying to learn how it works and why other emulator developers did it the way they did.
You are going to have read only memory, read/write memory and memory mapped I/O (peripherals etc). So you need to decode the address to some extent to break it into the major categories, then for the peripherals you have to emulate all of those so you have to get very detailed in your address decoding.
For the peripherals you will need to detect a read/write to some address which you cannot do by simply landing the writes in an array (two writes of the same value for example make a difference, you cant just scan some array to look for changes you have to trigger on reads and writes and perform the hardware action).
If you wish to be cycle accurate you will also need to know the timings for the rams and roms in order to mimic those, depending on how many banks of each or if timing is dependent on that you will need to decode the address further.
Hardware decodes these addresses to the same level so if you are emulating hardware then you need to...emulate hardware...and do the same amount of address decoding.
I'm going to be gameboy specific here. Look at gameboy's address space map. The address space itself is divided, it's not that emulators do it. Hardware itself operates that way.
Here's some of the regions that can't be implemented as just an array:
0x0000-0x3FFF. First bank of a ROM. It's read-only but not quite. Read the next one
0x4000-0x7FFF. Switchable ROM bank, it's also not quite read-only. Cartridges that don't fit into gameboy's address space contain memory bank controller. ROM will write to some specific read-only ROM regions to actually select which ROM bank is mapped into 0x4000-0x7FFF address range. So you have to detect these writes and then redirect reads into the selected ROM bank.
0xA000-0xBFFF. Switchable RAM bank. Same thing as with switchable ROM banks but now for RAM. Cartridges may contain additional RAM that's being mapped into gameboy's address space. Which bank of the RAM is mapped is controlled, again, by writes to specific read-only regions.
0xFF00-0xFF4B. IO ports. Here you have hardware registers mapped into address space. Gameboy has several hardware components each with it's own registers and even memory (display controller, sound processor, timers etc). To control that hardware ROM reads and writes into the IO ports. You obviously have to detect these writes so you can emulate the hardware they correspond to. It's not just CPU and memory you have to emulate. I would even say that the least part of it and the easy one. For example, it much harder to get display controller and sound channels right. They have complicated logic, bugs and very tricky behaviour that's not documented very well but is crucial to achieve accurate emulation. Wave channel in particular gave me a hard time.

How to work with physical addresses as a network and how does DMA connect to it?

I am working on a network driver and am somewhat confused with the memory management.
On the TX path, i receive a skb, as the lower layer expects to get only physical addresses, I think I need to call *virt_to_phys* and send the return value to the lower layer.
(Does it make sense?)
Now, I know there are the functions *dma_map_single* and *dma_unmap_single*. I am still not sure how they come to the picture here. So the lower layer wants to work with DMA... Does it mean that I need to run the above commands (in the appropriate time) before dispatching the packet to the lower layer?
I also not sure I understand the meaning of the description of dma_map_single
Ensure that any data held in the cache is appropriately discarded or written back.
Would appreciate your help.
The files DMA-API.txt and DMA-API-HOWTO.txt in the Documentation directory of the Linux sources document how to use these functions.
You can't use virt_to_phys() and the like to get DMA addresses. That worked a long time ago in simpler times. Linux supports a wide range of hardware architectures and buses and for many of those, the address space the devices see does not map 1:1 to the physical address space of the CPU. Then there are also IOMMUs that can change that mapping dynamically. All of this necessitates the use of the DMA API.
Physical addresses are not necessarily the same as I/O bus addresses, so you must always use the dma_map_ functions.
On many systems, DMA accesses go only to main memory, which would break if there were a copy of that memory's contents in the CPU cache.
On these systems, the dma_ functions take the appropriate architecture-dependent action to ensure that such conflicts do not happen.

Userspace vs kernel space driver

I am looking to write a PWM driver. I know that there are two ways we can control a hardware driver:
User space driver.
Kernel space driver
If in general (do not consider a PWM driver case) we have to make a decision whether to go for user space or kernel space driver. Then what factors we have to take into consideration apart from these?
User space driver can directly mmap() /dev/mem memory to their virtual address space and need no context switching.
Userspace driver cannot have interrupt handlers implemented (They have to poll for interrupt).
Userspace driver cannot perform DMA (As DMA capable memory can be allocated from kernel space).
From those three factors that you have listed only the first one is actually correct. As for the rest — not really. It is possible for a user space code to perform DMA operations — no problem with that. There are many hardware appliance companies who employ this technique in their products. It is also possible to have an interrupt driven user-space application, even when all of the I/O is done with a full kernel-bypass. Of course, it is not as easy simply doing an mmap() on /dev/mem.
You would have to have a minimal portion of your driver in the kernel — that is needed in order to provide your user space with a bare minimum that it needs from the kernel (because if you think about it — /dev/mem is also backed up by a character device driver).
For DMA, it is actually too darn easy — all you have to do is to handle mmap request and map a DMA buffer into the user space. For interrupts — it is a little bit more tricky, the interrupt must be handled by the kernel no matter what, however, the kernel may not do any work and just wake up the process that calls, say, epoll_wait(). Another approach is to deliver a signal to the process as done by DOSEMU, but that is very slow and is not recommended.
As for your actual question, one factor that you should take into consideration is resource sharing. As long as you don't have to share a device across multiple applications and there is nothing that you cannot do in user space — go for the user space. You will probably save tons of time during the development cycle as writing user space code is extremely easy. When, however, two or more applications need to share the device (or its resources) then chances are that you will spend tremendous amount of time making it possible — just imagine multiple processes forking, crashing, mapping (the same?) memory concurrently etc. And after all, IPC is generally done through the kernel, so if application would need to start "talking" to each other, the performance might degrade greatly. This is still done in real-life for certain performance-critical applications, though, but I don't want to go into those details.
Another factor is the kernel infrastructure. Let's say you want to write a network device driver. That's not a problem to do it in user space. However, if you do that then you'd need to write a full network stack too as it won't be possible to user Linux's default one that lives in the kernel.
I'd say go for user space if it is possible and the amount of effort to make things work is less than writing a kernel driver, and keeping in mind that one day it might be necessary to move code into the kernel. In fact, this is a common practice to have the same code being compiled for both user space and kernel space depending on whether some macro is defined or not, because testing in user space is a lot more pleasant.
Another consideration: it is far easier to debug user-space drivers. You can use gdb, valgrind, etc. Heck, you don't even have to write your driver in C.
There's a third option beyond just user space or kernel space drivers: some of both. You can do just the kernel-space-only stuff in a kernel driver and do everything else in user space. You might not even have to write the kernel space driver if you use the Linux UIO driver framework (see https://www.kernel.org/doc/html/latest/driver-api/uio-howto.html).
I've had luck writing a DMA-capable driver almost completely in user space. UIO provides the infrastructure so you can just read/select/epoll on a file to wait on an interrupt.
You should be cognizant of the security implications of programming the DMA descriptors from user space: unless you have some protection in the device itself or an IOMMU, the user space driver can cause the device to read from or write to any address in physical memory.

Resources