What is non-idempotent memory-mapped I/O meaning? - io

In RISCV privileged spec page 75 mention a term "non-idempotent memory-mapped I/O". What is non-idempotent memory-mapped I/O? Is it about non side effect?What is design concern about non-idempotent memory-mapped I/O?

It means writing the same value twice is not the same thing as writing it just once.
e.g. an MMIO register where a write transaction triggers a UART to send the bits out over a serial port.
Unlike for a control register where writing the value that's already there has no effect. Or e.g. a parallel port where the external pins reflect the bits in the MMIO register, and writing them again doesn't change anything.

Related

What is the benefit and micro-ops of ENQCMD instruction?

ENQCMD and MOVDIR64B are two instructions in Intel DSA.
MOVDIR64B reads 64-bytes from the source memory address and performs a 64-byte direct-store operation to the destination address. The ENQCMD instruction allows software to write commands to enqueue registers, which are special device registers accessed using memory-mapped I/O (MMIO).
My question is - what is the aim of designing those two instructions?
Based on my understanding, setting up the memory-mapped IO area (the register) requires OS support, i.e. the device driver. After setting up the MMIO area, we could access it using write() system call, which is also implemented in the device driver. For general architectures, Linux supports iowrite64() to write 8-byte values at a time. Hence, if we want to write 64 bytes, needs to call iowrite64() 8 times.
With the help of MOVDIR64B, for Intel DSA, a new API is created - __iowrite512() which writes 64 bytes atomically.
I agree that the latter one is at least more efficient than the previous one, but I am confused about the time it requires to transfer data.
Consider the following case: if we are given a device (Intel DSA) that supports MOVDIR64B and ENQCMD, suppose we want to transfer 64 bytes of data from memory to MMIO register. There are two options: iowrite64() 8 times (using a loop); or __iowrite512() once. Will the latter one be 8 times faster than the previous one?
My thoughts is that it is less likely to be 8 times difference, but the latter one will be faster. May I know how faster it would be? Is it documented anywhere? I do not have Intel DSA, so I am not sure how to test it.
Besides, what other benefits do ENQCMD have? Will it be broken up into several micro operations? If yes, then what are the micro operations that does ENQCMD?
iowrite64 uses a UC access to MMIO space, so writes are serialized, not pipelined. That is, only one UC write can be in flight at a time from a single CPU thread, and the CPU doesn't continue execution until the MMIO write is complete.
MOVDIR64B has the potential to be faster than even a single iowrite64, because it uses the WC memory type instead of UC (even if the destination address is UC). After the write is issued by the CPU, it can continue execution. Multiple direct stores can be streamed to the device. That means that multiple direct stores can be in flight at one time from a single CPU thread. MOVDIRI also behaves this way.
As far as I know, the time to actually transfer the data to the destination is the same regardless of the size (between 1 and 64 bytes). Of course that is dependent on the width of the data path within the SoC, which could be different for different implementations.
The main advantage of MOVDIR64B is that the descriptor arrives at the device all at once instead of in pieces. The device doesn't have to worry about receiving a partial descriptor or receiving parts of two descriptors interleaved. In fact, Intel DSA ignores writes smaller than 64 bytes to a portal.
To realize the full benefit of streaming writes, the destination address for each MOVDIR64B from a single CPU thread should be different. Each Intel DSA portal is a 4096-byte page, so there are 64 unique addresses within each portal. Descriptor writes from a single CPU can be striped across the 64 addresses. (It doesn't matter whether writes from multiple CPUs use the same address or different addresses, but normally you would not expect multiple CPUs to be using the same dedicated WQ in DSA.)
ENQCMD allows the device to respond to software whether it accepted the descriptor or not. This allows multiple applications to use the same shared WQ without risk of a descriptor being lost because the shared WQ is full. Applications can submit descriptors without any driver involvement (after setup), and without any lock or communication between the applications.

How to generate a zero-length read on PCIE Bus using x86-64 and Linux?

In the PCI Express Base specification, section 2.2.5 "First/Last DW Byte Enables Rules", it says a zero length read can be used as a flush request. However, in the linux kernel documentation, most examples just use either a 1B or 4B read request:
Bus-Independent Device Accesses
How To Write Linux PCI Drivers
I'm wondering if it's possible the x86-64 architecture is capable of generating an instruction that causes a zero length read on PCI, and if it can, if there is some linux kernel function that creates that instruction.
The two examples you mentioned involve MMIO accesses or legacy I/O port accesses from the CPU to an I/O device, but the zero-length read implementation note from Section 2.2.5 of the PCIe specification is about accesses from an I/O device. The PCIe spec and the Intel/AMD64 x86 manuals are obviously different from each other and they use different terms, so I don't understand how you confused the two. No, there is no such thing as zero-length read in x86.
The code from the first link is the following:
WRT_REG_WORD(&reg->ictrl, 0);
/*
* The following read will ensure that the above write
* has been received by the device before we return from this
* function.
*/
RD_REG_WORD(&reg->ictrl);
There is a 16-bit MMIO write followed by a 16-bit MMIO read to the same address. The memory type of the target location is most probably UC, which ensures that all UC accesses appears on the system bus in program order. This means that it reaches the PCIe root complex (which is integrated on modern processors) in order. An MMIO write is translated by the processor's I/O unit to a posted write PCIe transaction and the read is translated to a non-posted read PCIe transaction. Both of these transactions would have traffic class and with relaxed ordering disabled. According to the transaction order rules, such a non-posted read cannot be reordered with any earlier posted write. The overall effect is that when the UC read gets back the result, the preceding UC write must have already completed at the target I/O device.
The second link you provided also includes an example of MMIO ordering that works exactly the same way. Issuing a read after a posted write is a commonly used technique to determine when the write has completed. A UC read is not a fully serializing operation in x86. If you don't want any later instructions (not UC accesses) to execute until the read completes, you need to add a fully serializing instruction after the read. The Linux kernel itself defines numerous MMIO barriers used in different situations.
The second link also mention that a legacy I/O write doesn't require a following read because "I/O Port space guarantees write transactions reach the PCI device before the CPU can continue." I/O instructions provide more ordering guarantees than UC accesses, but still they are not fully serializing. Among these guarantees include waiting for previous instructions to commit before executing an I/O instruction and not allowing later instructions to execute until the I/O instruction completes. These guarantees combined with the fact that I/O instructions are translated by the I/O controller to PCIe I/O transactions, where an I/O write transaction is a non-posted transaction, ensures that when the next instruction executes, it's guaranteed that the I/O write has completed at the target I/O device.
Zero-length reads can be used by an I/O device to determine that earlier writes have completed at the destination. This is how, for example, an I/O device can ensure that a write has reached the persistence domain on a platform that supports Asynchronous
DRAM Refresh (ADR) or that a write has become observable by the device driver.

Memory mapped IO - how does IO device know value has changed?

How does an IO device know that a value in memory pertaining to it has changed in memory mapped IO?
For example, let's say memory address 0 has been dedicated to hold the background color for a VGA device. How does the VGA device know when we change the value in memory[0]? Is the VGA device constantly polling the memory location? Or does the CPU somehow notify the device when it changes the value (and if so how?)?
An example architecture is MIPS. Given that the MIPS instruction set does not have in or out instructions, I don't understand how it could possibly communicate (on change) with the VGA device in the example. Another example is the ARM architecture.
In memory-mapped I/O, performing a memory read/write to the device's memory region will cause the CPU to perform a transaction with the device to fetch/store that value -- either directly through the CPU's memory bus, or through a secondary bus (such as AHB/APB on ARM systems). This memory transaction directly notifies the device that a value is being changed; no separate notification is necessary.
You're assuming that memory-mapped I/O is mapped by normal RAM. This is not the case. Indeed, these devices may behave in ways which are entirely unlike real memory! For instance, a typical UART or SPI device implementation may have a single data register which can be written to to transmit data, or read from to retrieve received data. Similarly, it's not uncommon for interrupt registers to have "clear on read" or "write 1 to clear" semantics.
For what it's worth: in practice, many framebuffer graphics implementations do actually behave as normal memory. What's different is that the memory is stored in a dual-ported RAM (or a time-multiplexed bus), and the video RAMDAC continuously reads through that memory to transmit its contents to an attached display.
A region of the physical address space that is designated as memory-mapped I/O (MMIO) is not mapped to main memory (system memory); it's mapped to I/O registers which are physically part of the I/O device.
To determine how to handle a memory access (read or write), the processor checks first the type of the region to which the target memory address belongs. In any MIPS processor, there are at least two types: Uncached and Cached. MMIO regions are always Uncached. An Uncached memory access request is directly sent to the main memory controller without examining or affecting any of the caches. However, an I/O Uncached memory access request is sent to an I/O controller, and eventually the request will reach the destination I/O device.
Now exactly how the CPU and the I/O device communicate with each other is completely specified by the I/O device itself. So an I/O device would have a specification that discusses how many I/O registers there are and how each of them should be used. An I/O register could be used to hold status flags, control flags, data to be read or written by the CPU, or some combination thereof. Note that since the I/O registers are physically part of the I/O device, then the I/O device can be designed so that it can detect when any of its registers are being read from or written to and take an action accordingly if required.
An I/O device can send an interrupt to the CPU to inform it that some data is available or maybe it wants attention for whatever reason. The CPU can also frequently poll the I/O device by checking some status flag(s) and then take some action accordingly.

How does device mmap work in contrast to I/O address ports?

I was wondering if Linux sees a difference between mmap to a peripheral devices memory in comparision to reading/writing to the device via I/O Ports. From what I've learned in my Assembly class, we pretty much looked at I/O port addressing in the same light as memory addressing. So I suppose I was wondering if I were to rw to the I/O my port my device is connected to, is that the same thing mmaping to that devices memory?
Thanks
I/O ports are not memory. Some hardware (e.g. graphical cards) are interfaced thru the memory bus, not only thru the I/O port bus.
For hardware having a memory interface (that is, viewed as a range of memory to the CPU), you might use mmap.
The X11 server Xorg is very often mmap-ing the graphical cards.
I think the OP is confusing three things:
mmap() is a way for application programs to perform some operations on page registers and/or the MMU.
Memory mapped I/O is a hardware architecture concept: instead of having separate buses and operations for I/O, some area of the address space is dedicated to I/O operations. (the 68K processor family uses memory mapped I/O, and IBM's AIX too, IIRC).
DMA means that not only the CPU(s), but also the I/O devices can read and write to and from (parts of) physical memory.
The vm_area_struct contains vm_flags field. In case of the special mapping it contains VM_PFNMAP or VM_IO flags. See struct vm_area_struct, VM_PFNMAP and VM_IO definitions at LXR.

What is the difference between DMA and memory-mapped IO?

What is the difference between DMA and memory-mapped IO? They both look similar to me.
Memory-mapped I/O allows the CPU to control hardware by reading and writing specific memory addresses. Usually, this would be used for low-bandwidth operations such as changing control bits.
DMA allows hardware to directly read and write memory without involving the CPU. Usually, this would be used for high-bandwidth operations such as disk I/O or camera video input.
Here is a paper has a thorough comparison between MMIO and DMA.
Design Guidelines for High Performance RDMA Systems
Since others have already answered the question, I'll just add a little bit of history.
Back in the old days, on x86 (PC) hardware, there was only I/O space and memory space. These were two different address spaces, accessed with different bus protocol and different CPU instructions, but able to talk over the same plug-in card slot.
Most devices used I/O space for both the control interface and the bulk data-transfer interface. The simple way to access data was to execute lots of CPU instructions to transfer data one word at a time from an I/O address to a memory address (sometimes known as "bit-banging.")
In order to move data from devices to host memory autonomously, there was no support in the ISA bus protocol for devices to initiate transfers. A compromise solution was invented: the DMA controller. This was a piece of hardware that sat up by the CPU and initiated transfers to move data from a device's I/O address to memory, or vice versa. Because the I/O address is the same, the DMA controller is doing the exact same operations as a CPU would, but a little more efficiently and allowing some freedom to keep running in the background (though possibly not for long as it can't talk to memory).
Fast-forward to the days of PCI, and the bus protocols got a lot smarter: any device can initiate a transfer. So it's possible for, say, a RAID controller card to move any data it likes to or from the host at any time it likes. This is called "bus master" mode, but for no particular reason people continue to refer to this mode as "DMA" even though the old DMA controller is long gone. Unlike old DMA transfers, there is frequently no corresponding I/O address at all, and the bus master mode is frequently the only interface present on the device, with no CPU "bit-banging" mode at all.
Memory-mapped IO means that the device registers are mapped into the machine's memory space - when those memory regions are read or written by the CPU, it's reading from or writing to the device, rather than real memory. To transfer data from the device to an actual memory buffer, the CPU has to read the data from the memory-mapped device registers and write it to the buffer (and the converse for transferring data to the device).
With a DMA transfer, the device is able to directly transfer data to or from a real memory buffer itself. The CPU tells the device the location of the buffer, and then can perform other work while the device is directly accessing memory.
Direct Memory Access (DMA) is a technique to transfer the data from I/O to memory and from memory to I/O without the intervention of the CPU. For this purpose, a special chip, named DMA controller, is used to control all activities and synchronization of data. As result, compare to other data transfer techniques, DMA is much faster.
On the other hand, Virtual memory acts as a cache between main memory and secondary memory. Data is fetched in advance from the secondary memory (hard disk) into the main memory so that data is already available in the main memory when needed. It allows us to run more applications on the system than we have enough physical memory to support.

Resources