Is it necessary to flush write combine memory explicitly by programmer?

Is it necessary to flush write combine memory explicitly by programmer? - linux

I know that write combine writes will be cached, and don't reach the memory directly.
But is it necessary for the programmer to flush this memory explicitly before others can access?
I got this question from the graphics driver code. For example, CPU fills the vertex buffer(mapped as WC). But before GPU access it, I don't see any flush operation in the code.
Have the architecture(x86) already taken care of this for us? Any more detail document about this?

According to Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A: System Programming Guide, Part 1 (August 2012 version, but this should not have changed), Section 11.3.1, the buffer must be flushed:
The protocol for evicting the WC buffers is implementation dependent and should not be relied on by software for system memory coherency. When using the WC memory type, software must be sensitive to the fact that the writing of data to system memory is being delayed and must deliberately empty the WC buffers when system memory coherency is required.
If the graphics drivers did not actually flush the write combining buffers, then they were depending on system specific timing and/or buffer sizes (while assuming that subsequent WC writes will be allocated to the buffer, this is not architecturally guaranteed). This may work (or appear to work) on existing systems under ordinary workloads, but it is not architecturally guaranteed to work.
Since a broad range of serializing events will flush the write combining buffers, it is quite possible that the flush operation/event is present but not obvious (as an SFENCE would be). From Intel® 64 and IA-32 Architectures Software Developer’s Manual (version 052, September 2014), Volume 3, Section 11.3 Methods of Caching Available:
If the WC buffer is partially filled, the writes may be delayed until the next occurrence of a serializing event; such as, an SFENCE or MFENCE instruction, CPUID execution, a read or write to uncached memory, an interrupt occurrence, or a LOCK instruction execution.
For example, a write to a GPU register (if mapped to uncached memory) would flush the write combining buffer.

Related

What is the benefit and micro-ops of ENQCMD instruction?

ENQCMD and MOVDIR64B are two instructions in Intel DSA.
MOVDIR64B reads 64-bytes from the source memory address and performs a 64-byte direct-store operation to the destination address. The ENQCMD instruction allows software to write commands to enqueue registers, which are special device registers accessed using memory-mapped I/O (MMIO).
My question is - what is the aim of designing those two instructions?
Based on my understanding, setting up the memory-mapped IO area (the register) requires OS support, i.e. the device driver. After setting up the MMIO area, we could access it using write() system call, which is also implemented in the device driver. For general architectures, Linux supports iowrite64() to write 8-byte values at a time. Hence, if we want to write 64 bytes, needs to call iowrite64() 8 times.
With the help of MOVDIR64B, for Intel DSA, a new API is created - __iowrite512() which writes 64 bytes atomically.
I agree that the latter one is at least more efficient than the previous one, but I am confused about the time it requires to transfer data.
Consider the following case: if we are given a device (Intel DSA) that supports MOVDIR64B and ENQCMD, suppose we want to transfer 64 bytes of data from memory to MMIO register. There are two options: iowrite64() 8 times (using a loop); or __iowrite512() once. Will the latter one be 8 times faster than the previous one?
My thoughts is that it is less likely to be 8 times difference, but the latter one will be faster. May I know how faster it would be? Is it documented anywhere? I do not have Intel DSA, so I am not sure how to test it.
Besides, what other benefits do ENQCMD have? Will it be broken up into several micro operations? If yes, then what are the micro operations that does ENQCMD?

iowrite64 uses a UC access to MMIO space, so writes are serialized, not pipelined. That is, only one UC write can be in flight at a time from a single CPU thread, and the CPU doesn't continue execution until the MMIO write is complete.
MOVDIR64B has the potential to be faster than even a single iowrite64, because it uses the WC memory type instead of UC (even if the destination address is UC). After the write is issued by the CPU, it can continue execution. Multiple direct stores can be streamed to the device. That means that multiple direct stores can be in flight at one time from a single CPU thread. MOVDIRI also behaves this way.
As far as I know, the time to actually transfer the data to the destination is the same regardless of the size (between 1 and 64 bytes). Of course that is dependent on the width of the data path within the SoC, which could be different for different implementations.
The main advantage of MOVDIR64B is that the descriptor arrives at the device all at once instead of in pieces. The device doesn't have to worry about receiving a partial descriptor or receiving parts of two descriptors interleaved. In fact, Intel DSA ignores writes smaller than 64 bytes to a portal.
To realize the full benefit of streaming writes, the destination address for each MOVDIR64B from a single CPU thread should be different. Each Intel DSA portal is a 4096-byte page, so there are 64 unique addresses within each portal. Descriptor writes from a single CPU can be striped across the 64 addresses. (It doesn't matter whether writes from multiple CPUs use the same address or different addresses, but normally you would not expect multiple CPUs to be using the same dedicated WQ in DSA.)
ENQCMD allows the device to respond to software whether it accepted the descriptor or not. This allows multiple applications to use the same shared WQ without risk of a descriptor being lost because the shared WQ is full. Applications can submit descriptors without any driver involvement (after setup), and without any lock or communication between the applications.

How to read stale values on x86

My goal is to read in stale and outdated values of memory without cache-coherence. I have attempted to use prefetchnta to perform a non-temporal load, but it failed to fetch outdated values. I am looking into performing some kind of Streaming Memory-to-Memory Direct-Memory-Access, but am having a little trouble due to the overwhelming amount of background knowledge required to proceed with my current project. Currently I am attempting to mess around with udmabuf but even that is going slowly. It should be noted that ideally I would like to ignore the contents of all CPU caches, including the current CPU.
To provide my reasoning as to why: I am developing software that can be used to prove correctness of programs written for non-volatile memory. As the CPU Cache is volatile, the CPU's write-back cache will still be volatile and the arbitrary nature of how they are written back to memory needs to be observed.
I would sincerely appreciate it if someone could give me some pointers of how to proceed. I do not mind digging into the Linux kernel, as in fact I am doing that now, nor do I mind modifying it, I just need a little guidance in the right direction.

I haven't played around with this, but my understanding from the docs is that for loads (unlike NT stores) nothing can bypass cache or override the strong ordering of memory types like the normal WB (write-back). And even NT stores evict already-cached data, so they can't break coherence for this or another core that has cached data for the line you're writing.
You can do weakly-ordered loads from WC (write-combining) memory regions (with prefetchnta or SSE4 movntdqa), but they're probably still coherent at the physical address level.
#MargaretBloom commented
IIRC Intel warns the developer about multiple mapping with different cache types, which may indeed be good in this case.
so maybe you could actually bypass cache coherence with multiple virtual mappings of the same physical page.
I don't know if it's possible to do non-coherent DMA with a PCI / PCIe device, but that might be your only hope for getting actual DRAM contents without going through cache.
Normally (always?) DMA on modern x86 systems is cache-coherent, which is good for performance. To maintain backwards compat with 386 and earlier CPUs without caches, the first x86 CPUs with caches had cache-coherent DMA, not introducing cache-control instructions until later generations, since existing OSes didn't use them. In modern systems, memory controllers are built-in to the CPU. So on Intel CPUs, the system agent can snoop L3 tags to see if a line is cached anywhere on-chip in parallel with sending the request to the memory controller. Or a Xeon can DMA right into L3 cache without data having to bounce through DRAM, good for high bandwidth NICs.
There's an INVD instruction which invalidates all caches without doing write-back first, but I think that includes the shared L3 cache, and probably the private caches of all other cores. So you can't practically use it on a Linux system where other cores are potentially in the middle of doing stuff; you'd potentially corrupt kernel data structures by using it, as well as simulating power failure on a machine with NVDIMMs for the process you were interested in.
Maybe if you somehow offlined all the other CPU cores, and disabled interrupts on the one core that was still up
you could wbinvd (write-back+invalidate) to flush all caches
then run some code under test
then invd and see what made it to DRAM
Then re-enable interrupts. Interrupt handlers could end up with some kernel data cached and some in memory, or get device drivers out of sync with hardware, if any interrupts are handled between the wbinvd and the invd.
Update: someone did actually attempt this:
How to run "invd" instruction with disabled SMP support?
How to explicitly load a structure into L1d cache? Weird results with INVD with CR0.CD = 1 on isolated core with/without hyperthreading - invd worked so well it nuked some of the stores done by printk in the mis-designed attempt to log something about it.

Do locked instructions provide a barrier between weakly-ordered accesses?

On x86, lock-prefixed instructions such as lock cmpxchg provide barrier semantics in addition to their atomic operation: for normal memory access on write-back memory regions, reads and writes are not re-ordered across lock-prefixed instructions, per section 8.2.2 of Volume 3 of the Intel SDM:
Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.
This section applies only to write-back memory types. In the same list, you find an exception where it notes that weakly ordered stores are not ordered:
Reads are not reordered with other reads.
Writes are not reordered
with older reads.
Writes to memory are not reordered with other
writes, with the following exceptions: —
streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and —
string operations (see Section 8.2.4.1).
Note that there is no exception made for non-temporal instructions in any other items in the list, e.g., in the item referring to lock-prefixed instructions.
In various other sections of the guide, it is mentioned that the mfence and/or sfence instructions can be used to order memory when weakly ordered (non-temporal) instructions are used. These sections generally don't mention lock-prefixed instruction as an alternative.
All that leaves me uncertain: do lock-prefixed instructions provide the same full barrier that mfence provides between weakly ordered (non-temporal) instructions on WB memory? The same question applies again but to any type of access on WC memory.

On all 64-bit AMD processors, MFENCE is a fully serializing instruction and the Lock-prefixed instructions are not. However, both serialize all memory accesses according to the AMD manual V2 7.4.2:
All previous loads and stores complete to memory or I/O space before a
memory access for an I/O, locked or serializing instruction is issued.
All loads and stores associated with the I/O and locked instructions
complete to memory (no buffered stores) before a load or store from a
subsequent instruction is issued.
There are no exceptions or erratum related to the serialization properties of these instructions.
It's clear from the Intel manual and documents that both serialize all stores with no exceptions or related erratum. MFENCE also serializes all loads, with one errata documented for most processors based on Skylake, Kaby Lake, and Coffee Lake microarchitectures, which states that MOVNTDQA from WC memory may passs earlier MFENCE instructions. In addition, many processors based on the Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Broadwell, Skylake, Kaby Lake, Coffee Lake, and Silvermont microarchitectures have an errata that says that MOVNTDQA from WC memory may passs earlier locked instructions. Processors based on the Core, Westmere, Sunny Cove, and Goldmont microarchitectures don't have this errata.
The quote from Necrolis's answer says that the lock prefix may not serialize load operations that reference weakly ordered memory types on the Pentium 4 processors. My understanding is that this looks like a bug in the Pentium 4 processors and it doesn't apply to any other processors. Although it's worth noting that it's not documented in the spec update documents of the Pentium 4 processors.
#PeterCordes's experiments show that, on Skylake, locking instructions don't seem to block ALU instructions from being executed out-of-order while mfence does serialize ALU instructions (potentially behaving identically to lfence + a store-buffer flush like a locked instruction). However, I think this is an implementation detail.

Bus locks (via the LOCK opcode prefix) produce a full fence*, however, on WC memory they don't provide the load fence, this is documented in the Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, 8.1.2:
For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for
them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load
operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.
*See Intel's 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, 8.2.3.9 for an example

What is the Unix/Linux equivalent of Registered I/O?

Windows 8 and Server 2012 sport RIO, which allows you to pre-register I/O buffers once and then just use those same buffers repeatedly, avoiding the otherwise (apparently) necessary per-I/O-op buffer checks.
RIO also allows completion operations such as polling to be done entirely in user-mode, without making system calls.
(How) is this possible with Linux/Unix?

Starting with Linux Kernel 5.1 there is finally proper support with io_uring.

It seems netmap is that - and more:
In building netmap, we identified and successfully reduced or
removed three main packet processing costs:
per-packet dynamic memory allocations, removed by preallocating resources
system call overheads, amortized over large batches
and memory copies, eliminated by sharing buffers and metadata between kernel and userspace, while still protecting access to device
registers and other kernel memory areas

How are the O_SYNC and O_DIRECT flags in open(2) different/alike?

The use and effects of the O_SYNC and O_DIRECT flags is very confusing and appears to vary somewhat among platforms. From the Linux man page (see an example here), O_DIRECT provides synchronous I/O, minimizes cache effects and requires you to handle block size alignment yourself. O_SYNC just guarantees synchronous I/O. Although both guarantee that data is written into the hard disk's cache, I believe that direct I/O operations are supposed to be faster than plain synchronous I/O since they bypass the page cache (Though FreeBSD's man page for open(2) states that the cache is bypassed when O_SYNC is used. See here).
What exactly are the differences between the O_DIRECT and O_SYNC flags? Some implementations suggest using O_SYNC | O_DIRECT. Why?

O_DIRECT alone only promises that the kernel will avoid copying data from user space to kernel space, and will instead write it directly via DMA (Direct memory access; if possible). Data does not go into caches. There is no strict guarantee that the function will return only after all data has been transferred.
O_SYNC guarantees that the call will not return before all data has been transferred to the disk (as far as the OS can tell). This still does not guarantee that the data isn't somewhere in the harddisk write cache, but it is as much as the OS can guarantee.
O_DIRECT|O_SYNC is the combination of these, i.e. "DMA + guarantee".

Actuall under linux 2.6, o_direct is syncronous, see the man page:
manpage of open, there is 2 section about it..
Under 2.4 it is not guaranteed
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. Ingeneral this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File
I/O is done directly to/from user-space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block devices is described in raw(8).
but under 2.6 it is guaranteed, see
O_DIRECT
The O_DIRECT flag may impose alignment restrictions on the length and address of userspace buffers and the file offset of I/Os. In Linux alignment restrictions vary by file system and kernel version and might be absent entirely. However there is currently no file system-independent interface for an application to discover these restrictions for a given file or file system. Some file systems provide their own interfaces for doing so, for example the XFS_IOC_DIOINFO operation in xfsctl(3).
Under Linux 2.4, transfer sizes, and the alignment of the user buffer and the file offset must all be multiples of the logical block size of the file system. Under Linux 2.6, alignment to 512-byte boundaries suffices.
O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on the heap and statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another thread in the process, should be completed before fork(2) is called. Failure to do so can result in data corruption and undefined behavior in parent and child processes. This restriction does not apply when the memory buffer for the O_DIRECT I/Os was created using shmat(2) or mmap(2) with the MAP_SHARED flag. Nor does this restriction apply when the memory buffer has been advised as MADV_DONTFORK with madvise(2), ensuring that it will not be available to the child after fork(2).
The O_DIRECT flag was introduced in SGI IRIX, where it has alignment restrictions similar to those of Linux 2.4. IRIX has also a fcntl(2) call to query appropriate alignments, and sizes. FreeBSD 4.x introduced a flag of the same name, but without alignment restrictions.
O_DIRECT support was added under Linux in kernel version 2.4.10. Older Linux kernels simply ignore this flag. Some file systems may not implement the flag and open() will fail with EINVAL if it is used.
Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file. Even when the file system correctly handles the coherency issues in this situation, overall I/O throughput is likely to be slower than using either mode alone. Likewise, applications should avoid mixing mmap(2) of files with direct I/O to the same files.
The behaviour of O_DIRECT with NFS will differ from local file systems. Older kernels, or kernels configured in certain ways, may not support this combination. The NFS protocol does not support passing the flag to the server, so O_DIRECT I/O will only bypass the page cache on the client; the server may still cache the I/O. The client asks the server to make the I/O synchronous to preserve the synchronous semantics of O_DIRECT. Some servers will perform poorly under these circumstances, especially if the I/O size is small. Some servers may also be configured to lie to clients about the I/O having reached stable storage; this will avoid the performance penalty at some risk to data integrity in the event of server power failure. The Linux NFS client places no alignment restrictions on O_DIRECT I/O.
In summary, O_DIRECT is a potentially powerful tool that should be used with caution. It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default.
"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances."---Linus

AFAIK, O_DIRECT bypasses the page cache. O_SYNC uses page cache but syncs it immediately. Page cache is shared between processes so if there is another process that is working on the same file without O_DIRECT flag can read the correct data.

This IBM doc explains the difference rather clearly, I think.
A file opened in the O_DIRECT mode ("direct I/O"), GPFS™ transfers data directly between the user buffer and the file on the disk.Using direct I/O may provide some performance benefits in the
following cases:
The file is accessed at random locations.
There is no access locality.
Direct transfer between the user buffer and the disk can only happen
if all of the following conditions are true: The number of bytes
transferred is a multiple of 512 bytes. The file offset is a multiple
of 512 bytes. The user memory buffer address is aligned on a 512-byte
boundary. When these conditions are not all true, the operation will
still proceed but will be treated more like other normal file I/O,
with the O_SYNC flag that flushes the dirty buffer to disk.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string