What is the difference between Memory and IO bandwidth and how do we measure each one?

What is the difference between Memory and IO bandwidth and how do we measure each one? - io

What is the difference between memory and io bandwidth, and how do you measure each one?
I have so many assumptions, forgive the verbosity of this two part question.
The inspiration for these questions came from: What is the meaning of IB read, IB write, OB read and OB write. They came as output of Intel® PCM while monitoring PCIe bandwidth where Hadi explains:
DATA_REQ_OF_CPU is NOT used to measure memory bandwidth but i/o bandwidth.
I’m wondering if the difference between mem/io bandwidth is similar to the difference between DMA(direct memory addressing) & MMIO(memory mapped io) or if the bandwidth of both IS io bandwidth?
I’m trying to use this picture to help visualize:
(Hopefully I have this right) In x86 there are two address spaces: Memory and IO. Would IO bandwidth be the measure between cpu (or dma controller) to the io device, and then memory bandwidth would be between cpu and main memory? All data in these two scenarios running through the memory bus? Just for clarity, we all agree the definition of the memory bus is the combination of address and data bus? If so that part of the image might be a little misleading...
If we can measure IO bandwidth with Intel® Performance Counter Monitor (PCM) by utilizing the pcm-iio program, how would we measure memory bandwidth? Now I’m wondering why they would differ if running through the same wires? Unless I just have this all wrong. The github page for a lot of this test code is a bit overwhelming: https://github.com/opcm/pcm
Thank you

The DATA_REQ_OF_CPU event cannot be used to measure memory bandwdith for the following reasons:
Not all inbound memory requests from an IIO controller are serviced by a memory controller because a request could also be serviced by the LLC (or an LLC in case of multiple sockets). Note, however, on Intel processors that don't support DDIO, IO memory read requests may cause speculative read requests to be sent to memory in parallel with the LLC lookup.
The DATA_REQ_OF_CPU event has many subevents. The inbound memory metrics measured by the pcm-iio tool don't include all types of memory requests. Specifically, they don't include atomic memory reads and writes and IOMMU memory requests, which may consume memory bandwdith.
Some subevents count non-memory requests. For example, there are peer-to-peer requests (from one IIO to another).
An IO device may want to access memory on a NUMA node that is different from the node to which it's connected. In this case, it will consume memory bandwidth on a different NUMA node.
Now I realize the statement you quoted is a little ambiguous; I don't remember whether I was talking specifically about the metrics measured by pcm-iio or the event in general or whether "memory bandwdith" refers to total memory bandwidth or only the portion consumed by IO devices attached to an IIO. Although the statement interpreted in any of these ways is correct for the reasons mentioned above.
The pcm-iio tool only measures IO bandwdith. Use instead the pcm-memory tool for measuring memory bandwdith, which utilizes the performance events of the IMCs. It appears to me that the none of the PCM tools can measure memory bandwdith consumed by IO devices, which requires using the CBox events.
The main source of information on uncore performance events is the Intel uncore manuals. You'll find nice figures in the Introduction chapters of these manuals that show how the different units of a processor are connected to each other.

Related

how to achieve best tcpip socket data-rate performance

Packing as much data as possible per tcp packet will obviously decrease the relative weight of overhead. Increasing buffer size increases robustness against peaks of CPU usage.
But what else can be done to achieve highest data rates?
Is increasing the priority of the data reader thread a good idea? If the highest priority was used, could this thread compete for CPU usage with the Network driver and actually harm performance?
Is blocking or non-blocking best in terms of achievable data rates?
At very high data rates, can overflow of the receive buffers be detected as the buffer gets to, say, 90% and trigger a high priority read?
Other techniques for high data-rate over tcpip sockets?

One way would be to use busy polling for getting data from NIC. This can improve the data rate by reducing interrupt overhead. This is done in high performance packet processing frameworks such as DPDK.
Another way would be to avoid copying packets from kernel space to userspace. I don't whether this is possible in your case. Avoiding copying packets by mapping kernel memory to userspace memory. Copying data to userspace is one of the most time consuming step in network stacks. Again this is done in DPDK.

PCIe - DMA: Consistent vs. Streaming Memory

Currently I'm adding DMA to my PCIe driver for Linux. As I'm reading through the documentation it makes mention of consistent, or coherent, memory by using the API:
pci_set_consistent_dma_mask(...)
but never really talks about why to use it or what it does. It seems to mention to call the function for best practices and future proofing. The best I can gather is that consistent DMA memory does not have cache effects and the memory is written between device (FPGA) and CPU without any software/driver intervention once set up correctly (assuming I read correctly).
So my questions are:
Assuming a PCIe device does not require consistent memory then why would anyone use it, or in what cases is consistent memory used?
If I use consistent memory then do I not need to implement an interrupt in the PCIe driver for DMA? If true, then how does the userpsace code and device know a transfer has occurred?
If I transfer a lot of small packets, ~50 bytes, continuously and on occasion larger packets, ~6 kB, which DMA memory is better: consistent or streaming?

Think about it this way: "Consistent" means it will be automatically coherent between CPU and bus without doing anything to specifically synchronize it. For example - say I have a memory ring for inbound and outbound packets. It's lifespan will be the entire time the system is in use, and I'm going to be checking it all the time. I want this to be always consistent, because if it isn't I would have to (manually) flush or synchronize the caches, and if this were costly, and I had to do this very time I touched the ring - it would be nightmare.
On the other hand - let's take a single data buffer I'm transferring. I't kind of a "one off" deal. I can let the device transfer it - and maybe it takes many PCI cycles to complete the DMA. And maybe this is inconsistent. That's okay - but when it's done I can flush/sync caches/force consistency. If it took a tiny bit of extra time to do so - no problem - because I'm just doing it once.
So you might ask "why not make everything consistent". Answer is there is generally some level of overhead to make things consistent. Depending on the architecture, this could be significant. So in such cases, there are provisions to allow for inconsistent (streaming) mappings which don't do cache consistency (but require an explicit sync). So allowing an inconsistent transfer could gain you some performance.
Remember too - there are some cases where you would never need any consistency. For example - reading a buffer from a network device to memory, then writing that memory to a disk controller. This data may never be read/used by the CPU at all - so why bother placing any overhead on the CPU cache to track it.
As for you comment about the "interrupt" - this is kind of odd. In a "normal" case - you might have a control structure in consistent memory (like a Tx/Rx rings) which you could poll to tell you if the transaction was done. But the actual data transferred would be in a different memory which could be streaming or non-consistent.

1)Imagine you want to transfer a huge amount of data via PCIE, with high rate. you have to use scatter/gather list, and you can use a consistent memory for prepare this list for FPGA, so FPGA can read this list very fast and then do the transmissions.
2)Of course you need interrupts, otherwise you have to use polling which is very slow and unreliable.
3)If you use larger consistent memory, you can minimize interrupt/polling overheads, so they are faster, but windows usually don't allow you to allocate large consistent memory.

Why splice with sockets cannot improve performance without DMA?

In Wikipedia's introduction to splice, I found:
When using splice() with sockets, the network controller (NIC) must
support DMA.
When the NIC does not support DMA then splice() will not deliver any
performance improvement. The reason for this is that each page of the
pipe will just fill up to frame size (1460 bytes of the available 4096
bytes per page).
From what I understand, the splice improves performance because:
there's less context switching
it minimizes the number of copies (minimum two DMA copies)
If the NIC does not support DMA copy, we use CPU copy. This is still better than normal copies which have to go to the user space.
So, I don't understand why Wikipedia says there's no performance improvements without DMA support in NIC.

Maybe wikipedia is wrong? That article is already flagged as being light on citations...

Multicore processor core communication speeds

I would look to find the speed of communication between the two cores of a computer.
I'm in the very early stages of planning to massively parallelise a sequential program and I need to think about network communication speeds vs. communication between cores on a single processor.
Ubuntu linux probably provides some way of seeing this sort of information? I would have thought speed fluctuates.. I just need some average value. I'm basically needing to write something up at the moment and it would be good to talk about these ratios.
Any ideas?
Thanks.

According to this benchmark: http://www.dragonsteelmods.com/index.php?option=com_content&task=view&id=6120&Itemid=38&limit=1&limitstart=4 (Last image on the page)
On an Intel Q6600, inter-core latency is 32 nanoseconds. Network latency is measured in milliseconds which 1,000,000 milliseconds / nanosecond. "Good" network latency is considered around or under 100ms, so given that, the difference is about the order of 1 million times faster for inter-core latency.
Besides latency though there's also Bandwidth to consider. Again based on the linked bookmark, benchmark for that particular configuration, inter-core bandwidth is about 14GB/sec whereas according to this: http://www.tomshardware.com/reviews/gigabit-ethernet-bandwidth,2321-3.html, real-world test of a Gigabit ethernet connection shows about 35.8MB/sec so the difference there is smaller, only on the order of around 500 times faster in terms of bandwidth as opposed to a 1,000,000 times in latency. Depending on which is more important to your application that might change your numbers.

The network speeds are measured in milliseconds for Ethernet ($5-$100/port), or microseconds for specialized MPI hardware like Dolphin on Myrintet (~ $1k/port). Inter-core speeds are measured in nanoseconds, as the data is copied from one memory area to another, and then some signal is sent from one CPU to another (the data will be protected from simultaneous access by a mutex or a full-bodied queue).
So, using a back'o'the'napkin calculation the ratio is 1:10^6.

Inter-core communication is going to be massively faster. Why ?
the network layer imposes a massive overhead in terms of packets, addressing, handling contention etc.
the physical distances will impose a sizeable impact
How you measure inter-core communication speed would be very difficult. But given the above I think it's a redundant calculation to make.

This is a non-trivial thing to find. The speed of data transfer between two cores depends entirely on the application. It could depend on any (or all) of - the speed of register access, the clock speed of the cores, the system bus speed, the latency of your cache, the latency of your memory, etc., etc., etc. In short, run a benchmark or you'll be guessing in the dark.

What is the difference between DMA and memory-mapped IO?

What is the difference between DMA and memory-mapped IO? They both look similar to me.

Memory-mapped I/O allows the CPU to control hardware by reading and writing specific memory addresses. Usually, this would be used for low-bandwidth operations such as changing control bits.
DMA allows hardware to directly read and write memory without involving the CPU. Usually, this would be used for high-bandwidth operations such as disk I/O or camera video input.
Here is a paper has a thorough comparison between MMIO and DMA.
Design Guidelines for High Performance RDMA Systems

Since others have already answered the question, I'll just add a little bit of history.
Back in the old days, on x86 (PC) hardware, there was only I/O space and memory space. These were two different address spaces, accessed with different bus protocol and different CPU instructions, but able to talk over the same plug-in card slot.
Most devices used I/O space for both the control interface and the bulk data-transfer interface. The simple way to access data was to execute lots of CPU instructions to transfer data one word at a time from an I/O address to a memory address (sometimes known as "bit-banging.")
In order to move data from devices to host memory autonomously, there was no support in the ISA bus protocol for devices to initiate transfers. A compromise solution was invented: the DMA controller. This was a piece of hardware that sat up by the CPU and initiated transfers to move data from a device's I/O address to memory, or vice versa. Because the I/O address is the same, the DMA controller is doing the exact same operations as a CPU would, but a little more efficiently and allowing some freedom to keep running in the background (though possibly not for long as it can't talk to memory).
Fast-forward to the days of PCI, and the bus protocols got a lot smarter: any device can initiate a transfer. So it's possible for, say, a RAID controller card to move any data it likes to or from the host at any time it likes. This is called "bus master" mode, but for no particular reason people continue to refer to this mode as "DMA" even though the old DMA controller is long gone. Unlike old DMA transfers, there is frequently no corresponding I/O address at all, and the bus master mode is frequently the only interface present on the device, with no CPU "bit-banging" mode at all.

Memory-mapped IO means that the device registers are mapped into the machine's memory space - when those memory regions are read or written by the CPU, it's reading from or writing to the device, rather than real memory. To transfer data from the device to an actual memory buffer, the CPU has to read the data from the memory-mapped device registers and write it to the buffer (and the converse for transferring data to the device).
With a DMA transfer, the device is able to directly transfer data to or from a real memory buffer itself. The CPU tells the device the location of the buffer, and then can perform other work while the device is directly accessing memory.

Direct Memory Access (DMA) is a technique to transfer the data from I/O to memory and from memory to I/O without the intervention of the CPU. For this purpose, a special chip, named DMA controller, is used to control all activities and synchronization of data. As result, compare to other data transfer techniques, DMA is much faster.
On the other hand, Virtual memory acts as a cache between main memory and secondary memory. Data is fetched in advance from the secondary memory (hard disk) into the main memory so that data is already available in the main memory when needed. It allows us to run more applications on the system than we have enough physical memory to support.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string