how to achieve best tcpip socket data-rate performance - linux

Packing as much data as possible per tcp packet will obviously decrease the relative weight of overhead. Increasing buffer size increases robustness against peaks of CPU usage.
But what else can be done to achieve highest data rates?
Is increasing the priority of the data reader thread a good idea? If the highest priority was used, could this thread compete for CPU usage with the Network driver and actually harm performance?
Is blocking or non-blocking best in terms of achievable data rates?
At very high data rates, can overflow of the receive buffers be detected as the buffer gets to, say, 90% and trigger a high priority read?
Other techniques for high data-rate over tcpip sockets?

One way would be to use busy polling for getting data from NIC. This can improve the data rate by reducing interrupt overhead. This is done in high performance packet processing frameworks such as DPDK.
Another way would be to avoid copying packets from kernel space to userspace. I don't whether this is possible in your case. Avoiding copying packets by mapping kernel memory to userspace memory. Copying data to userspace is one of the most time consuming step in network stacks. Again this is done in DPDK.

Related

What is the difference between Memory and IO bandwidth and how do we measure each one?

What is the difference between memory and io bandwidth, and how do you measure each one?
I have so many assumptions, forgive the verbosity of this two part question.
The inspiration for these questions came from: What is the meaning of IB read, IB write, OB read and OB write. They came as output of Intel® PCM while monitoring PCIe bandwidth where Hadi explains:
DATA_REQ_OF_CPU is NOT used to measure memory bandwidth but i/o bandwidth.
I’m wondering if the difference between mem/io bandwidth is similar to the difference between DMA(direct memory addressing) & MMIO(memory mapped io) or if the bandwidth of both IS io bandwidth?
I’m trying to use this picture to help visualize:
(Hopefully I have this right) In x86 there are two address spaces: Memory and IO. Would IO bandwidth be the measure between cpu (or dma controller) to the io device, and then memory bandwidth would be between cpu and main memory? All data in these two scenarios running through the memory bus? Just for clarity, we all agree the definition of the memory bus is the combination of address and data bus? If so that part of the image might be a little misleading...
If we can measure IO bandwidth with Intel® Performance Counter Monitor (PCM) by utilizing the pcm-iio program, how would we measure memory bandwidth? Now I’m wondering why they would differ if running through the same wires? Unless I just have this all wrong. The github page for a lot of this test code is a bit overwhelming: https://github.com/opcm/pcm
Thank you
The DATA_REQ_OF_CPU event cannot be used to measure memory bandwdith for the following reasons:
Not all inbound memory requests from an IIO controller are serviced by a memory controller because a request could also be serviced by the LLC (or an LLC in case of multiple sockets). Note, however, on Intel processors that don't support DDIO, IO memory read requests may cause speculative read requests to be sent to memory in parallel with the LLC lookup.
The DATA_REQ_OF_CPU event has many subevents. The inbound memory metrics measured by the pcm-iio tool don't include all types of memory requests. Specifically, they don't include atomic memory reads and writes and IOMMU memory requests, which may consume memory bandwdith.
Some subevents count non-memory requests. For example, there are peer-to-peer requests (from one IIO to another).
An IO device may want to access memory on a NUMA node that is different from the node to which it's connected. In this case, it will consume memory bandwidth on a different NUMA node.
Now I realize the statement you quoted is a little ambiguous; I don't remember whether I was talking specifically about the metrics measured by pcm-iio or the event in general or whether "memory bandwdith" refers to total memory bandwidth or only the portion consumed by IO devices attached to an IIO. Although the statement interpreted in any of these ways is correct for the reasons mentioned above.
The pcm-iio tool only measures IO bandwdith. Use instead the pcm-memory tool for measuring memory bandwdith, which utilizes the performance events of the IMCs. It appears to me that the none of the PCM tools can measure memory bandwdith consumed by IO devices, which requires using the CBox events.
The main source of information on uncore performance events is the Intel uncore manuals. You'll find nice figures in the Introduction chapters of these manuals that show how the different units of a processor are connected to each other.

On which factor (parameters) bandwidth is depend using FIO benchmarking of pmem

I am doing FIO testing of /dev/pmem for sequential read with command using:
fio --name=readf --filename=/dev/pmem --iodepth=4 --ioengine=libaio --direct=1 --buffered=0 --groupreprting --timebased --bs=64k --size=10g --rw=read --norandommap --refillbuffers=1 --randrepeat=0 --runtime=300
The question is vague and can be read as what does your maximum disk bandwidth depend on?
Speed of the underlying device.
State of the underlying device.
Busyness of the system.
Amount of I/O that can be sent down in tandem.
How I/O is batched together.
Block size chosen (disks generally have an optimal size).
Whether the I/O is sequential or random.
Whether there is other I/O happening to the same disk (e.g. SMART updates).
Whether cache on the device is exhausted.
Whether the device has to do maintenance.
Size of caches.
Amount of I/O put down.
Compressibility of data.
Configuration parameters of the OS.
Configuration parameters of the hardware.
Size of the region I/O is done within.
In the job given things that stand out are:
The iodepth looks a bit low try boosting it until you don't get a benefit.
setting norandommap is meaningless for a sequential job
Setting both direct=1 and buffered=0 is redundant
groupreprting and timebased are spelt incorrectly
refillbuffers is also spelt wrong and you might get away with scramble_buffers with lower overhead but at the risk of less random data
You might get some benefit from pinning fio to appropriate CPUs
You might get some benefit submitting I/O in batches

How to test linux NAPI feature?

I am trying to test the NAPI functionalities on embedded linux environment. I used 'pktgen' to generate the large number of packets and tried to verify the interrupt count of my network interface at /proc/interrupts.
I found out that interrupt count is comparatively less than the packets generated.
Also I am trying to tune the 'netdev_budget' value from 1 to 1000(default is 300) so that I can observe the reduction in interrupt count when netdev_budget is increased.
However increasing the netdev_budget doesn't seems to help. The interrupt is similar to that of interrupt count observed with netdev_budget set to 300.
So here are my queries:
What is the effect of 'netdev_budget' on NAPI?
What other parameters I can/should tune to observe the changes in interrupt count?
Is there any other way I can use to test the NAPI functionality on Linux?(Apart from directly looking at the network driver code)
Any help is much appreaciated.
Thanks in advance.
I wrote a comprehensive blog post about Linux network tuning which explains everything about monitoring, tuning, and optimizing the Linux network stack (including the NAPI weight). Take a look.
Keep in mind: some drivers do not disable IRQs from the NIC when NAPI starts. They are supposed to, but some simply do not. You can verify this by examining the hard IRQ handler in the driver to see if hard IRQs are being disabled.
Note that hard IRQs are re-enabled in some cases as mentioned in the blog post and below.
As far as your questions:
Increasing netdev_budget increases the number of packets that the NET_RX softirq can process. The number of packets that can be processed is also limited by a time limit, which is not tunable. This is to prevent the NET_RX softirq from eating 100% of CPU usage. If the device does not receive enough packets to process during its time allocation, hardirqs are reneabled and NAPI is disabled.
You can also try modifying your IRQ coalescing settings for the NIC, if it is supported. See the blog post above for more information on how to do this and what this means, exactly.
You should add monitoring to your /proc/net/softnet_stat file. The fields in this file can help you figure out how many packets are being processed, whether you are running out of time, etc.
A question for you to consider, if I may:
Why does your hardirq rate matter? It probably doesn't matter, directly. The hardirq handler in your NIC driver should do as little work as possible, so it executing a lot is probably not a problem for your system. If it is, you should carefully measure that as it seems very unlikely. Nevertheless, you can adjust IRQ coalescing settings and IRQ CPU affinity to distribute processing to alter the number of hardirqs generated by the NIC and processed by a particular CPU, respectively.
You should consider whether you probably are more interested in packet processing throughput or packet processing latency. Depending on which is the concern, you can tune your network stack appropriately.
Remember: to completely tune and optimize your Linux networking stack, you have to monitor and tune each component. They are all intertwined and it is difficult (and often insufficient) to monitor and tune just a single aspect of the stack.

PCIe - DMA: Consistent vs. Streaming Memory

Currently I'm adding DMA to my PCIe driver for Linux. As I'm reading through the documentation it makes mention of consistent, or coherent, memory by using the API:
pci_set_consistent_dma_mask(...)
but never really talks about why to use it or what it does. It seems to mention to call the function for best practices and future proofing. The best I can gather is that consistent DMA memory does not have cache effects and the memory is written between device (FPGA) and CPU without any software/driver intervention once set up correctly (assuming I read correctly).
So my questions are:
Assuming a PCIe device does not require consistent memory then why would anyone use it, or in what cases is consistent memory used?
If I use consistent memory then do I not need to implement an interrupt in the PCIe driver for DMA? If true, then how does the userpsace code and device know a transfer has occurred?
If I transfer a lot of small packets, ~50 bytes, continuously and on occasion larger packets, ~6 kB, which DMA memory is better: consistent or streaming?
Think about it this way: "Consistent" means it will be automatically coherent between CPU and bus without doing anything to specifically synchronize it. For example - say I have a memory ring for inbound and outbound packets. It's lifespan will be the entire time the system is in use, and I'm going to be checking it all the time. I want this to be always consistent, because if it isn't I would have to (manually) flush or synchronize the caches, and if this were costly, and I had to do this very time I touched the ring - it would be nightmare.
On the other hand - let's take a single data buffer I'm transferring. I't kind of a "one off" deal. I can let the device transfer it - and maybe it takes many PCI cycles to complete the DMA. And maybe this is inconsistent. That's okay - but when it's done I can flush/sync caches/force consistency. If it took a tiny bit of extra time to do so - no problem - because I'm just doing it once.
So you might ask "why not make everything consistent". Answer is there is generally some level of overhead to make things consistent. Depending on the architecture, this could be significant. So in such cases, there are provisions to allow for inconsistent (streaming) mappings which don't do cache consistency (but require an explicit sync). So allowing an inconsistent transfer could gain you some performance.
Remember too - there are some cases where you would never need any consistency. For example - reading a buffer from a network device to memory, then writing that memory to a disk controller. This data may never be read/used by the CPU at all - so why bother placing any overhead on the CPU cache to track it.
As for you comment about the "interrupt" - this is kind of odd. In a "normal" case - you might have a control structure in consistent memory (like a Tx/Rx rings) which you could poll to tell you if the transaction was done. But the actual data transferred would be in a different memory which could be streaming or non-consistent.
1)Imagine you want to transfer a huge amount of data via PCIE, with high rate. you have to use scatter/gather list, and you can use a consistent memory for prepare this list for FPGA, so FPGA can read this list very fast and then do the transmissions.
2)Of course you need interrupts, otherwise you have to use polling which is very slow and unreliable.
3)If you use larger consistent memory, you can minimize interrupt/polling overheads, so they are faster, but windows usually don't allow you to allocate large consistent memory.

Multicore processor core communication speeds

I would look to find the speed of communication between the two cores of a computer.
I'm in the very early stages of planning to massively parallelise a sequential program and I need to think about network communication speeds vs. communication between cores on a single processor.
Ubuntu linux probably provides some way of seeing this sort of information? I would have thought speed fluctuates.. I just need some average value. I'm basically needing to write something up at the moment and it would be good to talk about these ratios.
Any ideas?
Thanks.
According to this benchmark: http://www.dragonsteelmods.com/index.php?option=com_content&task=view&id=6120&Itemid=38&limit=1&limitstart=4 (Last image on the page)
On an Intel Q6600, inter-core latency is 32 nanoseconds. Network latency is measured in milliseconds which 1,000,000 milliseconds / nanosecond. "Good" network latency is considered around or under 100ms, so given that, the difference is about the order of 1 million times faster for inter-core latency.
Besides latency though there's also Bandwidth to consider. Again based on the linked bookmark, benchmark for that particular configuration, inter-core bandwidth is about 14GB/sec whereas according to this: http://www.tomshardware.com/reviews/gigabit-ethernet-bandwidth,2321-3.html, real-world test of a Gigabit ethernet connection shows about 35.8MB/sec so the difference there is smaller, only on the order of around 500 times faster in terms of bandwidth as opposed to a 1,000,000 times in latency. Depending on which is more important to your application that might change your numbers.
The network speeds are measured in milliseconds for Ethernet ($5-$100/port), or microseconds for specialized MPI hardware like Dolphin on Myrintet (~ $1k/port). Inter-core speeds are measured in nanoseconds, as the data is copied from one memory area to another, and then some signal is sent from one CPU to another (the data will be protected from simultaneous access by a mutex or a full-bodied queue).
So, using a back'o'the'napkin calculation the ratio is 1:10^6.
Inter-core communication is going to be massively faster. Why ?
the network layer imposes a massive overhead in terms of packets, addressing, handling contention etc.
the physical distances will impose a sizeable impact
How you measure inter-core communication speed would be very difficult. But given the above I think it's a redundant calculation to make.
This is a non-trivial thing to find. The speed of data transfer between two cores depends entirely on the application. It could depend on any (or all) of - the speed of register access, the clock speed of the cores, the system bus speed, the latency of your cache, the latency of your memory, etc., etc., etc. In short, run a benchmark or you'll be guessing in the dark.

Resources