SMP affinity routing doesn't work with GICv2 on ARM

SMP affinity routing doesn't work with GICv2 on ARM - linux

There are 4 CPU cores and one Ethernet card on my Raspberry Pi.
I need interrupts from NIC to be routed to all the 4 CPU cores.
I set the /proc/irq/24/smp_affinity to 0xF (1111), but that doesn't help.
In sixth column of /proc/interrupts I don't see IO-APIC (which definitely supports* affinity routing) but GICv2 instead. Still can't find any useful info about GICv2 and smp_affinity.
Does GICv2 support SMP affinity routing?
*UPD:
from that post:
The only reason to look at this value is that SMP affinity will only
work for IO-APIC enabled device drivers.

TL;DR - The existence of /proc/irq/24/smp_affinity indicates that your Linux SMP system supports affinity. The text IO-APIC is the type of interrupt controller (typical PC) and it does NOT indicate that the system can handle affinities. On ARM systems a GIC is usually the interrupt controller, although some interrupts can be routed to a 'sub-controller'.
At least the mainline is supporting some affinities as per Kconfig. However, I am not sure what you are trying to do. The interrupt can only run on one CPU as only one CPU can take the data off the NIC. If a particular CPU is running network code and the rest are used for other purposes, the affinity makes sense.
The data on that core will probably not be in cache as the NIC buffers are probably DMA and not cacheable. So, I am not really sure what you would achieve or how you would expect the interrupts to run on all four CPUs? If you have four NIC interfaces, you can peg each to a CPU. This may be good for power consumption issues.
Specifically, for your case of four CPUs, the affinity mask of 0xf will disable any affinity and this is the default case. You can cat /proc/irq/24/smp_affinity to see the affinity is set. Also, the existence of this file would indicate that your Linux SMP system supports affinity. The text IO-APIC is the type of interrupt controller (typical PC) and it does NOT indicate that the system can handle affinities.
See also:
zero copy vs kernel by-pass
University of Waterloo doc
IRQ-affinity.txt
NOTE This part is speculative and is NOT how any cards I know of works.
The major part that you want is not generally possible. The NIC registers are a single resource. There are multiple registers and they have general sequences to reading and writing registers to perform an operation. If two CPUs were writing (or even reading) the register at the same time, then it will severely mix up the NIC. Often the CPU is not that involved in an interrupt and only some DMA engine needs to be told about a next buffer in an interrupt.
In order for what you want to be useful, you would need a NIC with several register 'banks' that can be used independently. For instance, just READ/WRITE packet banks is easy to comprehend. However, there may be several banks to write different packets and then the card would have to manage how to serialize them. Also, the card could do some packet inspection and interrupt different CPUs based on fixed packet values. Ie, a port and IP. This packet matching would generate different interrupt sources and different CPUs could handle different matches.
This would allow you to route different socket traffic to a particular CPU using a single NIC.
The problems are to make this card in hardware would be incredible complex compared to existing cards. It would be more expensive and it would take more power to operate.
If it is standard NIC hardware, there is no gain by rotating CPUs if the original CPU is not busy. If there is non-network activity, it is better to leave other CPUs alone so there cache can be use for a different workload (code/data). So in most case, it is best just to keep the interrupt on a fixed CPU unless it is busy and then it may ping-pong between a few CPUs. It would almost never be beneficial to run the interrupt on all CPUs.

I do not believe the the GICv2 supports IRQ balancing. Interrupts will always be handled by the same CPU. At least this was the case when I looked at this last for 5.1 kernels. The discussion at the time was that this would not be supported because it was not a good idea.
You will see interrupts will always be handled by CPU 0. Use something like ftrace or LTTng to observe what CPU is doing what.
I think via the affinity setting you could prevent the interrupt from running on a CPU, by setting that bit to zero. But this does not balance the IRQ over all CPUs on which it is allowed. It will still always go to the same CPU. But you could make this CPU 1 instead of 0.
So what you can do, is to put certain interrupts on different CPUs. This would allow something like SDIO and network to not vie for CPU time from the CPU 0 in their interrupt handlers. It's also possible to set the affinity of a userspace process such that it will not run on the same CPU which will handle interrupts and thereby reduce the time that the userspace process can be interrupted.
So why don't we do IRQ balancing? It ends up not being useful.
Keep in mind that the interrupt handler here is only the "hard" IRQ handler. This usually does not do very much work. It acknowledges the interrupt with the hardware and then triggers a back-end handler, like a work queue, IRQ thread, soft-irq, or tasklet. These don't run in IRQ context and can and will be scheduled to different CPU or CPUs based on the current workload.
So even if the network interrupt is always routed to the same CPU, the network stack is multi-threaded and runs on all CPUs. Its main work is not done in the hard IRQ handler that runs on one CPU. Again, use ftrace or LTTng to see this.
If the hard IRQ does very little, what is most important is to reduce latency, which is best done by running on the same CPU to improve cache effectiveness. Spreading it out is likely worse for latency and also for the total cost of handling the IRQs.
The hard IRQ handler can only run once instance at a time. So even if it was balanced, it could use just one CPU at any one time. If this was not the case, the handler would be virtually impossible to write without race conditions. If you want to use multiple CPUs at the same time, then don't do the work in a hard IRQ handler, do it in a construct like a workqueue. Which is how the network stack works. And the block device layer.
IRQs aren't balanced, because it's not usually the answer. The answer is to not do the work in IRQ context.

Related

X86 clear interrupt flag instruction `cli` not working in user space?

I try to stop interrupts from user space for a specific isolated core,
so I set CPU affinity:
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(2, &set);
assert(sched_setaffinity(getpid(),sizeof(set),&set)==0);
and useiopl(3) to execute privileged instruction cli/sti in user space:
iopl(3);
__asm__("cli;");
// busy looping for a while
__asm__("sti;");
and there are two phenomenons I can't explain:
1 cli can't actually stop interrupts (at least not all interrupts), and interrupt, such as LOC (Local Timer Interrupt) comes out every now and then;
I notice lasted kernel patches prevent cli in user space (reference) , but this result can be reproduced in kernel 4.19.0.
2 AFAIK, cli only clear interrupt flag of CPU on which the program is running, but in practice, my whole system is stuck, not responding to my mouse or keyboard.

(2): Many parts of the Linux kernel depend on communicating with other cores, including RCU depending on for each core: run_on(core) and stuff like that. (https://lwn.net/Articles/262464/). Any kernel code doing that will get stuck when this core doesn't respond to the IPI that other cores send to ask the kernel on this core to switch to a certain task, or perhaps to do TLB shootdowns.
I don't know what exact thing would tend to lead to getting stuck, but I don't find it surprising at all that other parts of the kernel are waiting for something that depends on hearing back from the kernel on this core, and that blocks progress of something involved in getting keyboard/mouse events to an X server and to user-space. (Or even to a text console? That might have more hope, fewer layers of software.)
Or it's always possible that some keyboard or mouse interrupts get distributed to this core, and ignored.
As for (1): do you leave the NMI watchdog enabled, or other source of NMIs? That could get the kernel running temporarily in a state where (other?) interrupts are enabled.
I use kernel/nmi_watchdog = 0 in /etc/sysctl.d/99-local.conf to free up an extra perf counter, but the default is enabled.
(cli doesn't stop Non-Maskable Interrupts, as you might guess from the name.)
Other than that guess, I don't know why you'd still be occasional LOCal timer interrupts; maybe someone more familiar with modern x86 interrupts would know.

Using loopback for synchronous IPC when using NUMA architecture

(For a Linux platform) Is it feasible (from a performance point of view) to try to communicate (in a synchronous way) via loopback interface between processes on different NUMA nodes?
What about if the processes reside on the same NUMA node?
I know it's possible to memory bind a process and/or set CPU affinity to a node (using libnuma). I don't know if this true also for the network interface.
Later edit. If loopback interface is just a memory buffer used by kernel, is there a way to be sure that buffer is on the same NUMA node in order for two processes to communicate without the cross node overhead?

Network interfaces don't reside on a node; they're a device - virtual or real - shared across the whole machine. The loopback interface is just a memory buffer somewhere or other, and some kernel code. The code that runs to support that device is likely bouncing round the CPU cores, just like any other thread in the system.
You talk of NUMA nodes, and tagged the question with Linux. Linux doens't run on pure NUMA architectures, it runs on SMP architectures. Modern CPUs from, say, Intel, AMD, ARM all synthesise an SMP hardware environment using separate cores, varying degrees of cache / memory interface unification, and high speed serial links between cores or CPUs. Effectively it's not possible for the operating system or software running on top to see the underlying NUMA architecture; it thinks it's running on a classical SMP architecture.
Intel / AMD / everyone else have done this because, back in the day, successful multiple CPU machines really were SMP; they had multiple CPUs all sharing the same memory bus, and had equal access to the RAM at the other end of the bus. Software got written to take advantage of that (Linux, Windows, etc).
Then the CPU manufacturers realised that SMP architectures suck so far as speed improvements are concerned. AMD blinked first, and ditched SMP in favour of Hypertransport, and were successful. Intel persisted with pure SMP for longer, but soon gave up too and started using QPI between CPUs.
But to give the old software (Linux, Windows, etc) backward compatibility, the CPU designers had to create a synthetic SMP hardware environment on top of Hypertransport and QPI. In principal they might have, at that point in time, decided that SMP was dead and delivered us pure NUMA architectures. But that would likely have been commercial suicide; it would have taken coorindation of the entire hardware and software industries to agree to go that way, but by then it was already far too late to rewrite everything from scratch.
Thinks like network sockets (including via the loopback interface), pipes, serial ports are not synchronous. They're stream carriers, and the sender and receiver are not synchronised by the act of transferring data. That is, the sender can write() data and think that that has completed, but the data is in reality still stuck in some network buffer somewhere and hasn't yet made it into the read() that the destination process will have to call to receive the data.
What Linux will do with processes and threads is endeavour to run them all at once, up to the limit of the number of CPU cores in the machine. By and large that will result in your processes running simultaneously on separate cores. I think Linux will also use knowledge of which physical CPU's memory holds the bulk of a process's data, and will try to run the process on that CPU; memory latency will be a tiny bit better that way.
If your processes try to communicate via socket, pipe or similar, it results in data being copied out of one process's memory space into a memory buffer controlled by the kernel (that's what write() is doing under the hood), and then being copied out of that into the receiving process's memory space (that's what read() does). Where that intermediate kernel buffer actually is doesn't really matter because the transactions taking place at the microelectronic level (below the SMP level) are pretty much the same regardless. Memory allocations and processes can be bound to specific CPU cores, but you can't influence whereabouts the kernel puts its memory buffers through which the exchanged data must pass.
Regarding memory and process core affinity - it's really, really hard to do this to any measurable benefit. The OSes are so good nowadays at understanding the behaviour of CPUs that it's almost always best to simply let the OS run your processes and cores whereever it chooses. Companies like Intel make large code contributions to the Linux project, specifically to ensure that Linux does this as well as possible on the latest and greatest chips.
==EDIT==
Additions in the light of engaging comments!
By "pure NUMA" I really mean systems where one CPU core cannot directly address memory physically attached to another CPU core. Such systems include Transputers, and even the Cell processor found in the Sony PS3. These aren't SMP, there's nothing in the silicon that unifies the separate memories into a single address space, so the question of cache coherency doesn't come into it.
With Transputer systems the only way to access memory attached to another transputer was to have the application software send the data over via a serial link; what made it CSP was that the sending application would finish sending until the receiving application had read the last byte.
For the Cell processor, there were 8 maths cores each with 256kbyte of RAM. That was the only RAM the maths cores could address. To use them the application had to move data and code into that 256k of RAM, tell the core to run, and then move the results out (possibly back out to RAM, or onto another maths core).
There are some supercomputers today that aren't disimilar to this. The K machine (Riken, Kobe in Japan) has an awful lot of cores, a very complex on-chip interconnect fabric, and OpenMPI is used by applications to move data around between nodes; nodes cannot directly address memory on other nodes.
The point is that on the PS3 it was up to application software to decide what data was in what memory and when, whereas modern x86 implementations from Intel and AMD make all data in all memories (no matter if they're shared via an L3 cache or are remote at the other end of a hypertransport or QPI link) accessible from any cores (that's what SMP means afterall).
The all out performance of code written on the Cell process was truly astounding for the Watts and transistor count. Trouble was in a world where programemrs are trained in writing for SMP environments, it takes a brain transplant to get to grips with one that isn't.
Newer languages like Rust and Go have reintroduced the concept of communicating sequential processes, which is all one had with Transputers back in the 1980s, early 1990s. CSP is almost ideal for multicore systems as the hardware does not need to implement an SMP environment. In principle this saves an awful lot of silicon.
CSP implemented on top of today's cache coherent SMP chips in languages like C generally involves a thread writing data into a bufffer, and that being copied into a buffer belonging to another thread (Rust can do it a little differently because Rust knows about memory ownership, and so can transfer ownership instead of copying memory. I'm not familiar with Go - maybe it can do the same).
Looked at at the microelectronic level, copying data from one buffer to another is not really any different to what happens if the data is shared by 2 cores instead of copied (especially in AMD's hypertransport CPUs where each has its own memory system). To share data, the remote core has to use hypertransport to request data from another core's memory, and more traffic to maintain cache coherency. That's about the same amount of hypertransport traffic as if the data where copied from one core to the other, but then there's no subsequent cache coherency traffic.

Programmatically disable CPU core

It is known the way to disable logical CPUs in Linux, basically with echo 0 > /sys/devices/system/cpu/cpu<number>/online. This way, you are only telling to the OS to ignore that given (<number>) CPU.
My question goes further, is it possible not only to ignore it but to turn it off physically programmatically? I want that CPU to not receive any power, in order to make its energy consumption zero.
I know that it is possible disable cores from the BIOS (not always), but I want to know whether is possible to do it within a certain program or not.

When you do echo 0 > /sys/devices/system/cpu/cpu<number>/online, what happens next depends on the particular CPU. On ARM embedded systems the kernel will typically disable the clock that drives the particular core PLL so effectively you get what you want.
On Intel X86 systems, you can only disable the interrupts and call the hlt instruction (which Linux Kernel does). This effectively puts CPU to the power-saving state until it is woken up by another CPU at user request. If you have a laptop, you can verify that power draw indeed goes down when you disable the core by reading the power from /sys/class/power_supply/BAT{0,1}/current_now (or uevent for all values such as voltage) or using the "powertop" utility.
For example, here's the call chain for disabling the CPU core in Linux Kernel for Intel CPUs.
https://github.com/torvalds/linux/blob/master/drivers/cpufreq/intel_pstate.c
arch/x86/kernel/smp.c: smp_ops.play_dead = native_play_dead,
arch/x86/kernel/smpboot.c : native_play_dead() -> play_dead_common() -> local_irq_disable()
Before that, CPUFREQ also sets the CPU to the lowest power consumption level before disabling it though this does not seem to be strictly necessary.
intel_pstate_stop_cpu -> intel_cpufreq_stop_cpu -> intel_pstate_set_min_pstate -> intel_pstate_set_pstate -> wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
On Intel X86 there does not seem to be an official way to disable the actual clocks and voltage regulators. Even if there was, it would be specific to the motherboard and thus your closest bet might be looking into BIOS such as coreboot.
Hmm, I realized I have no idea about Intel except looking into kernel sources.

In Windows 10 it became possible with new power management commands CPMINCORES CPMAXCORES.
Powercfg -setacvalueindex scheme_current sub_processor CPMAXCORES 50
Powercfg -setacvalueindex scheme_current sub_processor CPMINCORES 25
Powercfg -setactive scheme_current
Here 50% of cores are assigned for desired deep sleep, and 25% are forbidden to be parked. Very good in numeric simulations requiring increased clock rate (15% boost on Intel)
You can not choose which cores to park, but Windows 10 kernel checks Intel's Comet Lake and newer "prefered" (more power efficient) cores, and starts parking those not preferred.
It is not a strict parking, so at high load the kernel can use these cores with very low load.
just in case if you are looking for alternatives

You can get closest to this by using governors like cpufreq. Make Linux exclude the CPU and power saving mode will ensure that the core runs at minimal frequency.

You can also isolate cpus from the scheduler at kernel boot time.
Add isolcpus=0,1,2 to the kernel boot parameters.
https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html

I know this is an old question but one way to disable the CPU is via grub config.
If you add to end of GRUB_CMDLINE_LINUX in /etc/default/grub (assuming you are using a standard Linux dist, if you are using an appliance the location of the grub config may be different), e.g.:
GRUB_CMDLINE_LINUX=".......Current config here **maxcpus**=2"
Then remake you grub config by running
grub2-mkconfig -o /boot/grub2/grub.cfg (or grub-mkconfig -o /boot/grub2/grub.cfg depending on your installation). Some distros may require nr_cpus instead of maxcpus.
Just some extra info:
If you are running a server with Multiple physical CPU then disabling one CPU may will most likely disable the memory set that is linked to that CPU, therefore it may have an effect on the performance of the server
Disabling the CPU this way, will not effect your type 1 hypervisor from accessing the CPU (this is based on xen hypervisor, I believe it will apply to vmware as well, if anyone can provide confirmation would be great). Depending on virtualbox setup, it may restrict the amount of CPU you can allocate to VM's unless you are running para-virtualization.
I am unsure however if you will have any power savings, most servers and even desktops these days, already control the power well, putting to sleep any device not needed for the current load. My concern would be by reducing the number of CPU (cores) then you will just be moving the load to the remaining CPU and due to the need to schedule the processors time, and potentially having instructions queued, and the effect of having a smaller number of cores available for interrupts (eg: network traffic), it may have a negative effect on power consumption.

AFAIK there is no system call or library function available as of now. or even ioctl implementation. So apart from creating new module / system call there are two ways I can think of :
using ASM asm(<assembly code>); where assembly code being architecture specific asm code to modify cpu flag.
system call in c (man 3 system). Assuming you just want to do it through c.

How to test linux NAPI feature?

I am trying to test the NAPI functionalities on embedded linux environment. I used 'pktgen' to generate the large number of packets and tried to verify the interrupt count of my network interface at /proc/interrupts.
I found out that interrupt count is comparatively less than the packets generated.
Also I am trying to tune the 'netdev_budget' value from 1 to 1000(default is 300) so that I can observe the reduction in interrupt count when netdev_budget is increased.
However increasing the netdev_budget doesn't seems to help. The interrupt is similar to that of interrupt count observed with netdev_budget set to 300.
So here are my queries:
What is the effect of 'netdev_budget' on NAPI?
What other parameters I can/should tune to observe the changes in interrupt count?
Is there any other way I can use to test the NAPI functionality on Linux?(Apart from directly looking at the network driver code)
Any help is much appreaciated.
Thanks in advance.

I wrote a comprehensive blog post about Linux network tuning which explains everything about monitoring, tuning, and optimizing the Linux network stack (including the NAPI weight). Take a look.
Keep in mind: some drivers do not disable IRQs from the NIC when NAPI starts. They are supposed to, but some simply do not. You can verify this by examining the hard IRQ handler in the driver to see if hard IRQs are being disabled.
Note that hard IRQs are re-enabled in some cases as mentioned in the blog post and below.
As far as your questions:
Increasing netdev_budget increases the number of packets that the NET_RX softirq can process. The number of packets that can be processed is also limited by a time limit, which is not tunable. This is to prevent the NET_RX softirq from eating 100% of CPU usage. If the device does not receive enough packets to process during its time allocation, hardirqs are reneabled and NAPI is disabled.
You can also try modifying your IRQ coalescing settings for the NIC, if it is supported. See the blog post above for more information on how to do this and what this means, exactly.
You should add monitoring to your /proc/net/softnet_stat file. The fields in this file can help you figure out how many packets are being processed, whether you are running out of time, etc.
A question for you to consider, if I may:
Why does your hardirq rate matter? It probably doesn't matter, directly. The hardirq handler in your NIC driver should do as little work as possible, so it executing a lot is probably not a problem for your system. If it is, you should carefully measure that as it seems very unlikely. Nevertheless, you can adjust IRQ coalescing settings and IRQ CPU affinity to distribute processing to alter the number of hardirqs generated by the NIC and processed by a particular CPU, respectively.
You should consider whether you probably are more interested in packet processing throughput or packet processing latency. Depending on which is the concern, you can tune your network stack appropriately.
Remember: to completely tune and optimize your Linux networking stack, you have to monitor and tune each component. They are all intertwined and it is difficult (and often insufficient) to monitor and tune just a single aspect of the stack.

Does pinning a process to a CPU core or an SMP node help reduce cache coherency traffic?

It is possible to pin a process to a specific set of CPU cores using sched_setaffinity() call. The manual page says:
Restricting a process to run on a single CPU also avoids the
performance cost caused by the cache invalidation that occurs when a process
ceases to execute on one CPU and then recommences execution on a different
CPU.
Which is almost an obvious thing (or not?). What is not that obvious to me is this -
Does pinning LWPs to a specific CPU or an SMP node reduces a cache coherency bus traffic? For example, since a process is running pinned, other CPUs should not modify its private memory, thus only CPUs that are part of the same SMP node should stay cache-coherent.

There should be no CPU socket-to-socket coherency traffic for the pinned process case you describe. Modern Xeon platforms implement snoop filtering in the chipset. The snoop filter indicates when a remote socket cannot have the cache line in question, thus avoiding the need to send cache invalidate messages to that socket.
You can measure this for yourself. Xeon processors implement a large variety of cache statistic counters. You can read the counters in your own code with the rdpmc instruction or just use a product like VTune. FYI, using rdpmc is very precise, but a little tricky since you have to initially set a bit in CR4 to allow using this instruction in user mode.
-- EDIT --
My answer above is outdated for the 55xx series of CPUs which use QPI links. These links interconnect CPU sockets directly without an intervening chipset, as in:
http://ark.intel.com/products/37111/Intel-Xeon-Processor-X5570-%288M-Cache-2_93-GHz-6_40-GTs-Intel-QPI%29
However, since the L3 cache in each CPU is inclusive, snoops over the QPI links only occur when the local L3 cache indicates the line is nowhere in the local socket. Likewise, the remote socket's L3 can quickly respond to a cross-snoop without bothering the cores, assuming the line isn't there either.
So, the inclusive L3 caches should minimize inter-socket coherency overhead, it's just not due to a chipset snoop filter in your case.

If you run on a NUMA system (like, Opteron server or Itanium), it makes sense, but you must be sure to bind a process to the same NUMA node that it allocates memory from. Otherwise, this is an anti-optimization. It should be noted that any NUMA-aware operating system will try to keep execution and memory in the same node anyway, if you don't tell it anything at all, to the best of its abilities (some elderly versions of Windows are rather poor at this, but I wouldn't expect that to be the case with recent Linux).
If you don't run on a NUMA system, binding a process to a particular core is the one biggest stupid thing you can do. The OS will not make processes bounce between CPUs for fun, and if a process must be moved to another CPU, then that is not ideal, but the world does not end, either. It happens rarely, and when it does, you will hardly be able to tell.
On the other hand, if the process is bound to a CPU and another CPU is idle, the OS cannot use it... that is 100% available processing power gone down the drain.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string