X86 clear interrupt flag instruction `cli` not working in user space? - linux

I try to stop interrupts from user space for a specific isolated core,
so I set CPU affinity:
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(2, &set);
assert(sched_setaffinity(getpid(),sizeof(set),&set)==0);
and useiopl(3) to execute privileged instruction cli/sti in user space:
iopl(3);
__asm__("cli;");
// busy looping for a while
__asm__("sti;");
and there are two phenomenons I can't explain:
1 cli can't actually stop interrupts (at least not all interrupts), and interrupt, such as LOC (Local Timer Interrupt) comes out every now and then;
I notice lasted kernel patches prevent cli in user space (reference) , but this result can be reproduced in kernel 4.19.0.
2 AFAIK, cli only clear interrupt flag of CPU on which the program is running, but in practice, my whole system is stuck, not responding to my mouse or keyboard.

(2): Many parts of the Linux kernel depend on communicating with other cores, including RCU depending on for each core: run_on(core) and stuff like that. (https://lwn.net/Articles/262464/). Any kernel code doing that will get stuck when this core doesn't respond to the IPI that other cores send to ask the kernel on this core to switch to a certain task, or perhaps to do TLB shootdowns.
I don't know what exact thing would tend to lead to getting stuck, but I don't find it surprising at all that other parts of the kernel are waiting for something that depends on hearing back from the kernel on this core, and that blocks progress of something involved in getting keyboard/mouse events to an X server and to user-space. (Or even to a text console? That might have more hope, fewer layers of software.)
Or it's always possible that some keyboard or mouse interrupts get distributed to this core, and ignored.
As for (1): do you leave the NMI watchdog enabled, or other source of NMIs? That could get the kernel running temporarily in a state where (other?) interrupts are enabled.
I use kernel/nmi_watchdog = 0 in /etc/sysctl.d/99-local.conf to free up an extra perf counter, but the default is enabled.
(cli doesn't stop Non-Maskable Interrupts, as you might guess from the name.)
Other than that guess, I don't know why you'd still be occasional LOCal timer interrupts; maybe someone more familiar with modern x86 interrupts would know.

Related

SMP affinity routing doesn't work with GICv2 on ARM

There are 4 CPU cores and one Ethernet card on my Raspberry Pi.
I need interrupts from NIC to be routed to all the 4 CPU cores.
I set the /proc/irq/24/smp_affinity to 0xF (1111), but that doesn't help.
In sixth column of /proc/interrupts I don't see IO-APIC (which definitely supports* affinity routing) but GICv2 instead. Still can't find any useful info about GICv2 and smp_affinity.
Does GICv2 support SMP affinity routing?
*UPD:
from that post:
The only reason to look at this value is that SMP affinity will only
work for IO-APIC enabled device drivers.
TL;DR - The existence of /proc/irq/24/smp_affinity indicates that your Linux SMP system supports affinity. The text IO-APIC is the type of interrupt controller (typical PC) and it does NOT indicate that the system can handle affinities. On ARM systems a GIC is usually the interrupt controller, although some interrupts can be routed to a 'sub-controller'.
At least the mainline is supporting some affinities as per Kconfig. However, I am not sure what you are trying to do. The interrupt can only run on one CPU as only one CPU can take the data off the NIC. If a particular CPU is running network code and the rest are used for other purposes, the affinity makes sense.
The data on that core will probably not be in cache as the NIC buffers are probably DMA and not cacheable. So, I am not really sure what you would achieve or how you would expect the interrupts to run on all four CPUs? If you have four NIC interfaces, you can peg each to a CPU. This may be good for power consumption issues.
Specifically, for your case of four CPUs, the affinity mask of 0xf will disable any affinity and this is the default case. You can cat /proc/irq/24/smp_affinity to see the affinity is set. Also, the existence of this file would indicate that your Linux SMP system supports affinity. The text IO-APIC is the type of interrupt controller (typical PC) and it does NOT indicate that the system can handle affinities.
See also:
zero copy vs kernel by-pass
University of Waterloo doc
IRQ-affinity.txt
NOTE This part is speculative and is NOT how any cards I know of works.
The major part that you want is not generally possible. The NIC registers are a single resource. There are multiple registers and they have general sequences to reading and writing registers to perform an operation. If two CPUs were writing (or even reading) the register at the same time, then it will severely mix up the NIC. Often the CPU is not that involved in an interrupt and only some DMA engine needs to be told about a next buffer in an interrupt.
In order for what you want to be useful, you would need a NIC with several register 'banks' that can be used independently. For instance, just READ/WRITE packet banks is easy to comprehend. However, there may be several banks to write different packets and then the card would have to manage how to serialize them. Also, the card could do some packet inspection and interrupt different CPUs based on fixed packet values. Ie, a port and IP. This packet matching would generate different interrupt sources and different CPUs could handle different matches.
This would allow you to route different socket traffic to a particular CPU using a single NIC.
The problems are to make this card in hardware would be incredible complex compared to existing cards. It would be more expensive and it would take more power to operate.
If it is standard NIC hardware, there is no gain by rotating CPUs if the original CPU is not busy. If there is non-network activity, it is better to leave other CPUs alone so there cache can be use for a different workload (code/data). So in most case, it is best just to keep the interrupt on a fixed CPU unless it is busy and then it may ping-pong between a few CPUs. It would almost never be beneficial to run the interrupt on all CPUs.
I do not believe the the GICv2 supports IRQ balancing. Interrupts will always be handled by the same CPU. At least this was the case when I looked at this last for 5.1 kernels. The discussion at the time was that this would not be supported because it was not a good idea.
You will see interrupts will always be handled by CPU 0. Use something like ftrace or LTTng to observe what CPU is doing what.
I think via the affinity setting you could prevent the interrupt from running on a CPU, by setting that bit to zero. But this does not balance the IRQ over all CPUs on which it is allowed. It will still always go to the same CPU. But you could make this CPU 1 instead of 0.
So what you can do, is to put certain interrupts on different CPUs. This would allow something like SDIO and network to not vie for CPU time from the CPU 0 in their interrupt handlers. It's also possible to set the affinity of a userspace process such that it will not run on the same CPU which will handle interrupts and thereby reduce the time that the userspace process can be interrupted.
So why don't we do IRQ balancing? It ends up not being useful.
Keep in mind that the interrupt handler here is only the "hard" IRQ handler. This usually does not do very much work. It acknowledges the interrupt with the hardware and then triggers a back-end handler, like a work queue, IRQ thread, soft-irq, or tasklet. These don't run in IRQ context and can and will be scheduled to different CPU or CPUs based on the current workload.
So even if the network interrupt is always routed to the same CPU, the network stack is multi-threaded and runs on all CPUs. Its main work is not done in the hard IRQ handler that runs on one CPU. Again, use ftrace or LTTng to see this.
If the hard IRQ does very little, what is most important is to reduce latency, which is best done by running on the same CPU to improve cache effectiveness. Spreading it out is likely worse for latency and also for the total cost of handling the IRQs.
The hard IRQ handler can only run once instance at a time. So even if it was balanced, it could use just one CPU at any one time. If this was not the case, the handler would be virtually impossible to write without race conditions. If you want to use multiple CPUs at the same time, then don't do the work in a hard IRQ handler, do it in a construct like a workqueue. Which is how the network stack works. And the block device layer.
IRQs aren't balanced, because it's not usually the answer. The answer is to not do the work in IRQ context.

Trap instruction: why must the program counter and processor status register be changed atomically?

I came across the following problem on a previous exam from my operating systems class.
Consider an architecture in which the TRAP instruction has two effects: to load a predefined value of the Processor Status Register (PCR), which contains the user/kernel mode bit, saving the value of the Program Counter (PC) to a special Save PC register and loading apredefined value into the PC. Explain why loading a new value for the PCR without also changing the PC in the same instruction cycle would be unsafe.
I know that the PCR would be set to kernel mode with memory management off. Is it unsafe because the PC is still in the user program? If so where could it go wrong? If not why is it unsafe? Why would changing the PC first also be unsafe?
Aside: there is no reason to assume that "memory management" is turned "off" by loading the new processor status; in fact, in the CPUs in my experience that would not happen. But that is not relevant to this answer.
We're executing in user mode and a TRAP instruction is fetched. The program counter is then (let's say) pointing to the instruction after TRAP.
Now the processor executes the TRAP. It loads the new processor status, which switches the CPU to kernel mode. Assume this does not in itself inhibit device interrupts.
NOW... a device interrupts. The hardware or software mechanism saves the processor status (=kernel mode) and program counter (=the user-mode address of the instruction after TRAP). The device interrupt service routine does its thing and executes a return from interrupt to restore program counter and processor status. We can't resume "half-way through the TRAP instruction" - the only thing that can happen is that we start to execute the instruction that PC points to, i.e., we're executing the instruction after the TRAP but in kernel mode.
The exact problem depends on the system architecture:
If the kernel address map is a superset of the user address map (typical on OSes where user space is half the total address space) then we're executing user-provided code in kernel mode, which is at least a serious privilege problem, and may cause us to fail by page faulting when we can't handle it.
If the kernel address map doesn't include user space (frequently the case on systems with limited virtual address size) then this is equivalent to taking wild jump into the kernel.
The summary is that you need both the processor status and program counter to define "where you are in execution", and they both need to be saved/updated together; or in other words, no change of control (such as an interrupt) can be permitted in the middle.

Which part of Linux kernel enforces privilege separation and how?

I want to know how privilege separation is enforced by the kernel and the part of kernel that is responsible for this task.
For example, assume there are two processes running -- one at ring 0 and another at ring 3. How does the kernel keep track of the ring number of each process?
Edit: I know about ring numbers. My question is about the part of kernel (module or something) which performs checks on the processes to find out their privilege level. I believe there might be a component of kernel which would check the ring number of a process.
There is no concept of a ring number of a process.
The kernel is mapped in one area of memory, userspace is mapped in another. On boot the kernel specifies an address where the cpu has to jump to when the syscall instruction is executed. So someone does syscall, the cpu switches to ring0 and jumps to the address as instructed by the kernel. It is now executing kernel code. Then, on return, the cpu switches back to ring3 and resumes execution.
Similar story for other ways of entering the kernel like exceptions.
So, how does linux kernel enforce separation? It sets things up for usersapace to execute in ring3. Anything triggering the cpu to switch to ring0 also makes the jump to an address configured by the kernel on boot. no code other than kernel code executes in ring0

About Linux NMI watchdog

Now I encounter a problem about Linux NMI Watchdog.
I want to use Linux NMI watchdog to detect and recovery OS hang. So, I add "nmi_watchdog=1" into grub.cfg. And then check the /proc/interrupt, NMI were triggered per second. But after I load a module with deadlock (double-acquire spinlock), system were hanged totally, and nothing occurs (never panic!). It looks like that NMI watchdog did not work!
Then I read the Documentation/nmi_watchdog.txt, it says:
Be aware that when using local APIC, the frequency of NMI interrupts
it generates, depends on the system load. The local APIC NMI watchdog,
lacking a better source, uses the "cycles unhalted" event.
What's the "cycles unhalted" event?
It added:
but if your system locks up on anything but the "hlt" processor
instruction, the watchdog will trigger very soon as the "cycles
unhalted" event will happen every clock tick...If it locks up on
"hlt", then you are out of luck -- the event will not happen at all
and the watchdog won't trigger.
Seems like that watchdog won't trigger if processor executes "hlt" instruction, then I search "hlt" in "Intel 64 and IA-32 Architectures Software Developer's Manual, Volumn 2A", it describes it as follow:
Stops instruction execution and places the processor in a HALT state.
An enabled interrupt (including NMI and SMI), a debug exception, the
BINIT# signal, the INIT# signal, or the RESET# signal will resume
execution.
Then I am lost...
My question is:
How does Linux NMI watchdog work?
Who trigger the NMI?
My OS is Ubuntu 10.04 LTS, Linux-2.6.32.21, CPU Pentium 4 Dual-core 3.20 GHz.
I didn't read the whole source code about NMI watchdog(no time), if I couldn't understand how NMI watchdog work, I want use performance monitoring counter interrupt and inter-processor interrupt (be provided by APIC) to send NMI instead of NMI watchdog.
The answer depends on your hardware.
Non-maskable interrupts (NMI) can be triggered 2 ways: 1) when the kernel reaches a halting state that can't be interrupted by another method, and 2) by hardware -- using an NMI button.
On the front of some Dell servers, for example, you will see a small circle with a zig-zag line inside it. This is the NMI symbol. Nearby there is a hole. Insert a pin to trigger the interrupt. If your kernel is built to support it, this will dump a kernel panic trace to the console, then reboot the system.
This can happen very fast. So if you don't have a console attached to save the output to a file, it might look like only a reboot.
As I know, nmi_watchdog would only triggered for non-interruptible hangs. I found an code example by google: http://oslearn.blogspot.in/2011/04/use-nmi-watchdog.html
If your deadlock is not non-interruptiable, you can try enable sysRq to trigger some trace (Alt-printscreen-t) or crash (Alt-printscreen-c) to get more information.

Does pinning a process to a CPU core or an SMP node help reduce cache coherency traffic?

It is possible to pin a process to a specific set of CPU cores using sched_setaffinity() call. The manual page says:
Restricting a process to run on a single CPU also avoids the
performance cost caused by the cache invalidation that occurs when a process
ceases to execute on one CPU and then recommences execution on a different
CPU.
Which is almost an obvious thing (or not?). What is not that obvious to me is this -
Does pinning LWPs to a specific CPU or an SMP node reduces a cache coherency bus traffic? For example, since a process is running pinned, other CPUs should not modify its private memory, thus only CPUs that are part of the same SMP node should stay cache-coherent.
There should be no CPU socket-to-socket coherency traffic for the pinned process case you describe. Modern Xeon platforms implement snoop filtering in the chipset. The snoop filter indicates when a remote socket cannot have the cache line in question, thus avoiding the need to send cache invalidate messages to that socket.
You can measure this for yourself. Xeon processors implement a large variety of cache statistic counters. You can read the counters in your own code with the rdpmc instruction or just use a product like VTune. FYI, using rdpmc is very precise, but a little tricky since you have to initially set a bit in CR4 to allow using this instruction in user mode.
-- EDIT --
My answer above is outdated for the 55xx series of CPUs which use QPI links. These links interconnect CPU sockets directly without an intervening chipset, as in:
http://ark.intel.com/products/37111/Intel-Xeon-Processor-X5570-%288M-Cache-2_93-GHz-6_40-GTs-Intel-QPI%29
However, since the L3 cache in each CPU is inclusive, snoops over the QPI links only occur when the local L3 cache indicates the line is nowhere in the local socket. Likewise, the remote socket's L3 can quickly respond to a cross-snoop without bothering the cores, assuming the line isn't there either.
So, the inclusive L3 caches should minimize inter-socket coherency overhead, it's just not due to a chipset snoop filter in your case.
If you run on a NUMA system (like, Opteron server or Itanium), it makes sense, but you must be sure to bind a process to the same NUMA node that it allocates memory from. Otherwise, this is an anti-optimization. It should be noted that any NUMA-aware operating system will try to keep execution and memory in the same node anyway, if you don't tell it anything at all, to the best of its abilities (some elderly versions of Windows are rather poor at this, but I wouldn't expect that to be the case with recent Linux).
If you don't run on a NUMA system, binding a process to a particular core is the one biggest stupid thing you can do. The OS will not make processes bounce between CPUs for fun, and if a process must be moved to another CPU, then that is not ideal, but the world does not end, either. It happens rarely, and when it does, you will hardly be able to tell.
On the other hand, if the process is bound to a CPU and another CPU is idle, the OS cannot use it... that is 100% available processing power gone down the drain.

Resources