Core consistent tick counter for ARM - multithreading

I'm wondering if there's any counters on an ARM chip that can provide a tick-accurate count, but that is synced on all cores. That is, if I have a process running on cpu1 and cpu2, and they both read the register, I would like to be able to compare the counters. This would have to be from EL0, so unfortunately systick is out. The PMU is per core, and I'm not aware of a method to sync their counters, so I unfortunately cannot use those either. I also need very low latency, so a few register reads would be ideal.

Answering my own question -- It appears that the generic timer registers (cntpct, cntvct), etc are good for this purpose. These registers still have a per-core enable bit to allow userspace access, but they all link to the same counter, and thus are in sync between the cores.

Related

Vtune: Accuracy of Intel sampling drivers when vtune measurement run on a machine running other tasks

I have the latest coffeelake machine which is primarily used as a storage server. The average workload on each core (4 cores) is around 5-10% when running a storage server alone.
I want to run vtune measurements of a workload on this machine using Intel Sampling drivers. However, I'm doubtful whether or not the measurements will be accurate given the storage server application is concurrently running.
But as the intel's documents suggest, the sampling drivers get installed on the Linux kernel, so is it really the case that the measurements will be inaccurate if run concurrently with other applications? In other words, how exactly do the intel sampling drivers work? Are they able to distinguish between the workload process and other processes running on the system?
If VTune is like the Linux PAPI subsystem that perf uses, it basically saves/restores HW event counter registers on context switch, along with the regular register state. So events like instructions and uops_retired should be unaffected. And effects on other events will be due to actual impacts, like extra cache misses.
(The basic mechanism for HW performance events are that each logical core has its own programmable perf counters that increment every time some microarchitectural event happens. If one overflows, it raises an interrupt for the driver to collect the count. Or for perf record type of functionality, perf or VTune would program them to count down so trigger an interrupt regularly, and sample the saved user-space RIP at that point. This produces some funky effects on a superscalar out-of-order CPU, like "blaming" the instruction waiting for data, not the cache miss load itself, for example. But the key point is that the inside-the-core events are totally per-core. The uncore / L3 cache events count stuff about shared resources like L3 cache, so are more easily disturbed by system load.)
Another point is that if you are running something on a CPU core, Linux isn't going to want to schedule other tasks there. So your background load will tend to avoid whichever core your test is running on, leaving it able to use 100% of a single core without a lot of context switches. (Although network / disk interrupts might still be handled on that core.)
So yes, you should be able to fairly accurately measure what's actually happening in your process while it runs on a system that's not totally idle. That might be a bit different from what would happen if it were run on a fully idle system, but probably not much different. Especially if it's single-threaded, or you can limit it to fewer than all of your cores, so there's at least one left for the OS to schedule other tasks onto.

What options do I have for running recurring events on a microsecond resolution from a kernel driver?

I want to create a simulation of an actual device on an x86 Linux Kernel. Part of this will involve simulating timings as close to possible as I can get. Based on some research it seems I will need at least microsecond resolution timing. I understand that on a non-realtime system it won't be possible to get perfect timing, but I don't perfect, just as close as I can get, perhaps with hacking around with thread scheduling / preemption options.
What I actually want to do is perform an action every interval, i.e. run a some code every Xµs. I've been trying to research the best ways to do this from a Kernel driver as well as some research into whether it's possible to do this reasonably accurately from user mode (keeping the above paragraph in mind). One of the first things that caught my eye was the HPET timer, that is programmable to generate interrupts based on programmable comparators. Unfortunately, it seems on many chipsets it has been rather buggy in the past, and there's not much information on using it for anything that obtaining a timestamp or using it as the main clock source. The linux Kernel provides an HPET driver that in the past, seemed to provide both kernel and user mode interfaces, but seems only to provide a barely documented usermode interface in more recent kernel versions. I've also read about various other kernel functions and interfaces such as the hrtimer interface and the various delay functions, though I'm having a bit of trouble understanding them and if they are suited for my purpose.
Given my current use case, what are the best options I have running recurring events at a µs resolution from say a kernel driver? Obviously accuracy is probably my biggest criteria, but ease of use would be second.
Well, it's possible to achieve your accuracy in userspace -- clock_nanosleep is one ideal option, which has relative and absolute mode. Since clock_nanosleep is based on hrtimer in kernel mode, you may want to use hrtimer if you'd like to implement it in kernel space.
However, to make the timer work accurately, there're two IMPORTENT things worth mentioning.
You should set the timerslack of your process (either by writing nonzero value in ns to /proc/self/timerslack_ns or via prctl(PR_SET_TIMERSLACK,...)). This value is considered as the 'tolerance' of the timer.
The CPU power management also matters here. The CPU has many different Cstates, each of which has a different exit latency. So you need to configure your cpuidle module to not use Cstates other than C0, e.g. for an Intel CPU you could simply write 1 to /sys/devices/system/cpu/cpu$c/cpuidle/state$s/disable to disable state $s of CPU $c. Or just add idle=poll to your kernel options to let CPU keep active (in C0) while kernel idle. NOTE that this significantly influences the power of the computer and leads the cooling fans to make noise.
You can get a timer with delays under 10 microseconds if the two things mentioned above are configured correctly. There is a trade-off between latency and power consumption that you should made.

SMP affinity routing doesn't work with GICv2 on ARM

There are 4 CPU cores and one Ethernet card on my Raspberry Pi.
I need interrupts from NIC to be routed to all the 4 CPU cores.
I set the /proc/irq/24/smp_affinity to 0xF (1111), but that doesn't help.
In sixth column of /proc/interrupts I don't see IO-APIC (which definitely supports* affinity routing) but GICv2 instead. Still can't find any useful info about GICv2 and smp_affinity.
Does GICv2 support SMP affinity routing?
*UPD:
from that post:
The only reason to look at this value is that SMP affinity will only
work for IO-APIC enabled device drivers.
TL;DR - The existence of /proc/irq/24/smp_affinity indicates that your Linux SMP system supports affinity. The text IO-APIC is the type of interrupt controller (typical PC) and it does NOT indicate that the system can handle affinities. On ARM systems a GIC is usually the interrupt controller, although some interrupts can be routed to a 'sub-controller'.
At least the mainline is supporting some affinities as per Kconfig. However, I am not sure what you are trying to do. The interrupt can only run on one CPU as only one CPU can take the data off the NIC. If a particular CPU is running network code and the rest are used for other purposes, the affinity makes sense.
The data on that core will probably not be in cache as the NIC buffers are probably DMA and not cacheable. So, I am not really sure what you would achieve or how you would expect the interrupts to run on all four CPUs? If you have four NIC interfaces, you can peg each to a CPU. This may be good for power consumption issues.
Specifically, for your case of four CPUs, the affinity mask of 0xf will disable any affinity and this is the default case. You can cat /proc/irq/24/smp_affinity to see the affinity is set. Also, the existence of this file would indicate that your Linux SMP system supports affinity. The text IO-APIC is the type of interrupt controller (typical PC) and it does NOT indicate that the system can handle affinities.
See also:
zero copy vs kernel by-pass
University of Waterloo doc
IRQ-affinity.txt
NOTE This part is speculative and is NOT how any cards I know of works.
The major part that you want is not generally possible. The NIC registers are a single resource. There are multiple registers and they have general sequences to reading and writing registers to perform an operation. If two CPUs were writing (or even reading) the register at the same time, then it will severely mix up the NIC. Often the CPU is not that involved in an interrupt and only some DMA engine needs to be told about a next buffer in an interrupt.
In order for what you want to be useful, you would need a NIC with several register 'banks' that can be used independently. For instance, just READ/WRITE packet banks is easy to comprehend. However, there may be several banks to write different packets and then the card would have to manage how to serialize them. Also, the card could do some packet inspection and interrupt different CPUs based on fixed packet values. Ie, a port and IP. This packet matching would generate different interrupt sources and different CPUs could handle different matches.
This would allow you to route different socket traffic to a particular CPU using a single NIC.
The problems are to make this card in hardware would be incredible complex compared to existing cards. It would be more expensive and it would take more power to operate.
If it is standard NIC hardware, there is no gain by rotating CPUs if the original CPU is not busy. If there is non-network activity, it is better to leave other CPUs alone so there cache can be use for a different workload (code/data). So in most case, it is best just to keep the interrupt on a fixed CPU unless it is busy and then it may ping-pong between a few CPUs. It would almost never be beneficial to run the interrupt on all CPUs.
I do not believe the the GICv2 supports IRQ balancing. Interrupts will always be handled by the same CPU. At least this was the case when I looked at this last for 5.1 kernels. The discussion at the time was that this would not be supported because it was not a good idea.
You will see interrupts will always be handled by CPU 0. Use something like ftrace or LTTng to observe what CPU is doing what.
I think via the affinity setting you could prevent the interrupt from running on a CPU, by setting that bit to zero. But this does not balance the IRQ over all CPUs on which it is allowed. It will still always go to the same CPU. But you could make this CPU 1 instead of 0.
So what you can do, is to put certain interrupts on different CPUs. This would allow something like SDIO and network to not vie for CPU time from the CPU 0 in their interrupt handlers. It's also possible to set the affinity of a userspace process such that it will not run on the same CPU which will handle interrupts and thereby reduce the time that the userspace process can be interrupted.
So why don't we do IRQ balancing? It ends up not being useful.
Keep in mind that the interrupt handler here is only the "hard" IRQ handler. This usually does not do very much work. It acknowledges the interrupt with the hardware and then triggers a back-end handler, like a work queue, IRQ thread, soft-irq, or tasklet. These don't run in IRQ context and can and will be scheduled to different CPU or CPUs based on the current workload.
So even if the network interrupt is always routed to the same CPU, the network stack is multi-threaded and runs on all CPUs. Its main work is not done in the hard IRQ handler that runs on one CPU. Again, use ftrace or LTTng to see this.
If the hard IRQ does very little, what is most important is to reduce latency, which is best done by running on the same CPU to improve cache effectiveness. Spreading it out is likely worse for latency and also for the total cost of handling the IRQs.
The hard IRQ handler can only run once instance at a time. So even if it was balanced, it could use just one CPU at any one time. If this was not the case, the handler would be virtually impossible to write without race conditions. If you want to use multiple CPUs at the same time, then don't do the work in a hard IRQ handler, do it in a construct like a workqueue. Which is how the network stack works. And the block device layer.
IRQs aren't balanced, because it's not usually the answer. The answer is to not do the work in IRQ context.

Programmatically disable CPU core

It is known the way to disable logical CPUs in Linux, basically with echo 0 > /sys/devices/system/cpu/cpu<number>/online. This way, you are only telling to the OS to ignore that given (<number>) CPU.
My question goes further, is it possible not only to ignore it but to turn it off physically programmatically? I want that CPU to not receive any power, in order to make its energy consumption zero.
I know that it is possible disable cores from the BIOS (not always), but I want to know whether is possible to do it within a certain program or not.
When you do echo 0 > /sys/devices/system/cpu/cpu<number>/online, what happens next depends on the particular CPU. On ARM embedded systems the kernel will typically disable the clock that drives the particular core PLL so effectively you get what you want.
On Intel X86 systems, you can only disable the interrupts and call the hlt instruction (which Linux Kernel does). This effectively puts CPU to the power-saving state until it is woken up by another CPU at user request. If you have a laptop, you can verify that power draw indeed goes down when you disable the core by reading the power from /sys/class/power_supply/BAT{0,1}/current_now (or uevent for all values such as voltage) or using the "powertop" utility.
For example, here's the call chain for disabling the CPU core in Linux Kernel for Intel CPUs.
https://github.com/torvalds/linux/blob/master/drivers/cpufreq/intel_pstate.c
arch/x86/kernel/smp.c: smp_ops.play_dead = native_play_dead,
arch/x86/kernel/smpboot.c : native_play_dead() -> play_dead_common() -> local_irq_disable()
Before that, CPUFREQ also sets the CPU to the lowest power consumption level before disabling it though this does not seem to be strictly necessary.
intel_pstate_stop_cpu -> intel_cpufreq_stop_cpu -> intel_pstate_set_min_pstate -> intel_pstate_set_pstate -> wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
On Intel X86 there does not seem to be an official way to disable the actual clocks and voltage regulators. Even if there was, it would be specific to the motherboard and thus your closest bet might be looking into BIOS such as coreboot.
Hmm, I realized I have no idea about Intel except looking into kernel sources.
In Windows 10 it became possible with new power management commands CPMINCORES CPMAXCORES.
Powercfg -setacvalueindex scheme_current sub_processor CPMAXCORES 50
Powercfg -setacvalueindex scheme_current sub_processor CPMINCORES 25
Powercfg -setactive scheme_current
Here 50% of cores are assigned for desired deep sleep, and 25% are forbidden to be parked. Very good in numeric simulations requiring increased clock rate (15% boost on Intel)
You can not choose which cores to park, but Windows 10 kernel checks Intel's Comet Lake and newer "prefered" (more power efficient) cores, and starts parking those not preferred.
It is not a strict parking, so at high load the kernel can use these cores with very low load.
just in case if you are looking for alternatives
You can get closest to this by using governors like cpufreq. Make Linux exclude the CPU and power saving mode will ensure that the core runs at minimal frequency.
You can also isolate cpus from the scheduler at kernel boot time.
Add isolcpus=0,1,2 to the kernel boot parameters.
https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html
I know this is an old question but one way to disable the CPU is via grub config.
If you add to end of GRUB_CMDLINE_LINUX in /etc/default/grub (assuming you are using a standard Linux dist, if you are using an appliance the location of the grub config may be different), e.g.:
GRUB_CMDLINE_LINUX=".......Current config here **maxcpus**=2"
Then remake you grub config by running
grub2-mkconfig -o /boot/grub2/grub.cfg (or grub-mkconfig -o /boot/grub2/grub.cfg depending on your installation). Some distros may require nr_cpus instead of maxcpus.
Just some extra info:
If you are running a server with Multiple physical CPU then disabling one CPU may will most likely disable the memory set that is linked to that CPU, therefore it may have an effect on the performance of the server
Disabling the CPU this way, will not effect your type 1 hypervisor from accessing the CPU (this is based on xen hypervisor, I believe it will apply to vmware as well, if anyone can provide confirmation would be great). Depending on virtualbox setup, it may restrict the amount of CPU you can allocate to VM's unless you are running para-virtualization.
I am unsure however if you will have any power savings, most servers and even desktops these days, already control the power well, putting to sleep any device not needed for the current load. My concern would be by reducing the number of CPU (cores) then you will just be moving the load to the remaining CPU and due to the need to schedule the processors time, and potentially having instructions queued, and the effect of having a smaller number of cores available for interrupts (eg: network traffic), it may have a negative effect on power consumption.
AFAIK there is no system call or library function available as of now. or even ioctl implementation. So apart from creating new module / system call there are two ways I can think of :
using ASM asm(<assembly code>); where assembly code being architecture specific asm code to modify cpu flag.
system call in c (man 3 system). Assuming you just want to do it through c.

How can I get the CPU core number from within a user-space app (Linux, C)?

Presumably there is a library or simple asm blob that can get me the number of the current CPU that I am executing on.
Use sched_getcpu to determine the CPU on which the calling thread is running. See man getcpu (the system call) and man sched_getcpu (a library wrapper). However, note what it says:
The information placed in cpu is only guaranteed to be current at the time of the call: unless the CPU affinity has been fixed using sched_setaffinity(2), the kernel might change the CPU at any time. (Normally this does not happen because the scheduler tries to minimize movements between CPUs to keep caches hot, but it is possible.) The caller must be prepared to handle the situation when cpu and node are no longer the current CPU and node.
You need to do something like:
Call sched_getaffinity and identify the CPU bits
Iterate over the CPUs, doing sched_setaffinity to each one
(I'm not sure if after sched_setaffinity you're guaranteed to be on the CPU, or
need to yield explicitly ?)
Execute CPUID (asm instruction)... there is a way of getting a unique per-core ID out of one of it's outputs (see Intel docs). I vaguely recall it's the "APIC ID".
Build a table (a std::map ?) from APIC IDs to a CPU number or affinity mask or something.
If you did this on your main thread, don't forget to set sched_setaffinity back to all CPUS!
Now you can CPUID again whenever you need to and lookup which core you're on.
But I'd query why you need to do this; normally you want to take control via sched_setaffinity rather than finding out which core you're on (and even that's a pretty rare thing to want/need). (That's why I don't know the crucial detail of what to pull out of CPUID exactly, sorry!)
Update: Just learned about sched_getcpu from litb's response here. Much better! (my Debian/etch libc is too old to have it though).
I don't know of anything to get your current core id. With kernel level task/process migration, you wouldn't be guaranteed that it would remain constant for any length of time, unless you were running in some form of real-time mode.
If you want to be on a specific core, you can put use that sched_setaffinity() function or the taskset command to launch your program. I believe that these need elevated permissions to work, though. In your program, you could then run sched_getaffinity() to see the mask that was set earlier and use that as a best guess at the core on which you are executing.
sysconf(_SC_NPROCESSORS_ONLN);

Resources