Evaluating SMI (System Management Interrupt) latency on Linux-CentOS/Intel machine - linux

I am interested in evaluating the behavior (latency, frequency) of SMI handling on Linux machine running CentOS and used for a (very) soft real time application.
What tools are recommended (hwlatdetect for CentOS?), and what is the best course of action to go about this?
If no good tools are available for CentOS, am I correct to assume that installing a
different OS on the same machine should yield the same results since the underlying hardware/bios are the same?
Is there any source for ballpark figures on these parameters.
The machines are X86_64 architecture, running CentOS 6.4 (kernel 2.6.32-358.23.2.el2.centos.plus.x86_64.)

SMIs can certainly happen during normal operation. My home desktop has a chipset-driven SMI every second and a half which is enabled in the chipset. I've also seen some servers that have them twice a second due to a BIOS-driven CPU frequency scaling scheme. However, some systems can go long periods of time without an SMI occurring so it really depends.
Question #1: hwlatdetect is one option to detect the latency of SMIs occurring on your system. BIOSBITS is another option which is a bootable CD that can identify if SMIs are occuring. You can also write your own test by creating a kernel module that spins in a loop and takes timestamps (using RDTSC). If you see a long gap between two timestamp readings, you could consult CPU MSR 0x34 to see if the SMI counter incremented which would indicate that an SMI happened.
If you want to generate an SMI, you can make a kernel module that does an OUT CPU instruction to port 0xb2, e.g. write a value of 0 to this port. (You can also time this SMI by gathering a timestamp just before and just after the write to port 0xB2).
Question #2, SMIs operate at a layer below the OS so which OS you choose, shouldn't have any impact.
Question #3: BIOSBITS recommends that SMI latencies be kept under 150 microseconds.

SMI will put your system into SMM (System Management Mode) mode, which will postpone the
normal execution of kernel during the SMI handling time period. In other words, SMM
is neither real mode nor protected mode as we know of normal operation of kernel,
instead it executes some special instruction kept in SMRAM (stored in Bios Firmware). To detect it's latency you can try to trigger an SMI (it can be software generated) and try to catch the total time spent in SMM mode. To accomplish this you can write a Linux kernel module, cause you'll be require some special privileges to issue an SMI (I think).
For real time systems I think it's nice if you can avoid these sort of interrupts like SMI.

You can check whether System Management Interrupts (SMI) are serviced or not with turbostat. For example:
# turbostat sleep 120
[check column SMI for value greater than 0]
Of course, from that you can also compute a SMI frequency.
Knowing that SMIs are actually happening at a certain rate is important information. But you also want to know how much time System Management Mode (SMM) spends in those interrupts. For example, if an SMI interruption is only very short than it might be irrelevant for your realtime application. On the other hand, if you have hardware with long SMI interruptions you probably want to talk to the vendor, configure the firmware differently (if possible) and or switch to other hardware with less intrusive SMM.
The perf tool has a mode that measures how many cycles are spend in SMM during SMIs (using the information provided by certain CPU counters). Example:
# perf stat -a -A --smi-cost -- sleep 120
Performance counter stats for 'system wide':
SMI cycles% SMI#
CPU0 0.0% 0
CPU1 0.0% 0
CPU2 0.0% 0
CPU3 0.0% 0
120.002927948 seconds time elapsed
You can also look at the raw values with:
# perf stat -a -A --smi-cost --metric-only -- sleep 120
From that you can compute how much time an SMI takes on average on your machine. (divide cycles difference by the number of cycles per time unit).
It certainly makes sense to cross check the CPU counter based results with empiric ones.
You can use the Linux Hardware Latency Detector that is integrated in the Linux Kernel. Usage example:
# echo hwlat > /sys/kernel/debug/tracing/current_tracer
# echo 1 > /sys/kernel/debug/tracing/tracing_thresh
# watch -d -n 5 cat /sys/kernel/debug/tracing/tracing_max_latency
# echo "Don't forget to disable it again"
# echo nop > /sys/kernel/debug/tracing/current_tracer
Those tools are available on CentOS/RHEL 7 and should be available on other distributions, as well.
Regarding ballpark figures: Recently I came across a HP 2011-ish ProLiant Gen8 Xeon server that fires 504 SMIs per minute. Perf computes a rate of 0.1 % in SMM, and based on the counter values the averge time spent in an SMI is as high as several microseconds - but the Linux hwlat detector doesn't detect such high interruptions on that system.
That SMI rate matches what HP documents in its Configuring and tuning
HPE ProLiant Servers for low-latency applications guide (October, 2017):
Disabling System Management Interrupts to the processor provides one of
the greatest benefits to low-latency environments.
Disabling the Processor Power and Utilization Monitoring SMI has the greatest
effect because it generates a processor interrupt eight times a second in G6
and later servers.
(emphasis mine; and that guide also documents other SMI sources)
On a Supermicro board with Intel Atom C3758 and an Intel NUC (i5-4250U) system of mine there are exactly zero SMIs counted.
On an Intel i7-6600U based Dell laptop, the system reports 8 SMIs per minute, but the aperf counter is lower than the (unhalted) cycles counter which isn't supposed to happen.

Actually, SMI is used for more than just keyboard emulation. Servers use SMI to report and correct ECC memory errors, ACPI uses SMI to communicate with BIOS and perform some tasks, even enabling and disabling ACPI is done through SMI, BIOS often intercepts power state changes through SMI... there's more, this is just a few examples.

According to wikipage on System Management Mode, SMI is not used during normal operation, except perhaps to emulate a PS/2 keyboard with a USB physical keyboard.
And most Linux systems are able to drive genuine USB keyboard without that emulation. You could configure your BIOS to disable it.

Related

Vtune: Accuracy of Intel sampling drivers when vtune measurement run on a machine running other tasks

I have the latest coffeelake machine which is primarily used as a storage server. The average workload on each core (4 cores) is around 5-10% when running a storage server alone.
I want to run vtune measurements of a workload on this machine using Intel Sampling drivers. However, I'm doubtful whether or not the measurements will be accurate given the storage server application is concurrently running.
But as the intel's documents suggest, the sampling drivers get installed on the Linux kernel, so is it really the case that the measurements will be inaccurate if run concurrently with other applications? In other words, how exactly do the intel sampling drivers work? Are they able to distinguish between the workload process and other processes running on the system?
If VTune is like the Linux PAPI subsystem that perf uses, it basically saves/restores HW event counter registers on context switch, along with the regular register state. So events like instructions and uops_retired should be unaffected. And effects on other events will be due to actual impacts, like extra cache misses.
(The basic mechanism for HW performance events are that each logical core has its own programmable perf counters that increment every time some microarchitectural event happens. If one overflows, it raises an interrupt for the driver to collect the count. Or for perf record type of functionality, perf or VTune would program them to count down so trigger an interrupt regularly, and sample the saved user-space RIP at that point. This produces some funky effects on a superscalar out-of-order CPU, like "blaming" the instruction waiting for data, not the cache miss load itself, for example. But the key point is that the inside-the-core events are totally per-core. The uncore / L3 cache events count stuff about shared resources like L3 cache, so are more easily disturbed by system load.)
Another point is that if you are running something on a CPU core, Linux isn't going to want to schedule other tasks there. So your background load will tend to avoid whichever core your test is running on, leaving it able to use 100% of a single core without a lot of context switches. (Although network / disk interrupts might still be handled on that core.)
So yes, you should be able to fairly accurately measure what's actually happening in your process while it runs on a system that's not totally idle. That might be a bit different from what would happen if it were run on a fully idle system, but probably not much different. Especially if it's single-threaded, or you can limit it to fewer than all of your cores, so there's at least one left for the OS to schedule other tasks onto.

Programmatically disable CPU core

It is known the way to disable logical CPUs in Linux, basically with echo 0 > /sys/devices/system/cpu/cpu<number>/online. This way, you are only telling to the OS to ignore that given (<number>) CPU.
My question goes further, is it possible not only to ignore it but to turn it off physically programmatically? I want that CPU to not receive any power, in order to make its energy consumption zero.
I know that it is possible disable cores from the BIOS (not always), but I want to know whether is possible to do it within a certain program or not.
When you do echo 0 > /sys/devices/system/cpu/cpu<number>/online, what happens next depends on the particular CPU. On ARM embedded systems the kernel will typically disable the clock that drives the particular core PLL so effectively you get what you want.
On Intel X86 systems, you can only disable the interrupts and call the hlt instruction (which Linux Kernel does). This effectively puts CPU to the power-saving state until it is woken up by another CPU at user request. If you have a laptop, you can verify that power draw indeed goes down when you disable the core by reading the power from /sys/class/power_supply/BAT{0,1}/current_now (or uevent for all values such as voltage) or using the "powertop" utility.
For example, here's the call chain for disabling the CPU core in Linux Kernel for Intel CPUs.
https://github.com/torvalds/linux/blob/master/drivers/cpufreq/intel_pstate.c
arch/x86/kernel/smp.c: smp_ops.play_dead = native_play_dead,
arch/x86/kernel/smpboot.c : native_play_dead() -> play_dead_common() -> local_irq_disable()
Before that, CPUFREQ also sets the CPU to the lowest power consumption level before disabling it though this does not seem to be strictly necessary.
intel_pstate_stop_cpu -> intel_cpufreq_stop_cpu -> intel_pstate_set_min_pstate -> intel_pstate_set_pstate -> wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
On Intel X86 there does not seem to be an official way to disable the actual clocks and voltage regulators. Even if there was, it would be specific to the motherboard and thus your closest bet might be looking into BIOS such as coreboot.
Hmm, I realized I have no idea about Intel except looking into kernel sources.
In Windows 10 it became possible with new power management commands CPMINCORES CPMAXCORES.
Powercfg -setacvalueindex scheme_current sub_processor CPMAXCORES 50
Powercfg -setacvalueindex scheme_current sub_processor CPMINCORES 25
Powercfg -setactive scheme_current
Here 50% of cores are assigned for desired deep sleep, and 25% are forbidden to be parked. Very good in numeric simulations requiring increased clock rate (15% boost on Intel)
You can not choose which cores to park, but Windows 10 kernel checks Intel's Comet Lake and newer "prefered" (more power efficient) cores, and starts parking those not preferred.
It is not a strict parking, so at high load the kernel can use these cores with very low load.
just in case if you are looking for alternatives
You can get closest to this by using governors like cpufreq. Make Linux exclude the CPU and power saving mode will ensure that the core runs at minimal frequency.
You can also isolate cpus from the scheduler at kernel boot time.
Add isolcpus=0,1,2 to the kernel boot parameters.
https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html
I know this is an old question but one way to disable the CPU is via grub config.
If you add to end of GRUB_CMDLINE_LINUX in /etc/default/grub (assuming you are using a standard Linux dist, if you are using an appliance the location of the grub config may be different), e.g.:
GRUB_CMDLINE_LINUX=".......Current config here **maxcpus**=2"
Then remake you grub config by running
grub2-mkconfig -o /boot/grub2/grub.cfg (or grub-mkconfig -o /boot/grub2/grub.cfg depending on your installation). Some distros may require nr_cpus instead of maxcpus.
Just some extra info:
If you are running a server with Multiple physical CPU then disabling one CPU may will most likely disable the memory set that is linked to that CPU, therefore it may have an effect on the performance of the server
Disabling the CPU this way, will not effect your type 1 hypervisor from accessing the CPU (this is based on xen hypervisor, I believe it will apply to vmware as well, if anyone can provide confirmation would be great). Depending on virtualbox setup, it may restrict the amount of CPU you can allocate to VM's unless you are running para-virtualization.
I am unsure however if you will have any power savings, most servers and even desktops these days, already control the power well, putting to sleep any device not needed for the current load. My concern would be by reducing the number of CPU (cores) then you will just be moving the load to the remaining CPU and due to the need to schedule the processors time, and potentially having instructions queued, and the effect of having a smaller number of cores available for interrupts (eg: network traffic), it may have a negative effect on power consumption.
AFAIK there is no system call or library function available as of now. or even ioctl implementation. So apart from creating new module / system call there are two ways I can think of :
using ASM asm(<assembly code>); where assembly code being architecture specific asm code to modify cpu flag.
system call in c (man 3 system). Assuming you just want to do it through c.

How to measure total boottime for linux kernel on intel rangeley board

I am working on intel rangeley board. I want to measure the total time taken to boot the linux kernel. Is there any possible and proven way to achieve this on intel board?
Try using rdtsc. According to the Intel insn ref manual:
The processor monotonically increments the time-stamp counter MSR
every clock cycle and resets it to 0 whenever the processor is reset.
See “Time Stamp Counter” in Chapter 17 of the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 3B, for specific
details of the time stamp counter behavior.
(see the x86 tag wiki for links to manuals)
Normally the TSC is only used for relative measurements between two points in time, or as a timesource. The absolute value is apparently meaningful. It ticks at the CPU's rated clock speed, regardless of the power-saving clock speed it's actually running at.
You might need to make sure you read the TSC from the boot CPU on a multicore system. The other cores might not have started their TSCs until Linux sent them an inter-processor interrupt to start them up. Linux might sync their TSCs to the boot CPU's TSC, since gettimeofday() does use the TSC. IDK, I'm just writing down stuff I'd be sure to check on if I wanted to do this myself.
You may need to take precautions to avoid having the kernel modify the TSC when using it as a timesource. Probably via a boot option that forces Linux to use a different timesource.

How to find the cause of delay?

A program I'm working on needs to process certain objects upon arrival from network in real-time. The throughput is good, but I have occasional drops in the input queue due to unexpected delays.
My analysis shows that most probably the source of the delay is outside my program; something like another process being scheduled on my process's CPU core (I set the affinity of the process to a certain core) or a hardware interrupt arriving (perhaps a network interrupt).
My problem is I don't know the source of the delay for sure. Is there a tool or a method to find how a CPU core was used exactly during a certain period of time? (Like for example telling me that core 0 was used by process 19494 99.1 percent of the time, process 20001 0.8 percent of the time and process 8110 0.1 percent of the time.)
I use Ubuntu 14.04 Server Edition on an HP server with a Xeon CPU.
could be CPU, diskspeed, networkspeed or memory.
Memory usage and CPU is easy to spot using htop . (use the sort option, F6)
HD speed could be an issue. for example if you use low-energy disks (they slow down when not in use). Do you have a database running on the same system?
use iotop , it might give a clue.

timing mechanisms in computer systems

I've read this link on Measure time in Linux - getrusage vs clock_gettime vs clock vs gettimeofday? which provides a great breakdown of timing functions available in C
I'm very confused, however, to how the different notions of "time" are maintained by the OS/hardware.
This is a quote from the Linux man pages,
RTCs should not be confused with the system clock, which is a
software clock maintained by the kernel and used to implement
gettimeofday(2) and time(2), as well as setting timestamps on files,
and so on. The system clock reports seconds and microseconds since a
start point, defined to be the POSIX Epoch: 1970-01-01 00:00:00 +0000
(UTC). (One common implementation counts timer interrupts, once per
"jiffy", at a frequency of 100, 250, or 1000 Hz.) That is, it is
supposed to report wall clock time, which RTCs also do.
A key difference between an RTC and the system clock is that RTCs run
even when the system is in a low power state (including "off"), and
the system clock can't. Until it is initialized, the system clock
can only report time since system boot ... not since the POSIX Epoch.
So at boot time, and after resuming from a system low power state,
the system clock will often be set to the current wall clock time
using an RTC. Systems without an RTC need to set the system clock
using another clock, maybe across the network or by entering that
data manually.
The Arch Linux docs indicate that the RTC and system clock are independent after bootup. My questions then are:
What causes the interrupts that increments the system clock???
If wall time = time interval using the system clock, what does the process time depend on??
Is any of this all related to the CPU frequency? Or is that a totally orthogonal time-keeping business
On Linux, from the application point of view, the time(7) man page gives a good explanation.
Linux provides also the (linux specific) timerfd_create(2) and related syscalls.
You should not care about interrupts (they are the kernel's business, and are configured dynamically, e.g. thru application timers -timer_create(2), poll(2) and many other syscalls- and by the scheduler), but only about application visible time related syscalls.
Probably, if some process is making a timer with a tiny period of e.g. 10ms, the kernel will increase the frequency of timer interrupts to 100Hz
On recent kernels, you probably want the
CONFIG_HIGH_RES_TIMERS=y
CONFIG_TIMERFD=y
CONFIG_HPET_TIMER=y
CONFIG_PREEMPT=y
options in your kernel's .config file.
BTW, you could do cat /proc/interrupts twice with 10 seconds interval. On my laptop with a home-built 3.16 kernel -with mostly idle processes, but a firefox browser and an emacs, I'm getting 25 interrupts per second. Try also cat /proc/timer_list and cat /proc/timer_stats
Look also in the Documentation/timers/ directory of a recent (e.g. 3.16) Linux kernel tree.
The kernel probably use hardware devices like -for PC laptops and desktops- on-chip HPET (or the TSC) which are much better than the old battery saved RTC timer. Of course, details are hardware specific. So, on ARM based Linux systems (e.g. your Android smartphone) it is different.

Resources