Vtune: Accuracy of Intel sampling drivers when vtune measurement run on a machine running other tasks

Vtune: Accuracy of Intel sampling drivers when vtune measurement run on a machine running other tasks - linux

I have the latest coffeelake machine which is primarily used as a storage server. The average workload on each core (4 cores) is around 5-10% when running a storage server alone.
I want to run vtune measurements of a workload on this machine using Intel Sampling drivers. However, I'm doubtful whether or not the measurements will be accurate given the storage server application is concurrently running.
But as the intel's documents suggest, the sampling drivers get installed on the Linux kernel, so is it really the case that the measurements will be inaccurate if run concurrently with other applications? In other words, how exactly do the intel sampling drivers work? Are they able to distinguish between the workload process and other processes running on the system?

If VTune is like the Linux PAPI subsystem that perf uses, it basically saves/restores HW event counter registers on context switch, along with the regular register state. So events like instructions and uops_retired should be unaffected. And effects on other events will be due to actual impacts, like extra cache misses.
(The basic mechanism for HW performance events are that each logical core has its own programmable perf counters that increment every time some microarchitectural event happens. If one overflows, it raises an interrupt for the driver to collect the count. Or for perf record type of functionality, perf or VTune would program them to count down so trigger an interrupt regularly, and sample the saved user-space RIP at that point. This produces some funky effects on a superscalar out-of-order CPU, like "blaming" the instruction waiting for data, not the cache miss load itself, for example. But the key point is that the inside-the-core events are totally per-core. The uncore / L3 cache events count stuff about shared resources like L3 cache, so are more easily disturbed by system load.)
Another point is that if you are running something on a CPU core, Linux isn't going to want to schedule other tasks there. So your background load will tend to avoid whichever core your test is running on, leaving it able to use 100% of a single core without a lot of context switches. (Although network / disk interrupts might still be handled on that core.)
So yes, you should be able to fairly accurately measure what's actually happening in your process while it runs on a system that's not totally idle. That might be a bit different from what would happen if it were run on a fully idle system, but probably not much different. Especially if it's single-threaded, or you can limit it to fewer than all of your cores, so there's at least one left for the OS to schedule other tasks onto.

Related

Programmatically disable CPU core

It is known the way to disable logical CPUs in Linux, basically with echo 0 > /sys/devices/system/cpu/cpu<number>/online. This way, you are only telling to the OS to ignore that given (<number>) CPU.
My question goes further, is it possible not only to ignore it but to turn it off physically programmatically? I want that CPU to not receive any power, in order to make its energy consumption zero.
I know that it is possible disable cores from the BIOS (not always), but I want to know whether is possible to do it within a certain program or not.

When you do echo 0 > /sys/devices/system/cpu/cpu<number>/online, what happens next depends on the particular CPU. On ARM embedded systems the kernel will typically disable the clock that drives the particular core PLL so effectively you get what you want.
On Intel X86 systems, you can only disable the interrupts and call the hlt instruction (which Linux Kernel does). This effectively puts CPU to the power-saving state until it is woken up by another CPU at user request. If you have a laptop, you can verify that power draw indeed goes down when you disable the core by reading the power from /sys/class/power_supply/BAT{0,1}/current_now (or uevent for all values such as voltage) or using the "powertop" utility.
For example, here's the call chain for disabling the CPU core in Linux Kernel for Intel CPUs.
https://github.com/torvalds/linux/blob/master/drivers/cpufreq/intel_pstate.c
arch/x86/kernel/smp.c: smp_ops.play_dead = native_play_dead,
arch/x86/kernel/smpboot.c : native_play_dead() -> play_dead_common() -> local_irq_disable()
Before that, CPUFREQ also sets the CPU to the lowest power consumption level before disabling it though this does not seem to be strictly necessary.
intel_pstate_stop_cpu -> intel_cpufreq_stop_cpu -> intel_pstate_set_min_pstate -> intel_pstate_set_pstate -> wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
On Intel X86 there does not seem to be an official way to disable the actual clocks and voltage regulators. Even if there was, it would be specific to the motherboard and thus your closest bet might be looking into BIOS such as coreboot.
Hmm, I realized I have no idea about Intel except looking into kernel sources.

In Windows 10 it became possible with new power management commands CPMINCORES CPMAXCORES.
Powercfg -setacvalueindex scheme_current sub_processor CPMAXCORES 50
Powercfg -setacvalueindex scheme_current sub_processor CPMINCORES 25
Powercfg -setactive scheme_current
Here 50% of cores are assigned for desired deep sleep, and 25% are forbidden to be parked. Very good in numeric simulations requiring increased clock rate (15% boost on Intel)
You can not choose which cores to park, but Windows 10 kernel checks Intel's Comet Lake and newer "prefered" (more power efficient) cores, and starts parking those not preferred.
It is not a strict parking, so at high load the kernel can use these cores with very low load.
just in case if you are looking for alternatives

You can get closest to this by using governors like cpufreq. Make Linux exclude the CPU and power saving mode will ensure that the core runs at minimal frequency.

You can also isolate cpus from the scheduler at kernel boot time.
Add isolcpus=0,1,2 to the kernel boot parameters.
https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html

I know this is an old question but one way to disable the CPU is via grub config.
If you add to end of GRUB_CMDLINE_LINUX in /etc/default/grub (assuming you are using a standard Linux dist, if you are using an appliance the location of the grub config may be different), e.g.:
GRUB_CMDLINE_LINUX=".......Current config here **maxcpus**=2"
Then remake you grub config by running
grub2-mkconfig -o /boot/grub2/grub.cfg (or grub-mkconfig -o /boot/grub2/grub.cfg depending on your installation). Some distros may require nr_cpus instead of maxcpus.
Just some extra info:
If you are running a server with Multiple physical CPU then disabling one CPU may will most likely disable the memory set that is linked to that CPU, therefore it may have an effect on the performance of the server
Disabling the CPU this way, will not effect your type 1 hypervisor from accessing the CPU (this is based on xen hypervisor, I believe it will apply to vmware as well, if anyone can provide confirmation would be great). Depending on virtualbox setup, it may restrict the amount of CPU you can allocate to VM's unless you are running para-virtualization.
I am unsure however if you will have any power savings, most servers and even desktops these days, already control the power well, putting to sleep any device not needed for the current load. My concern would be by reducing the number of CPU (cores) then you will just be moving the load to the remaining CPU and due to the need to schedule the processors time, and potentially having instructions queued, and the effect of having a smaller number of cores available for interrupts (eg: network traffic), it may have a negative effect on power consumption.

AFAIK there is no system call or library function available as of now. or even ioctl implementation. So apart from creating new module / system call there are two ways I can think of :
using ASM asm(<assembly code>); where assembly code being architecture specific asm code to modify cpu flag.
system call in c (man 3 system). Assuming you just want to do it through c.

Logging all memory accesses of any executable/process in Linux

I have been looking for a way to log all memory accesses of a process/execution in Linux. I know there have been questions asked on this topic previously here like this
Logging memory access footprint of whole system in Linux
But I wanted to know if there is any non-instrumentation tool that performs this activity. I am not looking for QEMU/ VALGRIND for this purpose since it would be a bit slow and I want as little overhead as possible.
I looked at perf mem and PEBS events like cpu/mem-loads/pp for this purpose but I see that they will collect only sampled data and I actually wanted the trace of all the memory accesses without any sampling.
I wanted to know is there any possibility to collect all memory accesses without wasting too much on overhead by using a tool like QEMU. Is there any possibility to use PERF only but without samples so that I get all the memory access data ?
Is there any other tool out there that I am missing ? Or any other strategy that gives me all memory access data ?

It is just impossible both to have fastest possible run of Spec and all memory accesses (or cache misses) traced in this run (using in-system tracers). Do one run for timing and other run (longer,slower), or even recompiled binary for memory access tracing.
You may start from short and simple program (not the ref inputs of recent SpecCPU, or billion mem accesses in your big programs) and use perf linux tool (perf_events) to find acceptable ratio of memory requests recorded to all memory requests. There is perf mem tool or you may try some PEBS-enabled events of memory subsystem. PEBS is enabled by adding :p and :pp suffix to the perf event specifier perf record -e event:pp, where event is one of PEBS events. Also try pmu-tools ocperf.py for easier intel event name encoding and to find PEBS enabled events.
Try to find the real (maximum) overhead with different recording ratios (1% / 10% / 50%) on the memory performance tests. Check worst case of memory recording overhead at left part on the Arithmetic Intensity scale of [Roofline model](https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/. Typical tests from this part are: STREAM (BLAS1), RandomAccess (GUPS) and memlat are almost SpMV; many real tasks are usually not so left on the scale:
STREAM test (linear access to memory),
RandomAccess (GUPS) test
some memory latency test (memlat of 7z, lat_mem_rd of lmbench).
Do you want to trace every load/store commands or you only want to record requests that missed all (some) caches and were sent to main RAM memory of PC (to L3)?
Why you want no overhead and all memory accesses recorded? It is just impossible as every memory access have tracing of several bytes (the memory address, sometimes: instruction address) to be recorded to the same memory. So, having memory tracing enabled (more than 10% or memory access tracing) clearly will limit available memory bandwidth and the program will run slower. Even 1% tracing can be noted, but it effect (overhead) is smaller.
Your CPU E5-2620 v4 is Broadwell-EP 14nm so it may have also some earlier variant of the Intel PT: https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/intel-pt.txt https://github.com/01org/processor-trace and especially Andi Kleen's blog on pt: http://halobates.de/blog/p/410 "Cheat sheet for Intel Processor Trace with Linux perf and gdb"
PT support in hardware: Broadwell (5th generation Core, Xeon v4) More overhead. No fine grained timing.
PS: Scholars who study SpecCPU for memory access worked with memory access dumps/traces, and dumps were generated slowly:
http://www.bu.edu/barc2015/abstracts/Karsli_BARC_2015.pdf - LLC misses recorded to offline analysis, no timing was recorded from tracing runs
http://users.ece.utexas.edu/~ljohn/teaching/382m-15/reading/gove.pdf - all load/stores instrumented by writing into additional huge tracing buffer to periodic (rare) online aggregation. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core.
http://www.jaleels.org/ajaleel/publications/SPECanalysis.pdf (by Aamer Jaleel of Intel Corporation, VSSAD) - Pin-based instrumentation - program code was modified and instrumented to write memory access metadata into buffer. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core. The paper lists and explains instrumentation overhead and Caveats:
Instrumentation Overhead: Instrumentation involves
injecting extra code dynamically or statically into the
target application. The additional code causes an
application to spend extra time in executing the original
application ... Additionally, for multi-threaded
applications, instrumentation can modify the ordering of
instructions executed between different threads of the
application. As a result, IDS with multi-threaded
applications comes at the lack of some fidelity
Lack of Speculation: Instrumentation only observes
instructions executed on the correct path of execution. As
a result, IDS may not be able to support wrong-path ...
User-level Traffic Only: Current binary instrumentation
tools only support user-level instrumentation. Thus,
applications that are kernel intensive are unsuitable for
user-level IDS.

How to find the cause of delay?

A program I'm working on needs to process certain objects upon arrival from network in real-time. The throughput is good, but I have occasional drops in the input queue due to unexpected delays.
My analysis shows that most probably the source of the delay is outside my program; something like another process being scheduled on my process's CPU core (I set the affinity of the process to a certain core) or a hardware interrupt arriving (perhaps a network interrupt).
My problem is I don't know the source of the delay for sure. Is there a tool or a method to find how a CPU core was used exactly during a certain period of time? (Like for example telling me that core 0 was used by process 19494 99.1 percent of the time, process 20001 0.8 percent of the time and process 8110 0.1 percent of the time.)
I use Ubuntu 14.04 Server Edition on an HP server with a Xeon CPU.

could be CPU, diskspeed, networkspeed or memory.
Memory usage and CPU is easy to spot using htop . (use the sort option, F6)
HD speed could be an issue. for example if you use low-energy disks (they slow down when not in use). Do you have a database running on the same system?
use iotop , it might give a clue.

Understanding cpu frequency, thread selection and more

With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. So the performance is defined by the single-thread power of the CPU.
A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores.
So I assume that the thread jumps around between these cores.
So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%?
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. So I'm wondering, does this not support hyper-threading or is the whole program that buggy?
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right?
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second.
If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter?
Thanks for reading so far, I hope you have a moment to help me understand the whole thing.
Atop example screenshots:
http://i.imgur.com/VFEBvLx.png
http://i.imgur.com/cBKOnJM.png
http://i.imgur.com/bgQfwfU.png

I believe a lot of your confusion has to do with the fuzzy mapping of the capabilities of cpufreq to the actual capabilities of the hardware.
Here’s a description of what is taking place on the HW and in the OS.
A processor is a collection of cores on the same silicon substrate. The cores are what we used to call CPUs with some enhancements. CPUs now have the capability of running multiple HW threads (hyperthreading), each hardware thread being equivalent to one of the old type CPUs. Putting this all together, the 1270v3 is a quad core (if I recall correctly), meaning it has 4 cores on the same silicon substrate. Each core can support two HW threads, each HW thread being equivalent to what the OS calls a CPU (and I’ll call a virtual CPU). So from the OS perspective, the 1270v3 has 8 (virtual) CPUs.
The OS doesn’t see packages, cores or HW threads. It sees CPUs, and to it there appear to be 8 of them.
To further complicate the issue, modern processors have various HW supporting power saving states called P-states, C-states and package C-states. Why do I mention these? It’s because they are intimately associated with the frequency of the processor. And cpufreq professes to provide the user with control over the processor’s frequency.
Now, I’m not familiar with cpufreq outside of reading the manpage and other material on the web. From my reading, it has a lot of idiosyncrasies, so I’ll talk about what it is doing from a broad perspective.
In a general sense, cpufreq has a lot of generic capability that may or may not be supported by the HW or the kernel. This is confusing because it looks like the functionality is there but then things don’t happen as you would expect. For example, cpufreq gives the impression that you can set each CPU’s frequency independently. In reality, on a hyperthreaded system, two “CPUs” are associated with each core and must have the same frequency.
A lot of the functionality that cpufreq is supposed to control is associated with the power efficiency characteristics of the processor, but again, its mapping to the processor’s actual hardware capabilities is incomplete and misleading. Though cpufreq seems to allow you to set max and min frequencies, the processor hardware doesn’t work this way. In modern Intel processors, such as the 1270v3, there are something called P-states. These P-states are frequency-voltage pairs that slow down a processor’s frequency (and drop its voltage) to reduce the processor’s power consumption at the cost of the application taking longer to run. These frequency-voltage pairings aren’t arbitrary though cpufreq gives the impression that they are.
What does this all mean? In addition to the thread migration issues that the commenter mentioned, cpufreq isn’t going to behave the way you expect because it appears to have capability that it actually doesn’t, and the capability that it does actually have maps only roughly to the actual capabilities of the HW and OS.
I embedded some further comments in your text.
With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. [Yes, but this doesn’t mean that the thread is locked to a specific virtual CPU or core.] So the performance is defined by the single-thread power of the CPU. [It’s not that simple. The OS migrates threads around, it has its own maintenance processes, etc] A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores. So I assume that the thread jumps around between these cores. So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%? [Since I can’t see what you are observing, I don’t really understand what you are asking.]
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. [I haven’t used cpufreq so can’t directly speak to its behavior, but the manpage implies that “-c ” refers to a specific virtual CPU and not the number of virtual CPUs.] So I'm wondering, does this not support hyper-threading or is the whole program that buggy? [Again, I’m not sure from your explanation what you are doing, but the n->n/2 behavior makes sense to me. You can change the frequency of a core but since each core has two hyperthreads/virtual CPUs, two of those virtual CPUs must scale together.]
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right? [Again, I’m not sure what you are observing. Both physically and in atop, the CPU designation should not change, meaning CPU001 will always refer to the same virtual CPU. The core with the max frequency shouldn’t physically jump around, though the user thread may. Something to note is that monitoring tools can be pretty heavy users of the CPU. This heavy usage can make figuring out your processor usage difficult if it causes threads to jump around to different virtual CPUs.]
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second. [I found a pretty good explanation of atop with a lot of helpful screen shots: http://www.unixmen.com/linux-basics-monitor-system-resources-processes-using-atop/] If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter? [It all depends upon what other processes are running on your system. If your system is mostly idle except for your processes, then forcing a core to a certain frequency won’t make a difference. [I’m suspicious of what a “governor” does. The language appears to refer to power-efficiency/performance (“balanced”, “powersave”, “performance”, etc) but the details don’t match the capability of today’s hardware.]
Thanks for reading so far, I hope you have a moment to help me

Does pinning a process to a CPU core or an SMP node help reduce cache coherency traffic?

It is possible to pin a process to a specific set of CPU cores using sched_setaffinity() call. The manual page says:
Restricting a process to run on a single CPU also avoids the
performance cost caused by the cache invalidation that occurs when a process
ceases to execute on one CPU and then recommences execution on a different
CPU.
Which is almost an obvious thing (or not?). What is not that obvious to me is this -
Does pinning LWPs to a specific CPU or an SMP node reduces a cache coherency bus traffic? For example, since a process is running pinned, other CPUs should not modify its private memory, thus only CPUs that are part of the same SMP node should stay cache-coherent.

There should be no CPU socket-to-socket coherency traffic for the pinned process case you describe. Modern Xeon platforms implement snoop filtering in the chipset. The snoop filter indicates when a remote socket cannot have the cache line in question, thus avoiding the need to send cache invalidate messages to that socket.
You can measure this for yourself. Xeon processors implement a large variety of cache statistic counters. You can read the counters in your own code with the rdpmc instruction or just use a product like VTune. FYI, using rdpmc is very precise, but a little tricky since you have to initially set a bit in CR4 to allow using this instruction in user mode.
-- EDIT --
My answer above is outdated for the 55xx series of CPUs which use QPI links. These links interconnect CPU sockets directly without an intervening chipset, as in:
http://ark.intel.com/products/37111/Intel-Xeon-Processor-X5570-%288M-Cache-2_93-GHz-6_40-GTs-Intel-QPI%29
However, since the L3 cache in each CPU is inclusive, snoops over the QPI links only occur when the local L3 cache indicates the line is nowhere in the local socket. Likewise, the remote socket's L3 can quickly respond to a cross-snoop without bothering the cores, assuming the line isn't there either.
So, the inclusive L3 caches should minimize inter-socket coherency overhead, it's just not due to a chipset snoop filter in your case.

If you run on a NUMA system (like, Opteron server or Itanium), it makes sense, but you must be sure to bind a process to the same NUMA node that it allocates memory from. Otherwise, this is an anti-optimization. It should be noted that any NUMA-aware operating system will try to keep execution and memory in the same node anyway, if you don't tell it anything at all, to the best of its abilities (some elderly versions of Windows are rather poor at this, but I wouldn't expect that to be the case with recent Linux).
If you don't run on a NUMA system, binding a process to a particular core is the one biggest stupid thing you can do. The OS will not make processes bounce between CPUs for fun, and if a process must be moved to another CPU, then that is not ideal, but the world does not end, either. It happens rarely, and when it does, you will hardly be able to tell.
On the other hand, if the process is bound to a CPU and another CPU is idle, the OS cannot use it... that is 100% available processing power gone down the drain.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string