Classifying a program as compute intensive based on performance counters - linux

I'm trying to classify few parallel programs as compute / memory/ data intensive. Can I classify them from values obtained from performance counters like perf. This command gives couple of values like number of page faults that I think can be used to know if a program needs to access memory frequently, else otherwise.
Is this approach correct and possible way. If not can someone guide me in classifying programs into respective categories.
Cheers,
Kris

Yes you should in theroy be able to do that with perf. I don't think page faults events are the one to observe if you want to analyse memory activity. For this purpose, on Intel processors you should use uncore events that allow you to count memory traffic (read/write separately). On my Westmere-EP these counters are UNC_QMC_NORMAL_READS.ANY and UNC_QMC_WRITES_FULL.ANY
The following article deals exactly with your problem (on Intel processors):
http://spiral.ece.cmu.edu:8080/pub-spiral/pubfile/ispass-2013_177.pdf

Related

How to calculate max number if thread that my kernel runs on them in opencl

when I read the device info from an OpenCL device, how can I calculate how good is its processing capability?
To add more information, assume that I want to do a very simple task on a pixels of an image, as far as I know (which maybe is not right !) when I run my kernel on a GPU, opencl runs it in parallel with different processing unit in GPU and I can think of the kernel as the thread body which would run in parallel.
If this is correct, then for my simple task, I need to find the device that has more processing unit so my kernel runs on them and hence finishes faster. Am I wrong?
How to find a suitable device based on its processing power?
Counting the number of processors in an OpenCL device is not sufficient to know how it will perform, for many reasons:
Different processors can have very different frequencies (in MHz/GHz)
Different processors can have very different architectures, e.g. out-of-order, multi-scalar, functions implemented in hardware
Different OpenCL devices have different types of memory available to them, which can affect the overall performance to a large extent
OpenCL devices could be integrated with the main CPU, on discrete peripheral board, or across a network. The latency and the need to synchronize or copy memory will affect the performance.
Different algorithms favor different architectures, so while one device may be faster than another for one algorithm, the same may not be true for a different algorithm.
I don't recommend using the number of processors as a measure of performance. The best way is to benchmark with a specific algorithm.

Understanding cpu frequency, thread selection and more

With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. So the performance is defined by the single-thread power of the CPU.
A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores.
So I assume that the thread jumps around between these cores.
So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%?
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. So I'm wondering, does this not support hyper-threading or is the whole program that buggy?
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right?
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second.
If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter?
Thanks for reading so far, I hope you have a moment to help me understand the whole thing.
Atop example screenshots:
http://i.imgur.com/VFEBvLx.png
http://i.imgur.com/cBKOnJM.png
http://i.imgur.com/bgQfwfU.png
I believe a lot of your confusion has to do with the fuzzy mapping of the capabilities of cpufreq to the actual capabilities of the hardware.
Here’s a description of what is taking place on the HW and in the OS.
A processor is a collection of cores on the same silicon substrate. The cores are what we used to call CPUs with some enhancements. CPUs now have the capability of running multiple HW threads (hyperthreading), each hardware thread being equivalent to one of the old type CPUs. Putting this all together, the 1270v3 is a quad core (if I recall correctly), meaning it has 4 cores on the same silicon substrate. Each core can support two HW threads, each HW thread being equivalent to what the OS calls a CPU (and I’ll call a virtual CPU). So from the OS perspective, the 1270v3 has 8 (virtual) CPUs.
The OS doesn’t see packages, cores or HW threads. It sees CPUs, and to it there appear to be 8 of them.
To further complicate the issue, modern processors have various HW supporting power saving states called P-states, C-states and package C-states. Why do I mention these? It’s because they are intimately associated with the frequency of the processor. And cpufreq professes to provide the user with control over the processor’s frequency.
Now, I’m not familiar with cpufreq outside of reading the manpage and other material on the web. From my reading, it has a lot of idiosyncrasies, so I’ll talk about what it is doing from a broad perspective.
In a general sense, cpufreq has a lot of generic capability that may or may not be supported by the HW or the kernel. This is confusing because it looks like the functionality is there but then things don’t happen as you would expect. For example, cpufreq gives the impression that you can set each CPU’s frequency independently. In reality, on a hyperthreaded system, two “CPUs” are associated with each core and must have the same frequency.
A lot of the functionality that cpufreq is supposed to control is associated with the power efficiency characteristics of the processor, but again, its mapping to the processor’s actual hardware capabilities is incomplete and misleading. Though cpufreq seems to allow you to set max and min frequencies, the processor hardware doesn’t work this way. In modern Intel processors, such as the 1270v3, there are something called P-states. These P-states are frequency-voltage pairs that slow down a processor’s frequency (and drop its voltage) to reduce the processor’s power consumption at the cost of the application taking longer to run. These frequency-voltage pairings aren’t arbitrary though cpufreq gives the impression that they are.
What does this all mean? In addition to the thread migration issues that the commenter mentioned, cpufreq isn’t going to behave the way you expect because it appears to have capability that it actually doesn’t, and the capability that it does actually have maps only roughly to the actual capabilities of the HW and OS.
I embedded some further comments in your text.
With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. [Yes, but this doesn’t mean that the thread is locked to a specific virtual CPU or core.] So the performance is defined by the single-thread power of the CPU. [It’s not that simple. The OS migrates threads around, it has its own maintenance processes, etc] A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores. So I assume that the thread jumps around between these cores. So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%? [Since I can’t see what you are observing, I don’t really understand what you are asking.]
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. [I haven’t used cpufreq so can’t directly speak to its behavior, but the manpage implies that “-c ” refers to a specific virtual CPU and not the number of virtual CPUs.] So I'm wondering, does this not support hyper-threading or is the whole program that buggy? [Again, I’m not sure from your explanation what you are doing, but the n->n/2 behavior makes sense to me. You can change the frequency of a core but since each core has two hyperthreads/virtual CPUs, two of those virtual CPUs must scale together.]
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right? [Again, I’m not sure what you are observing. Both physically and in atop, the CPU designation should not change, meaning CPU001 will always refer to the same virtual CPU. The core with the max frequency shouldn’t physically jump around, though the user thread may. Something to note is that monitoring tools can be pretty heavy users of the CPU. This heavy usage can make figuring out your processor usage difficult if it causes threads to jump around to different virtual CPUs.]
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second. [I found a pretty good explanation of atop with a lot of helpful screen shots: http://www.unixmen.com/linux-basics-monitor-system-resources-processes-using-atop/] If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter? [It all depends upon what other processes are running on your system. If your system is mostly idle except for your processes, then forcing a core to a certain frequency won’t make a difference. [I’m suspicious of what a “governor” does. The language appears to refer to power-efficiency/performance (“balanced”, “powersave”, “performance”, etc) but the details don’t match the capability of today’s hardware.]
Thanks for reading so far, I hope you have a moment to help me

measure cycles spent in accessing remote cache

How to measure cycles spent in accessing shared remote cache say L3. I need to get this cache access information both system-wide and for per-thread. Is there any specific tool/hardware requirements. Or can I use any formula to get an approximate value of cycles spent over a time interval
To get the average latencies (when a single thread is running) to various caches present on your machine, you can use memory profiler tools such as RMMA for windows (http://cpu.rightmark.org/products/rmma.shtml) and Lmbench for linux.
You can also write your own benchmarks based on the ideas used by these tools.
See the answers posted on this StackOverflow question:
measuring latencies of memory
Or Google for how the Lmbench benchmark works.
If you want to find exact latencies for particular memory access patterns, you will need to use a simulator. This way you can trace a memory access as it flows through the memory system. However simulators will not model all the effects that are present in a modern processor or memory system.
If you want to learn how multiple threads affect the average latency to L3, I think the best bet would be to write your own benchmark.

Profiling tool on Linux showing size of parameters passed to invoked functions

Can any of profiling (performance measure) tools on Linux (OProfile, perf events, HTC Toolkit) show call graph with size of parameters (data size) passed to function / procedure?
In other words I'd like to visualize data flow in a profiled application (best together with profile data, i.e. time spent in a function / subroutine).
Note: when arrays (vectors and matrices) are passed to subroutine / function, I am interested not only in size of pointer, but size of the whole data.
Beyond my skeptic remarks, I have a growing sense that you really want to have a metric that represents 'memory bandwidth', correlated to certain parts of your code.
I have a hunch that you might get mileage out of oprofile. oprof_start gui has a nice checklist of metrics
http://oprofile.sourceforge.net/docs/intel-p6-mobile-events.php
(NOTE: these are CPU dependent, see http://oprofile.sourceforge.net/docs/ or op_help for details on your CPU)
These seem interesting:
DATA_MEM_REFS all memory references, cachable and non
L2_LD number of L2 data loads
L2_ST number of L2 data stores
perhaps some more, but I grew tired of reading things outside my area of expertise here
I think that perf superseeds(?) oprofile, and it certainly has the easier interface to do (sampling) call graph profiling but I guess that if you want access to actual CPU performance counter events, oprofile may still be the place to go

How to measure power consumed by my algorithm?

I have an image processing algorithm running on an ARM-Cortex-A8/Ubuntu 9.01 platform and I have to measure the power consumed by my algorithm, does anyone know how to do this? Any tools available for this?
Strictly speaking your algorithm doesn't consume power.
Presumably you have some hardware which can accurately measure the power usage of the device, so you should just be able to repeatedly run your code (on an otherwise idle device) on various test data sets and measure the cumulative power usage, and compare that with the idle power consumption of the device over the same time; the difference would be the amount of additional juice the device used running your code.
Like any kind of benchmark, you'll need to run it repeatedly in a loop to get accurate data.
As the data may change its performance characteristics, you'll need a corpus of different test data to simulate different use-cases. Talk to your QA team about it.
I think u can try POWERTOP and POWERSTAT and measure once while system is idle and once while running your program and difference might give you the necessary information.
http://www.hecticgeek.com/2012/02/powerstat-power-calculator-ubuntu-linux/
Thanks--
S Teja

Resources