performance counter events associated with false sharing - multithreading

I am looking at the performance of OpenMP program, specifically cache and memory performance.
I have found guidelines while back ago how to analyze performance with Vtune that mentioned which counters to watch out for. However now cannot seem to find the manual.
If you know which manual I have in question or if you know the counters/events, please let me know. Also if you have other techniques for analyzing multithreaded memory performance, please share if you can
Thanks

Here is an article discussion this topic.
The most common counters to examine are L2 cache misses and branch prediction misses.
Note that, in VS2010, you can use the concurrency visualizer in the new profiling tools to directly see this. It does a great job of helping you analyze this information, including directly showing you how your code lays out, showing you misses, blocks, and many other useful tools for debugging and profiling concurrent apps.

Related

fine grained multi core thread execution measurement on linux?

I'd like to have detailed insight into the running of a multi-threaded application (written in C++), and see in detail how much time each thread is spending on what CPU core.
I've looked at some tools like perf, or hcprof, but haven't found anything that would accomplish what I'm looking for.
What am I missing?
Any pointers welcome.
Akos
You are not missing to much. Try LttNG it gives you kind of a scope into the execution.
The following link describes well what is it doing.
https://blog.selectel.com/deep-kernel-introduction-lttng/
LttNG has a quite steep learning curve.
Perf if ok but it gives you more of a ocupancy grid the a time profile. Are you using FlameGraps?
Please read this post http://www.brendangregg.com/blog/2015-07-08/choosing-a-linux-tracer.html it will give you a hint what tools do what.
As last a word of advice. In such systems everything is a problem, from building flags drivers to coding style for synchronization or memory management. You must be precise and meticulous in you profiling.

Libraries for inprocess perf profiling? (Intel performance counters on Linux)

I want to profile an application's critical path by taking readings of the performance counters at various points along the path.
I came across libperf which provides a fairly neat C api. However, the last activity was 3 years ago.
I am also aware of the PAPI. This is under active development.
Are there other libraries I should be aware of?
Can anyone offer any insight into using one or the other?
Any tutorials / introductions to integrating these into application code?
I've used both PAPI (on Solaris) and perf (on Linux) and found it is much easier to record the entire program run and use 'perf-annotate' to see how your critical path is doing rather than trying to measure only the critical path. It's a different approach but it worked well for me.
Also, as someone mentioned at the comments, there is vTune, if you are x86 based. I've never used it myself.
There are several options, among them:
JEvents
libpfc
Intel PCM
likwid
Direct use of perf_event_open
I cover these in more detail in an answer to a related question.
Have a look at
http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization?page=8
VTune however is proprietory, if you want it can serve the purpose. Try its 30 day trial.
perfmon and perfctr can be used for analysis of parts of program but will have to be included in the code itself.

Measuring the scaling behaviour of multithreaded applications

I am working on an application which supports many-core MIMD architectures (on consumer/desk-computers). I am currently worrying about the scaling behaviour of the application. Its designed to be massively parallel and addressing next-gen hardware. That's actually my problem. Does anyone know any software to simulate/emulate many-core MIMD Processors with >16 cores on a machine-code level? I've already implemented a software based thread sheduler with the ability to simulate multiple processors, by simple timing techniques.
I was curious if there's any software which could do this kind of simulation on a lower level preferably on an assembly language level to get better results. I want to emphasize once again that I'm only interested in MIMD Architectures. I know about OpenCL/CUDA/GPGPU but thats not what I'm looking for.
Any help is appreciated and thanks in advance for any answers.
You will rarely find all-purpose testing tools that are ALSO able to target very narrow (high-performance) corners - for a simple reason: the overhead of the "general-purpose" object will defeat that goal in the first place.
This is especially true with paralelism where locality and scheduling have a huge impact.
All this to say that I am affraid that you will have to write your own testing tool to target your exact usage pattern.
That's the price to pay for relevance.
If you are writing your application in C/C++, I might be able to help you out.
My company has created a tool that will take any C or C++ code and simulate its run-time behavior at bytecode level to find parallelization opportunities. The tool highlights parallelization opportunities and shows how parallel tasks behave.
The only mismatch is that our tool will also provide refactoring recipes to actually get to parallelization, whereas it sounds like you already have that.

threadscope functionality

Can programs be monitored while they are running (possibly piping the event log)? Or is it only possible to view event logs after execution. If the latter is the case, is there a deeper reason with respect to how the Haskell runtime works?
Edit: I don't know much about the runtime tbh, but given dflemstr's response, I was curious about how much and the ways in which performance is degraded by adding the event monitoring runtime option. I recall in RWH they mentioned that the rts has to add cost centres, but I wasn't completely sure about how expensive this sort of thing was.
The direct answer is that, no, it is not possible. And, no, there is no reason for that except that nobody has done the required legwork so far.
I think this would mainly be a matter of
Modifying ghc-events so it supports reading event logs chunk-wise and provide partial results. Maybe porting it over to attoparsec would help?
Threadscope would have to update its internal tree data structures as new data streams in.
Nothing too hard, but somebody would need to do it. I think I heard discussion about adding this feature already... So it might happen eventually.
Edit: And to make it clear, there's no real reason this would have to degrade performance beyond what you get with event log or cost centre profiling already.
If you want to monitor the performance of the application while it is running, you can for instance use the ekg package as described in this blog post. It isn't as detailed as ThreadScope, but it does the job for web services, for example.
To get live information about what the runtime is doing, you can use the dtrace program to capture dynamic events posted by some GHC runtime probes. How this is done is outlined in this wiki page. You can then use this information to put together a more coherent event log.

Tools to visualize multithreaded C++ application call graph, multithreaded code coverage?

I would like to know if there are tools that can
Help visualize call graph of a large multi-threaded application.
Specifically I want to see how multiple threads interleaves on one core / executes simultaneously on multiple cores.
The tool would ideally identify possible wait/deadlock/race conditions.
Ultimately I want to do code coverage in terms of how threads interacts with each other during runtime (multi-thread-wise code coverage tool) so as to find potential multi-threaded bugs.
I apologize if I haven't explained my question clearly and I would love to provide any details.
The VTune Profiler from Intel can do some of what you ask. From the VTune site:
Locks and Waits: Use the IntelĀ® performance profiling tools to quickly find a common cause of slow performance in parallel programs: waiting too long on a lock while the cores are underutilized during the wait.
Timeline Visualizes Thread Behavior: See when threads are running and waiting, and when transitions occur.
If you were looking for something that is open source/free, then Valgrind has an experimental tool called Helgrind that supposedly finds races in multi-threaded programs. I can't comment on it, I haven't used it.
I should note that I haven't been successful in utilizing these or other profilers for multi-threaded debugging and optimizations, and instead I have developed my own techniques.
For identifying lock contention my preferred technique is to use an extended Mutex class that records all the operations done on each instance. I do this in a very lightweight way, so that the application performance doesn't change in a big way.
To identify race conditions I find the brute force approach the best. I just design a test that can be run for an extended period of time, some times this is hours, or days, depending on the case. And I always run my test on at least two different platforms (more if I can), since different OSes use different schedulers and that gives you better coverage.
While I can't help (yet!) on most of your issues, I think our C++ Test Coverage tool could provide you with multithreaded test coverage data pretty easily.
This tool instruments your source code; you compile and run that. You end up with (cheap)
instrumentation probes in your code representing various blocks. The instrumentation
records which parts of your program execute, nominally as a bit vector with one
bit per instrumentation probe. At the end of execution (or whenever you like), this
bit vector is dumped out and a viewer will show it to you superimposed on the code.
The trick to getting multihread test coverage is to know that we provide you complete
control over defining how the instrument probes work; they are macros. So rather than
using the default macro of essentially
probe[n]=true;
on a boolean array, you can instead implement
probe[n]|=1<<threadid;
on an int array (or something cleverly cheaper by precomputing this value).
This likely takes only a few lines of code to implement.
Folks might note this technically has synchronization troubles.
That's true, but at most it loses a bit
of coverage data, and the odds against it are pretty high. Most people
are happy with "pretty good" data rather than perfect. If you insist
on perfection, you'll pay a high synchonization price using some
atomic update instruction.
We also provide you control over the probe dumping logic; you can revise it to write out
thread-specific coverage data (in the tens of lines of custom code range).
The test coverage data viewer will then let you see thread-specific coverage
(just choose the right coverage vector); it also has built-in facility for
easily computing/displaying intersection/union/diff
on coverage vectors, which gives you exactly your relation of coverage-per-thread.
Concurrency Visualizer (a free add on to visual studio) is really nice visualizer of parallel threads. Including visualizing mutex locks, preemption and callstacks. https://msdn.microsoft.com/en-us/library/dd537632.aspx

Resources