fine grained multi core thread execution measurement on linux? - linux

I'd like to have detailed insight into the running of a multi-threaded application (written in C++), and see in detail how much time each thread is spending on what CPU core.
I've looked at some tools like perf, or hcprof, but haven't found anything that would accomplish what I'm looking for.
What am I missing?
Any pointers welcome.
Akos

You are not missing to much. Try LttNG it gives you kind of a scope into the execution.
The following link describes well what is it doing.
https://blog.selectel.com/deep-kernel-introduction-lttng/
LttNG has a quite steep learning curve.
Perf if ok but it gives you more of a ocupancy grid the a time profile. Are you using FlameGraps?
Please read this post http://www.brendangregg.com/blog/2015-07-08/choosing-a-linux-tracer.html it will give you a hint what tools do what.
As last a word of advice. In such systems everything is a problem, from building flags drivers to coding style for synchronization or memory management. You must be precise and meticulous in you profiling.

Related

RISC-V and Spike: some very basic questions

I want to emulate various multi-core hardware with Risc V and Spike but I am really struggling to find documentation: for instance I don't even know where a typcial RISC-V processor begins execution on reset and cannot seem to find this information in the ISA documentation.
Is the answer to look at the Spike sources? Or is there some other pool of documentation I have missed?
What you are asking about is not a part of the user-level ISA, but rather, the Platform Specification.
Unfortunately, such a manual does not yet exist.
Your best bet, particularly as the platform and privileged-level specifications are still under rapid development, is to look at the Spike source code, as it is the "Golden Model".
To answer your question about the boot PC, just see what Spike does:
spike -d hello.riscv
Regarding bootstrap PC after reset, according to post linked below it is from 0x200.
How can I compile C code to get a bare-metal skeleton of a minimal RISC-V assembly program?
I am still trying to figure out how to get the example in linked post to work on up-to-date rocket.

Libraries for inprocess perf profiling? (Intel performance counters on Linux)

I want to profile an application's critical path by taking readings of the performance counters at various points along the path.
I came across libperf which provides a fairly neat C api. However, the last activity was 3 years ago.
I am also aware of the PAPI. This is under active development.
Are there other libraries I should be aware of?
Can anyone offer any insight into using one or the other?
Any tutorials / introductions to integrating these into application code?
I've used both PAPI (on Solaris) and perf (on Linux) and found it is much easier to record the entire program run and use 'perf-annotate' to see how your critical path is doing rather than trying to measure only the critical path. It's a different approach but it worked well for me.
Also, as someone mentioned at the comments, there is vTune, if you are x86 based. I've never used it myself.
There are several options, among them:
JEvents
libpfc
Intel PCM
likwid
Direct use of perf_event_open
I cover these in more detail in an answer to a related question.
Have a look at
http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization?page=8
VTune however is proprietory, if you want it can serve the purpose. Try its 30 day trial.
perfmon and perfctr can be used for analysis of parts of program but will have to be included in the code itself.

Measuring the scaling behaviour of multithreaded applications

I am working on an application which supports many-core MIMD architectures (on consumer/desk-computers). I am currently worrying about the scaling behaviour of the application. Its designed to be massively parallel and addressing next-gen hardware. That's actually my problem. Does anyone know any software to simulate/emulate many-core MIMD Processors with >16 cores on a machine-code level? I've already implemented a software based thread sheduler with the ability to simulate multiple processors, by simple timing techniques.
I was curious if there's any software which could do this kind of simulation on a lower level preferably on an assembly language level to get better results. I want to emphasize once again that I'm only interested in MIMD Architectures. I know about OpenCL/CUDA/GPGPU but thats not what I'm looking for.
Any help is appreciated and thanks in advance for any answers.
You will rarely find all-purpose testing tools that are ALSO able to target very narrow (high-performance) corners - for a simple reason: the overhead of the "general-purpose" object will defeat that goal in the first place.
This is especially true with paralelism where locality and scheduling have a huge impact.
All this to say that I am affraid that you will have to write your own testing tool to target your exact usage pattern.
That's the price to pay for relevance.
If you are writing your application in C/C++, I might be able to help you out.
My company has created a tool that will take any C or C++ code and simulate its run-time behavior at bytecode level to find parallelization opportunities. The tool highlights parallelization opportunities and shows how parallel tasks behave.
The only mismatch is that our tool will also provide refactoring recipes to actually get to parallelization, whereas it sounds like you already have that.

Which of this program can be multithreaded?

I am a normal user and does not have strong background in programming.
I have a 64 bit, dual core machine (Dell Vostro 3400) and I think I can run multithreaded program with this machine (yes?)
The program that I think could be convert into multithreaded program is this:
http://code.google.com/p/malwarecookbook/source/browse/trunk/3/8/pescanner.py
Is possible to do so?
If yes, which part should being edited so that it will work?
Thanks.
Multithreading is not an easy subject.
I suggest you read up on some tutorials, see:
http://www.tutorialspoint.com/python/python_multithreading.htm
http://www.devshed.com/c/a/Python/Basic-Threading-in-Python/
http://www.artfulcode.net/articles/multi-threading-python/
To answer the general part of your question, you can run multithreaded code an any machine newer than say 2000.
Your question is too broad though to answer without going into details on the code.
My suggestion
I suggest you try the tutorials first and write same sample programs, ask a specific question with sourcecode! if you get stuck.
That's a road I'd recommend rather than taking someone else's code and rewriting it without detailed knowledge of threads.

performance counter events associated with false sharing

I am looking at the performance of OpenMP program, specifically cache and memory performance.
I have found guidelines while back ago how to analyze performance with Vtune that mentioned which counters to watch out for. However now cannot seem to find the manual.
If you know which manual I have in question or if you know the counters/events, please let me know. Also if you have other techniques for analyzing multithreaded memory performance, please share if you can
Thanks
Here is an article discussion this topic.
The most common counters to examine are L2 cache misses and branch prediction misses.
Note that, in VS2010, you can use the concurrency visualizer in the new profiling tools to directly see this. It does a great job of helping you analyze this information, including directly showing you how your code lays out, showing you misses, blocks, and many other useful tools for debugging and profiling concurrent apps.

Resources