Finding "cold spots" with `perf`

Finding "cold spots" with `perf` - multithreading

Running a perf record/perf annotate cycle generates information about what assembly instructions are "hot", in that they are being executed over and over.
However, for certain performance problems, we want to know where a function is "cold". This often happens in multithreaded environments, where some sort of resource contention, such as mutexes being held for too long, cause underutilization of the architecture.
How can I understand "cold spots" from a perf.data file?

Related

How I profile multithreading problems?

This is the first time I am trying to profile a multi-threaded program.
I suspect the problem is it waiting for something, but I have no clue what, the program never reaches 100% of CPU, GPU, RAM or I/O use.
Until recently, I've only worked on projects with single-threading, or where the threads were very simple (example: usually an extra thread just to ensure the UI is not locked while the program works, or once I made a game engine with a separate thread to handle .XM and .IT files music, so that the main thread could do everything, while the other thread in another core could take care of decoding those files).
This program has several threads, and they don't do parallel work on the same tasks, each thread has its own completely separate purpose (for example one thread is dedicated to handling all sound-related API calls to the OS).
I downloaded Microsoft performance tools, there is a blog by an ex-Valve employee that explains that they work to do this, but although I even managed to make some profiles and whatnot, I don't really understood what I am seeing, it is only a bunch of pretty graphs to me (except the CPU use graph, that I already knew from doing sample-based profiling on single-threaded apps), so, how I find why the program is waiting on something? Or how I find what is it waiting for? How I find what thread is blocking the others?

I look at is as an alternation between two things:
a) measuring overall time, for which all you need is some kind of timer, and
b) finding speedups, which does not mean measuring, in spite of what a lot of people have been told.
Each time you find a speedup, you time the results and do it again.
That's the alternation.
To find speedups, the method I and many people use is random pausing.
The idea is, you get the program running under a debugger and manually interrupt it, several times.
Each time, you examine the state of every thread, including the call stack.
It is very crude, and it is very effective.
The reason this works is that the only way the program can go faster is if it is doing an activity that you can remove, and if that saves a certain fraction of time, you are at least that likely to see it on every pause.
This works whether it is doing I/O, waiting for something, or computing.
It sees things that profilers do not expose, because they make summaries from which speedups can easily hide.

Performance Wizard in Visual Studio Performance and Diagnostics Hub has "Resource contention data" profiling regime which allows to analyze concurrency contention among threads, i.e. how the overall performance of a program is impacted by threads waiting on other threads. Please refer to this blog post for more details.
PerfView is an extremely powerful profiling tool which allows one to analyze the impact of service threads and tasks to the overall performance of the program. Here is the PerfView Tutorial available.

Consistent use of CPU by Java Process

I am running a Java program which does a heavy load work and needs lots of memory and CPU attention.
I took the snapshot of task manager while that program was running and this is how it looks like
Clearly this program is making use of all 8 cores available on my machine but if you see the CPU usage graph, you can see dips in the CPU usage and these dips are consistent across all cores.
My question is, Is there some way of avoiding these dips? Can i make sure that all my cores are being used consistently without any dip and come to rest only after my program has finished?

This looks so familiar. Obviously, your threads are blocking for some reason. Here are my suggestions:
Check to see if you have any thread blocking (synchronization). Thread synchronization is easy to do wrong and can stop computation for extended periods of time.
Make sure you aren't waiting on I/O (file, network, devices, etc). Often the default for network or other I/O is to block.
Don't block on message passing or remote procedure calls.
Use a more sophisticated profiler to get a better look. I use Intel VTune, but then I have access to it. There are other low-level profiling tools that are just as capable but more difficult to use.
Check for other processes that might be using the system. I've had situations where that other process doesn't use the processor (blocks) but doesn't give the context up (doesn't swap out and allow another process to run).
When I say "don't block", I don't mean that you should poll. That's even worse as it consumes processing without doing anything useful. Restructure your algorithm to hide latency. Use a new algorithm that permits more latency hiding. Find alternate ways of thread synchronization that minimizes or eliminates blocking.
My two cents.

How reliable is pstack as a profiling tool?

I've been using pstack (called in a loop periodically) as a substitute for a real profiling tool. I've noticed that even though there's more then 85% cpu usage for that pid in top, pstack shows the pid being blocked on I/O more often than being CPU bound.
How's pstack implemented? Is there any reason why pstack would be more susceptible to attaching to the pid when it's actually blocked on I/O?

You say you're calling pstack periodically in a loop - i.e. in a separate process (B) from the one you are profiling(A). If they are running in a single core, then B is more likely to "wake up" when A is blocked.
Regardless, I would trigger pstack manually, on the theory that not many samples are needed. Rather the samples I do get need to be scrutinized, not just lumped together.
In general, it's good to take samples during I/O time as well as CPU time, because both I/O and CPU wastage can make your program slow.
If it somewhat inflates one or the other, that's fairly harmless, assuming your real goal is to precisely identify things to optimize, rather than just get precise measurements of fuzzy things like functions.

Record thread events

Suppose I need to peek on a thread's state at regular intervals and record its state along the whole execution of a program. I wouldn't know how to start thinking about this. Any pointers (pun?)? I'm on Linux, using gcc, phreads and C and have access to all usual Linux tools. Basically, I guess I'm asking about how to build a simple profiler for threads that will tell me how long a thread has been in some or other state during the execution of the program.
I want to be able to create graphs like Threadscope does. The X axis is time, the Y axis is core/thread number and the "colors" are state: green means running, orange is garbage collection, and so on. Does this make more sense now?
.

For Linux specific solution, you might like to have a look at /proc/<pid>/stat and /proc/<pid>/task/<tid>/stat for process and thread statistics, respectively. Have a look at proc(5) manual page for full description of all the fields there (online http://man7.org/linux/man-pages/man5/proc.5.html - search for /proc/[pid]/stat). Specifically, at least the fields cutime and stime are of interests to you. These are monotonically increasing times, so you need to remember the previously measured value to be able to produce the time spent in the process/thread during the given time slice, in order to produce the data for your graphs. (This is how top(1) works.)
However, for the profiler to distinguish different states makes the problem more complicated. How do the profiler distinguish that the profiled program is in which state? It seems to me the profiled program threads need to signal this in some way to the profiler. You need to have some kind of tailored solution for this state sharing (unless you can run the different states in different threads and make the distinction this way, which I doubt).
If the state transitions are done in single place (e.g. enter GC and leave GC in your example), then one way would be as follows:
The monitored threads would get the start and end times of the special states by using POSIX function clock_gettime() - with clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &tp) you can get the process time and with clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tp) you can get the thread time (both monotonically increasing, again).
The thread could communicate these timings to the profiler program with some kind of IPC.
If the profiler application knows the thread times of entering and leaving a state, then because it knows the thread time values at the change of measuring slices, it can determine how much of the thread time is spent in the reported states within a reporting time slice (and of course here we need to adjust the start time for a state to equal the start of the next reporting time slice).
The time the whole process has spent on a specific state can be calculated by summing up the thread times for that state.
Note that through /proc/<pid>/stat or /proc/<pid>/task/<tid>/stat, the measurement accuracy is not very good (clock ticks, often units of 10ms), but I do not know other way of getting timing information from outside of the process/thread. The function clock_gettime() gives very accurate times (nominally nanosecond accuracy, but note that at least in some MIPS and ARM systems the accuracy is as bad as with the stat files under /proc due to unexisting implementation of accurate timer reading for these fields within Linux kernel). You also would need to do some experimentation to make sure these two timing sources really would give the same results (by reading both values from the same threads). You can of course use these /proc/.../stat files inside the thread, but the accuracy just is not very good unless you spend a lot of time within a state.

Well, the direct match to profiling info produced by the haskell compiler and processed by Threadscope is, using C and GCC, the gprof utility (it's part of the GNU binutils).
For it to work correctly with pthreads you need each thread to trigger some timer initialization function. This can be done without modifying your code with this pthreads wrapper library: http://sam.zoy.org/writings/programming/gprof.html . I haven't dealt with the problem recently, it may be that something has changed and the wrapper isn't needed anymore...
As to GUI to interpret the profiling results, there is kprof (http://kprof.sourceforge.net). Unfortunately, AFAIK it doesn't produce thread duration graphs, for that you'll have to work your own solution with the textual info produced by gprof.
If you are not picky about using the "standard" solution offered by the GCC, you may wanna try this: http://code.google.com/p/gperftools/?redir=1 (didn't try it personally, but heard good opinions).
Good luck!

Take a look at at Intel VTune Amplifier XE (formerly … Intel Thread Profiler) to see if it will meet your needs.
This and other Intel Linux development tools are available free for non-commercial use.
In the video Using the Timeline in Intel VTune Amplifier XE showing a timeline of a multi-threaded application, at 9:20 the presenter mentions
"...with the frame API you can programmatically mark certain events or phases in your code. And these marks will appear on the timeline."

I think it will be rather difficult build a simple profiler simply because there are many different factors that you have to consider and system profiling is an inherently complex task, made all the more so when you are profiling a multithreaded application. The best advice I can think of is to look at something that already exists, for example OProfile.
One advantage of OProfile is that it is open source so the source code is available. But beyond this I suspect that asking how to build a profiling application might be beyond the scope of what someone can answer in a SO question, which might be why this question hasn't gotten very many responses. Hopefully looking at some example will help you get started and then perhaps if you have more focused questions you could get some more detailed responses.

Can a multi-threaded program ever be deterministic?

Normally it is said that multi threaded programs are non-deterministic, meaning that if it crashes it will be next to impossible to recreate the error that caused the condition. One doesn't ever really know what thread is going to run next, and when it will be preempted again.
Of course this has to do with the OS thread scheduling algorithm and the fact that one doesn't know what thread is going to be run next, and how long it will effectively run.
Program execution order also plays a role as well, etc...
But what if you had the algorithm used for thread scheduling and what if you could know when what thread is running, could a multi threaded program then become "deterministic", as in, you'll be able to reproduce a crash?

Knowing the algorithm will not actually allow you to predict what will happen when. All kinds of delays that happen in the execution of a program or thread are dependent on environmental conditions such as: available memory, swapping, incoming interrupts, other busy tasks, etc.
If you were to map your multi-threaded program to a sequential execution, and your threads in themselves behave deterministically, then your whole program could be deterministic and 'concurrency' issues could be made reproducible. Of course, at that point they would not be concurrency issues any more.
If you would like to learn more, http://en.wikipedia.org/wiki/Process_calculus is very interesting reading.

My opinion is: technically no (but mathematically yes). You can write deterministic threading algorithm, but it will be extremely hard to predict state of the application after some sensible amount of time that you can treat it is non-deterministic.

There are some tools (in development) that will try to create race-conditions in a somewhat predictable manner but this is about forward-looking testing, not about reconstructing a 'bug in the wild'.
CHESS is an example.

It would be possible to run a program on a virtual multi-threaded machine where the allocation of virtual cycles to each thread was done via some entirely deterministic process, possibly using a pseudo-random generator (which could be seeded with a constant before each program run). Another, possibly more interesting, possibility would be to have a virtual machine which would alternate between running threads in 'splatter' mode (where almost any variable they touch would have its value become 'unknown' to other threads) and 'cleanup' mode (where results of operations with known operands would be visible and known to other threads). I would expect the situation would probably be somewhat analogous to hardware simulation: if the output of every gate is regarded as "unknown" between its minimum and maximum propagation times, but the simulation works anyway, that's a good indication the design is robust, but there are many useful designs which could not be constructed to work in such simulations (the states would be essentially guaranteed to evolve into a valid combination, though one could not guarantee which one). Still, it might be an interesting avenue of exploration, since large parts of many programs could be written to work correctly even in a 'splatter mode' VM.

I don't think it is practicable. To enforce a specific thread interleaving we require to place locks on shared variables, forcing the threads to access them in a specific order. This would cause severe performance degradation.
Replaying concurrency bugs is usually handled by record&replay systems. Since the recording of such large amounts of information also degrades performance, the most recent systems do partial logging and later complete the thread interleavings using SMT solving. I believe that the most recent advance in this type of systems is Symbiosis (published in this year's PLDI conference). Tou can find open source implementations in this URL:
http://www.gsd.inesc-id.pt/~nmachado/software/Symbiosis_Tutorial.html

This is actually a valid requirement in many systems today which want to execute tasks parallelly but also want some determinism from time to time.
For example, a mobile company would want to process subscription events of multiple users parallelly but would want to execute events of a single user one at a time.
One solution is to of course write everything to get executed on a single thread. Another solution is deterministic threading. I have written a simple library in Java that can be used to achieve the behavior I have described in the above example. Take a look at this- https://github.com/mukulbansal93/deterministic-threading.
Now, having said that, the actual allocation of CPU to a thread or process is in the hands of the OS. So, it is possible that the threads get the CPU cycles in a different order every time you run the same program. So, you cannot achieve the determinism in the order the threads are allocated CPU cycles. However, by delegating tasks effectively amongst threads such that sequential tasks are assigned to a single thread, you can achieve determinism in overall task execution.
Also, to answer your question about the simulation of a crash. All modern CPU scheduling algorithms are free from starvation. So, each and every thread is bound to get guaranteed CPU cycles. Now, it is possible that your crash was a result of the execution of a certain sequence of threads on a single CPU. There is no way to rerun that same execution order or rather the same CPU cycle allocation order. However, the combination of modern CPU scheduling algorithms being starvation-free and Murphy's law will help you simulate the error if you run your code enough times.
PS, the definition of enough times is quite vague and depends on a lot of factors like execution cycles need by the entire program, number of threads, etc. Mathematically speaking, a crude way to calculate the probability of simulating the same error caused by the same execution sequence is on a single processor is-
1/Number of ways to execute all atomic operations of all defined threads
For instance, a program with 2 threads with 2 atomic instructions each can be allocated CPU cycles in 4 different ways on a single processor. So probability would be 1/4.

Lots of crashes in multithreaded programs have nothing to do with the multithreading itself (or the associated resource contention).

Normally it is said that multi threaded programs are non-deterministic, meaning that if it crashes it will be next to impossible to recreate the error that caused the condition.
I disagree with this entirely, sure multi-threaded programs are non-deterministic, but then so are single-threaded ones, considering user input, message pumps, mouse/keyboard handling, and many other factors. A multi-threaded program usually makes it more difficult to reproduce the error, but definitely not impossible. For whatever reasons, program execution is not completely random, there is some sort of repeatability (but not predictability), I can usually reproduce multi-threaded bugs rather quickly in my apps, but then I have lots of verbose logging in my apps, for the end users' actions.
As an aside, if you are getting crashes, can't you also get crash logs, with call stack info? That will greatly aid in the debugging process.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string