Which part of my matlab code is multi-threaded? - multithreading

Is there a way to do some profiling to check which part of my code uses several thread?
I have a matlab code, where I run it, I can see on the system monitor that several cores are used. But I cannot figure out which part of the code is multi-threaded...
I wonder if the profiler can find it out, or some other tool...

A round about way, is to search for known multi-threaded functions. Here's the best list I know of for that: http://www.walkingrandomly.com/?p=1894. Also note that operations like, a.*b, or sin(a) will be multi-threaded if a and/or b are large enough matrices.
You can also use the profiler to reduce your search space. If most of your code is running multi-threaded most of the time, then wherever your code is spending time the most is likely where the multi-threading is occurring....

Related

The number of times to run a profiling experiment

I am trying to profile a CUDA Application. I had a basic doubt about performance analysis and workload characterization of HPC programs. Let us say I want to analyse the wall clock time(the end-to-end time of execution of a program). How many times should one run the same experiment to account for the variation in the wall clock time measurement?
Thanks.
How many times should one run the same experiment to account for the
variation in the wall clock time measurement?
The question statement assumes that there will be a variation in execution time. Had the question been
How many times should one run CUDA code for performance analysis and workload characterization?
then I would have answered
Once.
Let me explain why ... and give you some reasons for disagreeing with me ...
Fundamentally, computers are deterministic and the execution of a program is deterministic. (Though, and see below, some programs can provide an impression of non-determinism but they do so deterministically unless equipped with exotic peripherals.)
So what might be the causes of a difference in execution times between two runs of the same program?
Physics
Do the bits move faster between RAM and CPU as the temperature of the components varies? I haven't a clue but if they do I'm quite sure that within the usual temperature ranges at which computers operate the relative difference is going to be down in the nano- range. I think any other differences arising from the physics of computation are going to be similarly utterly negligible. Only lesson here, perhaps, is don't do performance analysis on a program which only takes a microsecond or two to execute.
Note that I ignore, for the purposes of this answer, the capability of some processors to adjust their clock rates in response to their temperature. This would have some (possibly large) impact on a program's execution time, but all you'd learn is how to use it as a thermometer.
Contention for System Resources
By which I mean matters such as other processes (including the operating system) running on the same CPU / core, other traffic on the memory bus, other processes using I/O, etc. Sure, yes, these may have a major impact on a program's execution time. But what do variations in run times between runs of your program tell you in these cases? They tell you how busy the system was doing other work at the same time. And make it very difficult to analyse your program's performance.
A lesson here is to run your program on an otherwise quiet machine. Indeed one of the characteristics of the management of HPC systems in general is that they aim to provide a quiet platform to provide a reliable run time to user codes.
Another lesson is to avoid including in your measurement of execution time the time taken for operations, such as disk reads and writes or network communications, over which you have no control.
If your program is a heavy user of, say, disks, then you should probably be measuring i/o rates using one of the standard benchmarking codes for the purpose to get a clear idea of the potential impact on your program.
Program Features
There may be aspects of your program which can reasonably be expected to produce different times from one run to the next. For example, if your program relies on randomness then different rolls of the dice might have some impact on execution time. (In this case you might want to run the program more than once to see how sensitive it is to the operations of the RNG.)
However, I exclude from this third source of variability the running of the code with different inputs or parameters. If you want to measure the scalability of program execution time wrt input size then you surely will have to run the program a number of times.
In conclusion
There is very little of interest to be learned, about a program, by running it more than once with no differences in the work it is doing from one run to the next.
And yes, in my early days I was guilty of running the same program multiple times to see how the execution time varied. I learned that it didn't, and that's where I got this answer from.
This kind of test demonstrates how well the compiled application interacts with the OS/computing environment where it will be used, as opposed to the efficiency of a specific algorithm or architecture. I do this kind of test by running the application three times in a row after a clean reboot/spinup. I'm looking for any differences caused by the OS loading and caching libraries or runtime environments on the first execution; and I expect the next two runtimes to be similar to each other (and faster than the first one). If they are not, then more investigation is needed.
Two further comments: it is difficult to be certain that you know what libraries and runtimes your application requires, and how a given computing environment will handle them, if you have a complex application with lots of dependencies.
Also, I recommend avoiding specifying the application runtime for a customer, because it is very hard to control the customer's computing environment. Focus on the things you can control in your application: architecture, algorithms, library version.

How I profile multithreading problems?

This is the first time I am trying to profile a multi-threaded program.
I suspect the problem is it waiting for something, but I have no clue what, the program never reaches 100% of CPU, GPU, RAM or I/O use.
Until recently, I've only worked on projects with single-threading, or where the threads were very simple (example: usually an extra thread just to ensure the UI is not locked while the program works, or once I made a game engine with a separate thread to handle .XM and .IT files music, so that the main thread could do everything, while the other thread in another core could take care of decoding those files).
This program has several threads, and they don't do parallel work on the same tasks, each thread has its own completely separate purpose (for example one thread is dedicated to handling all sound-related API calls to the OS).
I downloaded Microsoft performance tools, there is a blog by an ex-Valve employee that explains that they work to do this, but although I even managed to make some profiles and whatnot, I don't really understood what I am seeing, it is only a bunch of pretty graphs to me (except the CPU use graph, that I already knew from doing sample-based profiling on single-threaded apps), so, how I find why the program is waiting on something? Or how I find what is it waiting for? How I find what thread is blocking the others?
I look at is as an alternation between two things:
a) measuring overall time, for which all you need is some kind of timer, and
b) finding speedups, which does not mean measuring, in spite of what a lot of people have been told.
Each time you find a speedup, you time the results and do it again.
That's the alternation.
To find speedups, the method I and many people use is random pausing.
The idea is, you get the program running under a debugger and manually interrupt it, several times.
Each time, you examine the state of every thread, including the call stack.
It is very crude, and it is very effective.
The reason this works is that the only way the program can go faster is if it is doing an activity that you can remove, and if that saves a certain fraction of time, you are at least that likely to see it on every pause.
This works whether it is doing I/O, waiting for something, or computing.
It sees things that profilers do not expose, because they make summaries from which speedups can easily hide.
Performance Wizard in Visual Studio Performance and Diagnostics Hub has "Resource contention data" profiling regime which allows to analyze concurrency contention among threads, i.e. how the overall performance of a program is impacted by threads waiting on other threads. Please refer to this blog post for more details.
PerfView is an extremely powerful profiling tool which allows one to analyze the impact of service threads and tasks to the overall performance of the program. Here is the PerfView Tutorial available.

Should I use "real" or "user+sys" on the time function?

I understand the difference between "real","user" and "sys" when you use the time command on Linux, as explained on this other thread: What do 'real', 'user' and 'sys' mean in the output of time(1)?
Now I am working on a small comparison between the performance of Python, Java and C, and I am wondering which report I should use.
"User+sys" seems to be the more realistic one, but wouldn't this cause problems when comparing C to Java, for instance, cause the JVM knows how to optimize the code for multi-processors/threads while GCC doesn't?
Also, wouldn't "real" be realistic enough if I make sure no other heavy process is running on the background?
The answer will depend on what you mean by "the performance of (Python|Java|C)". In many cases what a user really cares about is the elapsed wall time, corresponding to real. Suppose you write some piece of code in a reasonable way in several languages and one of the languages can automatically parallelize it to use your 4 cores. If this makes the user wait less time for a reply, then I say this is a fair comparison. Of course it is valid for that particular machine, the results on a single core machine could be different. If an app causes page faults, then it makes the user wait. For the user it's no help if you say the app took fewer cycles if they have to wait longer.
Any way you measure, be sure to repeat the tests multiple times, as there can be lots of variation between runs. Languages like Java also need a program to run for some time before it reaches top speed, due to JIT compilation (but again: if your program is very short by definition and doesn't allow the Java Virtual Machine to warp up, then well it's too bad for Java). Testing performance is very tricky and even experienced developers are prone to misinterpreting results or measuring not what they really intended.

Record thread events

Suppose I need to peek on a thread's state at regular intervals and record its state along the whole execution of a program. I wouldn't know how to start thinking about this. Any pointers (pun?)? I'm on Linux, using gcc, phreads and C and have access to all usual Linux tools. Basically, I guess I'm asking about how to build a simple profiler for threads that will tell me how long a thread has been in some or other state during the execution of the program.
I want to be able to create graphs like Threadscope does. The X axis is time, the Y axis is core/thread number and the "colors" are state: green means running, orange is garbage collection, and so on. Does this make more sense now?
.
For Linux specific solution, you might like to have a look at /proc/<pid>/stat and /proc/<pid>/task/<tid>/stat for process and thread statistics, respectively. Have a look at proc(5) manual page for full description of all the fields there (online http://man7.org/linux/man-pages/man5/proc.5.html - search for /proc/[pid]/stat). Specifically, at least the fields cutime and stime are of interests to you. These are monotonically increasing times, so you need to remember the previously measured value to be able to produce the time spent in the process/thread during the given time slice, in order to produce the data for your graphs. (This is how top(1) works.)
However, for the profiler to distinguish different states makes the problem more complicated. How do the profiler distinguish that the profiled program is in which state? It seems to me the profiled program threads need to signal this in some way to the profiler. You need to have some kind of tailored solution for this state sharing (unless you can run the different states in different threads and make the distinction this way, which I doubt).
If the state transitions are done in single place (e.g. enter GC and leave GC in your example), then one way would be as follows:
The monitored threads would get the start and end times of the special states by using POSIX function clock_gettime() - with clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &tp) you can get the process time and with clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tp) you can get the thread time (both monotonically increasing, again).
The thread could communicate these timings to the profiler program with some kind of IPC.
If the profiler application knows the thread times of entering and leaving a state, then because it knows the thread time values at the change of measuring slices, it can determine how much of the thread time is spent in the reported states within a reporting time slice (and of course here we need to adjust the start time for a state to equal the start of the next reporting time slice).
The time the whole process has spent on a specific state can be calculated by summing up the thread times for that state.
Note that through /proc/<pid>/stat or /proc/<pid>/task/<tid>/stat, the measurement accuracy is not very good (clock ticks, often units of 10ms), but I do not know other way of getting timing information from outside of the process/thread. The function clock_gettime() gives very accurate times (nominally nanosecond accuracy, but note that at least in some MIPS and ARM systems the accuracy is as bad as with the stat files under /proc due to unexisting implementation of accurate timer reading for these fields within Linux kernel). You also would need to do some experimentation to make sure these two timing sources really would give the same results (by reading both values from the same threads). You can of course use these /proc/.../stat files inside the thread, but the accuracy just is not very good unless you spend a lot of time within a state.
Well, the direct match to profiling info produced by the haskell compiler and processed by Threadscope is, using C and GCC, the gprof utility (it's part of the GNU binutils).
For it to work correctly with pthreads you need each thread to trigger some timer initialization function. This can be done without modifying your code with this pthreads wrapper library: http://sam.zoy.org/writings/programming/gprof.html . I haven't dealt with the problem recently, it may be that something has changed and the wrapper isn't needed anymore...
As to GUI to interpret the profiling results, there is kprof (http://kprof.sourceforge.net). Unfortunately, AFAIK it doesn't produce thread duration graphs, for that you'll have to work your own solution with the textual info produced by gprof.
If you are not picky about using the "standard" solution offered by the GCC, you may wanna try this: http://code.google.com/p/gperftools/?redir=1 (didn't try it personally, but heard good opinions).
Good luck!
Take a look at at Intel VTune Amplifier XE (formerly … Intel Thread Profiler) to see if it will meet your needs.
This and other Intel Linux development tools are available free for non-commercial use.
In the video Using the Timeline in Intel VTune Amplifier XE showing a timeline of a multi-threaded application, at 9:20 the presenter mentions
"...with the frame API you can programmatically mark certain events or phases in your code. And these marks will appear on the timeline."
I think it will be rather difficult build a simple profiler simply because there are many different factors that you have to consider and system profiling is an inherently complex task, made all the more so when you are profiling a multithreaded application. The best advice I can think of is to look at something that already exists, for example OProfile.
One advantage of OProfile is that it is open source so the source code is available. But beyond this I suspect that asking how to build a profiling application might be beyond the scope of what someone can answer in a SO question, which might be why this question hasn't gotten very many responses. Hopefully looking at some example will help you get started and then perhaps if you have more focused questions you could get some more detailed responses.

Threading run time without adding extra lines in program

Is there any thread library which can parse through code and find blocks of code which can be threaded and accordingly add the required threading instructions.
Also I want to check performance of a multithreaded program as compared to its single thread version. For this I would need to monitor the CPU usage(how much each processor is getting used). Is there any tool available to do this?
I'd say the decision whether or not a given block of code can be rewritten to be multi-threaded is way too hard for an automated process to make. To make matters worse, multi-threaded code typically accesses resources outside its own scope, such as pulling data over the network, loading large files, waiting for events, executing database queries, etc.; without detailed information about all these external factors, it is impossible to decide where to go multithreaded, simply because not all the required information is in the code.
Also, a lot of code that is multi-threadable in theory will not run faster if multi-threaded, but in fact slow down.
Some compilers (such as recent versions of the Intel compiler and gcc) can automatically parallelize simple loops, but anything beyond that is too complex. On the other hand, there are task libraries that use thread pools, and will automatically scale the number of threads to the available processors, and divide the work between them. Of course, using such a library will require rewriting your code to do so.
Structuring your application to make best use of multithreading is not a simple matter, and requires careful thought about which parts of your application can best make use of it. This is not something that can be automated.
Consider multi-threading as an approach to make full utilization of available resources. This is when it works the best. Consider an application which has multiple modules/areas which are multi-threadable. If all of them are made multi-threaded, the available resources might go down substantially. This could at times be detrimental to the application itself. Thus, multi-threading has to be used very carefully.
As Chris mentioned, there are a lot of profilers which do profiling for given combination of OS/language.
The first thing you need to do is profile your code in a single thread and see if the areas you think are good candidates for multithreading are actually a problem. It's easy to waste a lot of time multithreading working code only to end up with a buggy mess that's slower than the original implementation if you don't carefully consider the problem first.

Resources