Should I use "real" or "user+sys" on the time function? - linux

I understand the difference between "real","user" and "sys" when you use the time command on Linux, as explained on this other thread: What do 'real', 'user' and 'sys' mean in the output of time(1)?
Now I am working on a small comparison between the performance of Python, Java and C, and I am wondering which report I should use.
"User+sys" seems to be the more realistic one, but wouldn't this cause problems when comparing C to Java, for instance, cause the JVM knows how to optimize the code for multi-processors/threads while GCC doesn't?
Also, wouldn't "real" be realistic enough if I make sure no other heavy process is running on the background?

The answer will depend on what you mean by "the performance of (Python|Java|C)". In many cases what a user really cares about is the elapsed wall time, corresponding to real. Suppose you write some piece of code in a reasonable way in several languages and one of the languages can automatically parallelize it to use your 4 cores. If this makes the user wait less time for a reply, then I say this is a fair comparison. Of course it is valid for that particular machine, the results on a single core machine could be different. If an app causes page faults, then it makes the user wait. For the user it's no help if you say the app took fewer cycles if they have to wait longer.
Any way you measure, be sure to repeat the tests multiple times, as there can be lots of variation between runs. Languages like Java also need a program to run for some time before it reaches top speed, due to JIT compilation (but again: if your program is very short by definition and doesn't allow the Java Virtual Machine to warp up, then well it's too bad for Java). Testing performance is very tricky and even experienced developers are prone to misinterpreting results or measuring not what they really intended.

Related

The number of times to run a profiling experiment

I am trying to profile a CUDA Application. I had a basic doubt about performance analysis and workload characterization of HPC programs. Let us say I want to analyse the wall clock time(the end-to-end time of execution of a program). How many times should one run the same experiment to account for the variation in the wall clock time measurement?
Thanks.
How many times should one run the same experiment to account for the
variation in the wall clock time measurement?
The question statement assumes that there will be a variation in execution time. Had the question been
How many times should one run CUDA code for performance analysis and workload characterization?
then I would have answered
Once.
Let me explain why ... and give you some reasons for disagreeing with me ...
Fundamentally, computers are deterministic and the execution of a program is deterministic. (Though, and see below, some programs can provide an impression of non-determinism but they do so deterministically unless equipped with exotic peripherals.)
So what might be the causes of a difference in execution times between two runs of the same program?
Physics
Do the bits move faster between RAM and CPU as the temperature of the components varies? I haven't a clue but if they do I'm quite sure that within the usual temperature ranges at which computers operate the relative difference is going to be down in the nano- range. I think any other differences arising from the physics of computation are going to be similarly utterly negligible. Only lesson here, perhaps, is don't do performance analysis on a program which only takes a microsecond or two to execute.
Note that I ignore, for the purposes of this answer, the capability of some processors to adjust their clock rates in response to their temperature. This would have some (possibly large) impact on a program's execution time, but all you'd learn is how to use it as a thermometer.
Contention for System Resources
By which I mean matters such as other processes (including the operating system) running on the same CPU / core, other traffic on the memory bus, other processes using I/O, etc. Sure, yes, these may have a major impact on a program's execution time. But what do variations in run times between runs of your program tell you in these cases? They tell you how busy the system was doing other work at the same time. And make it very difficult to analyse your program's performance.
A lesson here is to run your program on an otherwise quiet machine. Indeed one of the characteristics of the management of HPC systems in general is that they aim to provide a quiet platform to provide a reliable run time to user codes.
Another lesson is to avoid including in your measurement of execution time the time taken for operations, such as disk reads and writes or network communications, over which you have no control.
If your program is a heavy user of, say, disks, then you should probably be measuring i/o rates using one of the standard benchmarking codes for the purpose to get a clear idea of the potential impact on your program.
Program Features
There may be aspects of your program which can reasonably be expected to produce different times from one run to the next. For example, if your program relies on randomness then different rolls of the dice might have some impact on execution time. (In this case you might want to run the program more than once to see how sensitive it is to the operations of the RNG.)
However, I exclude from this third source of variability the running of the code with different inputs or parameters. If you want to measure the scalability of program execution time wrt input size then you surely will have to run the program a number of times.
In conclusion
There is very little of interest to be learned, about a program, by running it more than once with no differences in the work it is doing from one run to the next.
And yes, in my early days I was guilty of running the same program multiple times to see how the execution time varied. I learned that it didn't, and that's where I got this answer from.
This kind of test demonstrates how well the compiled application interacts with the OS/computing environment where it will be used, as opposed to the efficiency of a specific algorithm or architecture. I do this kind of test by running the application three times in a row after a clean reboot/spinup. I'm looking for any differences caused by the OS loading and caching libraries or runtime environments on the first execution; and I expect the next two runtimes to be similar to each other (and faster than the first one). If they are not, then more investigation is needed.
Two further comments: it is difficult to be certain that you know what libraries and runtimes your application requires, and how a given computing environment will handle them, if you have a complex application with lots of dependencies.
Also, I recommend avoiding specifying the application runtime for a customer, because it is very hard to control the customer's computing environment. Focus on the things you can control in your application: architecture, algorithms, library version.

Multi-core JIT in multithreaded application

I would like to know about how ProfileOptimization (also known as Multi-core JIT) works in multi-threaded application.
Documentation says that ProfileOptimization tracks and records methods that are called during the application execution. But what if there are multiple threads that are executed at the same time? In this case method call order may differ from run to run. So profile will always be overwritten with the new data.
Does that mean that using Multi-core JIT is not efficient in this scenario? Or may be ProfileOptimization tracks method calls from only the thread that called ProfileOptimazation.StartProfile(...)? Or something else?
Could someone explain how do ProfileOptimization behave in such a case?
It isn't very clear why you think threads are a problem, I'll just noodle about the feature for a while. The traditional way the jitter works is by compiling methods just-in-time, a fraction of a second before the method starts running. That's different with the multicore JIT option, it necessarily needs to compile methods earlier so it can take advantage of an extra core running the jitter. Problem is, what method should it compile early? Clearly there is very little gain if it compiles the wrong one, a method that will only be called minutes from the start of the program. Or worse, is never called.
To figure out what methods it should work on, it needs to know ahead of time what method will run. A time machine is not an option of course. It could only guess at this with some degree of accuracy by knowing what happened previously. With the assumption that, when the program runs for the second time, it will call methods in roughly the same order.
So your call to StartProfile() starts recording the names of the methods that get jitted, simply in the order in which they run for the first time and get compiled. That list of method names is stored in a file. Next time you run the program and call StartProfile() again, it now starts using the data in that file to give other cores work to do, pre-compiling the methods in the order in which they appear in the list.
This has pretty decent odds of having the method already compiled before it is runs for the first time, incurring no delay. Thus improving the warm-start time of your program. It doesn't have to be, nothing can go wrong when it wasn't compiled yet, the normal just-in-time compilation that traditionally happened takes care of it. It just isn't as efficient as it could be.
If your program is highly non-deterministic when it starts, having wildly different execution paths through the code from one run to the next then, no, the likelihood of multicore jit being a benefit to your startup time is going to be a low one. The jitter is going to pre-compile the wrong methods. This is very unusual, real programs rarely behave that way when they start up. That doesn't otherwise have anything to do with threads, they are not likely to be particularly less deterministic than your main thread. The opposite actually, the main thread is expected to interact with the user, which can behave irrational like a human can, your workers don't. And in general a problem with threads, they tend to settle in execution patterns that hide threading race bugs.
Do keep in mind that all of this only matters in the first, give or take, 30 seconds of your program's life. And only matters to warm-start time. The jitter simply stops recording completely when the jitting rate drops too low.

boost::io_service::strand performance

I am using a boost::io_service to build a thread pool that executes computational jobs in parallel. Some jobs are not allowed to run concurrently, which - I think - is the ideal application of a boost::io_service::strand. As the order in which the sequential jobs are executed does not matter, I am asking, which of the two ways to use the strand I should use:
strand.post(bind(jobA...));
or
io_service.post(strand.wrap(bind(jobA...)))
If I understand the boost docs correctly, the first version will insure that the jobs are executed in the same order they were posted, whereas the second version does not give any guarantee.
My question is: Which one is faster?
You can use the two methods described above interchangeably and it will result in identical results. I doubt very much that there is any performance difference, but if there is, it's in the overhead of the two function (strand.post vs io_service.post) calls but not in the actual execution of the io_service since they both do the same thing under the hood and have the same path of execution.
I would guess that io_service.post() requires a handful fewer clock cycles, but in the same breath I'm also guessing that such micro-optimizations are as noticeable in your application as interference from solar radiation and the CPU having to re-execute instructions. I don't even know if that's a real phenomena or not, but it sounded cool when trying to come up with a verbose way of saying, "don't worry about it". If there is in fact a performance difference, please share the benchmarks. *rolls eyes at self*
Personally, I doubt the end performance difference is detectible in your final system, but simplicity combined with functional sufficiency argues for option 1.
It's more comprehensible, and using the io_service route does not give you any extra function, while necessarily, since you are indirecting through one extra layer - the io_service - adding extra lines of code that must be executed.
The docs for strand::post are clear that using this method already provides the necessary behavioural guarantees at both io_service and strand levels.

Threading run time without adding extra lines in program

Is there any thread library which can parse through code and find blocks of code which can be threaded and accordingly add the required threading instructions.
Also I want to check performance of a multithreaded program as compared to its single thread version. For this I would need to monitor the CPU usage(how much each processor is getting used). Is there any tool available to do this?
I'd say the decision whether or not a given block of code can be rewritten to be multi-threaded is way too hard for an automated process to make. To make matters worse, multi-threaded code typically accesses resources outside its own scope, such as pulling data over the network, loading large files, waiting for events, executing database queries, etc.; without detailed information about all these external factors, it is impossible to decide where to go multithreaded, simply because not all the required information is in the code.
Also, a lot of code that is multi-threadable in theory will not run faster if multi-threaded, but in fact slow down.
Some compilers (such as recent versions of the Intel compiler and gcc) can automatically parallelize simple loops, but anything beyond that is too complex. On the other hand, there are task libraries that use thread pools, and will automatically scale the number of threads to the available processors, and divide the work between them. Of course, using such a library will require rewriting your code to do so.
Structuring your application to make best use of multithreading is not a simple matter, and requires careful thought about which parts of your application can best make use of it. This is not something that can be automated.
Consider multi-threading as an approach to make full utilization of available resources. This is when it works the best. Consider an application which has multiple modules/areas which are multi-threadable. If all of them are made multi-threaded, the available resources might go down substantially. This could at times be detrimental to the application itself. Thus, multi-threading has to be used very carefully.
As Chris mentioned, there are a lot of profilers which do profiling for given combination of OS/language.
The first thing you need to do is profile your code in a single thread and see if the areas you think are good candidates for multithreading are actually a problem. It's easy to waste a lot of time multithreading working code only to end up with a buggy mess that's slower than the original implementation if you don't carefully consider the problem first.

What are the tell-tale signs that my code needs to make use of multi-threading?

I am using a third party API which performs what I would assume are expensive operations in terms of time/resources used (image recognition, etc). What tell-tale signs are there that the code under test should be made to use threads to increase performance?
I have a profiler and will be profiling the code I write which will rely on this API.
Thanks
If you have two distinct sequences of events that don't depend on one-another, then consider it. If you have to write bunches of logic just to make sure that two operations aren't getting in each-others way, it pays off by making the two pieces of code clearer.
If on the other hand you find that, in attempting to make something multithreaded, you have to add gobs of code to communicate results between the threads, because one (or both) can't proceed without some information from the other, that's a good sign that you are trying to make threads where they don't make sense.
One case where it makes sense to go multi-threaded, even when you have to add communication to do it, is when you have one task that needs to stay available for input, and another to do heavy computing. One thread may poll for input from somewhere, blocking when none is available, so that when input is available it is responded to in a timely manner, and feed jobs to another 'worker' thread, so that processing continues at all times, not just when there's input.
One other thing to consider, is that even when a job is 'embarrassingly parallel' (i.e., requiring little or no communication between the parallelized parts), there are cases where multithreading may not be worthwhile. If your CPU can assign different threads to different cores, multithreading will give you a speed up, by allowing multiple cores to chew through the work simultaneously. But on a single core processor, or even a multi-core one with an unfortunate OS, having multiple threads will not speed things up, as the one core will still have to get through all the work.
Image processing is often cpu-bound. However, if your image-processing api already is designed to leverage multiple cpus, multi-threading probably won't help you. The strategy I usually consider for quickly determining if multi-threading will help is to write a simple program which does the relevant processing over and over again. Then, I will run it on a set of data, then run two instances of the process simultaneously,each on half of the data. There is no need to ensure the data is equalized for such a test; if one process runs out it will just run one instance for anything left. Timing is done via wall-clock time. I mean this literally; pick a large enough data set that it will take at least a full minute to run, but ideally 5 minutes or longer).
If running two copies at the same time improves throughput significantly, multi-threading is probably a good idea. Obviously this strategy is only practical in certain instances and in some cases multi-threading can involve leveraging shared output in ways this trick can't emulate. But, it's an absurdly easy test to run, and rarely requires much, if any, code to be written.

Resources