Debugging slow functions in C programs (built by gcc) - linux

Having source like this:
void foo() {
func1();
if(qqq) {
func2();
};
func3();
func4();
for(...) {
func5();
}
}
I want to obtain info like this:
void foo() {
5 ms; 2 times; func1();
0 ms; 2 times; if(qqq) {
0 ms; 0 times; func2();
0 ms; 2 times; };
20 ms; 2 times; func3();
5 s ; 2 times; func4();
0 ms; 60 times; for(...) {
30 ms; 60 times; func5();
0 ms; 60 times; }
}
I.e. information about how long in average it took to execute this line (real clock time, including waiting in syscalls) and how many times is it executed.
What tools should I use?
I expect the tool to instrument each function to measure it's running time, which is used by instrumentation inside calling function that writes log file (or counts in memory and then dumps).

gprof is pretty standard for GNU built (gcc, g++) programs: http://www.cs.utah.edu/dept/old/texinfo/as/gprof_toc.html
Here is what the output looks like: http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html#SEC5

Take a trial run of Zoom. You won't be disappointed.
P.S. Don't expect instrumentation to do the job. For either line-level or function-level information, a wall-time stack sampler delivers the goods, assuming you don't absolutely need precise invocation counts (which have little relevance to performance).
ADDED: I'm on Windows, so I just ran your code with LTProf. The output looks like this:
void foo(){
5 func1();
if(qqq){
5 func2();
}
5 func3();
5 func4();
// I made this 16, not 60, so the total time would be 20 sec.
for(int i = 0; i < 16; i++){
80 func5();
}
}
where each func() does a Sleep(1000) and qqq is True, so the whole thing runs for 20 seconds. The numbers on the left are the percent of samples (6,667 samples) that have that line on them. So, for example, a single call to one of the func functions uses 1 second or 5% of the total time. So you can see that the line where func5() is called uses 80% of the total time. (That is, 16 out of the 20 seconds.) All the other lines were on the stack so little, comparatively, that their percents are zero.
I would present the information differently, but this should give a sense of what stack sampling can tell you.

Either Zoom or Intel VTune.

Related

How to execute a while loop precisely every 10 seconds in windows vc++

Please help me in running the following loop precisely every 10 seconds in windows vc++.
Initially It should start at something like say 12:12:40:000, It should neglect the milliseconds it takes to do some work commented, and restart the next loop at 12:12:50:000 and so on every 10 seconds precisely.
void controlloop()
{
struct timeb start, end;
while(1)
{
ftime(&start);
if(start.time %10 == 0)
break;
else
Sleep(100);
}
while(1)
{
ftime(&start);
if(start.time %10 == 0)
{
// some work here which will roughly take 100 ms
ftime(&end);
elapsedtime = (int) (1000.0 * (end.time - start.time) + (end.millitm - start.millitm));
if(elapsedtime > 10000)
{
sleeptime = 0;
}
else
{
sleeptime = 10000-(elapsedtime);
}
}
Sleep(sleeptime);
}//1
}
The Sleep approach only guarantees you sleep at least 10 seconds. After that your thread is considered eligible for scheduling and on the next quanta it will be considered again. You are still subject to the priority of any other threads on the system, the number of logical cores, etc. You are also still subject to the resolution of the threading quanta which is by default ~15 ms. You can change it with timeBeginPeriod, but that has system-wide power implications.
For more information on Windows scheduling see Microsoft Docs. For more on the power issues, see this blog post.
For Windows the best option is to use the high-frequency performance counter via QueryPerformanceCounter. You use QueryPerformanceFrequency to convert between cycles and seconds.
LARGE_INTEGER qpcFrequency;
QueryPerformanceFrequency(&qpcFrequency);
LARGE_INTEGER startTime;
QueryPerformanceCounter(&startTime);
LARGE_INTEGER tenSeconds;
tenSeconds.QuadPart = startTime .QuadPart + qpcFrequency.QuadPart * 10;
while (true)
{
LARGE_INTEGER currentTime;
QueryPerformanceCounter(&currentTime);
if (currentTime.QuadPart >= tenSeconds.QuadPart)
break;
}
The timer resolution for QPC is typically close the cycle speed of your CPU processor.
If you want to run a thread for as close to 10 seconds as you can while still yielding the processor use:
LARGE_INTEGER qpcFrequency;
QueryPerformanceFrequency(&qpcFrequency);
LARGE_INTEGER startTime;
QueryPerformanceCounter(&startTime);
LARGE_INTEGER tenSeconds;
tenSeconds.QuadPart = startTime .QuadPart + qpcFrequency.QuadPart * 10;
while (true)
{
LARGE_INTEGER currentTIme;
QueryPerformanceCounter(&currentTIme);
if (currentTime.QuadPart >= tenSeconds.QuadPart)
{
// do a thing
tenSeconds.QuadPart = currentTime.QuadPart + qpcFrequency.QuadPart * 10;
SwitchToThread();
}
This is not really the most efficient way to do a periodic timer, but you asked for precision not efficiency.
If you are using VS 2015 or later, you can use the C++11 type high_resolution_clock which uses QPC for it’s implementation. In older versions of Visual C++ used ‘file system time’ which is back to your original resolution problem with ftime.

Efficient way to iterate collection objects in Groovy 2.x?

What is the best and fastest way to iterate over Collection objects in Groovy. I know there are several Groovy collection utility methods. But they use closures which are slow.
The final result in your specific case might be different, however benchmarking 5 different iteration variants available for Groovy shows that old Java for-each loop is the most efficient one. Take a look at the following example where we iterate over 100 millions of elements and we calculate the total sum of these numbers in the very imperative way:
#Grab(group='org.gperfutils', module='gbench', version='0.4.3-groovy-2.4')
import java.util.concurrent.atomic.AtomicLong
import java.util.function.Consumer
def numbers = (1..100_000_000)
def r = benchmark {
'numbers.each {}' {
final AtomicLong result = new AtomicLong()
numbers.each { number -> result.addAndGet(number) }
}
'for (int i = 0 ...)' {
final AtomicLong result = new AtomicLong()
for (int i = 0; i < numbers.size(); i++) {
result.addAndGet(numbers[i])
}
}
'for-each' {
final AtomicLong result = new AtomicLong()
for (int number : numbers) {
result.addAndGet(number)
}
}
'stream + closure' {
final AtomicLong result = new AtomicLong()
numbers.stream().forEach { number -> result.addAndGet(number) }
}
'stream + anonymous class' {
final AtomicLong result = new AtomicLong()
numbers.stream().forEach(new Consumer<Integer>() {
#Override
void accept(Integer number) {
result.addAndGet(number)
}
})
}
}
r.prettyPrint()
This is just a simple example where we try to benchmark the cost of iteration over a collection, no matter what the operation executed for every element from collection is (all variants use the same operation to give the most accurate results). And here are results (time measurements are expressed in nanoseconds):
Environment
===========
* Groovy: 2.4.12
* JVM: OpenJDK 64-Bit Server VM (25.181-b15, Oracle Corporation)
* JRE: 1.8.0_181
* Total Memory: 236 MB
* Maximum Memory: 3497 MB
* OS: Linux (4.18.9-100.fc27.x86_64, amd64)
Options
=======
* Warm Up: Auto (- 60 sec)
* CPU Time Measurement: On
WARNING: Timed out waiting for "numbers.each {}" to be stable
user system cpu real
numbers.each {} 7139971394 11352278 7151323672 7246652176
for (int i = 0 ...) 6349924690 5159703 6355084393 6447856898
for-each 3449977333 826138 3450803471 3497716359
stream + closure 8199975894 193599 8200169493 8307968464
stream + anonymous class 3599977808 3218956 3603196764 3653224857
Conclusion
Java's for-each is as fast as Stream + anonymous class (Groovy 2.x does not allow using lambda expressions).
The old for (int i = 0; ... is almost twice slower comparing to for-each - most probably because there is an additional effort of returning a value from the array at given index.
Groovy's each method is a little bit faster then stream + closure variant, and both are more than twice slower comparing to the fastest one.
It's important to run benchmarks for a specific use case to get the most accurate answer. For instance, Stream API will be most probably the best choice if there are some other operations applied next to the iteration (filtering, mapping etc.). For simple iterations from the first to the last element of a given collection choosing old Java for-each might give the best results, because it does not produce much overhead.
Also - the size of collection matters. For instance, if we use the above example but instead of iterating over 100 millions of elements we would iterate over 100k elements, then the slowest variant would cost 0.82 ms versus 0.38 ms. If you build a system where every nanosecond matters then you have to pick the most efficient solution. But if you build a simple CRUD application then it doesn't matter if iteration over a collection takes 0.82 or 0.38 milliseconds - the cost of database connection is at least 50 times bigger, so saving approximately 0.44 milliseconds would not make any impact.
// Results for iterating over 100k elements
Environment
===========
* Groovy: 2.4.12
* JVM: OpenJDK 64-Bit Server VM (25.181-b15, Oracle Corporation)
* JRE: 1.8.0_181
* Total Memory: 236 MB
* Maximum Memory: 3497 MB
* OS: Linux (4.18.9-100.fc27.x86_64, amd64)
Options
=======
* Warm Up: Auto (- 60 sec)
* CPU Time Measurement: On
user system cpu real
numbers.each {} 717422 0 717422 722944
for (int i = 0 ...) 593016 0 593016 600860
for-each 381976 0 381976 387252
stream + closure 811506 5884 817390 827333
stream + anonymous class 408662 1183 409845 416381
UPDATE: Dynamic invocation vs static compilation
There is also one more factor worth taking into account - static compilation. Below you can find results for 10 millions element collection iterations benchmark:
Environment
===========
* Groovy: 2.4.12
* JVM: OpenJDK 64-Bit Server VM (25.181-b15, Oracle Corporation)
* JRE: 1.8.0_181
* Total Memory: 236 MB
* Maximum Memory: 3497 MB
* OS: Linux (4.18.10-100.fc27.x86_64, amd64)
Options
=======
* Warm Up: Auto (- 60 sec)
* CPU Time Measurement: On
user system cpu real
Dynamic each {} 727357070 0 727357070 731017063
Static each {} 141425428 344969 141770397 143447395
Dynamic for-each 369991296 619640 370610936 375825211
Static for-each 92998379 27666 93026045 93904478
Dynamic for (int i = 0; ...) 679991895 1492518 681484413 690961227
Static for (int i = 0; ...) 173188913 0 173188913 175396602
As you can see turning on static compilation (with #CompileStatic class annotation for instance) is a game changer. Of course Java for-each is still the most efficient, however its static variant is almost 4 times faster than the dynamic one. Static Groovy each {} is faster 5 times faster than the dynamic each {}. And static for loop is also 4 times faster then the dynamic for loop.
Conclusion - for 10 millions elements static numbers.each {} takes 143 milliseconds while static for-each takes 93 milliseconds for the same size collection. It means that for collection of size 100k static numbers.each {} will cost 0.14 ms and static for-each will take 0.09 ms approximately. Both are very fast and the real difference starts when the size of collection explodes to +100 millions of elements.
Java stream from Java compiled class
And to give you a perspective - here is Java class with stream().forEach() on 10 millions of elements for a comparison:
Java stream.forEach() 87271350 160988 87432338 88563305
Just a little bit faster than statically compiled for-each in Groovy code.

Exponential performance drop in macOS with timegm(..) due to timezone lock

In my application I experience an exponential slowdown due to a concurrent usage of timegm (at the moment around 800 times slower). I uploaded my test code here: http://cpp.sh/363nv
void* testThread(void* args)
{
time_t tt;
tm t;
time(&tt);
localtime_r(&tt, &t);
for (int i = 0; i < 5000; i++)
{
timegm(&t);
}
return nullptr;
}
The performance drop is caused by a timezone lock. Any idea how to bypass this lock? Depending if I misuse timegm(..) or not, I will file a report to Apple. Thanks for any help
Here are my results:
macOS 10.13.6 (Macbook Pro 2.5 GHz Intel Core i7):
5000 calls of 'timegm' with 1 thread(s) took 0.03634 seconds
5000 calls of 'timegm' with 8 thread(s) took 24.84911 seconds
For a comparison, here are the results of an online c++ execution:
Linux:
5000 calls of 'timegm' with 1 thread(s) took 0.00453 seconds
5000 calls of 'timegm' with 8 thread(s) took 0.00588 seconds

Why am i use Parallel.For get the diffrent result?

using System.Threading.Tasks;
const int _Total = 1000000;
[ThreadStatic]
static long count = 0;
static void Main(string[] args)
{
Parallel.For(0, _Total, (i) =>
{
count++;
});
Console.WriteLine(count);
}
I get different result every time, can anybody help me and tell me why?
Most likely your "count" variable isn't atomic in any form, so you are getting concurrent modifications that aren't synchronized. Thus, the following sequence of events is possible:
Thread 1 reads "count"
Thread 2 reads "count"
Thread 1 stores value+1
Thread 2 stores value+1
Thus, the "for" loop has done 2 iterations, but the value has only increased by 1. As thread ordering is "random", so will be the result.
Things can get a lot worse, of course:
Thread 1 reads count
Thread 2 increases count 100 times
Thread 1 stores value+1
In that case, all those 100 increases done by thread 2 are undone. Although that can really only happen if the "++" is actually split into at least 2 machine instructions, so it can be interrupted in the middle of the operation. In the one-instruction case, you're only dealing with interleaved hardware threads.
It's a typical race condition scenario.
So, most likely, ThreadStatic is not working here. In this concrete sample use System.Threading.Interlocked:
void Main()
{
int total = 1000000;
int count = 0;
System.Threading.Tasks.Parallel.For(0, _Total, (i) =>
{
System.Threading.Interlocked.Increment(ref count);
});
Console.WriteLine(count);
}
Similar question
C# ThreadStatic + volatile members not working as expected

CPU contention (wait time) for a process in Linux

How can I check how long a process spends waiting for the CPU in a Linux box?
For example, in a loaded system I want to check how long a SQL*Loader (sqlldr) process waits.
It would be useful if there is a command line tool to do this.
I've quickly slapped this together. It prints out the smallest and largest "interferences" from task switching...
#include <sys/time.h>
#include <stdio.h>
double seconds()
{
timeval t;
gettimeofday(&t, NULL);
return t.tv_sec + t.tv_usec / 1000000.0;
}
int main()
{
double min = 999999999, max = 0;
while (true)
{
double c = -(seconds() - seconds());
if (c < min)
{
min = c;
printf("%f\n", c);
fflush(stdout);
}
if (c > max)
{
max = c;
printf("%f\n", c);
fflush(stdout);
}
}
return 0;
}
Here's how you should go about measuring it. Have a number of processes, greater than the number of your processors * cores * threading capability wait (block) on an event that will wake them up all at the same time. One such event is a multicast network packet. Use an instrumentation library like PAPI (or one more suited to your needs) to measure the differences in real and virtual "wakeup" time between your processes. From several iterations of the experiment you can get an estimate of the CPU contention time for your processes. Obviously, it's not going to be at all accurate for multicore processors, but maybe it'll help you.
Cheers.
I had this problem some time back. I ended up using getrusage :
You can get detailed help at :
http://www.opengroup.org/onlinepubs/009695399/functions/getrusage.html
getrusage populates the rusage struct.
Measuring Wait Time with getrusage
You can call getrusage at the beginning of your code and then again call it at the end, or at some appropriate point during execution. You have then initial_rusage and final_rusage. The user-time spent by your process is indicated by rusage->ru_utime.tv_sec and system-time spent by the process is indicated by rusage->ru_stime.tv_sec.
Thus the total user-time spent by the process will be:
user_time = final_rusage.ru_utime.tv_sec - initial_rusage.ru_utime.tv_sec
The total system-time spent by the process will be:
system_time = final_rusage.ru_stime.tv_sec - initial_rusage.ru_stime.tv_sec
If total_time is the time elapsed between the two calls of getrusage then the wait time will be
wait_time = total_time - (user_time + system_time)
Hope this helps

Resources