What is the precision of the bash builtin time command? - linux

I have a script that measures the execution time of a program using the bash builtin command time.
I am trying to understand the precision of this command: as far as I know it returns the precision in ms, however it use the getrusage() function which returns a value in microseconds. But reading this paper, the real precision is only of 10ms because getrusage samples the time relying on ticks (= 100Hz). The paper is really old (it mentions Linux 2.2.14 running on a Pentium 166Mhz with 96Mb of ram).
Is time still using getrusage() and 100 Hz ticks or is more precise on modern systems?
The testing machine is running Linux 2.6.32.
EDIT: This is a slightly modified version (which should compile also on old version of GCC) of muru's code: modify the value of variable 'v' change also the delay between measures in order to spot the minimum granularity. A value around 500,000 should trigger changes of 1ms on a relatively recent cpu (first versions of i5/i7 #~2.5Ghz)
#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>
void dosomething(){
long v = 1000000;
while (v > 0)
v--;
}
int main()
{
struct rusage r1, r2;
long t1, t2, min, max;
int i;
printf("t1\tt2\tdiff\n");
for (i = 0; i<5; i++){
getrusage(RUSAGE_SELF, &r1);
dosomething();
getrusage(RUSAGE_SELF, &r2);
t1 = r1.ru_stime.tv_usec + r1.ru_stime.tv_sec*1000000 + r1.ru_utime.tv_usec + r1.ru_utime.tv_sec*1000000;
t2 = r2.ru_stime.tv_usec + r2.ru_stime.tv_sec*1000000 + r2.ru_utime.tv_usec + r2.ru_utime.tv_sec*1000000;
printf("%ld\t%ld\t%ld\n",t1,t2,t2-t1);
if ((t2-t1 < min) | (i == 0))
min = t2-t1;
if ((t2-t1 > max) | (i == 0))
max = t2-t1;
dosomething();
}
printf("Min = %ldus Max = %ldus\n",min,max);
return 0;
}
However the precision is bound to linux version: with Linux 3 and above the precision is in the order of us, while on linux 2.6.32 could be around 1ms, probably depending also on the specific distro. I suppose that this difference is related to the usage of HRT isntead of Tick on recent linux versions.
In any case the maximum precision of time is 1ms on all recent and not-so-recent machines.

The bash builtin time still uses getrusage(2). On an Ubuntu 14.04 system:
$ bash --version
GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
$ strace -o log bash -c 'time sleep 1'
real 0m1.018s
user 0m0.000s
sys 0m0.001s
$ tail log
getrusage(RUSAGE_SELF, {ru_utime={0, 0}, ru_stime={0, 3242}, ...}) = 0
getrusage(RUSAGE_CHILDREN, {ru_utime={0, 0}, ru_stime={0, 530}, ...}) = 0
write(2, "\n", 1) = 1
write(2, "real\t0m1.018s\n", 14) = 14
write(2, "user\t0m0.000s\n", 14) = 14
write(2, "sys\t0m0.001s\n", 13) = 13
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
exit_group(0) = ?
+++ exited with 0 +++
As the strace output shows, it calls getrusage.
As for precision, the rusage struct that getrusage uses includes timeval objects, and timeval has microsecond precision. From the manpage of getrusage:
ru_utime
This is the total amount of time spent executing in user mode,
expressed in a timeval structure (seconds plus microseconds).
ru_stime
This is the total amount of time spent executing in kernel
mode, expressed in a timeval structure (seconds plus
microseconds).
I think it does better than 10 ms. Take the following example file:
#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>
int main()
{
struct rusage now, then;
getrusage(RUSAGE_SELF, &then);
getrusage(RUSAGE_SELF, &now);
printf("%ld %ld\n",
then.ru_stime.tv_usec + then.ru_stime.tv_sec*1000000 + then.ru_utime.tv_usec + then.ru_utime.tv_sec*1000000
now.ru_stime.tv_usec + now.ru_stime.tv_sec*1000000 + now.ru_utime.tv_usec + now.ru_utime.tv_sec*1000000);
}
Now:
$ make test
cc test.c -o test
$ for ((i=0; i < 5; i++)); do ./test; done
447 448
356 356
348 348
347 347
349 350
The reported differences between successive calls to getrusage are 1 µs and 0 (the minimum). Since it does show a 1 µs gap, the tick must be at most 1 µs.
If it had a 10 ms tick, the differences would be zero, or at least 10000.

Related

Use linux perf utility to report counters every second like vmstat

There is perf command-linux utility in Linux to access hardware performance-monitoring counters, it works using perf_events kernel subsystems.
perf itself has basically two modes: perf record/perf top to record sampling profile (the sample is for example every 100000th cpu clock cycle or executed command), and perf stat mode to report total count of cycles/executed commands for the application (or for the whole system).
Is there mode of perf to print system-wide or per-CPU summary on total count every second (every 3, 5, 10 seconds), like it is printed in vmstat and systat-family tools (iostat, mpstat, sar -n DEV... like listed in http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html)? For example, with cycles and instructions counters I will get mean IPC for every second of system (or of every CPU).
Is there any non-perf tool (in https://perf.wiki.kernel.org/index.php/Tutorial or http://www.brendangregg.com/perf.html) which can get such statistics with perf_events kernel subsystem? What about system-wide per-process IPC calculation with resolution of seconds?
There is perf stat option "interval-print" of -I N where N is millisecond interval to do interval counter printing every N milliseconds (N>=10): http://man7.org/linux/man-pages/man1/perf-stat.1.html
-I msecs, --interval-print msecs
Print count deltas every N milliseconds (minimum: 10ms) The
overhead percentage could be high in some cases, for instance
with small, sub 100ms intervals. Use with caution. example: perf
stat -I 1000 -e cycles -a sleep 5
For best results it is usually a good idea to use it with interval
mode like -I 1000, as the bottleneck of workloads can change often.
There is also importing results in machine-readable form, and with -I first field is datetime:
With -x, perf stat is able to output a not-quite-CSV format output ... optional usec time stamp in fractions of second (with -I xxx)
vmstat, systat-family tools iostat, mpstat, etc periodic printing is -I 1000 of perf stat (every second), for example system-wide (add -A to separate cpu counters):
perf stat -a -I 1000
The option is implemented in builtin-stat.c http://lxr.free-electrons.com/source/tools/perf/builtin-stat.c?v=4.8 __run_perf_stat function
531 static int __run_perf_stat(int argc, const char **argv)
532 {
533 int interval = stat_config.interval;
For perf stat -I 1000 with some program argument (forks=1), for example perf stat -I 1000 sleep 10 there is interval loop (ts is the millisecond interval converted to struct timespec):
639 enable_counters();
641 if (interval) {
642 while (!waitpid(child_pid, &status, WNOHANG)) {
643 nanosleep(&ts, NULL);
644 process_interval();
645 }
646 }
666 disable_counters();
For variant of system-wide hardware performance monitor counting and forks=0 there is other interval loop
658 enable_counters();
659 while (!done) {
660 nanosleep(&ts, NULL);
661 if (interval)
662 process_interval();
663 }
666 disable_counters();
process_interval() http://lxr.free-electrons.com/source/tools/perf/builtin-stat.c?v=4.8#L347 from the same file uses read_counters(); which loops over event list and invokes read_counter() which loops over all known threads and all cpus and starts actual reading function:
306 for (thread = 0; thread < nthreads; thread++) {
307 for (cpu = 0; cpu < ncpus; cpu++) {
...
310 count = perf_counts(counter->counts, cpu, thread);
311 if (perf_evsel__read(counter, cpu, thread, count))
312 return -1;
perf_evsel__read is the real counter read while program is still running:
1207 int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
1208 struct perf_counts_values *count)
1209 {
1210 memset(count, 0, sizeof(*count));
1211
1212 if (FD(evsel, cpu, thread) < 0)
1213 return -EINVAL;
1214
1215 if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) < 0)
1216 return -errno;
1217
1218 return 0;
1219 }

Crosscompiling from Linux to Windows, but Windows terminal won't stop? (gcc)

quick question;
I'm using Ubuntu as my coding environment, and I am trying to write a C program for Windows for school.
The assignment says I have to do something using the system clock, and I decided to make a quick benchmarking program. Here it is:
#include <stdio.h>
#include <unistd.h>
#include <time.h>
int main () {
int i = 0;
int p = (int) getpid();
int n = 0;
clock_t cstart = clock();
clock_t cend = 0;
for (i=0; i<100000000; i++) {
long f = (((i+9)*99)%4)+(8+i*999);
if (i % p == 0)
printf("i=%d, f=%ld\n", i, f);
}
cend = clock();
printf ("%.3f cpu sec\n", ((double)cend - (double)cstart)* 1.0e-6);
return 0;
}
When I cross compile from Ubuntu to Windows using mingw32, it's fine. However, when I run the program in Windows, two issues happen:
The benchmark runs as expected, and takes roughly 5 seconds, yet the timer says it took 0.03 seconds (this doesnt happen when testing in my Ubuntu VM. If the benchmark takes 5 seconds in real time, the timer will say 5 seconds. So obviously, this is an issue.)
Then, once the program is done, the Windows terminal will close immediately.
How do I make the program stay open so you can look at your time for more than like 10 milliseconds, and how can I make the runtime of the benchmark reflect it's score like it does when I test in Ubuntu?
Thanks!

high resource usage program stalls/crashes linux

I have a program that reads about 1000 images and creates a statistical summary of their contents. Each image is processed in its own thread using OpenMP, and I have the thread limit set to match my number of processors.
Until about two weeks ago, the program ran fine. Now, however, if I run the program more than once, my system slows down and eventually freezes up.
In order to troubleshoot, I wrote the simple code listed below that emulates what my program is doing. This code will freeze my system, just as my original program does, after trying to read only a few files at line 35.
I ran the program, successively reverting to an earlier kernel after each failure, and found that it fails with all 3.6 kernels up to version 3.6.8.
However, when I go back to kernel 3.5.6, it works.
test.cc:
1 #include <cstdio>
2 #include <iostream>
3 #include <vector>
4 #include <unistd.h>
5
6 using namespace std;
7
8 int main ()
9 {
10 // number of files
11 const size_t N = 1000;
12 // total system memory
13 const size_t MEM = sysconf (_SC_PHYS_PAGES) * sysconf (_SC_PAGE_SIZE);
14 // file size
15 const size_t SZ = MEM/N;
16
17 // create temp filenames
18 vector<string> fn (N);
19 for (size_t i = 0; i < fn.size (); ++i)
20 fn[i] = string (tmpnam (NULL));
21
22 // write a bunch of files to disk
23 for (size_t i = 0; i < fn.size (); ++i)
24 {
25 vector<char> a (SZ);
26 FILE *fp = fopen (fn[i].c_str (), "wb");
27 fwrite (&a[0], a.size (), 1, fp);
28 clog << fn[i] << " written" << endl;
29 }
30
31 // read a bunch of files from disk
32 #pragma omp parallel for
33 for (size_t i = 0; i < fn.size (); ++i)
34 {
35 vector<char> a (SZ);
36 FILE *fp = fopen (fn[i].c_str (), "rb");
37 fread (&a[0], a.size (), 1, fp);
38 clog << fn[i] << " read" << endl;
39 }
40
41 return 0;
42 }
Makefile:
1 a:$
2 g++ -fopenmp -Wall -o test -g test.cc$
3 ./test$
My question is: What is different about kernel 3.6 that would cause this program to fail, but does not cause it to fail in version 3.5?
Without going through the code, if you want to set some limits to your processes, have a look at cgroups for limiting resource usage.
As for the freezing - you are trying to read/write GBs of data to disk at once. Given the speeds of ~100MB/s of today's hard-drives, I would expect a freeze at the time the kernel decides to flush the caches to the disk - which will probably occur as soon as you try to read a reasonably sized chunk of data from the disk under memory pressure (since you allocated lots of memory, the space for caches is limited).
You can try to mmap() the files or change kernel I/O scheduler.
I haven't look in deep at your code, but I realised some bad practices (at least, I thing they're) :
First, the critical section inside the openmp loop. That is a synchronism point, and putting it in every iteration sounds kind of problematic to me. Since each thread must be sure no other one has entered there, probably the overhead that synchronism introduces increases with the number of threads.
Second: I am not very used to C++, but I guess that every time vector<char> a (SZ) is executed memory is allocated (and freed at the end of the block). Excuse me if I am wrong. Since you know beforehand the value of SZ, you'll better allocate a vector<vector<char> > with as many elements as threads before the parallel region. Then, in the parallel region, you'd make each thread access its vector<char>.

Why is clock_gettime so erratic?

Intro
Section Old Question contains the initial question (Further Investigation and Conclusion have been added since).
Skip to the section Further Investigation below for a detailed comparison of the different timing methods (rdtsc, clock_gettime and QueryThreadCycleTime).
I believe the erratic behaviour of CGT can be attributed to either a buggy kernel or a buggy CPU (see section Conclusion).
The code used for testing is at the bottom of this question (see section Appendix).
Apologies for the length.
Old Question
In short: I am using clock_gettime to measure the execution time of many code segments. I am experiencing very inconsistent measurements between separate runs. The method has an extremely high standard deviation when compared to other methods (see Explanation below).
Question: Is there a reason why clock_gettime would give so inconsistent measurements when compared to other methods? Is there an alternative method with the same resolution that accounts for thread idle time?
Explanation: I am trying to profile a number of small parts of C code. The execution time of each of the code segments is not more than a couple of microseconds. In a single run, each of the code segments will execute some hundreds of times, which produces runs × hundreds of measurements.
I also have to measure only the time the thread actually spends executing (which is why rdtsc is not suitable). I also need a high resolution (which is why times is not suitable).
I've tried the following methods:
rdtsc (on Linux and Windows),
clock_gettime (with 'CLOCK_THREAD_CPUTIME_ID'; on Linux), and
QueryThreadCycleTime (on Windows).
Methodology: The analysis was performed on 25 runs. In each run, separate code segments repeat a 101 of times. Therefore I have 2525 measurements. Then I look at a histogram of the measurements, and also calculate some basic stuff (like the mean, std.dev., median, mode, min, and max).
I do not present how I measured the 'similarity' of the three methods, but this simply involved a basic comparison of proportion of times spent in each code segment ('proportion' means that the times are normalised). I then look at the pure differences in these proportions. This comparison showed that all 'rdtsc', 'QTCT', and 'CGT' measure the same proportions when averaged over the 25 runs. However, the results below show that 'CGT' has a very large standard deviation. This makes it unusable in my use case.
Results:
A comparison of clock_gettime with rdtsc for the same code segment (25 runs of 101 measurements = 2525 readings):
clock_gettime:
1881 measurements of 11 ns,
595 measurements were (distributed almost normally) between 3369 and 3414 ns,
2 measurements of 11680 ns,
1 measurement of 1506022 ns, and
the rest is between 900 and 5000 ns.
Min: 11 ns
Max: 1506022 ns
Mean: 1471.862 ns
Median: 11 ns
Mode: 11 ns
Stddev: 29991.034
rdtsc (note: no context switches occurred during this run, but if it happens, it usually results in just a single measurement of 30000 ticks or so):
1178 measurements between 274 and 325 ticks,
306 measurements between 326 and 375 ticks,
910 measurements between 376 and 425 ticks,
129 measurements between 426 and 990 ticks,
1 measurement of 1240 ticks, and
1 measurement of 1256 ticks.
Min: 274 ticks
Max: 1256 ticks
Mean: 355.806 ticks
Median: 333 ticks
Mode: 376 ticks
Stddev: 83.896
Discussion:
rdtsc gives very similar results on both Linux and Windows. It has an acceptable standard deviation--it is actually quite consistent/stable. However, it does not account for thread idle time. Therefore, context switches make the measurements erratic (on Windows I have observed this quite often: a code segment with an average of 1000 ticks or so will take ~30000 ticks every now and then--definitely because of pre-emption).
QueryThreadCycleTime gives very consistent measurements--i.e. much lower standard deviation when compared to rdtsc. When no context switches happen, this method is almost identical to rdtsc.
clock_gettime, on the other hand, is producing extremely inconsistent results (not just between runs, but also between measurements). The standard deviations are extreme (when compared to rdtsc).
I hope the statistics are okay. But what could be the reason for such a discrepancy in the measurements between the two methods? Of course, there is caching, CPU/core migration, and other things. But none of this should be responsible for any such differences between 'rdtsc' and 'clock_gettime'. What is going on?
Further Investigation
I have investigated this a bit further. I have done two things:
Measured the overhead of just calling clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t) (see code 1 in Appendix), and
in a plain loop called clock_gettime and stored the readings into an array (see code 2 in Appendix). I measure the delta times (difference in successive measurement times, which should correspond a bit to the overhead of the call of clock_gettime).
I have measured it on two different computers with two different Linux Kernel versions:
CGT:
CPU: Core 2 Duo L9400 # 1.86GHz
Kernel: Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386
Results:
Estimated clock_gettime overhead: between 690-710 ns
Delta times:
Average: 815.22 ns
Median: 713 ns
Mode: 709 ns
Min: 698 ns
Max: 23359 ns
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
------------------+-----------
697 < x ≤ 800 -> 78111 <-- cached?
800 < x ≤ 1000 -> 16412
1000 < x ≤ 1500 -> 3
1500 < x ≤ 2000 -> 4836 <-- uncached?
2000 < x ≤ 3000 -> 305
3000 < x ≤ 5000 -> 161
5000 < x ≤ 10000 -> 105
10000 < x ≤ 15000 -> 53
15000 < x ≤ 20000 -> 8
20000 < x -> 5
CPU: 4 × Dual Core AMD Opteron Processor 275
Kernel: Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux
Results:
Estimated clock_gettime overhead: between 279-283 ns
Delta times:
Average: 320.00
Median: 1
Mode: 1
Min: 1
Max: 3495529
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
--------------------+-----------
x ≤ 1 -> 86738 <-- cached?
282 < x ≤ 300 -> 13118 <-- uncached?
300 < x ≤ 440 -> 78
2000 < x ≤ 5000 -> 52
5000 < x ≤ 30000 -> 5
3000000 < x -> 8
RDTSC:
Related code rdtsc_delta.c and rdtsc_overhead.c.
CPU: Core 2 Duo L9400 # 1.86GHz
Kernel: Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386
Results:
Estimated overhead: between 39-42 ticks
Delta times:
Average: 52.46 ticks
Median: 42 ticks
Mode: 42 ticks
Min: 35 ticks
Max: 28700 ticks
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
------------------+-----------
34 < x ≤ 35 -> 16240 <-- cached?
41 < x ≤ 42 -> 63585 <-- uncached? (small difference)
48 < x ≤ 49 -> 19779 <-- uncached?
49 < x ≤ 120 -> 195
3125 < x ≤ 5000 -> 144
5000 < x ≤ 10000 -> 45
10000 < x ≤ 20000 -> 9
20000 < x -> 2
CPU: 4 × Dual Core AMD Opteron Processor 275
Kernel: Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux
Results:
Estimated overhead: between 13.7-17.0 ticks
Delta times:
Average: 35.44 ticks
Median: 16 ticks
Mode: 16 ticks
Min: 14 ticks
Max: 16372 ticks
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
------------------+-----------
13 < x ≤ 14 -> 192
14 < x ≤ 21 -> 78172 <-- cached?
21 < x ≤ 50 -> 10818
50 < x ≤ 103 -> 10624 <-- uncached?
5825 < x ≤ 6500 -> 88
6500 < x ≤ 8000 -> 88
8000 < x ≤ 10000 -> 11
10000 < x ≤ 15000 -> 4
15000 < x ≤ 16372 -> 2
QTCT:
Related code qtct_delta.c and qtct_overhead.c.
CPU: Core 2 6700 # 2.66GHz
Kernel: Windows 7 64-bit
Results:
Estimated overhead: between 890-940 ticks
Delta times:
Average: 1057.30 ticks
Median: 890 ticks
Mode: 890 ticks
Min: 880 ticks
Max: 29400 ticks
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
------------------+-----------
879 < x ≤ 890 -> 71347 <-- cached?
895 < x ≤ 1469 -> 844
1469 < x ≤ 1600 -> 27613 <-- uncached?
1600 < x ≤ 2000 -> 55
2000 < x ≤ 4000 -> 86
4000 < x ≤ 8000 -> 43
8000 < x ≤ 16000 -> 10
16000 < x -> 1
Conclusion
I believe the answer to my question would be a buggy implementation on my machine (the one with AMD CPUs with an old Linux kernel).
The CGT results of the AMD machine with the old kernel show some extreme readings. If we look at the delta times, we'll see that the most frequent delta is 1 ns. This means that the call to clock_gettime took less than a nanosecond! Moreover, it also produced a number of extraordinary large deltas (of more than 3000000 ns)! This seems to be erroneous behaviour. (Maybe unaccounted core migrations?)
Remarks:
The overhead of CGT and QTCT is quite big.
It is also difficult to account for their overhead, because CPU caching seems to make quite a big difference.
Maybe sticking to RDTSC, locking the process to one core, and assigning real-time priority is the most accurate way to tell how many cycles a piece of code used...
Appendix
Code 1: clock_gettime_overhead.c
#include <time.h>
#include <stdio.h>
#include <stdint.h>
/* Compiled & executed with:
gcc clock_gettime_overhead.c -O0 -lrt -o clock_gettime_overhead
./clock_gettime_overhead 100000
*/
int main(int argc, char **args) {
struct timespec tstart, tend, dummy;
int n, N;
N = atoi(args[1]);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tstart);
for (n = 0; n < N; ++n) {
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tend);
printf("Estimated overhead: %lld ns\n",
((int64_t) tend.tv_sec * 1000000000 + (int64_t) tend.tv_nsec
- ((int64_t) tstart.tv_sec * 1000000000
+ (int64_t) tstart.tv_nsec)) / N / 10);
return 0;
}
Code 2: clock_gettime_delta.c
#include <time.h>
#include <stdio.h>
#include <stdint.h>
/* Compiled & executed with:
gcc clock_gettime_delta.c -O0 -lrt -o clock_gettime_delta
./clock_gettime_delta > results
*/
#define N 100000
int main(int argc, char **args) {
struct timespec sample, results[N];
int n;
for (n = 0; n < N; ++n) {
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &sample);
results[n] = sample;
}
printf("%s\t%s\n", "Absolute time", "Delta");
for (n = 1; n < N; ++n) {
printf("%lld\t%lld\n",
(int64_t) results[n].tv_sec * 1000000000 +
(int64_t)results[n].tv_nsec,
(int64_t) results[n].tv_sec * 1000000000 +
(int64_t) results[n].tv_nsec -
((int64_t) results[n-1].tv_sec * 1000000000 +
(int64_t)results[n-1].tv_nsec));
}
return 0;
}
Code 3: rdtsc.h
static uint64_t rdtsc() {
#if defined(__GNUC__)
# if defined(__i386__)
uint64_t x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
# elif defined(__x86_64__)
uint32_t hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ((uint64_t)lo) | ((uint64_t)hi << 32);
# else
# error Unsupported architecture.
# endif
#elif defined(_MSC_VER)
return __rdtsc();
#else
# error Other compilers not supported...
#endif
}
Code 4: rdtsc_delta.c
#include <stdio.h>
#include <stdint.h>
#include "rdtsc.h"
/* Compiled & executed with:
gcc rdtsc_delta.c -O0 -o rdtsc_delta
./rdtsc_delta > rdtsc_delta_results
Windows:
cl -Od rdtsc_delta.c
rdtsc_delta.exe > windows_rdtsc_delta_results
*/
#define N 100000
int main(int argc, char **args) {
uint64_t results[N];
int n;
for (n = 0; n < N; ++n) {
results[n] = rdtsc();
}
printf("%s\t%s\n", "Absolute time", "Delta");
for (n = 1; n < N; ++n) {
printf("%lld\t%lld\n", results[n], results[n] - results[n-1]);
}
return 0;
}
Code 5: rdtsc_overhead.c
#include <time.h>
#include <stdio.h>
#include <stdint.h>
#include "rdtsc.h"
/* Compiled & executed with:
gcc rdtsc_overhead.c -O0 -lrt -o rdtsc_overhead
./rdtsc_overhead 1000000 > rdtsc_overhead_results
Windows:
cl -Od rdtsc_overhead.c
rdtsc_overhead.exe 1000000 > windows_rdtsc_overhead_results
*/
int main(int argc, char **args) {
uint64_t tstart, tend, dummy;
int n, N;
N = atoi(args[1]);
tstart = rdtsc();
for (n = 0; n < N; ++n) {
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
}
tend = rdtsc();
printf("%G\n", (double)(tend - tstart)/N/10);
return 0;
}
Code 6: qtct_delta.c
#include <stdio.h>
#include <stdint.h>
#include <Windows.h>
/* Compiled & executed with:
cl -Od qtct_delta.c
qtct_delta.exe > windows_qtct_delta_results
*/
#define N 100000
int main(int argc, char **args) {
uint64_t ticks, results[N];
int n;
for (n = 0; n < N; ++n) {
QueryThreadCycleTime(GetCurrentThread(), &ticks);
results[n] = ticks;
}
printf("%s\t%s\n", "Absolute time", "Delta");
for (n = 1; n < N; ++n) {
printf("%lld\t%lld\n", results[n], results[n] - results[n-1]);
}
return 0;
}
Code 7: qtct_overhead.c
#include <stdio.h>
#include <stdint.h>
#include <Windows.h>
/* Compiled & executed with:
cl -Od qtct_overhead.c
qtct_overhead.exe 1000000
*/
int main(int argc, char **args) {
uint64_t tstart, tend, ticks;
int n, N;
N = atoi(args[1]);
QueryThreadCycleTime(GetCurrentThread(), &tstart);
for (n = 0; n < N; ++n) {
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
}
QueryThreadCycleTime(GetCurrentThread(), &tend);
printf("%G\n", (double)(tend - tstart)/N/10);
return 0;
}
Well as CLOCK_THREAD_CPUTIME_ID is implemented using rdtsc it will likely suffer from the same problems as it. The manual page for clock_gettime says:
The CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks
are realized on many platforms using timers from the CPUs (TSC on
i386, AR.ITC on Itanium). These registers may differ between CPUs and
as a consequence these clocks may return bogus results if a
process is migrated to another CPU.
Which sounds like it might explain your problems? Maybe you should lock your process to one CPU to get stable results?
When you have a highly skewed distribution that cannot go negative, you're going to see large discrepancies between mean, median, and mode.
The standard deviation is fairly meaningless for such a distribution.
It's usually a good idea to log-transform it.
That will make it "more normal".

Accurately Calculating CPU Utilization in Linux using /proc/stat

There are a number of posts and references on how to get CPU Utilization using statistics in /proc/stat. However, most of them use only four of the 7+ CPU stats (user, nice, system, and idle), ignoring the remaining jiffie CPU counts present in Linux 2.6 (iowait, irq, softirq).
As an example, see Determining CPU utilization.
My question is this: Are the iowait/irq/softirq numbers also counted in one of the first four numbers (user/nice/system/idle)? In other words, does the total jiffie count equal the sum of the first four stats? Or, is the total jiffie count equal to the sum of all 7 stats? If the latter is true, then a CPU utilization formula should take all of the numbers into account, like this:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
long double a[7],b[7],loadavg;
FILE *fp;
for(;;)
{
fp = fopen("/proc/stat","r");
fscanf(fp,"%*s %Lf %Lf %Lf %Lf",&a[0],&a[1],&a[2],&a[3],&a[4],&a[5],&a[6]);
fclose(fp);
sleep(1);
fp = fopen("/proc/stat","r");
fscanf(fp,"%*s %Lf %Lf %Lf %Lf",&b[0],&b[1],&b[2],&b[3],&b[4],&b[5],&b[6]);
fclose(fp);
loadavg = ((b[0]+b[1]+b[2]+b[4]+b[5]+b[6]) - (a[0]+a[1]+a[2]+a[4]+a[5]+a[6]))
/ ((b[0]+b[1]+b[2]+b[3]+b[4]+b[5]+b[6]) - (a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]));
printf("The current CPU utilization is : %Lf\n",loadavg);
}
return(0);
}
I think iowait/irq/softirq are not counted in one of the first 4 numbers. You can see the comment of irqtime_account_process_tick in kernel code for more detail:
(for Linux kernel 4.1.1)
2815 * Tick demultiplexing follows the order
2816 * - pending hardirq update <-- this is irq
2817 * - pending softirq update <-- this is softirq
2818 * - user_time
2819 * - idle_time <-- iowait is included in here, discuss below
2820 * - system time
2821 * - check for guest_time
2822 * - else account as system_time
For the idle time handling, see account_idle_time function:
2772 /*
2773 * Account for idle time.
2774 * #cputime: the cpu time spent in idle wait
2775 */
2776 void account_idle_time(cputime_t cputime)
2777 {
2778 u64 *cpustat = kcpustat_this_cpu->cpustat;
2779 struct rq *rq = this_rq();
2780
2781 if (atomic_read(&rq->nr_iowait) > 0)
2782 cpustat[CPUTIME_IOWAIT] += (__force u64) cputime;
2783 else
2784 cpustat[CPUTIME_IDLE] += (__force u64) cputime;
2785 }
If the cpu is idle AND there is some IO pending, it will count the time in CPUTIME_IOWAIT. Otherwise, it is count in CPUTIME_IDLE.
To conclude, I think the jiffies in irq/softirq should be counted as "busy" for cpu because it was actually handling some IRQ or soft IRQ. On the other hand, the jiffies in "iowait" should be counted as "idle" for cpu because it was not doing something but waiting for a pending IO to happen.
from busybox, its top magic is:
static const char fmt[] ALIGN1 = "cp%*s %llu %llu %llu %llu %llu %llu %llu %llu";
int ret;
if (!fgets(line_buf, LINE_BUF_SIZE, fp) || line_buf[0] != 'c' /* not "cpu" */)
return 0;
ret = sscanf(line_buf, fmt,
&p_jif->usr, &p_jif->nic, &p_jif->sys, &p_jif->idle,
&p_jif->iowait, &p_jif->irq, &p_jif->softirq,
&p_jif->steal);
if (ret >= 4) {
p_jif->total = p_jif->usr + p_jif->nic + p_jif->sys + p_jif->idle
+ p_jif->iowait + p_jif->irq + p_jif->softirq + p_jif->steal;
/* procps 2.x does not count iowait as busy time */
p_jif->busy = p_jif->total - p_jif->idle - p_jif->iowait;
}

Resources