Memory Leak in Pango - memory-leaks

I am using Pango library alongside Cairo, without GTK, in a test-drive application which I'm currently compiling on MacOSX. I have a memory leakage problem, that I have traced to this function:
void draw_with_cairo (void)
PangoLayout *layout;
PangoFontDescription *desc;
int i;
cairo_save (cr);
cairo_scale (cr, 1, -1);
cairo_translate (cr, 0, -HEIGHT);
cairo_translate (cr, 400, 300);
layout = pango_cairo_create_layout (cr);
pango_layout_set_text (layout, "Test", -1);
desc = pango_font_description_from_string ("‌BMitra 32");
pango_layout_set_font_description (layout, desc);
pango_font_description_free (desc);
for (i = 0; i < 12; i++)
int width, height;
double angle = iter + (360.0 * i) / 12;
double red;
cairo_save (cr);
red = (1 + cos ((angle - 60) * G_PI / 180.)) / 2;
cairo_set_source_rgb (cr, red, 0, 1.0 - red);
cairo_rotate (cr, angle * G_PI / 180.);
pango_cairo_update_layout (cr, layout);
pango_layout_get_size (layout, &width, &height);
cairo_move_to (cr, - ((double)width / PANGO_SCALE) / 2, - 250);
pango_cairo_show_layout (cr, layout);
cairo_restore (cr);
cairo_restore (cr);
g_object_unref (layout);
This routine is being called a lot, maybe a hundred times in a second. And the memory leak is huge, around 30MB in 3secs, and has a constant rate. When I compare this code, it seems quite fine to me. I have searched for this, have found many references to memory leaks while using pango in Gtk applications, and they all look for a patch in pango or gtk. I am really puzzled and can't believe there would be such a bug in a heavily used library like pango and think this is a problem with my own code. Any suggestions is appreciated.
This is the vmmap result for Uli's code:
Executing vmmap -resident 25897 | grep TOTAL at beginning of main()
TOTAL 321.3M 126.2M 485
TOTAL 18.0M 200K 1323 173K 0% 2
Executing vmmap -resident 25897 | grep TOTAL after cairo init
TOTAL 331.3M 126.4M 489
TOTAL 27.0M 224K 1327 1155K 4% 6
Executing vmmap -resident 25897 | grep TOTAL after one iteration
TOTAL 383.2M 143.9M 517
TOTAL 37.2M 3368K 18634 3423K 8% 5
Executing vmmap -resident 25897 | grep TOTAL after loop
TOTAL 481.6M 244.1M 514
TOTAL 137.2M 103.7M 151961 66.4M 48% 6
Executing vmmap -resident 25897 | grep TOTAL at end
TOTAL 481.6M 244.1M 520
TOTAL 136.3M 103.1M 151956 65.4M 48% 11
And this is the unfiltered output of the last stage:
Executing vmmap -resident 25751 at end
Process: main [25751]
Path: /PATH/OMITTED/main
Load Address: 0x109b9c000
Identifier: main
Version: ???
Code Type: X86-64
Parent Process: bash [837]
Date/Time: 2016-01-30 23:28:35.866 +0330
Launch Time: 2016-01-30 23:27:35.148 +0330
OS Version: Mac OS X 10.11.2 (15C50)
Report Version: 7
Analysis Tool: /Applications/
Analysis Tool Version: Xcode 7.0.1 (7A1001)
Virtual Memory Map of process 25751 (main)
Output report format: 2.4 -- 64-bit process
VM page size: 4096 bytes
==== Non-writable regions for process 25751
==== Legend
SM=sharing mode:
COW=copy_on_write PRV=private NUL=empty ALI=aliased
SHM=shared ZER=zero_filled S/A=shared_alias
==== Summary for process 25751
ReadOnly portion of Libraries: Total=219.6M resident=112.2M(51%) swapped_out_or_unallocated=107.5M(49%)
Writable regions: Total=155.7M written=5448K(3%) resident=104.1M(67%) swapped_out=0K(0%) unallocated=51.6M(33%)
=========== ======= ======== =======
Activity Tracing 2048K 12K 2
Dispatch continuations 8192K 32K 2
Kernel Alloc Once 8K 8K 3
MALLOC guard page 32K 0K 7
MALLOC metadata 364K 84K 11
MALLOC_LARGE 260K 260K 2 see MALLOC ZONE table below
MALLOC_LARGE (empty) 980K 668K 2 see MALLOC ZONE table below
MALLOC_LARGE metadata 4K 4K 2 see MALLOC ZONE table below
MALLOC_SMALL 32.0M 880K 3 see MALLOC ZONE table below
MALLOC_TINY 104.0M 102.1M 7 see MALLOC ZONE table below
Stack 8264K 60K 3
__DATA 16.7M 13.6M 217
__IMAGE 528K 104K 2
__LINKEDIT 92.4M 22.5M 34
__TEXT 127.2M 89.6M 220
__UNICODE 552K 476K 2
mapped file 32.2M 13.7M 4
shared memory 328K 172K 10
=========== ======= ======== =======
TOTAL 481.6M 244.3M 518
=========== ======= ========= ========= ========= ====== ======
DefaultMallocZone_0x109bd0000 136.3M 103.2M 151952 65.4M 48% 10
GFXMallocZone_0x109bd3000 0K 0K 0 0K 0
=========== ======= ========= ========= ========= ====== ======
TOTAL 136.3M 103.2M 151952 65.4M 48% 10
I have omitted the non-writable regions part because it was overflowing stackoverflow limits!

I don't see any memory leaks. The following program prints its memory usage before and after running your above function 100.000 times. Both numbers are the same for me.
#include <cairo.h>
#include <math.h>
#include <pango/pangocairo.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define HEIGHT 500
#define WIDTH 500
void draw_with_cairo (cairo_t *cr)
PangoLayout *layout;
PangoFontDescription *desc;
int i;
cairo_save (cr);
cairo_scale (cr, 1, -1);
cairo_translate (cr, 0, -HEIGHT);
cairo_translate (cr, 400, 300);
layout = pango_cairo_create_layout (cr);
pango_layout_set_text (layout, "Test", -1);
desc = pango_font_description_from_string ("‌BMitra 32");
pango_layout_set_font_description (layout, desc);
pango_font_description_free (desc);
for (i = 0; i < 12; i++)
int width, height;
double angle = i + (360.0 * i) / 12;
double red;
cairo_save (cr);
red = (1 + cos ((angle - 60) * G_PI / 180.)) / 2;
cairo_set_source_rgb (cr, red, 0, 1.0 - red);
cairo_rotate (cr, angle * G_PI / 180.);
pango_cairo_update_layout (cr, layout);
pango_layout_get_size (layout, &width, &height);
cairo_move_to (cr, - ((double)width / PANGO_SCALE) / 2, - 250);
pango_cairo_show_layout (cr, layout);
cairo_restore (cr);
cairo_restore (cr);
g_object_unref (layout);
static void print_memory_usage(const char *comment)
char buffer[1024];
sprintf(buffer, "grep -E VmPeak\\|VmSize /proc/%d/status", getpid());
printf("Executing %s %s\n", buffer, comment);
int main()
cairo_surface_t *s;
cairo_t *cr;
int i;
print_memory_usage("at beginning of main()");
s = cairo_image_surface_create(CAIRO_FORMAT_ARGB32, WIDTH, HEIGHT);
cr = cairo_create(s);
print_memory_usage("after cairo init");
print_memory_usage("after one iteration");
for (i = 0; i < 100 * 1000; i++)
print_memory_usage("after loop");
print_memory_usage("at end");
return 0;
Output for me (with no traces of any memory leaks):
Executing grep -E VmPeak\|VmSize /proc/31881/status at beginning of main()
VmPeak: 76660 kB
VmSize: 76660 kB
Executing grep -E VmPeak\|VmSize /proc/31881/status after cairo init
VmPeak: 77640 kB
VmSize: 77640 kB
Executing grep -E VmPeak\|VmSize /proc/31881/status after one iteration
VmPeak: 79520 kB
VmSize: 79520 kB
Executing grep -E VmPeak\|VmSize /proc/31881/status after loop
VmPeak: 79520 kB
VmSize: 79520 kB
Executing grep -E VmPeak\|VmSize /proc/31881/status at end
VmPeak: 79520 kB
VmSize: 78540 kB
P.S.: I tested this on an up-to-date debian testing amd64.


Need help to detect extra malloc - CS50 Pset5

Valgrind says 0 bytes lost but also says one less Frees than Mallocs
Because I have used malloc only once, I'm only posting those segments and not all the 3 files.
When loading a dictionary.txt file into a hash table:
bool load(const char *dictionary)
(dictionary.c:54) FILE *dict_file = fopen(dictionary, "r");
if (dict_file == NULL)
return false;
int key;
node *n = NULL;
int mallocs = 0;
while (1)
n = malloc(sizeof(node));
printf("malloced: %i\n", ++mallocs);
if (fscanf(dict_file, "%s", n->word) == -1)
printf("malloc freed\n");
key = hash(n->word);
n->next = table[key];
table[key] = n;
return true;
And the Unloading part:
bool unload(void)
int deleted = 0;
node *n;
for (int i = 0; i < N; i++)
n = table[i];
while(n != NULL)
n = n->next;
table[i] = n;
printf("DELETED: %i", deleted);
return true;
Check50 says there are memory leaks. But can't understand where.
Command: ./speller dictionaries/small texts/cat.txt
malloced: 1
malloced: 2
malloced: 3
malloced: 4
malloc freed
TIME IN load: 0.03
TIME IN check: 0.00
TIME IN size: 0.00
TIME IN unload: 0.00
==4215== HEAP SUMMARY:
==4215== in use at exit: 552 bytes in 1 blocks
==4215== total heap usage: 9 allocs, 8 frees, 10,544 bytes allocated
==4215== 552 bytes in 1 blocks are still reachable in loss record 1 of 1
==4215== at 0x4C31B0F: malloc (in /usr/lib/valgrind/
==4215== by 0x525AF29: __fopen_internal (iofopen.c:65)
==4215== by 0x525AF29: fopen##GLIBC_2.2.5 (iofopen.c:89)
==4215== by 0x40114E: load (dictionary.c:54)
==4215== by 0x40095E: main (speller.c:40)
==4215== LEAK SUMMARY:
==4215== definitely lost: 0 bytes in 0 blocks
==4215== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
speller.c has distribution code. I hope the rest of the question is clear and understandable.
The pointer to the opened file (dict_file) needs to be closed. See man fclose.

how to get disk read/write bytes per second from /proc in programming on linux?

purpose :i want to get information like iostat command can get .
I have already known that if open /proc/diskstats or /sys/block/sdX/stat there are information that :sectors read and sectors write. So if i want to get read/write bytes/s ,the following formula is right ?
read/write bytes per second:
(sectors read/write(now)-sectors read/write(last))*512 bytes/time interval
read /write operations per second :
(read/write IOs(now)+read/write merges(now)-read/write IOs(last)-read/write merges(last ))/time interval
So if i have a timer that every second control software read the information from those two files ,and then using the above formula to calculate the value .Can i get the correct answer ?
TLDR Sector is 512 bytes (octets; 1 sector is 512 bytes; each bytes is 8 bits; every bit is either 0 or 1, but not superposition of them).
"The standard sector size of 512 bytes for magnetic disks was established ....[dubious – discuss] " (c) wiki
How to check sector size for io statistics (in /proc) in linux:
Check how iostat tool works (it shows kilobyte per second when started as iostat 1) - it is part of sysstat package:
* Read stats from /proc/diskstats.
void read_diskstats_stat(int curr)
/* major minor name rio rmerge rsect ruse wio wmerge wsect wuse running use aveq */
i = sscanf(line, "%u %u %s %lu %lu %lu %lu %lu %lu %lu %u %u %u %u",
&major, &minor, dev_name,
&rd_ios, &rd_merges_or_rd_sec, &rd_sec_or_wr_ios, &rd_ticks_or_wr_sec,
&wr_ios, &wr_merges, &wr_sec, &wr_ticks, &ios_pgr, &tot_ticks, &rq_ticks);
if (i == 14) {
sdev.rd_sectors = rd_sec_or_wr_ios;
sdev.wr_sectors = wr_sec;
* #fctr Conversion factor.
printf(" kB_read/s kB_wrtn/s kB_read kB_wrtn\n");
*fctr = 2;
/* rrq/s wrq/s r/s w/s rsec wsec rqsz qusz await r_await w_await svctm %util */
... 4 columns skipped
cprintf_f(4, 8, 2,
S_VALUE(ioj->rd_sectors, ioi->rd_sectors, itv) / fctr,
S_VALUE(ioj->wr_sectors, ioi->wr_sectors, itv) / fctr,
So, read sector count and divide by two to get kilobyte/s (seems like 1 sector read is 0.5 kb read; 2 sector read is 1 kb read and so on). We can conclude that the sector is always 512 bytes. Same is stated in the doc, isn't it?:
internet search for "/proc/diskstats" -> -> "I/O statistics fields" by ricklind from usa's ibm
Field 3 -- # of sectors read
This is the total number of sectors read successfully.
Field 7 -- # of sectors written
This is the total number of sectors written successfully.
No info about sector size here (why?). Is the source code being the best documentation (it may be)? The writer of /proc/diskstats is in kernel sources in file block/genhd.c, function diskstats_show:
1170 seq_printf(seqf, "%4d %7d %s %lu %lu %lu "
1171 "%u %lu %lu %lu %u %u %u %u\n",
1176 part_stat_read(hd, sectors[READ]),
1180 part_stat_read(hd, sectors[WRITE]),
Structure sectors is defined in
82 struct disk_stats {
83 unsigned long sectors[2]; /* READs and WRITEs */
It is read with part_stat_read and written with __part_stat_add
Adding to the sectors counter ... is... at
2264 void blk_account_io_completion(struct request *req, unsigned int bytes)
2265 {
2266 if (blk_do_io_stat(req)) {
2267 const int rw = rq_data_dir(req);
2268 struct hd_struct *part;
2269 int cpu;
2271 cpu = part_stat_lock();
2272 part = req->part;
2273 part_stat_add(cpu, part, sectors[rw], bytes >> 9);
2274 part_stat_unlock();
2275 }
2276 }
It uses hard-coded "bytes >> 9" to compute sector size from request size in bytes (why round down??) or for human, not no-floating-point compiler, it is the same as bytes / 512.
There is also blk_rq_sectors function (unused here...) to get sector count from request, which does the same >>9 from bytes to sectors
841 static inline unsigned int blk_rq_bytes(const struct request *rq)
842 {
843 return rq->__data_len;
844 }
853 static inline unsigned int blk_rq_sectors(const struct request *rq)
854 {
855 return blk_rq_bytes(rq) >> 9;
856 }
Authors of FS/VFS subsystem in Linux says in reply to "Why is SECTOR_SIZE = 512 inside kernel ?" (2015):
#define SECTOR_SHIFT 9
Message by Theodore Ts'o:
It's cast in stone. There are too many places all over the kernel,
especially in a huge number of file systems, which assume that the
sector size is 512 bytes. So above the block layer, the sector size
is always going to be 512.
This is actually better for user space programs using
/proc/diskstats, since they don't need to know whether a particular
underlying hardware is using 512, 4k, (or if the HDD manufacturers
fantasies become true 32k or 64k) sector sizes.
For similar reason, st_blocks in struct size is always in units of 512
bytes. We don't want to force userspace to have to figure out whether
the underlying file system is using 1k, 2k, or 4k. For that reason
the units of st_blocks is always going to be 512 bytes, and this is
hard-coded in the POSIX standard.

CUDA performance test

I'm writing a simple CUDA program for performance test.
This is not related to vector calculation, but just for a simple (parallel) string conversion.
#include <stdio.h>
#include <string.h>
#include <cuda_runtime.h>
#define UCHAR unsigned char
#define UINT32 unsigned long int
#define CTX_SIZE sizeof(aes_context)
#define DOCU_SIZE 4096
#define TOTAL 100000
#define BBLOCK_SIZE 500
void TEST_Encode( UCHAR *a_input, UCHAR *a_output )
UCHAR *input;
UCHAR *output;
input = &(a_input[threadIdx.x * DOCU_SIZE]);
output = &(a_output[threadIdx.x * DOCU_SIZE]);
for ( int i = 0 ; i < 30 ; i++ ) {
if ( (input[i] >= 'a') && (input[i] <= 'z') ) {
output[i] = input[i] - 'a' + 'A';
else {
output[i] = input[i];
int main(int argc, char** argv)
struct cudaDeviceProp xCUDEV;
cudaGetDeviceProperties(&xCUDEV, 0);
// Prepare Source
memset(pH_TXT, 0x00, DOCU_SIZE * TOTAL);
for ( int i = 0 ; i < TOTAL ; i++ ) {
strcpy((char*)pH_TXT + (i * DOCU_SIZE), "hello world, i need an apple.");
// Allocate vectors in device memory
cudaMalloc((void**)&pD_TXT, DOCU_SIZE * TOTAL);
cudaMalloc((void**)&pD_ENC, DOCU_SIZE * TOTAL);
// Copy vectors from host memory to device memory
cudaMemcpy(pD_TXT, pH_TXT, DOCU_SIZE * TOTAL, cudaMemcpyHostToDevice);
// Invoke kernel
int threadsPerBlock = BLOCK_SIZE;
int blocksPerGrid = (TOTAL + threadsPerBlock - 1) / threadsPerBlock;
printf("Total Task is %d\n", TOTAL);
printf("block size is %d\n", threadsPerBlock);
printf("repeat cnt is %d\n", blocksPerGrid);
TEST_Encode<<<blocksPerGrid, threadsPerBlock>>>(pD_TXT, pD_ENC);
cudaMemcpy(pH_ENC, pD_ENC, DOCU_SIZE * TOTAL, cudaMemcpyDeviceToHost);
// Free device memory
if (pD_TXT) cudaFree(pD_TXT);
if (pD_ENC) cudaFree(pD_ENC);
And when i change BLOCK_SIZE value from 2 to 1000, I got a following duration time (from NVIDIA Visual Profiler)
100000 50000 2 28.22
100000 10000 10 22.223
100000 2000 50 12.3
100000 1000 100 9.624
100000 500 200 10.755
100000 250 400 29.824
100000 200 500 39.67
100000 100 1000 81.268
My GPU is GeForce GT520 and max threadsPerBlock value is 1024, so I predicted that I would get best performance when BLOCK is 1000, but the above table shows different result.
I can't understand why Duration time is not linear, and how can I fix this problem. (or how can I find optimized Block value (mimimum Duration time)
It seems 2, 10, 50 threads doesn't utilize the capabilities of the gpu since its design is to start much more threads.
Your card has compute capability 2.1.
Maximum number of resident threads per multiprocessor = 1536
Maximum number of threads per block = 1024
Maximum number of resident blocks per multiprocessor = 8
Warp size = 32
There are two issues:
You try to occupy so much register memory per thread that it will definetly is outsourced to slow local memory space if your block sizes increases.
Perform your tests with multiple of 32 since this is the warp size of your card and many memory operations are optimized for thread sizes with multiple of the warp size.
So if you use only around 1024 (1000 in your case) threads per block 33% of your gpu is idle since only 1 block can be assigned per SM.
What happens if you use the following 100% occupancy sizes?
128 = 12 blocks -> since only 8 can be resident per sm the block execution is serialized
192 = 8 resident blocks per sm
256 = 6 resident blocks per sm
512 = 3 resident blocks per sm

Why is clock_gettime so erratic?

Section Old Question contains the initial question (Further Investigation and Conclusion have been added since).
Skip to the section Further Investigation below for a detailed comparison of the different timing methods (rdtsc, clock_gettime and QueryThreadCycleTime).
I believe the erratic behaviour of CGT can be attributed to either a buggy kernel or a buggy CPU (see section Conclusion).
The code used for testing is at the bottom of this question (see section Appendix).
Apologies for the length.
Old Question
In short: I am using clock_gettime to measure the execution time of many code segments. I am experiencing very inconsistent measurements between separate runs. The method has an extremely high standard deviation when compared to other methods (see Explanation below).
Question: Is there a reason why clock_gettime would give so inconsistent measurements when compared to other methods? Is there an alternative method with the same resolution that accounts for thread idle time?
Explanation: I am trying to profile a number of small parts of C code. The execution time of each of the code segments is not more than a couple of microseconds. In a single run, each of the code segments will execute some hundreds of times, which produces runs × hundreds of measurements.
I also have to measure only the time the thread actually spends executing (which is why rdtsc is not suitable). I also need a high resolution (which is why times is not suitable).
I've tried the following methods:
rdtsc (on Linux and Windows),
clock_gettime (with 'CLOCK_THREAD_CPUTIME_ID'; on Linux), and
QueryThreadCycleTime (on Windows).
Methodology: The analysis was performed on 25 runs. In each run, separate code segments repeat a 101 of times. Therefore I have 2525 measurements. Then I look at a histogram of the measurements, and also calculate some basic stuff (like the mean,, median, mode, min, and max).
I do not present how I measured the 'similarity' of the three methods, but this simply involved a basic comparison of proportion of times spent in each code segment ('proportion' means that the times are normalised). I then look at the pure differences in these proportions. This comparison showed that all 'rdtsc', 'QTCT', and 'CGT' measure the same proportions when averaged over the 25 runs. However, the results below show that 'CGT' has a very large standard deviation. This makes it unusable in my use case.
A comparison of clock_gettime with rdtsc for the same code segment (25 runs of 101 measurements = 2525 readings):
1881 measurements of 11 ns,
595 measurements were (distributed almost normally) between 3369 and 3414 ns,
2 measurements of 11680 ns,
1 measurement of 1506022 ns, and
the rest is between 900 and 5000 ns.
Min: 11 ns
Max: 1506022 ns
Mean: 1471.862 ns
Median: 11 ns
Mode: 11 ns
Stddev: 29991.034
rdtsc (note: no context switches occurred during this run, but if it happens, it usually results in just a single measurement of 30000 ticks or so):
1178 measurements between 274 and 325 ticks,
306 measurements between 326 and 375 ticks,
910 measurements between 376 and 425 ticks,
129 measurements between 426 and 990 ticks,
1 measurement of 1240 ticks, and
1 measurement of 1256 ticks.
Min: 274 ticks
Max: 1256 ticks
Mean: 355.806 ticks
Median: 333 ticks
Mode: 376 ticks
Stddev: 83.896
rdtsc gives very similar results on both Linux and Windows. It has an acceptable standard deviation--it is actually quite consistent/stable. However, it does not account for thread idle time. Therefore, context switches make the measurements erratic (on Windows I have observed this quite often: a code segment with an average of 1000 ticks or so will take ~30000 ticks every now and then--definitely because of pre-emption).
QueryThreadCycleTime gives very consistent measurements--i.e. much lower standard deviation when compared to rdtsc. When no context switches happen, this method is almost identical to rdtsc.
clock_gettime, on the other hand, is producing extremely inconsistent results (not just between runs, but also between measurements). The standard deviations are extreme (when compared to rdtsc).
I hope the statistics are okay. But what could be the reason for such a discrepancy in the measurements between the two methods? Of course, there is caching, CPU/core migration, and other things. But none of this should be responsible for any such differences between 'rdtsc' and 'clock_gettime'. What is going on?
Further Investigation
I have investigated this a bit further. I have done two things:
Measured the overhead of just calling clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t) (see code 1 in Appendix), and
in a plain loop called clock_gettime and stored the readings into an array (see code 2 in Appendix). I measure the delta times (difference in successive measurement times, which should correspond a bit to the overhead of the call of clock_gettime).
I have measured it on two different computers with two different Linux Kernel versions:
CPU: Core 2 Duo L9400 # 1.86GHz
Kernel: Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386
Estimated clock_gettime overhead: between 690-710 ns
Delta times:
Average: 815.22 ns
Median: 713 ns
Mode: 709 ns
Min: 698 ns
Max: 23359 ns
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
697 < x ≤ 800 -> 78111 <-- cached?
800 < x ≤ 1000 -> 16412
1000 < x ≤ 1500 -> 3
1500 < x ≤ 2000 -> 4836 <-- uncached?
2000 < x ≤ 3000 -> 305
3000 < x ≤ 5000 -> 161
5000 < x ≤ 10000 -> 105
10000 < x ≤ 15000 -> 53
15000 < x ≤ 20000 -> 8
20000 < x -> 5
CPU: 4 × Dual Core AMD Opteron Processor 275
Kernel: Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux
Estimated clock_gettime overhead: between 279-283 ns
Delta times:
Average: 320.00
Median: 1
Mode: 1
Min: 1
Max: 3495529
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
x ≤ 1 -> 86738 <-- cached?
282 < x ≤ 300 -> 13118 <-- uncached?
300 < x ≤ 440 -> 78
2000 < x ≤ 5000 -> 52
5000 < x ≤ 30000 -> 5
3000000 < x -> 8
Related code rdtsc_delta.c and rdtsc_overhead.c.
CPU: Core 2 Duo L9400 # 1.86GHz
Kernel: Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386
Estimated overhead: between 39-42 ticks
Delta times:
Average: 52.46 ticks
Median: 42 ticks
Mode: 42 ticks
Min: 35 ticks
Max: 28700 ticks
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
34 < x ≤ 35 -> 16240 <-- cached?
41 < x ≤ 42 -> 63585 <-- uncached? (small difference)
48 < x ≤ 49 -> 19779 <-- uncached?
49 < x ≤ 120 -> 195
3125 < x ≤ 5000 -> 144
5000 < x ≤ 10000 -> 45
10000 < x ≤ 20000 -> 9
20000 < x -> 2
CPU: 4 × Dual Core AMD Opteron Processor 275
Kernel: Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux
Estimated overhead: between 13.7-17.0 ticks
Delta times:
Average: 35.44 ticks
Median: 16 ticks
Mode: 16 ticks
Min: 14 ticks
Max: 16372 ticks
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
13 < x ≤ 14 -> 192
14 < x ≤ 21 -> 78172 <-- cached?
21 < x ≤ 50 -> 10818
50 < x ≤ 103 -> 10624 <-- uncached?
5825 < x ≤ 6500 -> 88
6500 < x ≤ 8000 -> 88
8000 < x ≤ 10000 -> 11
10000 < x ≤ 15000 -> 4
15000 < x ≤ 16372 -> 2
Related code qtct_delta.c and qtct_overhead.c.
CPU: Core 2 6700 # 2.66GHz
Kernel: Windows 7 64-bit
Estimated overhead: between 890-940 ticks
Delta times:
Average: 1057.30 ticks
Median: 890 ticks
Mode: 890 ticks
Min: 880 ticks
Max: 29400 ticks
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
879 < x ≤ 890 -> 71347 <-- cached?
895 < x ≤ 1469 -> 844
1469 < x ≤ 1600 -> 27613 <-- uncached?
1600 < x ≤ 2000 -> 55
2000 < x ≤ 4000 -> 86
4000 < x ≤ 8000 -> 43
8000 < x ≤ 16000 -> 10
16000 < x -> 1
I believe the answer to my question would be a buggy implementation on my machine (the one with AMD CPUs with an old Linux kernel).
The CGT results of the AMD machine with the old kernel show some extreme readings. If we look at the delta times, we'll see that the most frequent delta is 1 ns. This means that the call to clock_gettime took less than a nanosecond! Moreover, it also produced a number of extraordinary large deltas (of more than 3000000 ns)! This seems to be erroneous behaviour. (Maybe unaccounted core migrations?)
The overhead of CGT and QTCT is quite big.
It is also difficult to account for their overhead, because CPU caching seems to make quite a big difference.
Maybe sticking to RDTSC, locking the process to one core, and assigning real-time priority is the most accurate way to tell how many cycles a piece of code used...
Code 1: clock_gettime_overhead.c
#include <time.h>
#include <stdio.h>
#include <stdint.h>
/* Compiled & executed with:
gcc clock_gettime_overhead.c -O0 -lrt -o clock_gettime_overhead
./clock_gettime_overhead 100000
int main(int argc, char **args) {
struct timespec tstart, tend, dummy;
int n, N;
N = atoi(args[1]);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tstart);
for (n = 0; n < N; ++n) {
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tend);
printf("Estimated overhead: %lld ns\n",
((int64_t) tend.tv_sec * 1000000000 + (int64_t) tend.tv_nsec
- ((int64_t) tstart.tv_sec * 1000000000
+ (int64_t) tstart.tv_nsec)) / N / 10);
return 0;
Code 2: clock_gettime_delta.c
#include <time.h>
#include <stdio.h>
#include <stdint.h>
/* Compiled & executed with:
gcc clock_gettime_delta.c -O0 -lrt -o clock_gettime_delta
./clock_gettime_delta > results
#define N 100000
int main(int argc, char **args) {
struct timespec sample, results[N];
int n;
for (n = 0; n < N; ++n) {
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &sample);
results[n] = sample;
printf("%s\t%s\n", "Absolute time", "Delta");
for (n = 1; n < N; ++n) {
(int64_t) results[n].tv_sec * 1000000000 +
(int64_t) results[n].tv_sec * 1000000000 +
(int64_t) results[n].tv_nsec -
((int64_t) results[n-1].tv_sec * 1000000000 +
return 0;
Code 3: rdtsc.h
static uint64_t rdtsc() {
#if defined(__GNUC__)
# if defined(__i386__)
uint64_t x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
# elif defined(__x86_64__)
uint32_t hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ((uint64_t)lo) | ((uint64_t)hi << 32);
# else
# error Unsupported architecture.
# endif
#elif defined(_MSC_VER)
return __rdtsc();
# error Other compilers not supported...
Code 4: rdtsc_delta.c
#include <stdio.h>
#include <stdint.h>
#include "rdtsc.h"
/* Compiled & executed with:
gcc rdtsc_delta.c -O0 -o rdtsc_delta
./rdtsc_delta > rdtsc_delta_results
cl -Od rdtsc_delta.c
rdtsc_delta.exe > windows_rdtsc_delta_results
#define N 100000
int main(int argc, char **args) {
uint64_t results[N];
int n;
for (n = 0; n < N; ++n) {
results[n] = rdtsc();
printf("%s\t%s\n", "Absolute time", "Delta");
for (n = 1; n < N; ++n) {
printf("%lld\t%lld\n", results[n], results[n] - results[n-1]);
return 0;
Code 5: rdtsc_overhead.c
#include <time.h>
#include <stdio.h>
#include <stdint.h>
#include "rdtsc.h"
/* Compiled & executed with:
gcc rdtsc_overhead.c -O0 -lrt -o rdtsc_overhead
./rdtsc_overhead 1000000 > rdtsc_overhead_results
cl -Od rdtsc_overhead.c
rdtsc_overhead.exe 1000000 > windows_rdtsc_overhead_results
int main(int argc, char **args) {
uint64_t tstart, tend, dummy;
int n, N;
N = atoi(args[1]);
tstart = rdtsc();
for (n = 0; n < N; ++n) {
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
tend = rdtsc();
printf("%G\n", (double)(tend - tstart)/N/10);
return 0;
Code 6: qtct_delta.c
#include <stdio.h>
#include <stdint.h>
#include <Windows.h>
/* Compiled & executed with:
cl -Od qtct_delta.c
qtct_delta.exe > windows_qtct_delta_results
#define N 100000
int main(int argc, char **args) {
uint64_t ticks, results[N];
int n;
for (n = 0; n < N; ++n) {
QueryThreadCycleTime(GetCurrentThread(), &ticks);
results[n] = ticks;
printf("%s\t%s\n", "Absolute time", "Delta");
for (n = 1; n < N; ++n) {
printf("%lld\t%lld\n", results[n], results[n] - results[n-1]);
return 0;
Code 7: qtct_overhead.c
#include <stdio.h>
#include <stdint.h>
#include <Windows.h>
/* Compiled & executed with:
cl -Od qtct_overhead.c
qtct_overhead.exe 1000000
int main(int argc, char **args) {
uint64_t tstart, tend, ticks;
int n, N;
N = atoi(args[1]);
QueryThreadCycleTime(GetCurrentThread(), &tstart);
for (n = 0; n < N; ++n) {
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &tend);
printf("%G\n", (double)(tend - tstart)/N/10);
return 0;
Well as CLOCK_THREAD_CPUTIME_ID is implemented using rdtsc it will likely suffer from the same problems as it. The manual page for clock_gettime says:
are realized on many platforms using timers from the CPUs (TSC on
i386, AR.ITC on Itanium). These registers may differ between CPUs and
as a consequence these clocks may return bogus results if a
process is migrated to another CPU.
Which sounds like it might explain your problems? Maybe you should lock your process to one CPU to get stable results?
When you have a highly skewed distribution that cannot go negative, you're going to see large discrepancies between mean, median, and mode.
The standard deviation is fairly meaningless for such a distribution.
It's usually a good idea to log-transform it.
That will make it "more normal".

Accurately Calculating CPU Utilization in Linux using /proc/stat

There are a number of posts and references on how to get CPU Utilization using statistics in /proc/stat. However, most of them use only four of the 7+ CPU stats (user, nice, system, and idle), ignoring the remaining jiffie CPU counts present in Linux 2.6 (iowait, irq, softirq).
As an example, see Determining CPU utilization.
My question is this: Are the iowait/irq/softirq numbers also counted in one of the first four numbers (user/nice/system/idle)? In other words, does the total jiffie count equal the sum of the first four stats? Or, is the total jiffie count equal to the sum of all 7 stats? If the latter is true, then a CPU utilization formula should take all of the numbers into account, like this:
#include <stdio.h>
#include <stdlib.h>
int main(void)
long double a[7],b[7],loadavg;
FILE *fp;
fp = fopen("/proc/stat","r");
fscanf(fp,"%*s %Lf %Lf %Lf %Lf",&a[0],&a[1],&a[2],&a[3],&a[4],&a[5],&a[6]);
fp = fopen("/proc/stat","r");
fscanf(fp,"%*s %Lf %Lf %Lf %Lf",&b[0],&b[1],&b[2],&b[3],&b[4],&b[5],&b[6]);
loadavg = ((b[0]+b[1]+b[2]+b[4]+b[5]+b[6]) - (a[0]+a[1]+a[2]+a[4]+a[5]+a[6]))
/ ((b[0]+b[1]+b[2]+b[3]+b[4]+b[5]+b[6]) - (a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]));
printf("The current CPU utilization is : %Lf\n",loadavg);
I think iowait/irq/softirq are not counted in one of the first 4 numbers. You can see the comment of irqtime_account_process_tick in kernel code for more detail:
(for Linux kernel 4.1.1)
2815 * Tick demultiplexing follows the order
2816 * - pending hardirq update <-- this is irq
2817 * - pending softirq update <-- this is softirq
2818 * - user_time
2819 * - idle_time <-- iowait is included in here, discuss below
2820 * - system time
2821 * - check for guest_time
2822 * - else account as system_time
For the idle time handling, see account_idle_time function:
2772 /*
2773 * Account for idle time.
2774 * #cputime: the cpu time spent in idle wait
2775 */
2776 void account_idle_time(cputime_t cputime)
2777 {
2778 u64 *cpustat = kcpustat_this_cpu->cpustat;
2779 struct rq *rq = this_rq();
2781 if (atomic_read(&rq->nr_iowait) > 0)
2782 cpustat[CPUTIME_IOWAIT] += (__force u64) cputime;
2783 else
2784 cpustat[CPUTIME_IDLE] += (__force u64) cputime;
2785 }
If the cpu is idle AND there is some IO pending, it will count the time in CPUTIME_IOWAIT. Otherwise, it is count in CPUTIME_IDLE.
To conclude, I think the jiffies in irq/softirq should be counted as "busy" for cpu because it was actually handling some IRQ or soft IRQ. On the other hand, the jiffies in "iowait" should be counted as "idle" for cpu because it was not doing something but waiting for a pending IO to happen.
from busybox, its top magic is:
static const char fmt[] ALIGN1 = "cp%*s %llu %llu %llu %llu %llu %llu %llu %llu";
int ret;
if (!fgets(line_buf, LINE_BUF_SIZE, fp) || line_buf[0] != 'c' /* not "cpu" */)
return 0;
ret = sscanf(line_buf, fmt,
&p_jif->usr, &p_jif->nic, &p_jif->sys, &p_jif->idle,
&p_jif->iowait, &p_jif->irq, &p_jif->softirq,
if (ret >= 4) {
p_jif->total = p_jif->usr + p_jif->nic + p_jif->sys + p_jif->idle
+ p_jif->iowait + p_jif->irq + p_jif->softirq + p_jif->steal;
/* procps 2.x does not count iowait as busy time */
p_jif->busy = p_jif->total - p_jif->idle - p_jif->iowait;
