Is clock_gettime() adequate for submicrosecond timing?

Is clock_gettime() adequate for submicrosecond timing? - linux

I need a high-resolution timer for the embedded profiler in the Linux build of our application. Our profiler measures scopes as small as individual functions, so it needs a timer precision of better than 25 nanoseconds.
Previously our implementation used inline assembly and the rdtsc operation to query the high-frequency timer from the CPU directly, but this is problematic and requires frequent recalibration.
So I tried using the clock_gettime function instead to query CLOCK_PROCESS_CPUTIME_ID. The docs allege this gives me nanosecond timing, but I found that the overhead of a single call to clock_gettime() was over 250ns. That makes it impossible to time events 100ns long, and having such high overhead on the timer function seriously drags down app performance, distorting the profiles beyond value. (We have hundreds of thousands of profiling nodes per second.)
Is there a way to call clock_gettime() that has less than ¼μs overhead? Or is there some other way that I can reliably get the timestamp counter with <25ns overhead? Or am I stuck with using rdtsc?
Below is the code I used to time clock_gettime().
// calls gettimeofday() to return wall-clock time in seconds:
extern double Get_FloatTime();
enum { TESTRUNS = 1024*1024*4 };
// time the high-frequency timer against the wall clock
{
double fa = Get_FloatTime();
timespec spec;
clock_getres( CLOCK_PROCESS_CPUTIME_ID, &spec );
printf("CLOCK_PROCESS_CPUTIME_ID resolution: %ld sec %ld nano\n",
spec.tv_sec, spec.tv_nsec );
for ( int i = 0 ; i < TESTRUNS ; ++ i )
{
clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &spec );
}
double fb = Get_FloatTime();
printf( "clock_gettime %d iterations : %.6f msec %.3f microsec / call\n",
TESTRUNS, ( fb - fa ) * 1000.0, (( fb - fa ) * 1000000.0) / TESTRUNS );
}
// and so on for CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_THREAD_CPUTIME_ID.
Results:
CLOCK_PROCESS_CPUTIME_ID resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 3115.784947 msec 0.371 microsec / call
CLOCK_MONOTONIC resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2505.122119 msec 0.299 microsec / call
CLOCK_REALTIME resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2456.186031 msec 0.293 microsec / call
CLOCK_THREAD_CPUTIME_ID resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2956.633930 msec 0.352 microsec / call
This is on a standard Ubuntu kernel. The app is a port of a Windows app (where our rdtsc inline assembly works just fine).
Addendum:
Does x86-64 GCC have some intrinsic equivalent to __rdtsc(), so I can at least avoid inline assembly?

No. You'll have to use platform-specific code to do it. On x86 and x86-64, you can use 'rdtsc' to read the Time Stamp Counter.
Just port the rdtsc assembly you're using.
__inline__ uint64_t rdtsc(void) {
uint32_t lo, hi;
__asm__ __volatile__ ( // serialize
"xorl %%eax,%%eax \n cpuid"
::: "%rax", "%rbx", "%rcx", "%rdx");
/* We cannot use "=A", since this would use %rax on x86_64 and return only the lower 32bits of the TSC */
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}

This is what happens when you call clock_gettime() function.
Based on the clock you choose it will call the respective function. (from vclock_gettime.c file from kernel)
int clock_gettime(clockid_t, struct __kernel_old_timespec *)
__attribute__((weak, alias("__vdso_clock_gettime")));
notrace int
__vdso_clock_gettime_stick(clockid_t clock, struct __kernel_old_timespec *ts)
{
struct vvar_data *vvd = get_vvar_data();
switch (clock) {
case CLOCK_REALTIME:
if (unlikely(vvd->vclock_mode == VCLOCK_NONE))
break;
return do_realtime_stick(vvd, ts);
case CLOCK_MONOTONIC:
if (unlikely(vvd->vclock_mode == VCLOCK_NONE))
break;
return do_monotonic_stick(vvd, ts);
case CLOCK_REALTIME_COARSE:
return do_realtime_coarse(vvd, ts);
case CLOCK_MONOTONIC_COARSE:
return do_monotonic_coarse(vvd, ts);
}
/*
* Unknown clock ID ? Fall back to the syscall.
*/
return vdso_fallback_gettime(clock, ts);
}
CLOCK_MONITONIC better (though I use CLOCK_MONOTONIC_RAW) since it is not affected from NTP time adjustment.
This is how the do_monotonic_stick is implemented inside kernel:
notrace static __always_inline int do_monotonic_stick(struct vvar_data *vvar,
struct __kernel_old_timespec *ts)
{
unsigned long seq;
u64 ns;
do {
seq = vvar_read_begin(vvar);
ts->tv_sec = vvar->monotonic_time_sec;
ns = vvar->monotonic_time_snsec;
ns += vgetsns_stick(vvar);
ns >>= vvar->clock.shift;
} while (unlikely(vvar_read_retry(vvar, seq)));
ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
ts->tv_nsec = ns;
return 0;
}
And the vgetsns_stick() function which provides nano seconds resolution is implemented as:
notrace static __always_inline u64 vgetsns(struct vvar_data *vvar)
{
u64 v;
u64 cycles;
cycles = vread_tick();
v = (cycles - vvar->clock.cycle_last) & vvar->clock.mask;
return v * vvar->clock.mult;
}
Where the function vread_tick() reads the cycles from register based on the CPU:
notrace static __always_inline u64 vread_tick(void)
{
register unsigned long long ret asm("o4");
__asm__ __volatile__("rd %%tick, %L0\n\t"
"srlx %L0, 32, %H0"
: "=r" (ret));
return ret;
}
A single call to clock_gettime() takes around 20 to 100 nano seconds. reading the rdtsc register and converting the cycles to time is always faster.
I have done some experiment with CLOCK_MONOTONIC_RAW here: Unexpected periodic behaviour of an ultra low latency hard real time multi threaded x86 code

I need a high-resolution timer for the embedded profiler in the Linux build of our application. Our profiler measures scopes as small as individual functions, so it needs a timer precision of better than 25 nanoseconds.
Have you considered oprofile or perf? You can use the performance counter hardware on your CPU to get profiling data without adding instrumentation to the code itself. You can see data per-function, or even per-line-of-code. The "only" drawback is that it won't measure wall clock time consumed, it will measure CPU time consumed, so it's not appropriate for all investigations.

Give clockid_t CLOCK_MONOTONIC_RAW a try?
CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific)
Similar to CLOCK_MONOTONIC, but provides access to a
raw hardware-based time that is not subject to NTP
adjustments or the incremental adjustments performed by
adjtime(3).
From Man7.org

It's hard to give a globally applicable answer because the hardware and software implementation will vary widely.
However, yes, most modern platforms will have a suitable clock_gettime call that is implemented purely in user-space using the VDSO mechanism, and will in my experience take 20 to 30 nanoseconds to complete (but see Wojciech's comment below about contention).
Internally, this is using rdtsc or rdtscp for the fine-grained portion of the time-keeping, plus adjustments to keep this in sync with wall-clock time (depending on the clock you choose) and a multiplication to convert from whatever units rdtsc has on your platform to nanoseconds.
Not all of the clocks offered by clock_gettime will implement this fast method, and it's not always obvious which ones do. Usually CLOCK_MONOTONIC is a good option, but you should test this on your own system.

You are calling clock_getttime with control parameter which means the api is branching through if-else tree to see what kind of time you want. I know you cant't avoid that with this call, but see if you can dig into the system code and call what the kernal is eventually calling directly. Also, I note that you are including the loop time (i++, and conditional branch).

Related

In a busy loop, two continuously getting time encounters a big time gap

In a busy loop, some codes contiguously get the current time twice and count their time interval. The time interval is supposed to be small, since the codes belong to a SCHED_FIFO process with highest priority(99) and could not be interrupted by other processes while running.
rdtsc instruction is used to get the current time from userspace, and cpuid is used to serialize instructions.
the codes are as follow:
#define RDTSCP(u64) do { \
unsigned int hi, lo; \
__asm__ __volatile__ ("RDTSC\n\t" : "=a" (lo), "=d" (hi)); \
u64 = ((uint64_t)hi << 32) | lo; \
}while(0)
unsigned int usecs = 20; //20us
int tsum = 0;
for(long long ii = 0; ii < OUTER_LOOP; ++ii) {
for (long long jj = 0; jj < INNER_LOOP; ++jj) {
MEM_BAR;
__asm__ __volatile__ ("CPUID");
RDTSCP(begin); // get time
__asm__ __volatile__ ("CPUID");
RDTSCP(end); // get time as well
__asm__ __volatile__ ("CPUID");
MEM_BAR;
temp = end-begin;
if (temp < min) min = temp; // minimum time interval changes
if (temp > max) max = temp, flag=true; // maximum time interval changes
}
if (flag) {
printf("min: %ld\t\tmax: %ld \t(clocks) \n", min, max);
fflush(stdout);
flag=false;
}
usleep(usecs);
}
In observation of several hours, the biggest time interval in my PC (CPU i9-9900k, 3600MHz) is 578112 (clocks), which is about 160us. This value is extraordinary IMHO.
My question is why there is such a big time interval between these two sequential rdtsc instructions and what causes it ?
The compiler seems to be innocent and translated assembly codes are as follow:
0x5555555554fb <main+594> mfence
0x5555555554fe <main+597> cpuid
0x555555555500 <main+599> rdtsc
0x555555555502 <main+601> mov %eax,-0xe0(%rbp)
0x555555555508 <main+607> mov %edx,-0xdc(%rbp)
0x55555555550e <main+613> mov -0xdc(%rbp),%ea
0x555555555514 <main+619> shl $0x20,%ra
0x555555555518 <main+623> mov %rax,%rd
0x55555555551b <main+626> mov -0xe0(%rbp),%ea
0x555555555521 <main+632> or %rdx,%ra
0x555555555524 <main+635> mov %rax,-0xa0(%rbp)
0x55555555552b <main+642> cpuid
0x55555555552d <main+644> rdtsc
0x55555555552f <main+646> mov %eax,-0xd8(%rbp)
0x555555555535 <main+652> mov %edx,-0xd4(%rbp)
0x55555555553b <main+658> mov -0xd4(%rbp),%ea
0x555555555541 <main+664> shl $0x20,%ra
0x555555555545 <main+668> mov %rax,%rd
0x555555555548 <main+671> mov -0xd8(%rbp),%ea
0x55555555554e <main+677> or %rdx,%ra
0x555555555551 <main+680> mov %rax,-0x98(%rbp)
0x555555555558 <main+687> cpuid
0x55555555555a <main+689> mfence

Though with the highest scheduling priority, a userspace process may still be preempted by a kernel thread. You may
monitor the kernel scheduler with perf sched
https://man7.org/linux/man-pages/man1/perf-sched.1.html to track kernel schedule events and figure out which thread uses the core.
Reduce the kernel noise in the test core to reduce performance variance:
Isolate the test core from the scheduler entirely with boot option isolcpus
prevent the kernel from sending schedule interrupts with boot option nohz_full
move rcu callback to other cores with boot option rcu_nocbs.
my own experience: avoid use the first and the last core (0 and N).

In recent further research, I conducted several experiments and found LOC interrupts are the reasons for unusually large time intervals during the contiguous time acquisitions. Even a full tickless kernel would trigger a LOC interrupt every couple of seconds for the system statistics, so it seems such a microsecond-scale time jitter is inevitable in general Linux.
what I have done to verify this:
Tracing the underlying interrupts with the ftrace tool when querying time in a busy loop, finding out LOC interrupt break off the user program and cause the time jitters.
Rerun the test on a full-isolated CPU core, with help of the task-isolation kernel patch, which disables most interrupts, including LOC interrupts, and such time interval/jitter dismissed.
References:
A full task-isolation mode for the kernel
ftrace documentation

Linux clock_gettime() elapse spikes?

I'm try to get high resolution timestamp on linux. Using clock_gettime(), as below, I got "spike" elapses that looks pretty horrible at almost 26 micro second elapse. Most of the "dt"'s are around 30 ns. I was on linux 2.6.32, Red Hat 4.4.6. 'lscpu' shows CPU MHz=2666.121. I thought that means each each clock tick needs about 2 ns. So, asking for ns resolution didn't see like too unreasonable here.
output of program (sorry wasn't able to post this without making it a list. It thinks it's code some how)
1397534268,40823395 1397534268,40827950,dt=4555
1397534268,41233555 1397534268,41236716,dt=3161
1397534268,41389902 1397534268,41392922,dt=3020
1397534268,46488430 1397534268,46491674,dt=3244
1397534268,46531297 1397534268,46534279,dt=2982
1397534268,46823368 1397534268,46849336,dt=25968
1397534268,46915657 1397534268,46918663,dt=3006
1397534268,51488643 1397534268,51491791,dt=3148
1397534268,51530490 1397534268,51533496,dt=3006
1397534268,51823307 1397534268,51826904,dt=3597
1397534268,55823359 1397534268,55827826,dt=4467
1397534268,60531184 1397534268,60534183,dt=2999
1397534268,60823381 1397534268,60844866,dt=21485
1397534268,60913003 1397534268,60915998,dt=2995
1397534268,65823269 1397534268,65827742,dt=4473
1397534268,70823376 1397534268,70835280,dt=11904
1397534268,75823489 1397534268,75828872,dt=5383
1397534268,80823503 1397534268,80859500,dt=35997
1397534268,86823381 1397534268,86831907,dt=8526
Any ideas? thanks
#include <vector>
#include <iostream>
#include <time.h>
long long elapse( const timespec& t1, const timespec& t2 )
{
return ( t2.tv_sec * 1000000000L + t2.tv_nsec ) -
t1.tv_sec * 1000000000L + t1.tv_nsec );
}
int main()
{
const unsigned n=30000;
timespec ts;
std::vector<timespec> t( n );
for( unsigned i=0; i < n; ++i )
{
clock_gettime( CLOCK_REALTIME, &ts );
t[i] = ts;
}
std::vector<long> dt( n );
for( unsigned i=1; i < n; ++i )
{
dt[i] = elapse( t[i-1], t[i] );
if( dt[i] > 1000 )
{
std::cerr <<
t[i-1].tv_sec << ","
<< t[i-1].tv_nsec << " "
<< t[i].tv_sec << ","
<< t[i].tv_nsec
<< ",dt=" << dt[i] << std::endl;
}
else
{
//normally I get dt[i] = approx 30-35 nano secs
}
}
return 0;
}

The numbers you quoted are in the 3 to 30 microsecond range (3,000 to 30,000 nanoseconds). That is too short a time to be a context switch to another thread/process, let the other thread run, and context switch back to your thread. Most likely the core where your process was running was used by the kernel to service an external interrupt (e.g. network card, disk, timer), then returned to running your process.
You can watch the linux interrupt counters (per CPU core and per source) with this command
watch -d -n 0.2 cat /proc/interrupts
The -n 0.2 will cause the command to be issued at 5Hz, the -d flag will highlight what has changed.
The source of the interrupt could also be a TLB shootdown, which results in an IPI (Inter-Processor Interrupt). You can read more about TLB shootdowns here.
If you want to reduce the number of interrupts serviced by the core running your thread/process, you need to set the interrupt affinity. You can learn more about Red Hat Interrupts and IRQ (Interrupt requests) tuning here, and here.
Worth noting is that you are using CLOCK_REALTIME which isn't guaranteed to be "smooth", it could jump around as the system clock is "disciplined" to keep accurate time by a service like NTP (Network Time Protocol) or PTP (Precision Time Protocol). For your purposes it is better to use CLOCK_MONOTONIC, you can read more about the difference here. When a clock is "disciplined" the clock can jump by a "step" - this is unusual and certainly not the cause of the many spikes you see.

Could you check the resolution with clock_getres()?

I suspect what you are measuring here is called "OS Noise". This is often caused by your program getting pre-empted by the operating system. The operating system then performs other work. There are numerous causes, but commonly it is: other runnable tasks, hardware interrupts, or timer events.
The FTQ/FWQ benchmarks were designed to measure this characteristic and the summary contains some further information:
https://asc.llnl.gov/sequoia/benchmarks/FTQ_summary_v1.1.pdf

Reliability of Linux kernel add_timer at resolution of one jiffy?

In the code given below, there is a simple Linux kernel module (driver) which calls a function repeatedly 10 times, using add_timer at resolution of 1 jiffy (that is, the timer is scheduled to fire at jiffies + 1). Using the bash script rerun.sh, then I obtain timestamps from printout in syslog, and visualize them using gnuplot.
In most cases, I get a syslog output like this:
[ 7103.055787] Init testjiffy: 0 ; HZ: 250 ; 1/HZ (ms): 4
[ 7103.056044] testjiffy_timer_function: runcount 1
[ 7103.060045] testjiffy_timer_function: runcount 2
[ 7103.064052] testjiffy_timer_function: runcount 3
[ 7103.068050] testjiffy_timer_function: runcount 4
[ 7103.072053] testjiffy_timer_function: runcount 5
[ 7103.076036] testjiffy_timer_function: runcount 6
[ 7103.080044] testjiffy_timer_function: runcount 7
[ 7103.084044] testjiffy_timer_function: runcount 8
[ 7103.088060] testjiffy_timer_function: runcount 9
[ 7103.092059] testjiffy_timer_function: runcount 10
[ 7104.095429] Exit testjiffy
... which results with time series and delta histogram plots like these:
This is, essentially, the quality of timing that I'd expect from the code.
However - every once in a while, I get a capture like:
[ 7121.377507] Init testjiffy: 0 ; HZ: 250 ; 1/HZ (ms): 4
[ 7121.380049] testjiffy_timer_function: runcount 1
[ 7121.384062] testjiffy_timer_function: runcount 2
[ 7121.392053] testjiffy_timer_function: runcount 3
[ 7121.396055] testjiffy_timer_function: runcount 4
[ 7121.400068] testjiffy_timer_function: runcount 5
[ 7121.404085] testjiffy_timer_function: runcount 6
[ 7121.408084] testjiffy_timer_function: runcount 7
[ 7121.412072] testjiffy_timer_function: runcount 8
[ 7121.416083] testjiffy_timer_function: runcount 9
[ 7121.420066] testjiffy_timer_function: runcount 10
[ 7122.417325] Exit testjiffy
... which results with a rendering like:
... and I'm like: "WHOOOOOAAAAAA ... wait a second..." - isn't there a pulse dropped from the sequence? Meaning that add_timer missed a slot, and then fired up the function in the next 4 ms slot?
The interesting thing is, that in running these tests, I have nothing else but a terminal, web browser and a text editor started up - so I cannot really see anything running, that may hog the OS/kernel; and thus, I really cannot see a reason why the kernel would make such a big miss (of an entire jiffy period). When I read about Linux kernel timing, e.g. "The simplest and least accurate of all timers ... is the timer API", I read that "least accurate" as: "don't expect exactly 4 ms period" (as per this example) - and I don't, I'm fine with the variance shown in the (first) histogram; but I don't expect that a whole period will be missed!?
So my question(s) are:
Is this expected behavior from add_timer at this resolution (that a period can occasionally be missed)?
If so, is there a way to "force" add_timer to fire the function at each 4ms slot, as specified by a jiffy on this platform?
Is it possible that I get a "wrong" timestamp - e.g. the timestamp reflecting when the actual "print" to syslog happened, rather than when the function actually fired?
Note that I'm not looking for a period resolution below what corresponds to a jiffy (in this case, 4ms); nor am I looking to decrease the delta variance when the code works properly. So as I see it, I don't have "high resolution timer" demands, nor "hard real-time" demands - I just want add_timer to fire reliably. Would that be possible on this platform, without resorting to special "real-time" configurations of the kernel?
Bonus question: in rerun.sh below, you'll note two sleeps marked with MUSTHAVE; if either of them is left out/commented, OS/kernel freezes, and requires a hard reboot. And I cannot see why - is it really possible that running rmmod after insmod from bash is so fast, that it will conflict with the normal process of module loading/unloading?
Platform info:
$ cat /proc/cpuinfo | grep "processor\|model name\|MHz\|cores"
processor : 0 # (same for 1)
model name : Intel(R) Atom(TM) CPU N450 # 1.66GHz
cpu MHz : 1000.000
cpu cores : 1
$ echo $(cat /etc/issue ; uname -a)
Ubuntu 11.04 \n \l Linux mypc 2.6.38-16-generic #67-Ubuntu SMP Thu Sep 6 18:00:43 UTC 2012 i686 i686 i386 GNU/Linux
$ echo $(lsb_release -a 2>/dev/null | tr '\n' ' ')
Distributor ID: Ubuntu Description: Ubuntu 11.04 Release: 11.04 Codename: natty
Code:
$ cd /tmp/testjiffy
$ ls
Makefile rerun.sh testjiffy.c
Makefile:
obj-m += testjiffy.o
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
testjiffy.c:
/*
* [http://www.tldp.org/LDP/lkmpg/2.6/html/lkmpg.html#AEN189 The Linux Kernel Module Programming Guide]
*/
#include <linux/module.h> /* Needed by all modules */
#include <linux/kernel.h> /* Needed for KERN_INFO */
#include <linux/init.h> /* Needed for the macros */
#include <linux/jiffies.h>
#include <linux/time.h>
#define MAXRUNS 10
static volatile int runcount = 0;
static struct timer_list my_timer;
static void testjiffy_timer_function(unsigned long data)
{
int tdelay = 100;
runcount++;
if (runcount == 5) {
while (tdelay > 0) { tdelay--; } // small delay
}
printk(KERN_INFO
" %s: runcount %d \n",
__func__, runcount);
if (runcount < MAXRUNS) {
my_timer.expires = jiffies + 1;
add_timer(&my_timer);
}
}
static int __init testjiffy_init(void)
{
printk(KERN_INFO
"Init testjiffy: %d ; HZ: %d ; 1/HZ (ms): %d\n",
runcount, HZ, 1000/HZ);
init_timer(&my_timer);
my_timer.function = testjiffy_timer_function;
//my_timer.data = (unsigned long) runcount;
my_timer.expires = jiffies + 1;
add_timer(&my_timer);
return 0;
}
static void __exit testjiffy_exit(void)
{
printk(KERN_INFO "Exit testjiffy\n");
}
module_init(testjiffy_init);
module_exit(testjiffy_exit);
MODULE_LICENSE("GPL");
rerun.sh:
#!/usr/bin/env bash
set -x
make clean
make
# blank syslog first
sudo bash -c 'echo "0" > /var/log/syslog'
sleep 1 # MUSTHAVE 01!
# reload kernel module/driver
sudo insmod ./testjiffy.ko
sleep 1 # MUSTHAVE 02!
sudo rmmod testjiffy
set +x
# copy & process syslog
max=0;
for ix in _testjiffy_*.syslog; do
aa=${ix#_testjiffy_};
ab=${aa%.syslog} ;
case $ab in
*[!0-9]*) ab=0;; # reset if non-digit obtained; else
*) ab=$(echo $ab | bc);; # remove leading zeroes (else octal)
esac
if (( $ab > $max )) ; then
max=$((ab));
fi;
done;
newm=$( printf "%05d" $(($max+1)) );
PLPROC='chomp $_;
if (!$p) {$p=0;}; if (!$f) {$f=$_;} else {
$a=$_-$f; $d=$a-$p;
print "$a $d\n" ; $p=$a;
};'
set -x
grep "testjiffy" /var/log/syslog | cut -d' ' -f7- > _testjiffy_${newm}.syslog
grep "testjiffy_timer_function" _testjiffy_${newm}.syslog \
| sed 's/\[\(.*\)\].*/\1/' \
| perl -ne "$PLPROC" \
> _testjiffy_${newm}.dat
set +x
cat > _testjiffy_${newm}.gp <<EOF
set terminal pngcairo font 'Arial,10' size 900,500
set output '_testjiffy_${newm}.png'
set style line 1 linetype 1 linewidth 3 pointtype 3 linecolor rgb "red"
set multiplot layout 1,2 title "_testjiffy_${newm}.syslog"
set xtics rotate by -45
set title "Time positions"
set yrange [0:1.5]
set offsets graph 50e-3, 1e-3, 0, 0
plot '_testjiffy_${newm}.dat' using 1:(1.0):xtic(gprintf("%.3se%S",\$1)) notitle with points ls 1, '_testjiffy_${newm}.dat' using 1:(1.0) with impulses ls 1
binwidth=0.05e-3
set boxwidth binwidth
bin(x,width)=width*floor(x/width) + width/2.0
set title "Delta diff histogram"
set style fill solid 0.5
set autoscale xy
set offsets graph 0.1e-3, 0.1e-3, 0.1, 0.1
plot '_testjiffy_${newm}.dat' using (bin(\$2,binwidth)):(1.0) smooth freq with boxes ls 1
unset multiplot
EOF
set -x; gnuplot _testjiffy_${newm}.gp ; set +x
EDIT: Motivated by this comment by #granquet, I tried to obtain scheduler statistics from /proc/schedstat and /proc/sched_debug, by using dd through call_usermodehelper; note that this most of the time "skips" (that is, a file due to the 7th, or 6th, or Xth run of the function would be missing); but I managed to obtain two complete runs, and posted them in https://gist.github.com/anonymous/5709699 (as I noticed gist may be preferred to pastebin on SO), given the output is kinda massive; the *_11* files log a proper run, the *_17* files log a run with a "drop".
Note I also switched to mod_timer_pinned in the module, and it doesn't help much (the gist logs are obtained with the module using this function). These are the changes in testjiffy.c:
#include <linux/kmod.h> // usermode-helper API
...
char fcmd[] = "of=/tmp/testjiffy_sched00";
char *dd1argv[] = { "/bin/dd", "if=/proc/schedstat", "oflag=append", "conv=notrunc", &fcmd[0], NULL };
char *dd2argv[] = { "/bin/dd", "if=/proc/sched_debug", "oflag=append", "conv=notrunc", &fcmd[0], NULL };
static char *envp[] = {
"HOME=/",
"TERM=linux",
"PATH=/sbin:/bin:/usr/sbin:/usr/bin", NULL };
static void testjiffy_timer_function(unsigned long data)
{
int tdelay = 100;
unsigned long tjnow;
runcount++;
if (runcount == 5) {
while (tdelay > 0) { tdelay--; } // small delay
}
printk(KERN_INFO
" %s: runcount %d \n",
__func__, runcount);
if (runcount < MAXRUNS) {
mod_timer_pinned(&my_timer, jiffies + 1);
tjnow = jiffies;
printk(KERN_INFO
" testjiffy expires: %lu - jiffies %lu => %lu / %lu\n",
my_timer.expires, tjnow, my_timer.expires-tjnow, jiffies);
sprintf(fcmd, "of=/tmp/testjiffy_sched%02d", runcount);
call_usermodehelper( dd1argv[0], dd1argv, envp, UMH_NO_WAIT );
call_usermodehelper( dd2argv[0], dd2argv, envp, UMH_NO_WAIT );
}
}
... and this in rerun.sh:
...
set +x
for ix in /tmp/testjiffy_sched*; do
echo $ix | tee -a _testjiffy_${newm}.sched
cat $ix >> _testjiffy_${newm}.sched
done
set -x ; sudo rm /tmp/testjiffy_sched* ; set +x
cat > _testjiffy_${newm}.gp <<EOF
...
I'll use this post for verbose replying.
#CL.: many thanks for the answer. Good to have it confirmed that it's "possible that your timer function gets called at a later jiffy"; by logging the jiffies, I too realized that the timer function gets called at a later time - and other than that, it doesn't anything "wrong" per se.
Good to know about the timestamps; I wonder if it is possible that: the timer functions hits at the right time, but the kernel preempts the kernel logging service (I believe it's klogd), so I get a delayed timestamp? However, I'm trying to create a "looped" (or rather, periodic) timer function to write to hardware, and I first noted this "drop" by realizing the PC doesn't write data at certain intervals on the USB bus; and given that the timestamps confirm that behavior, it's probably not the problem here (I guess).
I have modified the timer function so it fires relative to the scheduled time of the last timer (my_timer.expires) - again via mod_timer_pinned instead of add_timer:
static void testjiffy_timer_function(unsigned long data)
{
int tdelay = 100;
unsigned long tjlast;
unsigned long tjnow;
runcount++;
if (runcount == 5) {
while (tdelay > 0) { tdelay--; } // small delay
}
printk(KERN_INFO
" %s: runcount %d \n",
__func__, runcount);
if (runcount < MAXRUNS) {
tjlast = my_timer.expires;
mod_timer_pinned(&my_timer, tjlast + 1);
tjnow = jiffies;
printk(KERN_INFO
" testjiffy expires: %lu - jiffies %lu => %lu / %lu last: %lu\n",
my_timer.expires, tjnow, my_timer.expires-tjnow, jiffies, tjlast);
}
}
... and the first few tries, it works impeccably - however, eventually, I get this:
[13389.775508] Init testjiffy: 0 ; HZ: 250 ; 1/HZ (ms): 4
[13389.776051] testjiffy_timer_function: runcount 1
[13389.776063] testjiffy expires: 3272445 - jiffies 3272444 => 1 / 3272444 last: 3272444
[13389.780053] testjiffy_timer_function: runcount 2
[13389.780068] testjiffy expires: 3272446 - jiffies 3272445 => 1 / 3272445 last: 3272445
[13389.788054] testjiffy_timer_function: runcount 3
[13389.788073] testjiffy expires: 3272447 - jiffies 3272447 => 0 / 3272447 last: 3272446
[13389.788090] testjiffy_timer_function: runcount 4
[13389.788096] testjiffy expires: 3272448 - jiffies 3272447 => 1 / 3272447 last: 3272447
[13389.792070] testjiffy_timer_function: runcount 5
[13389.792091] testjiffy expires: 3272449 - jiffies 3272448 => 1 / 3272448 last: 3272448
[13389.796044] testjiffy_timer_function: runcount 6
[13389.796062] testjiffy expires: 3272450 - jiffies 3272449 => 1 / 3272449 last: 3272449
[13389.800053] testjiffy_timer_function: runcount 7
[13389.800063] testjiffy expires: 3272451 - jiffies 3272450 => 1 / 3272450 last: 3272450
[13389.804056] testjiffy_timer_function: runcount 8
[13389.804072] testjiffy expires: 3272452 - jiffies 3272451 => 1 / 3272451 last: 3272451
[13389.808045] testjiffy_timer_function: runcount 9
[13389.808057] testjiffy expires: 3272453 - jiffies 3272452 => 1 / 3272452 last: 3272452
[13389.812054] testjiffy_timer_function: runcount 10
[13390.815415] Exit testjiffy
... which renders like so:
... so, basically I have a delay/"drop" at +8ms slot (which should be #3272446 jiffies), and then have two functions run at the +12ms slot (which would be # 3272447 jiffies); you can even see the label on the plot as "more bolded" because of it. This is better, in the sense of the "drop" sequence now being synchronous to a proper, non-drop sequence (which is as you said: "to avoid that one late timer function shifts all following timer calls") - however, I still miss a beat; and since I have to write bytes to hardware at each beat, so I keep a sustained, constant transfer rate, this unfortunately doesn't help me much.
As for the other suggestion, to "use ten timers" - because of my ultimate goal (write to hardware using a periodic lo-res timer function); I thought at first it doesn't apply - but if nothing else is possible (other than doing some special real-time kernel preparations), then I'll certainly try a scheme where I have 10 (or N) timers (maybe stored in an array) which are fired periodically one after another.
EDIT: just adding leftover relevant comments:
USB transfers are either scheduled in advance (isochronous) or have no timing guarantees (asynchronous). If your device doesn't use isochronous transfers, it's badly misdesigned. – CL. Jun 5 at 10:47
Thanks for the comment, #CL. - "... scheduled in advance (isochronous)..." cleared a confusion I had. I'm (eventually) targeting an FT232, which only has BULK mode - and as long as the bytes per timer hit is low, I can actually "cheat" my way through in "streaming" data with add_timer; however, when I transfer ammount of bytes close to consuming bandwidth, then these "misfires" start getting noticeable as drops. So I was interested in testing the limits of that, for which I need a reliably repetitive "timer" function - is there anything else I could try to have a reliable "timer"? – sdaau Jun 5 at 12:27
#sdaau Bulk transfers are not suitable for streaming. You cannot fix shortcomings in the hardware protocol by using another kind of software timer. – CL. Jun 5 at 13:50
... and as my response to #CL. : I'm aware I wouldn't be able to fix shortcomings; I was more interested in observing these shortcomings - say, if a kernel function makes a periodic USB write, I could observe the signals on a scope/analyzer, and hopefully see in what sense is the bulk mode unsuitable. But first, I'd have to trust that the function can (at least somewhat) reliably repeat at a periodic rate (i.e. "generate" a clock/tick) - and I wasn't aware, until now, that I cannot really trust add_timer at jiffies resolution (as it is capable of relatively easily skipping a whole period). However, it seems that moving to Linux' high-resolution timers (hrtimer) does give me a reliable periodic function in this sense - so I guess that solves my problem (posted in my answer below).

Many thanks for all the comments and answers; they all pointed to things that must be taken into account - but given I'm somewhat of a forever noob, I still needed to do some more reading, before gaining some understanding (I hope a correct one). Also, I couldn't really find anything specific for periodically "ticking" functions - so I'll post a more verbose answer here.
In brief - for a reliable periodic Linux kernel function at a resolution of a jiffy, do not use add_timer (<linux/time.h>), as it may "drop" an entire period; use high-resolution timers (<linux/hrtimer.h>) instead. In more detail:
Is it possible that I get a "wrong" timestamp - ...?
#CL.: The timestamp in the log is the time when that string was printed to the log.
So, maybe it's possible - but it turns out, that's not the problem here:
Is this expected behavior from add_timer at this resolution (that a period can occasionally be missed)?
I guess, it turns out - yes:
If so, is there a way to "force" add_timer to fire the function at each 4ms slot, as specified by a jiffy on this platform?
... and (I guess again), it turns out - no.
Now, the reasons for this are somewhat subtle - and I hope if I didn't get them right, someone will correct me. First of all, the first misconception that I had, was that "a clock is just a clock" (in the sense of: even if it is implemented as computer code) - but that is not quite correct. The kernel basically has to "queue" an "event" somewhere, each time something like add_timer is used; and this request may come from anything really: from any (and all) sort(s) of driver(s), or even possibly userspace.
The problem is that this "queuing" costs - since in addition to the kernel having to handle (the equivalent of) traversing and inserting (and removing) items in an array, it also has to handle timer delays spanning several orders of magnitude (from say milliseconds to maybe 10s of seconds); and the fact that some drivers (like, apparently, those for network protocols) apparently queue a lot of timer events, which are usually cancelled before running - while other types may require a completely different behavior (like in my case - in a periodic function, you expect that most of the time, the event will usually not be cancelled; and you also queue the events one by one). On top of that, the kernel needs to handle this for uniprocessor vs. SMP vs. multiprocessor platforms. Thus, there is a cost-benefit tradeoff involved in implementing timer handling in the kernel.
It turns out, the architecture around jiffies/add_timer is designed to handle the most common devices - and for them, precision at a resolution of a jiffy is not an issue; but this also means that one cannot expect a reliable timer at resolution of a single jiffy with this method. This is also compounded by the fact that the kernel handles these "event queues" by treating them (somewhat) like interrupt service requests (IRQ); and that there are several levels of priority in IRQ handling in the kernel, where higher priority routine can pre-empt a lower priority one (that is: interrupt and suspend a lower priority routine, even if it is being executed at the time - and allow the higher priority routine to go about its business). Or, as previously noted:
#granquet: timers run in soft irq context, which means they have the highest priority and they preempt everything running/runnable on the CPU ... but hardware interrupts which are not disabled when servicing a soft irq. So you might (most probable explanation) get an Hardware interrupt here and there that preempts your timer ... and thus you get an interrupt that is not serviced at the right time.
#CL.: It is indeed possible that your timer function gets called at a later jiffy than what expires what set to. Possible reasons are scheduling delays, other drivers that disable interrupts for too long (graphics and WLAN drivers are usual culprits), or some crappy BIOS executing SMI code.
I now think so, too - I think this could be an illustration of what happens:
jiffies changes to, say, 10000 (== 40000 ms #250 Hz)
Let's say the timer function, (queued by add_timer) is about to start running - but hasn't started running yet
Let's say here, the network card generates (for whatever reason) a hardware interrupt
The hardware interrupt, having a higher priority, triggers the kernel to pre-empt (stop and suspend) the timer function (possibly started by now, and just few instructions in);
That means the kernel now has to reschedule the timer function, to run at a later point - and since one only works with integer operations in the kernel, and time resolution for this kind of event is in jiffies - the best it can do is reschedule it for jiffies+1 (10001 == 40004 ms #250 Hz)
Now the kernel switches the context to the IRQ service routine of the network card driver, and it goes about its business
Let's say the IRQ service routine completes in 200 μs - that means now we're (in "absolute" terms) at 40000.2 ms - however, we are also still at 10000 jiffies
If the kernel now switched the context back to the timer function, it would have completed - without me necessarily noticing the delay;
... however, that will not happen, because the timer function is scheduled for the next jiffy!
So kernel goes about its business (possibly sleeping) for the next approx 3.8 ms
jiffies changes to 10001 (== 40004 ms #250 Hz)
(the previously rescheduled) timer function runs - and this time completes without interruption
I haven't really done a detailed analysis to see if the sequence of events is exactly as described above; but I'm quite persuaded that it is something close - in other words, a resolution problem - especially since the high-resolution timer approach seems to not show this behavior. It would be great indeed, to obtain a scheduler log, and know exactly what happened to cause a pre-empt - but I doubt the roundtrip to userspace, which I attempted in OP edit, in response to #granquet's comment, is the right thing to do.
In any case, going back to this:
Note that I'm not looking for a period resolution below what corresponds to a jiffy (in this case, 4ms); nor am I looking to decrease the delta variance when the code works properly. So as I see it, I don't have "high resolution timer" demands, nor "hard real-time" demands ...
... here was a bad mistake I made - as the analysis above shows, I did have "high resolution" demands! And had I realized that earlier, I may have found relevant reading sooner. Anyways, some relevant docs - even if they don't discuss specifically periodic functions - for me, were:
LDD3: 5.3. Semaphores and Mutexes - (in describing a driver with different demands from here): "no accesses will be made from interrupt handlers or other asynchronous contexts. There are no particular latency (response time) requirements; application programmers understand that I/O requests are not usually satisfied immediately"
Documentation/timers/hrtimers.txt - "The timers.c code is very "tightly coded" around jiffies and 32-bitness assumptions, and has been honed and micro-optimized for a relatively narrow use case (jiffies in a relatively narrow HZ range) for many years - and thus even small extensions to it easily break the wheel concept"
T. Gleixner, D. Niehaus Hrtimers and Beyond: Transforming the Linux Time Subsystems (pdf) - (most detailed, see also diagrams inside) "The Cascading Timer Wheel (CTW), which was implemented in 1997, replaced the original time ordered double linked list to resolve the scalability problem of the linked list's O(N) insertion time... The current approach to timer management in Linux does a good job of satisfying an extremely wide range of requirements, but it cannot provide the quality of service required in some cases precisely because it must satisfy such a wide range of requirements... The timeout related timers are kept in the existing timer wheel and a new subsystem optimized for (high resolution) timer requirements hrtimers was implemented. hrtimers are entirely based on human time (units: nanoseconds)... They are kept in a time sorted, per-CPU list, implemented as a red-black tree."
The high-resolution timer API [LWN.net] - "At its core, the hrtimer mechanism remains the same. Rather than using the "timer wheel" data structure, hrtimers live on a time-sorted linked list, with the next timer to expire being at the head of the list. A separate red/black tree is also used to enable the insertion and removal of timer events without scanning through the list. But while the core remains the same, just about everything else has changed, at least superficially."
Software interrupts and realtime [LWN.net] - "The softirq mechanism is meant to handle processing that is almost — but not quite — as important as the handling of hardware interrupts. Softirqs run at a high priority (though with an interesting exception, described below), but with hardware interrupts enabled. They thus will normally preempt any work except the response to a "real" hardware interrupt... Starting with the 3.0 realtime patch set, though, that capability went away... In response, in 3.6.1-rt1, the handling of softirqs has changed again."
High- (but not too high-) resolution timeouts [LWN.net] - "_poll() and epoll_wait() take an integer number of milliseconds; select() takes a struct timeval with microsecond resolution, and ppoll() and pselect() take a struct timespec with nanosecond resolution. They are all the same, though, in that they convert this timeout value to jiffies, with a maximum resolution between one and ten milliseconds. A programmer might program a pselect() call with a 10 nanosecond timeout, but the call may not return until 10 milliseconds later, even in the absence of contention for the CPU. ... It's a useful feature, but it comes at the cost of some significant API changes._"
One thing clear from the quotes, is that high-resolution timing facilities are still under active development (with API changes) in the kernel - and I was afraid, that maybe I'd have to install a special "real-time patch" kernel. Thankfully, high-resolution timers are seemingly available (and working) in my 2.6.38-16 SMP kernel without any special changes. Below is the listing of the modified testjiffies.c kernel module, which now uses high-resolution timers, but otherwise keeps the same period as determined by jiffies. For testing, I made it loop for 200 times (instead of 10 in the OP); and running the rerun.sh script for some 20-30 times, this is the worst result I got:
The time sequence is now obviously unreadable, but the histogram can still tell us this: taking 0.00435-0.004 (= 0.004-0.00365) = 350 μs for the max deviation, it represents only 100*(350/4000) = 8.75% of the expected period; which I certainly don't have a problem with. Additionally, I never got a drop (or correspondingly, an entire 2*period = 8 ms delay), or a 0 ms delay - the captures I got, are otherwise of the quality shown on the first image in OP. Now, of course I could run a longer test and see more precisely how reliable it is - but this is all the reliability I'd expect/need to see for this simple case; contrast that to the OP, where I'd get a drop in just 10 loops, with the probability of tossing a coin - every second or third run of the rerun.sh script, I'd get a drop - even in context of low OS resource usage!
Finally, note that the source below should have the problem, spotted by #CL.: "Your module is buggy: you must ensure that the timer is not pending before the module is unloaded", fixed (in the context of hrtimer). This seemingly answers my bonus question, as it obviates the need for either of the "MUSTHAVE" sleeps in the rerun.sh script. However, note that as 200 loops # 4 ms take 0.8 s - the sleep between insmod and rmmod is needed if we want a full 200 tick capture (otherwise, on my machine, I get only some 7 ticks captured).
Well, hope I got this right now (at least most if it) - if not, corrections are welcome :)
testjiffy(-hr).c
#include <linux/module.h> /* Needed by all modules */
#include <linux/kernel.h> /* Needed for KERN_INFO */
#include <linux/init.h> /* Needed for the macros */
#include <linux/jiffies.h>
#include <linux/time.h>
#define MAXRUNS 200
#include <linux/hrtimer.h>
static volatile int runcount = 0;
//~ static struct timer_list my_timer;
static unsigned long period_ms;
static unsigned long period_ns;
static ktime_t ktime_period_ns;
static struct hrtimer my_hrtimer;
//~ static void testjiffy_timer_function(unsigned long data)
static enum hrtimer_restart testjiffy_timer_function(struct hrtimer *timer)
{
int tdelay = 100;
unsigned long tjnow;
ktime_t kt_now;
int ret_overrun;
runcount++;
if (runcount == 5) {
while (tdelay > 0) { tdelay--; } // small delay
}
printk(KERN_INFO
" %s: runcount %d \n",
__func__, runcount);
if (runcount < MAXRUNS) {
tjnow = jiffies;
kt_now = hrtimer_cb_get_time(&my_hrtimer);
ret_overrun = hrtimer_forward(&my_hrtimer, kt_now, ktime_period_ns);
printk(KERN_INFO
" testjiffy jiffies %lu ; ret: %d ; ktnsec: %lld \n",
tjnow, ret_overrun, ktime_to_ns(kt_now));
return HRTIMER_RESTART;
}
else return HRTIMER_NORESTART;
}
static int __init testjiffy_init(void)
{
struct timespec tp_hr_res;
period_ms = 1000/HZ;
hrtimer_get_res(CLOCK_MONOTONIC, &tp_hr_res);
printk(KERN_INFO
"Init testjiffy: %d ; HZ: %d ; 1/HZ (ms): %ld ; hrres: %lld.%.9ld\n",
runcount, HZ, period_ms, (long long)tp_hr_res.tv_sec, tp_hr_res.tv_nsec );
hrtimer_init(&my_hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
my_hrtimer.function = &testjiffy_timer_function;
period_ns = period_ms*( (unsigned long)1E6L );
ktime_period_ns = ktime_set(0,period_ns);
hrtimer_start(&my_hrtimer, ktime_period_ns, HRTIMER_MODE_REL);
return 0;
}
static void __exit testjiffy_exit(void)
{
int ret_cancel = 0;
while( hrtimer_callback_running(&my_hrtimer) ) {
ret_cancel++;
}
if (ret_cancel != 0) {
printk(KERN_INFO " testjiffy Waited for hrtimer callback to finish (%d)\n", ret_cancel);
}
if (hrtimer_active(&my_hrtimer) != 0) {
ret_cancel = hrtimer_cancel(&my_hrtimer);
printk(KERN_INFO " testjiffy active hrtimer cancelled: %d (%d)\n", ret_cancel, runcount);
}
if (hrtimer_is_queued(&my_hrtimer) != 0) {
ret_cancel = hrtimer_cancel(&my_hrtimer);
printk(KERN_INFO " testjiffy queued hrtimer cancelled: %d (%d)\n", ret_cancel, runcount);
}
printk(KERN_INFO "Exit testjiffy\n");
}
module_init(testjiffy_init);
module_exit(testjiffy_exit);
MODULE_LICENSE("GPL");

It is indeed possible that your timer function gets called at a later jiffy than what expires what set to.
Possible reasons are scheduling delays, other drivers that disable interrupts for too long (graphics and WLAN drivers are usual culprits), or some crappy BIOS executing SMI code.
If you want to avoid that one late timer function shifts all following timer calls, you have to schedule the respective next timer not relative to the current time (jiffies), but relative to the scheduled time of the last timer (my_timer.expires).
Alternatively, use ten timers that you all schedule at the beginning at jiffies + 1, 2, 3, …
The timestamp in the log is the time when that string was printed to the log.
Your module is buggy: you must ensure that the timer is not pending before the module is unloaded.

Performance decrease with threaded implementation

I implemented a small program in C to calculate PI using a Monte Carlo method (mainly because of personal interest and training). After having implemented the basic code structure, I added a command-line option allowing to execute the calculations threaded.
I expected major speed ups, but I got disappointed. The command-line synopsis should be clear. The final number of iterations made to approximate PI is the product of the number of -iterations and -threads passed via the command-line. Leaving -threads blank defaults it to 1 thread resulting in execution in the main thread.
The tests below are tested with 80 Million iterations in total.
On Windows 7 64Bit (Intel Core2Duo Machine):
Compiled using Cygwin GCC 4.5.3: gcc-4 pi.c -o pi.exe -O3
On Ubuntu/Linaro 12.04 (8Core AMD):
Compiled using GCC 4.6.3: gcc pi.c -lm -lpthread -O3 -o pi
Performance
On Windows, the threaded version is a few milliseconds faster than the un-threaded. I expected a better performance, to be honest. On Linux, ew! What the heck? Why does it take even 2000% longer? Of course this is depending much on the implementation, so here it goes. An excerpt after the command-line argument parsing was done and the calculation is started:
// Begin computation.
clock_t t_start, t_delta;
double pi = 0;
if (args.threads == 1) {
t_start = clock();
pi = pi_mc(args.iterations);
t_delta = clock() - t_start;
}
else {
pthread_t* threads = malloc(sizeof(pthread_t) * args.threads);
if (!threads) {
return alloc_failed();
}
struct PIThreadData* values = malloc(sizeof(struct PIThreadData) * args.threads);
if (!values) {
free(threads);
return alloc_failed();
}
t_start = clock();
for (i=0; i < args.threads; i++) {
values[i].iterations = args.iterations;
values[i].out = 0.0;
pthread_create(threads + i, NULL, pi_mc_threaded, values + i);
}
for (i=0; i < args.threads; i++) {
pthread_join(threads[i], NULL);
pi += values[i].out;
}
t_delta = clock() - t_start;
free(threads);
threads = NULL;
free(values);
values = NULL;
pi /= (double) args.threads;
}
While pi_mc_threaded() is implemented as:
struct PIThreadData {
int iterations;
double out;
};
void* pi_mc_threaded(void* ptr) {
struct PIThreadData* data = ptr;
data->out = pi_mc(data->iterations);
}
You can find the full source code at http://pastebin.com/jptBTgwr.
Question
Why is this? Why this extreme difference on Linux? I expected the anmount of time taken to calculate to be at least 3/4 of the original time. It would of course be possible that I simply made wrong use of the pthread library. A clarifcation on how to do correct in this case would be very nice.

The problem is that in glibc's implementation, rand() calls __random(), and that
long int
__random ()
{
int32_t retval;
__libc_lock_lock (lock);
(void) __random_r (&unsafe_state, &retval);
__libc_lock_unlock (lock);
return retval;
}
locks around each call to the function __random_r that does the actual work.
Thus, as soon as you have more than one thread using rand(), you make each thread wait for the other(s) on almost every call to rand(). Directly using random_r() with your own buffers in each thread should be much faster.

Performance and threading is a black art. The answer depends on the specifics of the compiler and libraries used to do threading, how well the kernel handles it, etc. Basically, if your libraries for *nix are not efficient in switching, moving objects around etc, threading will in fact, be slower . THis is one of the reasons a lot us doing thread work now work with JVM or JVM-like languages. We can trust the runtime JVM's behavior -- it's overall speed may vary with platform, but it's consistent on that platform. In addition, you may have some hidden wait/race conditions that you uncovered just due to timing that may not show up on Windows.
If you are in a position to change your language, consider Scala or D. Scala is the actor driven model successor to Java, and D, the successor to C. Both languages show their roots -- if you can write in C, D should be no problem. Both languages however, implement the actor model. NO MORE THREAD POOLS, NO MORE RACE CONDITIONS ETC!!!!!!

For comparison, I just tried your app on Windows Vista, compiled with Borland C++, and the 2 thread version performed nearly twice as fast as the single thread.
pi.exe -iterations 20000000 -stats -threads 1
3.141167
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 12.511000 sec
Threads: Main
pi.exe -iterations 10000000 -stats -threads 2
3.142397
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 6.584000 sec
Threads: 2
That's compiled against the thread-safe run-time library. Using the single thread library, both versions run at twice their thread-safe speed.
pi.exe -iterations 20000000 -stats -threads 1
3.141167
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 6.458000 sec
Threads: Main
pi.exe -iterations 10000000 -stats -threads 2
3.141314
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 3.978000 sec
Threads: 2
So the 2 thread version is still twice as fast, but the 1 thread version with the single thread library is actually faster than the 2 thread version on the thread-safe library.
Looking at Borland's rand implementation, they use thread local storage for the seed in the thread-safe implementation, so it's not going to have the same negative impact on threaded code as glibc's lock, but the thread-safe implementation will obviously be slower than the single thread implementation.
The bottom line though, is that your compiler's rand implementation is probably the main performance issue in both cases.
Update
I've just tried replacing your rand_01 calls with inline implementations of Borland's rand function using a local variable for the seed, and the results are consistently twice as fast in the 2 thread case.
The updated code looks like this:
#define MULTIPLIER 0x015a4e35L
#define INCREMENT 1
double pi_mc(int iterations) {
unsigned seed = 1;
long long inner = 0;
long long outer = 0;
int i;
for (i=0; i < iterations; i++) {
seed = MULTIPLIER * seed + INCREMENT;
double x = ((int)(seed >> 16) & 0x7fff) / (double) RAND_MAX;
seed = MULTIPLIER * seed + INCREMENT;
double y = ((int)(seed >> 16) & 0x7fff) / (double) RAND_MAX;
double d = sqrt(pow(x, 2.0) + pow(y, 2.0));
if (d <= 1.0) {
inner++;
}
else {
outer++;
}
}
return ((double) inner / (double) iterations) * 4;
}
I don't know how good that is as rand implementations go, but it's worth at least trying on Linux to see whether it makes a difference to the performance.

CPU contention (wait time) for a process in Linux

How can I check how long a process spends waiting for the CPU in a Linux box?
For example, in a loaded system I want to check how long a SQL*Loader (sqlldr) process waits.
It would be useful if there is a command line tool to do this.

I've quickly slapped this together. It prints out the smallest and largest "interferences" from task switching...
#include <sys/time.h>
#include <stdio.h>
double seconds()
{
timeval t;
gettimeofday(&t, NULL);
return t.tv_sec + t.tv_usec / 1000000.0;
}
int main()
{
double min = 999999999, max = 0;
while (true)
{
double c = -(seconds() - seconds());
if (c < min)
{
min = c;
printf("%f\n", c);
fflush(stdout);
}
if (c > max)
{
max = c;
printf("%f\n", c);
fflush(stdout);
}
}
return 0;
}

Here's how you should go about measuring it. Have a number of processes, greater than the number of your processors * cores * threading capability wait (block) on an event that will wake them up all at the same time. One such event is a multicast network packet. Use an instrumentation library like PAPI (or one more suited to your needs) to measure the differences in real and virtual "wakeup" time between your processes. From several iterations of the experiment you can get an estimate of the CPU contention time for your processes. Obviously, it's not going to be at all accurate for multicore processors, but maybe it'll help you.
Cheers.

I had this problem some time back. I ended up using getrusage :
You can get detailed help at :
http://www.opengroup.org/onlinepubs/009695399/functions/getrusage.html
getrusage populates the rusage struct.
Measuring Wait Time with getrusage
You can call getrusage at the beginning of your code and then again call it at the end, or at some appropriate point during execution. You have then initial_rusage and final_rusage. The user-time spent by your process is indicated by rusage->ru_utime.tv_sec and system-time spent by the process is indicated by rusage->ru_stime.tv_sec.
Thus the total user-time spent by the process will be:
user_time = final_rusage.ru_utime.tv_sec - initial_rusage.ru_utime.tv_sec
The total system-time spent by the process will be:
system_time = final_rusage.ru_stime.tv_sec - initial_rusage.ru_stime.tv_sec
If total_time is the time elapsed between the two calls of getrusage then the wait time will be
wait_time = total_time - (user_time + system_time)
Hope this helps

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string