Unreliable QueryPerformanceCounter on a KVM Windows 2008 RC2 guest - linux

I am having some trouble when running a KVM Windows 2008 RC2 x64 guest on an ubuntu 12.04 x64 host. Specifically the Win32 call QueryPerformanceCounter seems to periodically produce unreliable results when compared to clock time. I am running a loop similar to this:
auto zero = tbb::tick_count::now ();
while (true) {
std::cout << datetime::now ()
<< " delta: " << (tbb::tick_count::now () - zero).seconds ()
<< std::endl;
zero = tbb::tick_count::now ();
Sleep (1000);
}
Above, tbb::tick_count is a thin wrapper over QueryPerformanceCounter and datetime::now() uses the system clock. Periodically, say at least once every 3 minutes, the delta is about 42 seconds. The system clock is always pretty accurate.
Any ideas on what could be causing this?

From Game Timing and Multicore Processors:
[...] While QueryPerformanceCounter and QueryPerformanceFrequency
typically adjust for multiple processors, bugs in the BIOS or drivers
may result in these routines returning different values as the thread
moves from one processor to another. So, it's best to keep the thread
on a single processor. [...]
The system clock uses a different mechanism to return time. It's generally more reliable at the cost of resolution.

Related

Dramatic timing differences between code compiled for linux 3.2.x vs 2.6.x based systems

I have an application that was written, tested, and debugged for a small linux distribution with a version 2.6.x kernel. I recently attempted to migrate the project to a distribution based on Debian with a 3.2.x kernel and we note huge performance decreases. I've done some primitive benchmarking and found differences in usleep() timing, differences in function call & loop timing, etc.
I'm not sure what the exact 2.6.x kernel configuration is (e.g. preemption model, etc) and I haven't been able to extract kernel build configuration info - we just have this system as an image that we've been using for our embedded applications. For the 3.2.x kernel I built a configuration with optimizations for our processor, with a "preemptable kernel" configuration, and removed a bunch of utterly unneeded optional modules (stuff like HAM radio device drivers - stuff that stuck out as totally reasonable to remove).
Our system is a near-realtime application that doesn't have hard realtime requirements (we just have to keep certain buffers populated with computed data before it gets consumed, which is done at a fixed rate, but one controlled by hardware and in practice our CPU load stays around 30% for the most demanding applications -ie, we have the performance to keep the buffer populated and do a fair bit of waiting for space). We use pthreads, pthread_cond_wait/broadcast, etc, to signal buffer states, control thread synchronization, etc.
First, some preamble about the system. There are many polling threads with the pattern:
while (threadRunning)
{
CheckSomeStuff();
usleep(polling_interval);
}
And other threads with patterns like:
while (threadRunning)
{
pthread_cond_wait(stuff_needed_condition, some_mutex); // wait on signal
doSomeStuffWhenNeeded();
}
That said, we are noticing subtle timing-related issues in the ported application, and algorithms are running a lot "slower" than on the 2.6.x kernel based system.
This simple benchmark is illustrative:
static volatile long g_foo;
static void setfoo(long foo)
{
g_foo = foo;
}
static void printElapsed(struct timeval t1, struct timeval t2, const char* smsg)
{
double time_elapsed;
time_elapsed = (t2.tv_sec - t1.tv_sec)*1e6 + (t2.tv_usec-t1.tv_usec);
printf("%s elapsed: %.8f\n", smsg, time_elapsed);
}
static void benchmarks(long sleeptime)
{
long i;
double time_elapsed;
struct timeval t1, t2;
// test 1
gettimeofday(&t1, NULL);
for (i=0;i<sleeptime;i++)
{
usleep(1);
}
gettimeofday(&t2, NULL);
printElapsed(t1, t2, "Loop usleep(1)");
// test 2
gettimeofday(&t1, NULL);
usleep(sleeptime);
gettimeofday(&t2, NULL);
printElapsed(t1, t2, "Single sleep");
// test 3
gettimeofday(&t1, NULL);
usleep(1);
gettimeofday(&t2, NULL);
printElapsed(t1, t2, "Single 1us sleep");
// test 4
gettimeofday(&t1, NULL);
gettimeofday(&t2, NULL);
printElapsed(t1, t2, "gettimeofday x 2");
// test 5
gettimeofday(&t1, NULL);
for (i=0;i<n;i++)
{
setfoo(i);
}
gettimeofday(&t2, NULL);
printElapsed(t1, t2, "loop function call");
}
Here are the benchmark results (yeah, output decimal places is silly I know):
Kernel 2.6.x trial 1:
Loop usleep(1) elapsed: 6063979.00000000
Single sleep elapsed: 100071.00000000
Single 1us sleep elapsed: 63.00000000
gettimeofday x 2 elapsed: 1.00000000
loop function call elapsed: 267.00000000
Kernel 2.6.x trial 2:
Loop usleep(1) elapsed: 6059328.00000000
Single sleep elapsed: 100070.00000000
Single 1us sleep elapsed: 63.00000000
gettimeofday x 2 elapsed: 0.00000000
loop function call elapsed: 265.00000000
Kernel 2.6.x trial 3:
Loop usleep(1) elapsed: 6063762.00000000
Single sleep elapsed: 100064.00000000
Single 1us sleep elapsed: 63.00000000
gettimeofday x 2 elapsed: 1.00000000
loop function call elapsed: 266.00000000
kernel 3.2.65 trial 1:
Loop usleep(1) elapsed: 8944631.00000000
Single sleep elapsed: 100106.00000000
Single 1us sleep elapsed: 96.00000000
gettimeofday x 2 elapsed: 2.00000000
loop function call elapsed: 491.00000000
kernel 3.2.65 trial 2:
Loop usleep(1) elapsed: 8891191.00000000
Single sleep elapsed: 100102.00000000
Single 1us sleep elapsed: 94.00000000
gettimeofday x 2 elapsed: 2.00000000
loop function call elapsed: 396.00000000
kernel 3.2.65 trial 3:
Loop usleep(1) elapsed: 8962089.00000000
Single sleep elapsed: 100171.00000000
Single 1us sleep elapsed: 123.00000000
gettimeofday x 2 elapsed: 2.00000000
loop function call elapsed: 407.00000000
There is a huge difference in walltime between builds on a linux OS using kernel 2.6.x and kernel 3.2.x for 100,000 cycles of a loop that calls usleep(1) (9 seconds for the 3.2.x vs 6 seconds for the 2.6.x). For the record, I don't think we're using calls to "usleep(1);" anywhere in the code base (but as with any huge application worse things probably exist here and there) but nevertheless this is a big difference in behaviour. There are also big differences in the loop that sets a static global variable 100,000 times (400 microseconds on 3.2 vs 260 microseconds on 2.6).
I realize there are multiple confounding issues from glibc to compiler & settings to the linux kernel configuration. What I'm hoping from Stack Overflow is to get some guidance in terms of where to start poking. What would you do if you had to accomplish this migration? What factors would you look at to fix the performance issues we're seeing?
For further info, the two distributions are:
Puppy Linux
- kernel 2.6.35-7 SMP unknown kernel configuration (PREEMPT though, I'm pretty sure)
- glibc 2.6.1
- gcc 4.6.3
Debian wheezy 7.7 (stripped down)
- Linux 3.2.65 custom config from kernel sources
- gcc 4.7.2
- glibc 2.13 (Debian EGLIBC 2.13-38+deb7u6)
You've updated the kernel, compiler and glibc versions.
Don't do that. Update one thing at a time, and measure the effects. Then you'll know which of the updates is actually causing your problems (it's likely not the kernel).
You should be able to update just the kernel, then just the glibc on the old system.

what is the best way to test the performance of a program in linux

Suppose I write a program, then I make some "optimization" for the code.
In my case I want to test how much the new feature std::move in C++11 can improve the performance of a program.
I would like to check whether the "optimization" do make sense.
Currently I test it by the following steps:
write a program(without std::move) , compile ,get binary file m1
optimize it(using std::move), compile, get binary file m2
use command "time" to compare the time consuming:
time ./m1 ; time ./m2
EDITED:
In order to get the statistical result, it was needed to run the test thousands of times.
Is there any better ways to do that or is there some tools can help on it ?
In general measuring performance using a simple time comparison, e.g. endTime-beginTime is always a good start, for a rough estimation.
Later on you can use a profiler, like Valgrind to get measures of how different parts of your program is performing.
With profiling you can measure space (memory) or time complexity of a program, usage of particular instructions or frequency/duration of function calls.
There's also AMD CodeAnalyst if you want more advanced profiling functionality using a GUI. It's free/open source.
There are several tools (among others) that can do profiling for you:
GNU gprof
google gperftools
intel VTune amplifier (part of the intel XE compiler package)
kernel perf
AMD CodeXL (successor of AMD CodeAnalyst)-
Valgrind
Some of them require a specific way of compilation or a specific compiler. Some of them are specifically good at profiling for a given processor-architecture (AMD/Intel...).
Since you seem to have access to C++11 and if you just want to measure some timings you can use std::chrono.
#include <chrono>
#include <iostream>
class high_resolution_timer
{
private:
typedef std::chrono::high_resolution_clock clock;
clock::time_point m_time_point;
public:
high_resolution_timer (void)
: m_time_point(clock::now()) { }
void restart (void)
{
m_time_point = clock::now();
}
template<class Duration>
Duration stopover (void)
{
return std::chrono::duration_cast<Duration>
(clock::now()-m_time_point);
}
};
int main (void)
{
using std::chrono::microseconds;
high_resolution_timer timer;
// do stuff here
microseconds first_result = timer.stopover<microseconds>();
timer.restart();
// do other stuff here
microseconds second_result = timer.stopover<microseconds>();
std::cout << "First took " << first_result.count() << " x 10^-6;";
std::cout << " second took " << second_result.count() << " x 10^-6.";
std::cout << std::endl;
}
But you should be aware that there's almost no sense in optimizing several milliseconds of overall runtime (if your program runtime will be >= 1s). You should instead time highly repetitive events in your code (if there are any, or at least those which are the bottlenecks). If those improve significantly (and this can be in terms of milli or microseconds) your overall performance will likely increase, too.

Linux Kernel: udelay() returns too early?

I have a driver which requires microsecond delays. To create this delay, my driver is using the kernel's udelay function. Specifically, there is one call to udelay(90):
iowrite32(data, addr + DATA_OFFSET);
iowrite32(trig, addr + CONTROL_OFFSET);
udelay(30);
trig |= 1;
iowrite32(trig, addr + CONTROL_OFFSET);
udelay(90); // This is the problematic call
We had reliability issues with the device. After a lot of debugging, we traced the problem to the driver resuming before 90us has passed. (See "proof" below.)
I am running kernel version 2.6.38-11-generic SMP (Kubuntu 11.04, x86_64) on an Intel Pentium Dual Core (E5700).
As far as I know, the documentation states that udelay will delay execution for at least the specified delay, and is uninterruptible. Is there a bug is this version of the kernel, or did I misunderstand something about the use of udelay?
To convince ourselves that the problem was caused by udelay returning too early, we fed a 100kHz clock to one of the I/O ports and implemented our own delay as follows:
// Wait until n number of falling edges
// are observed
void clk100_delay(void *addr, u32 n) {
int i;
for (i = 0; i < n; i++) {
u32 prev_clk = ioread32(addr);
while (1) {
u32 clk = ioread32(addr);
if (prev_clk && !clk) {
break;
} else {
prev_clk = clk;
}
}
}
}
...and the driver now works flawlessly.
As a final note, I found a discussion indicating that frequency scaling could be causing the *delay() family of functions to misbehave, but this was on a ARM mailing list - I assuming such problems would be non-existent on a Linux x86 based PC.
I don't know of any bug in that version of the kernel (but that doesn't mean that there isn't one).
udelay() isn't "uninterruptible" - it does not disable preemption, so your task can be preempted by a RT task during the delay. However the same is true of your alternate delay implementation, so that is unlikely to be the problem.
Could your actual problem be a DMA coherency / memory ordering issue? Your alternate delay implementation accesses the bus, so this might be hiding the real problem as a side-effect.
The E5700 has X86_FEATURE_CONSTANT_TSC but not X86_FEATURE_NONSTOP_TSC. The TSC is the likely clock source for the udelay. Unless bound to one of the cores with an affinity mask, your task may have been preempted and rescheduled to another CPU during the udelay. Or the TSC might not be stable during lower-power CPU modes.
Can you try disabling interrupts or disabling preemption during the udelay? Also, try reading the TSC before and after.

How to caluate cpu cycles as that of QueryPerformanceCounter

I have the following code in win32 to calculate cpu cycles using QueryPerformanceCounter()
LARGE_INTEGER ltime; <br>
UINT32 cycles; <br>
QueryPerformanceCounter(&ltime);<br>
cycles = (UINT32) ((ltime.QuadPart >> 8) & 0xFFFFFFF);
How do I implement the same on ARM cortex A9 (panda board) running Ubuntu (OMAP4) ????
Your best bet would probably be to use clock_gettime with either CLOCK_PROCESS_CPUTIME_ID or CLOCK_THREAD_CPUTIME_ID. (see clock_gettime)
That will give you "High-resolution per-process timer from the CPU" and "Thread-specific CPU-time clock", respectively.
Alternatively, one could sum up values returned by times, but I guess that would be less precise as it also depends on the scheduler, whereas the above supposedly reads a performance counter from the CPU if possible.

How to make ARM9 custom device emulator?

I am working on ARM 9 processor with 266 Mhz with fpu support and 32 MB RAM, I run linux on it.I want to emulate it on pc ( I have both linux and windows availabe on pc ). I want to profile my cycle counts, run my cross-compiled executables directly in emulator. Is there any opensource project available to create emulator easily, How much change/code/effort does I need to write to make custom emulator with it ? It would be great if you provide me tutorials ot other reference to get kick-start.
Thanks & Regards,
Sunny.
Do you want to emulate just the processor or an entire machine?
Emulate a CPU is very easy, just define a structure containing all CPU registers, create an array to simulate RAM and then just emulate like this:
cpu_ticks = 0; // counter for cpu cycles
while (true) {
opcode = RAM[CPU.PC++]; // Fetch opcode and increment program counter
switch (opcode) {
case 0x12: // invented opcode for "MOV A,B"
CPU.A = CPU.B;
cpu_ticks += 4; // imagine you need 4 ticks for this operation
set_cpu_flags_mov();
break;
case 0x23: // invented opcode for "ADD A, #"
CPU.A += RAM[CPU. PC++]; // get operand from memory
cpu_ticks += 8;
set_cpu_flags_add();
break;
case 0x45: // invented opcode for "JP Z, #"
if (CPU.FLAGS.Z) CPU.PC=RAM[CPU.PC++]; // jump
else CPU.PC++; // continue
cpu_ticks += 12;
set_cpu_flags_jump();
break;
...
}
handle_interrupts();
}
Emulate an entire machine is much much harder... you need to emulate LCD controllers, memory mapped registers, memory banks controllers, DMAs, input devices, sound, I/O stuff... also probably you need a dump from the bios and operative system... I don't know the ARM processor but if it has pipelines, caches and such things, things get more complicated for timing.
If you have all hardware parts fully documented, there's no problem but if you need to reverse engineer or guess how the emulated machine works... you will have a hard time.
Start here: http://infocenter.arm.com/help/index.jsp and download the "Technical Reference Manual" for your processor.
And for general emulation questions: http://www.google.es/search?q=how+to+write+an+emulator
You should give a look at QEMU.
I don't understand however, why do you need a complete emulator ?
You can already a lot of profiling without emulator. What are the gains you expect from having a system emulator ?

Resources