How to give an emulation the right speed? [duplicate] - emulation

This question already has answers here:
CPU Emulation and locking to a specific clock speed
(7 answers)
Closed 7 years ago.
I want to write an emulator for a particulary slow CPU which runs at 600 or so kilohertz. If I were to write an emulator for the CPU in the naïve way (i.e. emulating one instruction at a time without anything else), the emulation would be much faster than 600 kilohertz.
How do I program an emulator to emulate a CPU at the correct speed, regardless of host's speed? What technique is usually used by real-world emulators to do this? How do I avoid jitter slowing down the emulation?

On a typical platform, the only available "periodic events" are inaccurate and low-frequency, certainly nothing like 0.6MHz. But using a "slow" timer (maybe 100Hz or so) you can "run many short sprints", with enough time "resting" in between that on average you're emulating the right amount of cycles per second. Time can usually be measured fairly accurately, so you can emulate exactly the right number of cycles in every "sprint".
At a high level, that could look something like this:
int cycle_budget = 0;
time last_sprint = something;
// on timer fire
cycle_budget += (current_time - last_sprint) * clock_rate;
last_sprint = current_time;
while (cycle_budget >= slowest_instruction)
tick(); // emulates one instruction, subtracts from cycle_budget
There are some obvious variations, for example you can let the budget go negative instead of testing whether there is enough to run a slow instruction. Or you might decode the instruction and then test whether there is enough budget to run it. This all assumes an instruction won't take arbitrarily long, but as far as I know that's never a problem (even something like z80's string instructions, they actually loop by branching back and re-executing itself).

Related

Linux: CPU benchmark requiring longer time and different CPU utilization levels

For my research I need a CPU benchmark to do some experiments on my Ubuntu laptop (Ubuntu 15.10, Memory 7.7 GiB, Intel Core i7-4500U CPU # 1.80HGz x 4, 64bit). In an ideal world, I would like to have a benchmark satisfying the following:
The CPU should be an official benchmark rather than created by my own for transparency purposes.
The time needed to execute the benchmark on my laptop should be at least 5 minutes (the more the better).
The benchmark should result in different levels of CPU throughout execution. For example, I don't want a benchmark which permanently keeps the CPU utilization level at around 100% - so I want a benchmark which will make the CPU utilization vary over time.
Especially points 2 and 3 are really key for my research. However, I couldn't find any suitable benchmarks so far. Benchmarks I found so far include: sysbench, CPU Fibonacci, CPU Blowfish, CPU Cryptofish, CPU N-Queens. However, all of them just need a couple of seconds to complete and the utilization level on my laptop is at 100% constantly.
Question: Does anyone know about a suitable benchmark for me? I am also happy to hear any other comments/questions you have. Thank you!
To choose a benchmark, you need to know exactly what you're trying to measure. Your question doesn't include that, so there's not much anyone can tell you without taking a wild guess.
If you're trying to measure how well Turbo clock speed works to make a power-limited CPU like your laptop run faster for bursty workloads (e.g. to compare Haswell against Skylake's new and improved power management), you could just run something trivial that's 1 second on, 2 seconds off, and count how many loop iterations it manages.
The duty cycle and cycle length should be benchmark parameters, so you can make plots. e.g. with very fast on/off cycles, Skylake's faster-reacting Turbo will ramp up faster and drop down to min power faster (leaving more headroom in the bank for the next burst).
The speaker in that talk (the lead architect for power management on Intel CPUs) says that Javascript benchmarks are actually bursty enough for Skylake's power management to give a measurable speedup, unlike most other benchmarks which just peg the CPU at 100% the whole time. So maybe have a look at Javascript benchmarks, if you want to use well-known off-the-shelf benchmarks.
If rolling your own, put a loop-carried dependency chain in the loop, preferably with something that's not too variable in latency across microarchitectures. A long chain of integer adds would work, and Fibonacci is a good way to stop the compiler from optimizing it away. Either pick a fixed iteration count that works well for current CPU speeds, or check the clock every 10M iterations.
Or set a timer that will fire after some time, and have it set a flag that you check inside the loop. (e.g. from a signal handler). Specifically, alarm(2) may be a good choice. Record how many iterations you did in this burst of work.

Measuring time in assembly

For the several hours now, I was trying to find a way to measure time interval within assmebly code. What I have seen so far, is that I can query the number of CPU cycles, but of course, I'd need to know CPU frequency to translate number of cycles into time. I have found the rdmsr instruction, but it is ring0 instruction, and ring0 is not something I can to put my code in.
Some examples I've found call Windows Query* functions for this, but I am not running on Windows. Is there any way for me to measure time interval in user level? Any other way to get frequency, or may be other clock I can access directly? One-second resolution system clock is of course out of the question :)
I spent quite a while working with cycle counters, and eventually came to the (perhaps obvious) conclusion that RDTSC counts cycles, not time. It will never count time because the computer's clock is being constantly ramped up and down by the power management unit. So the cycle counter is extremely precise for measuring cycles, horribly off by random amounts in real time units. I believe Intel eventually addressed this by locking the cycle counter to a clock that is not affected by the PMU, but I haven't investigated it.
The Windows Query* functions do not actually use the RDTSC cycle counter. I thought they did until I tried to measure really small periods and found it had a 14MHz(?) tick, which turned out to be the PCI data bus clock.
On top of all this, each core has its own cycle counter. So have to pay attention to which core you are using when executing the RDTSC opcode. And each core has its own PMU.
The best timer you will find in Windows user mode is QueryPerformanceCounter() and QueryPerformanceFrequency().

Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?

I want to get the CPU cycles at a specific point. I use this function at that point:
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
// broken for 64-bit builds; don't copy this code
return x;
}
(editor's note: "=A" is wrong for x86-64; it picks either RDX or RAX. Only in 32-bit mode will it pick the EDX:EAX output you want. See How to get the CPU cycle count in x86_64 from C++?.)
The problem is that it returns always an increasing number (in every run). It's as if it is referring to the absolute time.
Am I using the functions incorrectly?
As long as your thread stays on the same CPU core, the RDTSC instruction will keep returning an increasing number until it wraps around. For a 2GHz CPU, this happens after 292 years, so it is not a real issue. You probably won't see it happen. If you expect to live that long, make sure your computer reboots, say, every 50 years.
The problem with RDTSC is that you have no guarantee that it starts at the same point in time on all cores of an elderly multicore CPU and no guarantee that it starts at the same point in time time on all CPUs on an elderly multi-CPU board.
Modern systems usually do not have such problems, but the problem can also be worked around on older systems by setting a thread's affinity so it only runs on one CPU. This is not good for application performance, so one should not generally do it, but for measuring ticks, it's just fine.
(Another "problem" is that many people use RDTSC for measuring time, which is not what it does, but you wrote that you want CPU cycles, so that is fine. If you do use RDTSC to measure time, you may have surprises when power saving or hyperboost or whatever the multitude of frequency-changing techniques are called kicks in. For actual time, the clock_gettime syscall is surprisingly good under Linux.)
I would just write rdtsc inside the asm statement, which works just fine for me and is more readable than some obscure hex code. Assuming it's the correct hex code (and since it neither crashes and returns an ever-increasing number, it seems so), your code is good.
If you want to measure the number of ticks a piece of code takes, you want a tick difference, you just need to subtract two values of the ever-increasing counter. Something like uint64_t t0 = rdtsc(); ... uint64_t t1 = rdtsc() - t0;
Note that for if very accurate measurements isolated from surrounding code are necessary, you need to serialize, that is stall the pipeline, prior to calling rdtsc (or use rdtscp which is only supported on newer processors). The one serializing instruction that can be used at every privilegue level is cpuid.
In reply to the further question in the comment:
The TSC starts at zero when you turn on the computer (and the BIOS resets all counters on all CPUs to the same value, though some BIOSes a few years ago did not do so reliably).
Thus, from your program's point of view, the counter started "some unknown time in the past", and it always increases with every clock tick the CPU sees. Therefore if you execute the instruction returning that counter now and any time later in a different process, it will return a greater value (unless the CPU was suspended or turned off in between). Different runs of the same program get bigger numbers, because the counter keeps growing. Always.
Now, clock_gettime(CLOCK_PROCESS_CPUTIME_ID) is a different matter. This is the CPU time that the OS has given to the process. It starts at zero when your process starts. A new process starts at zero, too. Thus, two processes running after each other will get very similar or identical numbers, not ever growing ones.
clock_gettime(CLOCK_MONOTONIC_RAW) is closer to how RDTSC works (and on some older systems is implemented with it). It returns a value that ever increases. Nowadays, this is typically a HPET. However, this is really time, and not ticks. If your computer goes into low power state (e.g. running at 1/2 normal frequency), it will still advance at the same pace.
There's lots of confusing and/or wrong information about the TSC out there, so I thought I'd try to clear some of it up.
When Intel first introduced the TSC (in original Pentium CPUs) it was clearly documented to count cycles (and not time). However, back then CPUs mostly ran at a fixed frequency, so some people ignored the documented behaviour and used it to measure time instead (most notably, Linux kernel developers). Their code broke in later CPUs that don't run at a fixed frequency (due to power management, etc). Around that time other CPU manufacturers (AMD, Cyrix, Transmeta, etc) were confused and some implemented TSC to measure cycles and some implemented it so it measured time, and some made it configurable (via. an MSR).
Then "multi-chip" systems became more common for servers; and even later multi-core was introduced. This led to minor differences between TSC values on different cores (due to different startup times); but more importantly it also led to major differences between TSC values on different CPUs caused by CPUs running at different speeds (due to power management and/or other factors).
People that were trying to use it wrong from the start (people who used it to measure time and not cycles) complained a lot, and eventually convinced CPU manufacturers to standardise on making the TSC measure time and not cycles.
Of course this was a mess - e.g. it takes a lot of code just to determine what the TSC actually measures if you support all 80x86 CPUs; and different power management technologies (including things like SpeedStep, but also things like sleep states) may effect TSC in different ways on different CPUs; so AMD introduced a "TSC invariant" flag in CPUID to tell the OS that the TSC can be used to measure time correctly.
All recent Intel and AMD CPUs have been like this for a while now - TSC counts time and doesn't measure cycles at all. This means if you want to measure cycles you had to use (model specific) performance monitoring counters. Unfortunately the performance monitoring counters are an even worse mess (due to their model specific nature and convoluted configuration).
good answers already, and Damon already mentioned this in a way in his answer, but I'll add this from the actual x86 manual (volume 2, 4-301) entry for RDTSC:
Loads the current value of the processor's time-stamp counter (a 64-bit MSR) into the EDX:EAX registers. The EDX register is loaded with the high-order 32 bits of the MSR and the EAX register is loaded with the low-order 32 bits. (On processors that support the Intel 64 architecture, the high-order 32 bits of each of RAX and RDX are cleared.)
The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever the processor is reset. See "Time Stamp Counter" in Chapter 17 of the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3B, for specific details of the time stamp counter behavior.

How long does a context switch take in Linux?

I'm curious how many cycles it takes to change contexts in Linux. I'm specifically using an E5405 Xeon (x64), but I'd love to see how it compares to other platforms as well.
There`s a free app called LMBench written by Larry McVoy and friends. It provides a bunch of OS & HW benchmarks
One of the tests is called lat_ctx and it measures contex switch latencies.
Google for lmbench and check for yourself on your own HW. Its the only way to get a number meaningful to you.
Gilad
Run vmstat on your machine while doing something that requires heavy context switching. It doesnt tell you how long the actual switch takes, but it will tell you how many switches you do per second.
Then, you have to estimate how much each timeslice spends performing actual code, compared to switching context. Maybe a 100:1 or something? I dont know. 1000:1?
A machine of mine is now doing roughly 3000 switches per second, ie 0.3 ms per timeslice. With a ratio of 100:1 that would mean the actual switch takes 0.003 ms.
But, with multiple cores, threads yielding execution, etc etc, I'm wouldnt draw any conclusion from such a guess :)
I've written code that's able to echo (small) UDP packets at 200k packets per second.
That suggests that it's possible to context switch in not more than 2.5 microseconds, with the actual context switch probably taking somewhat less than that.

Is gettimeofday() guaranteed to be of microsecond resolution?

I am porting a game, that was originally written for the Win32 API, to Linux (well, porting the OS X port of the Win32 port to Linux).
I have implemented QueryPerformanceCounter by giving the uSeconds since the process start up:
BOOL QueryPerformanceCounter(LARGE_INTEGER* performanceCount)
{
gettimeofday(&currentTimeVal, NULL);
performanceCount->QuadPart = (currentTimeVal.tv_sec - startTimeVal.tv_sec);
performanceCount->QuadPart *= (1000 * 1000);
performanceCount->QuadPart += (currentTimeVal.tv_usec - startTimeVal.tv_usec);
return true;
}
This, coupled with QueryPerformanceFrequency() giving a constant 1000000 as the frequency, works well on my machine, giving me a 64-bit variable that contains uSeconds since the program's start-up.
So is this portable? I don't want to discover it works differently if the kernel was compiled in a certain way or anything like that. I am fine with it being non-portable to something other than Linux, however.
Maybe. But you have bigger problems. gettimeofday() can result in incorrect timings if there are processes on your system that change the timer (ie, ntpd). On a "normal" linux, though, I believe the resolution of gettimeofday() is 10us. It can jump forward and backward and time, consequently, based on the processes running on your system. This effectively makes the answer to your question no.
You should look into clock_gettime(CLOCK_MONOTONIC) for timing intervals. It suffers from several less issues due to things like multi-core systems and external clock settings.
Also, look into the clock_getres() function.
High Resolution, Low Overhead Timing for Intel Processors
If you're on Intel hardware, here's how to read the CPU real-time instruction counter. It will tell you the number of CPU cycles executed since the processor was booted. This is probably the finest-grained counter you can get for performance measurement.
Note that this is the number of CPU cycles. On linux you can get the CPU speed from /proc/cpuinfo and divide to get the number of seconds. Converting this to a double is quite handy.
When I run this on my box, I get
11867927879484732
11867927879692217
it took this long to call printf: 207485
Here's the Intel developer's guide that gives tons of detail.
#include <stdio.h>
#include <stdint.h>
inline uint64_t rdtsc() {
uint32_t lo, hi;
__asm__ __volatile__ (
"xorl %%eax, %%eax\n"
"cpuid\n"
"rdtsc\n"
: "=a" (lo), "=d" (hi)
:
: "%ebx", "%ecx");
return (uint64_t)hi << 32 | lo;
}
main()
{
unsigned long long x;
unsigned long long y;
x = rdtsc();
printf("%lld\n",x);
y = rdtsc();
printf("%lld\n",y);
printf("it took this long to call printf: %lld\n",y-x);
}
#Bernard:
I have to admit, most of your example went straight over my head. It does compile, and seems to work, though. Is this safe for SMP systems or SpeedStep?
That's a good question... I think the code's ok.
From a practical standpoint, we use it in my company every day,
and we run on a pretty wide array of boxes, everything from 2-8 cores.
Of course, YMMV, etc, but it seems to be a reliable and low-overhead
(because it doesn't make a context switch into system-space) method
of timing.
Generally how it works is:
declare the block of code to be assembler (and volatile, so the
optimizer will leave it alone).
execute the CPUID instruction. In addition to getting some CPU information
(which we don't do anything with) it synchronizes the CPU's execution buffer
so that the timings aren't affected by out-of-order execution.
execute the rdtsc (read timestamp) execution. This fetches the number of
machine cycles executed since the processor was reset. This is a 64-bit
value, so with current CPU speeds it will wrap around every 194 years or so.
Interestingly, in the original Pentium reference, they note it wraps around every
5800 years or so.
the last couple of lines store the values from the registers into
the variables hi and lo, and put that into the 64-bit return value.
Specific notes:
out-of-order execution can cause incorrect results, so we execute the
"cpuid" instruction which in addition to giving you some information
about the cpu also synchronizes any out-of-order instruction execution.
Most OS's synchronize the counters on the CPUs when they start, so
the answer is good to within a couple of nano-seconds.
The hibernating comment is probably true, but in practice you
probably don't care about timings across hibernation boundaries.
regarding speedstep: Newer Intel CPUs compensate for the speed
changes and returns an adjusted count. I did a quick scan over
some of the boxes on our network and found only one box that
didn't have it: a Pentium 3 running some old database server.
(these are linux boxes, so I checked with: grep constant_tsc /proc/cpuinfo)
I'm not sure about the AMD CPUs, we're primarily an Intel shop,
although I know some of our low-level systems gurus did an
AMD evaluation.
Hope this satisfies your curiosity, it's an interesting and (IMHO)
under-studied area of programming. You know when Jeff and Joel were
talking about whether or not a programmer should know C? I was
shouting at them, "hey forget that high-level C stuff... assembler
is what you should learn if you want to know what the computer is
doing!"
You may be interested in Linux FAQ for clock_gettime(CLOCK_REALTIME)
Wine is actually using gettimeofday() to implement QueryPerformanceCounter() and it is known to make many Windows games work on Linux and Mac.
Starts http://source.winehq.org/source/dlls/kernel32/cpu.c#L312
leads to http://source.winehq.org/source/dlls/ntdll/time.c#L448
So it says microseconds explicitly, but says the resolution of the system clock is unspecified. I suppose resolution in this context means how the smallest amount it will ever be incremented?
The data structure is defined as having microseconds as a unit of measurement, but that doesn't mean that the clock or operating system is actually capable of measuring that finely.
Like other people have suggested, gettimeofday() is bad because setting the time can cause clock skew and throw off your calculation. clock_gettime(CLOCK_MONOTONIC) is what you want, and clock_getres() will tell you the precision of your clock.
The actual resolution of gettimeofday() depends on the hardware architecture. Intel processors as well as SPARC machines offer high resolution timers that measure microseconds. Other hardware architectures fall back to the system’s timer, which is typically set to 100 Hz. In such cases, the time resolution will be less accurate.
I obtained this answer from High Resolution Time Measurement and Timers, Part I
This answer mentions problems with the clock being adjusted. Both your problems guaranteeing tick units and the problems with the time being adjusted are solved in C++11 with the <chrono> library.
The clock std::chrono::steady_clock is guaranteed not to be adjusted, and furthermore it will advance at a constant rate relative to real time, so technologies like SpeedStep must not affect it.
You can get typesafe units by converting to one of the std::chrono::duration specializations, such as std::chrono::microseconds. With this type there's no ambiguity about the units used by the tick value. However, keep in mind that the clock doesn't necessarily have this resolution. You can convert a duration to attoseconds without actually having a clock that accurate.
From my experience, and from what I've read across the internet, the answer is "No," it is not guaranteed. It depends on CPU speed, operating system, flavor of Linux, etc.
Reading the RDTSC is not reliable in SMP systems, since each CPU maintains their own counter and each counter is not guaranteed to by synchronized with respect to another CPU.
I might suggest trying clock_gettime(CLOCK_REALTIME). The posix manual indicates that this should be implemented on all compliant systems. It can provide a nanosecond count, but you probably will want to check clock_getres(CLOCK_REALTIME) on your system to see what the actual resolution is.

Resources