Determining Opcode Cycle Count for a CPU

Determining Opcode Cycle Count for a CPU - emulation

I was wondering where would one go about getting CPU opcode cycle counts for various machines. An example of what I'm talking about can be seen at this link:
https://web.archive.org/web/20150217051448/http://www.obelisk.demon.co.uk/6502/reference.html
If you examine the MAME source code, especially under src\emu\cpu, you'll see that most of the CPU models keep a track of the cycle count in a similar way. My question is where does one go about getting this information, or reverse engineering it if its not available? I've never seen any 'official' ASM programmer's guide contain cycle count info. My initial guess is that a small program is thrown into the real hardware's bootrom, and if it contains an opcode equivalent to RDTSC, something like this is done:
RDTSC
//opcode of choosing
RDTSC
But what would you do if such support wasn't available? I know for older hardware the MAME team has no access to anything but the roms, and scattered documentation.

Up through about the Pentium, cycle counts were easy to find for Intel and AMD processors (and most competitors). Starting with the Pentium Pro and AMD K5, however, the CPU went to a dynamic execution model, in which instructions can be executed out of order. In this case, the time taken to execute an instruction depends heavily upon the data it uses, and whether (for example) it depends on data from a previous instruction (in which case, it has to wait for that instruction to complete before it can execute).
There are also constraints on things like how many instructions can be decoded per cycle (e.g. at least one, plus two more as long as they're "simple") and how many can be retired per cycle (usually around three or four).
As a result, on a modern CPU it's almost meaningless to talk about the cycles for a given instruction in isolation. Meaningful results require a stream of instructions, so you look not only at that instruction, but what comes before and after it. An instruction that's a serious bottleneck in one instruction stream might be essentially free in another stream (e.g. if you have one multiplication mixed in with a lot of adds, the multiplication might be almost free -- but if it's surrounded by a lot of other multiplications, it might be relatively expensive).

The accepted RDTSC count should have a serializing instruction to ensure that all previous instructions have retired before getting the count. This adds overhead to the count, but you can simply "count" zero instructions and subtract that value from the measured instructions.
Some pdf manuals that cover this very well.
http://www.agner.org/optimize/#manuals

Related

Why do register renaming, when we can increase the number of registers in the architecture?

In processors, why can't we simply increase the number of registers instead of having a huge reorder buffer and mapping the register for resolving name dependencies?

Lots of reasons.
first, we are often designing micro-architectures to execute programs for an existing architecture. Adding registers would change the architecture. At best, existing binaries would not benefit from the new registers, at worst they won't run at all without some kind of JIT compilation.
there is the problem of encoding. Adding new registers means increasing the number of bit dedicated to encode the registers, probably increasing the instruction size with effects on the cache and elsewhere.
there is the issue of the size of the visible state. Context swapping would have to save all the visible registers. Taking more time. Taking more place (and thus an effect on the cache, thus more time again).
there is the effect that dynamic renaming can be applied at places where static renaming and register allocation is impossible, or at least hard to do; and when they are possible, that takes more instructions thus increasing the cache pressure.
In conclusion there is a sweet spot which is usually considered at 16 or 32 registers for the integer/general purpose case. For floating point and vector registers, there are arguments to consider more registers (ISTR that Fujitsu was at a time using 128 or 256 floating point registers for its own extended SPARC).
Related question on electronics.se.
An additional note, the mill architecture takes another approach to statically scheduled processors and avoid some of the drawbacks, apparently changing the trade-off. But AFAIK, it is not yet know if there will ever be available silicon for it.

Because static scheduling at compile time is hard (software pipelining) and inflexible to variable timings like cache misses. Having the CPU able to find and exploit ILP (Instruction Level Parallelism) in more cases is very useful for hiding latency of cache misses and FP or integer math.
Also, instruction-encoding considerations. For example, Haswell's 168-entry integer register file would need about 8 bits per operand to encode if we had that many architectural registers. vs. 3 or 4 for actual x86 machine code.
Related:
http://www.lighterra.com/papers/modernmicroprocessors/ great intro to CPU design and how smarter CPUs can find more ILP
Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths shows how OoO exec can overlap exec of two dependency chains, unless you block it.
http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ has some specific examples of how much OoO exec can do to hide cache-miss or other latency
this Q&A about how superscalar execution works.

Register identifier encoding space will be a problem. Indeed, many more registers has been tried. For example, SPARC has register windows, 72 to 640 registers of which 32 are visible at one time.
Instead, from Computer Organization And Design: RISC-V Edition.
Smaller is faster. The desire for speed is the reason that RISC-V has 32 registers rather than many more.
BTW, ROB size has to do with the processor being out-of-order, superscalar, rather than renaming and providing lots of general purpose registers.

Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?

I want to get the CPU cycles at a specific point. I use this function at that point:
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
// broken for 64-bit builds; don't copy this code
return x;
}
(editor's note: "=A" is wrong for x86-64; it picks either RDX or RAX. Only in 32-bit mode will it pick the EDX:EAX output you want. See How to get the CPU cycle count in x86_64 from C++?.)
The problem is that it returns always an increasing number (in every run). It's as if it is referring to the absolute time.
Am I using the functions incorrectly?

As long as your thread stays on the same CPU core, the RDTSC instruction will keep returning an increasing number until it wraps around. For a 2GHz CPU, this happens after 292 years, so it is not a real issue. You probably won't see it happen. If you expect to live that long, make sure your computer reboots, say, every 50 years.
The problem with RDTSC is that you have no guarantee that it starts at the same point in time on all cores of an elderly multicore CPU and no guarantee that it starts at the same point in time time on all CPUs on an elderly multi-CPU board.
Modern systems usually do not have such problems, but the problem can also be worked around on older systems by setting a thread's affinity so it only runs on one CPU. This is not good for application performance, so one should not generally do it, but for measuring ticks, it's just fine.
(Another "problem" is that many people use RDTSC for measuring time, which is not what it does, but you wrote that you want CPU cycles, so that is fine. If you do use RDTSC to measure time, you may have surprises when power saving or hyperboost or whatever the multitude of frequency-changing techniques are called kicks in. For actual time, the clock_gettime syscall is surprisingly good under Linux.)
I would just write rdtsc inside the asm statement, which works just fine for me and is more readable than some obscure hex code. Assuming it's the correct hex code (and since it neither crashes and returns an ever-increasing number, it seems so), your code is good.
If you want to measure the number of ticks a piece of code takes, you want a tick difference, you just need to subtract two values of the ever-increasing counter. Something like uint64_t t0 = rdtsc(); ... uint64_t t1 = rdtsc() - t0;
Note that for if very accurate measurements isolated from surrounding code are necessary, you need to serialize, that is stall the pipeline, prior to calling rdtsc (or use rdtscp which is only supported on newer processors). The one serializing instruction that can be used at every privilegue level is cpuid.
In reply to the further question in the comment:
The TSC starts at zero when you turn on the computer (and the BIOS resets all counters on all CPUs to the same value, though some BIOSes a few years ago did not do so reliably).
Thus, from your program's point of view, the counter started "some unknown time in the past", and it always increases with every clock tick the CPU sees. Therefore if you execute the instruction returning that counter now and any time later in a different process, it will return a greater value (unless the CPU was suspended or turned off in between). Different runs of the same program get bigger numbers, because the counter keeps growing. Always.
Now, clock_gettime(CLOCK_PROCESS_CPUTIME_ID) is a different matter. This is the CPU time that the OS has given to the process. It starts at zero when your process starts. A new process starts at zero, too. Thus, two processes running after each other will get very similar or identical numbers, not ever growing ones.
clock_gettime(CLOCK_MONOTONIC_RAW) is closer to how RDTSC works (and on some older systems is implemented with it). It returns a value that ever increases. Nowadays, this is typically a HPET. However, this is really time, and not ticks. If your computer goes into low power state (e.g. running at 1/2 normal frequency), it will still advance at the same pace.

There's lots of confusing and/or wrong information about the TSC out there, so I thought I'd try to clear some of it up.
When Intel first introduced the TSC (in original Pentium CPUs) it was clearly documented to count cycles (and not time). However, back then CPUs mostly ran at a fixed frequency, so some people ignored the documented behaviour and used it to measure time instead (most notably, Linux kernel developers). Their code broke in later CPUs that don't run at a fixed frequency (due to power management, etc). Around that time other CPU manufacturers (AMD, Cyrix, Transmeta, etc) were confused and some implemented TSC to measure cycles and some implemented it so it measured time, and some made it configurable (via. an MSR).
Then "multi-chip" systems became more common for servers; and even later multi-core was introduced. This led to minor differences between TSC values on different cores (due to different startup times); but more importantly it also led to major differences between TSC values on different CPUs caused by CPUs running at different speeds (due to power management and/or other factors).
People that were trying to use it wrong from the start (people who used it to measure time and not cycles) complained a lot, and eventually convinced CPU manufacturers to standardise on making the TSC measure time and not cycles.
Of course this was a mess - e.g. it takes a lot of code just to determine what the TSC actually measures if you support all 80x86 CPUs; and different power management technologies (including things like SpeedStep, but also things like sleep states) may effect TSC in different ways on different CPUs; so AMD introduced a "TSC invariant" flag in CPUID to tell the OS that the TSC can be used to measure time correctly.
All recent Intel and AMD CPUs have been like this for a while now - TSC counts time and doesn't measure cycles at all. This means if you want to measure cycles you had to use (model specific) performance monitoring counters. Unfortunately the performance monitoring counters are an even worse mess (due to their model specific nature and convoluted configuration).

good answers already, and Damon already mentioned this in a way in his answer, but I'll add this from the actual x86 manual (volume 2, 4-301) entry for RDTSC:
Loads the current value of the processor's time-stamp counter (a 64-bit MSR) into the EDX:EAX registers. The EDX register is loaded with the high-order 32 bits of the MSR and the EAX register is loaded with the low-order 32 bits. (On processors that support the Intel 64 architecture, the high-order 32 bits of each of RAX and RDX are cleared.)
The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever the processor is reset. See "Time Stamp Counter" in Chapter 17 of the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3B, for specific details of the time stamp counter behavior.

What is the lowest time resolution for somewhat accurate measurements of cpu usage?

Some of the things I want to measure are very short,and I can only repeat them so many times if I don't run any of the setup/dispose code in the middle.
note: on linux,reading /proc/stat

Not very portable and you'll have to take great care so it is reliable, but the Time Stamp Counter definitely has the highest resolution available (increases at every CPU tick).
The time stamp counter has, until
recently, been an excellent
high-resolution, low-overhead way of
getting CPU timing information. With
the advent of multi-core/hyperthreaded
CPUs, systems with multiple CPUs, and
"hibernating" operating systems, the
TSC cannot be relied on to provide
accurate results - unless great care
is taken to correct the possible
flaws: rate of tick and whether all
cores (processors) have identical
values in their time-keeping
registers. There is no promise that
the timestamp counters of multiple
CPUs on a single motherboard will be
synchronized. In such cases,
programmers can only get reliable
results by locking their code to a
single CPU. Even then, the CPU speed
may change due to power-saving
measures taken by the OS or BIOS, or
the system may be hibernated and later
resumed (resetting the time stamp
counter). In those latter cases, to
stay relevant, the counter must be
recalibrated periodically (according
to the time resolution your
application requires).
There's some notes there about Linux specific solutions on the page, too:
Under Linux, similar functionality is
provided by reading the value of
CLOCK_MONOTONIC clock using POSIX
clock_gettime function.

Approximate Number of CPU Cycles for Various Operations

I am trying to find a reference for approximately how many CPU cycles various operations require.
I don't need exact numbers (as this is going to vary between CPUs) but I'd like something relatively credible that gives ballpark figures that I could cite in discussion with friends.
As an example, we all know that floating point division takes more CPU cycles than say doing a bitshift.
I'd guess that the difference is that the division is around 100 cycles, where as a shift is 1 but I'm looking for something to cite to back that up.
Can anyone recommend such a resource?

I did a small app to test this. A very approximate app using synthmaker free edition... e is for empty, numbers are very approx cycles
divide|e:115|10
mult|e: 48|10
add|e: 48|10
subs|e: 50|10
compare>|e: 50|10
sin|e:135:10
The readings in the cycle analyser vary wildly from 50 to 100, usually single or double of the expected amount, these are figures that represent averages,the cycle analyzer is a very rough tool, but it gives fair results, a workaround user made exponent coded in ASM that calculates both the exp and the base at audio rate for example is around 800 cycles, so I'd say the above figures are close to at least 50 percent. I thought the divide was way more! It seems about twice as much. If you want the file I made to run in SM free version mail me, I was going to save an exe that is why i did it but you cant save in free version silly me! I am not going to code it from square one in version 1.17 :/
ant.stewart at the place yahoo dotty com.

For x86 processors, see Intel® 64 and IA-32 Architectures Optimization Reference Manual, probably Appendix C.
However, it's not in any way easy to figure out how many cycles an instruction takes to execute on a modern x86 processor, as it depends too much on e.g. accessing data in cache,aligned access, whether branch prediction fails, if there's a stall in the instruction pipeline and quite a lot of other things.

This is going to be hardware-dependent. The best thing to do is to run some benchmarks on the particular hardware you want to test.
A benchmark would go roughly like this:
Run a primitive operation a million times (say, adding two integers)
Record the time it took to run (say, in seconds)
Multiply by the number of cycles your machine executes per second - this will give you the total number of cycles spent.
Divide 1000000 by the number from the previous step - this will give you the number of instructions per cycle. Keep in mind that with pipelining, this could be less than 1.

There is research made by Agner Fog:
Instruction tables
Instruction tables: Lists of instruction latencies, throughputs and
micro-operation breakdowns for Intel, AMD, and VIA CPUs.
Last updated 2021-03-22

Is gettimeofday() guaranteed to be of microsecond resolution?

I am porting a game, that was originally written for the Win32 API, to Linux (well, porting the OS X port of the Win32 port to Linux).
I have implemented QueryPerformanceCounter by giving the uSeconds since the process start up:
BOOL QueryPerformanceCounter(LARGE_INTEGER* performanceCount)
{
gettimeofday(&currentTimeVal, NULL);
performanceCount->QuadPart = (currentTimeVal.tv_sec - startTimeVal.tv_sec);
performanceCount->QuadPart *= (1000 * 1000);
performanceCount->QuadPart += (currentTimeVal.tv_usec - startTimeVal.tv_usec);
return true;
}
This, coupled with QueryPerformanceFrequency() giving a constant 1000000 as the frequency, works well on my machine, giving me a 64-bit variable that contains uSeconds since the program's start-up.
So is this portable? I don't want to discover it works differently if the kernel was compiled in a certain way or anything like that. I am fine with it being non-portable to something other than Linux, however.

Maybe. But you have bigger problems. gettimeofday() can result in incorrect timings if there are processes on your system that change the timer (ie, ntpd). On a "normal" linux, though, I believe the resolution of gettimeofday() is 10us. It can jump forward and backward and time, consequently, based on the processes running on your system. This effectively makes the answer to your question no.
You should look into clock_gettime(CLOCK_MONOTONIC) for timing intervals. It suffers from several less issues due to things like multi-core systems and external clock settings.
Also, look into the clock_getres() function.

High Resolution, Low Overhead Timing for Intel Processors
If you're on Intel hardware, here's how to read the CPU real-time instruction counter. It will tell you the number of CPU cycles executed since the processor was booted. This is probably the finest-grained counter you can get for performance measurement.
Note that this is the number of CPU cycles. On linux you can get the CPU speed from /proc/cpuinfo and divide to get the number of seconds. Converting this to a double is quite handy.
When I run this on my box, I get
11867927879484732
11867927879692217
it took this long to call printf: 207485
Here's the Intel developer's guide that gives tons of detail.
#include <stdio.h>
#include <stdint.h>
inline uint64_t rdtsc() {
uint32_t lo, hi;
__asm__ __volatile__ (
"xorl %%eax, %%eax\n"
"cpuid\n"
"rdtsc\n"
: "=a" (lo), "=d" (hi)
:
: "%ebx", "%ecx");
return (uint64_t)hi << 32 | lo;
}
main()
{
unsigned long long x;
unsigned long long y;
x = rdtsc();
printf("%lld\n",x);
y = rdtsc();
printf("%lld\n",y);
printf("it took this long to call printf: %lld\n",y-x);
}

#Bernard:
I have to admit, most of your example went straight over my head. It does compile, and seems to work, though. Is this safe for SMP systems or SpeedStep?
That's a good question... I think the code's ok.
From a practical standpoint, we use it in my company every day,
and we run on a pretty wide array of boxes, everything from 2-8 cores.
Of course, YMMV, etc, but it seems to be a reliable and low-overhead
(because it doesn't make a context switch into system-space) method
of timing.
Generally how it works is:
declare the block of code to be assembler (and volatile, so the
optimizer will leave it alone).
execute the CPUID instruction. In addition to getting some CPU information
(which we don't do anything with) it synchronizes the CPU's execution buffer
so that the timings aren't affected by out-of-order execution.
execute the rdtsc (read timestamp) execution. This fetches the number of
machine cycles executed since the processor was reset. This is a 64-bit
value, so with current CPU speeds it will wrap around every 194 years or so.
Interestingly, in the original Pentium reference, they note it wraps around every
5800 years or so.
the last couple of lines store the values from the registers into
the variables hi and lo, and put that into the 64-bit return value.
Specific notes:
out-of-order execution can cause incorrect results, so we execute the
"cpuid" instruction which in addition to giving you some information
about the cpu also synchronizes any out-of-order instruction execution.
Most OS's synchronize the counters on the CPUs when they start, so
the answer is good to within a couple of nano-seconds.
The hibernating comment is probably true, but in practice you
probably don't care about timings across hibernation boundaries.
regarding speedstep: Newer Intel CPUs compensate for the speed
changes and returns an adjusted count. I did a quick scan over
some of the boxes on our network and found only one box that
didn't have it: a Pentium 3 running some old database server.
(these are linux boxes, so I checked with: grep constant_tsc /proc/cpuinfo)
I'm not sure about the AMD CPUs, we're primarily an Intel shop,
although I know some of our low-level systems gurus did an
AMD evaluation.
Hope this satisfies your curiosity, it's an interesting and (IMHO)
under-studied area of programming. You know when Jeff and Joel were
talking about whether or not a programmer should know C? I was
shouting at them, "hey forget that high-level C stuff... assembler
is what you should learn if you want to know what the computer is
doing!"

You may be interested in Linux FAQ for clock_gettime(CLOCK_REALTIME)

Wine is actually using gettimeofday() to implement QueryPerformanceCounter() and it is known to make many Windows games work on Linux and Mac.
Starts http://source.winehq.org/source/dlls/kernel32/cpu.c#L312
leads to http://source.winehq.org/source/dlls/ntdll/time.c#L448

So it says microseconds explicitly, but says the resolution of the system clock is unspecified. I suppose resolution in this context means how the smallest amount it will ever be incremented?
The data structure is defined as having microseconds as a unit of measurement, but that doesn't mean that the clock or operating system is actually capable of measuring that finely.
Like other people have suggested, gettimeofday() is bad because setting the time can cause clock skew and throw off your calculation. clock_gettime(CLOCK_MONOTONIC) is what you want, and clock_getres() will tell you the precision of your clock.

The actual resolution of gettimeofday() depends on the hardware architecture. Intel processors as well as SPARC machines offer high resolution timers that measure microseconds. Other hardware architectures fall back to the system’s timer, which is typically set to 100 Hz. In such cases, the time resolution will be less accurate.
I obtained this answer from High Resolution Time Measurement and Timers, Part I

This answer mentions problems with the clock being adjusted. Both your problems guaranteeing tick units and the problems with the time being adjusted are solved in C++11 with the <chrono> library.
The clock std::chrono::steady_clock is guaranteed not to be adjusted, and furthermore it will advance at a constant rate relative to real time, so technologies like SpeedStep must not affect it.
You can get typesafe units by converting to one of the std::chrono::duration specializations, such as std::chrono::microseconds. With this type there's no ambiguity about the units used by the tick value. However, keep in mind that the clock doesn't necessarily have this resolution. You can convert a duration to attoseconds without actually having a clock that accurate.

From my experience, and from what I've read across the internet, the answer is "No," it is not guaranteed. It depends on CPU speed, operating system, flavor of Linux, etc.

Reading the RDTSC is not reliable in SMP systems, since each CPU maintains their own counter and each counter is not guaranteed to by synchronized with respect to another CPU.
I might suggest trying clock_gettime(CLOCK_REALTIME). The posix manual indicates that this should be implemented on all compliant systems. It can provide a nanosecond count, but you probably will want to check clock_getres(CLOCK_REALTIME) on your system to see what the actual resolution is.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string