Determine TSC frequency on Linux - linux

Given an x86 with a constant TSC, which is useful for measuring real time, how can one convert between the "units" of TSC reference cycles and normal human real-time units like nanoseconds using the TSC calibration factor calculated by Linux at boot-time?
That is, one can certainly calculate the TSC frequency in user-land by taking TSC and clock measurements (e.g., with CLOCK_MONOTONIC) at both ends of some interval to determine the TSC frequency, but Linux has already made this calculation at boot-time since it internally uses the TSC to help out with time-keeping.
For example, you can see the kernel's result with dmesg | grep tsc:
[ 0.000000] tsc: PIT calibration matches HPET. 2 loops
[ 0.000000] tsc: Detected 3191.922 MHz processor
[ 1.733060] tsc: Refined TSC clocksource calibration: 3192.007 MHz
In a worse-case scenario I guess you could try to grep the result out of dmesg at runtime, but that frankly seems terrible, fragile and all sorts of bad0.
The advantages of using the kernel-determined calibration time are many:
You don't have to write a TSC calibration routine yourself, and you can be pretty sure the Linux one is best-of-breed.
You automatically pick up new techniques in TSC calibration as new kernels come out using your existing binary (e.g., recently chips started advertising their TSC frequency using cpuid leaf 0x15 so calibration isn't always necessary).
You don't slow down your startup with a TSC calibtration.
You use the same TSC value on every run of your process (at least until reboot).
Your TSC frequency is somehow "consistent" with the TSC frequency used by OS time-keeping functions such as gettimeofday and clock_gettime1.
The kernel is able to do the TSC calibration very early at boot, in kernel mode, free from the scourges of interrupts, other processes and is able to access the underlying hardware timers direction as its calibration source.
It's not all gravy though, some downsides of using Linux's TSC calibration include:
It won't work on every Linux installation (e.g., perhaps those that don't use a tsc clocksource) or on other OSes at all, so you may still be stuck writing a fallback calibration method.
There is some reason to believe that a "recent" calibration may be more accurate than an old one, especially one taken right after boot: the crystal behavior may change, especially as temperatures change, so you may get a more accurate frequency by doing it manually close to the point where you'll use it.
0 For example: systems may not have dmesg installed, you may not be able to run it as a regular user, the accumulated output may have wrapped around so the lines are no longer present, you may get false positives on your grep, the kernel messages are English prose and subject to change, it may be hard to launch a sub-process, etc, etc.
1 It is somewhat debatable whether this matters - but if you are mixing rdtsc calls in with code that also uses OS time-keeping, it may increase precision.

Related

How to measure total boottime for linux kernel on intel rangeley board

I am working on intel rangeley board. I want to measure the total time taken to boot the linux kernel. Is there any possible and proven way to achieve this on intel board?
Try using rdtsc. According to the Intel insn ref manual:
The processor monotonically increments the time-stamp counter MSR
every clock cycle and resets it to 0 whenever the processor is reset.
See “Time Stamp Counter” in Chapter 17 of the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 3B, for specific
details of the time stamp counter behavior.
(see the x86 tag wiki for links to manuals)
Normally the TSC is only used for relative measurements between two points in time, or as a timesource. The absolute value is apparently meaningful. It ticks at the CPU's rated clock speed, regardless of the power-saving clock speed it's actually running at.
You might need to make sure you read the TSC from the boot CPU on a multicore system. The other cores might not have started their TSCs until Linux sent them an inter-processor interrupt to start them up. Linux might sync their TSCs to the boot CPU's TSC, since gettimeofday() does use the TSC. IDK, I'm just writing down stuff I'd be sure to check on if I wanted to do this myself.
You may need to take precautions to avoid having the kernel modify the TSC when using it as a timesource. Probably via a boot option that forces Linux to use a different timesource.

Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?

I want to get the CPU cycles at a specific point. I use this function at that point:
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
// broken for 64-bit builds; don't copy this code
return x;
}
(editor's note: "=A" is wrong for x86-64; it picks either RDX or RAX. Only in 32-bit mode will it pick the EDX:EAX output you want. See How to get the CPU cycle count in x86_64 from C++?.)
The problem is that it returns always an increasing number (in every run). It's as if it is referring to the absolute time.
Am I using the functions incorrectly?
As long as your thread stays on the same CPU core, the RDTSC instruction will keep returning an increasing number until it wraps around. For a 2GHz CPU, this happens after 292 years, so it is not a real issue. You probably won't see it happen. If you expect to live that long, make sure your computer reboots, say, every 50 years.
The problem with RDTSC is that you have no guarantee that it starts at the same point in time on all cores of an elderly multicore CPU and no guarantee that it starts at the same point in time time on all CPUs on an elderly multi-CPU board.
Modern systems usually do not have such problems, but the problem can also be worked around on older systems by setting a thread's affinity so it only runs on one CPU. This is not good for application performance, so one should not generally do it, but for measuring ticks, it's just fine.
(Another "problem" is that many people use RDTSC for measuring time, which is not what it does, but you wrote that you want CPU cycles, so that is fine. If you do use RDTSC to measure time, you may have surprises when power saving or hyperboost or whatever the multitude of frequency-changing techniques are called kicks in. For actual time, the clock_gettime syscall is surprisingly good under Linux.)
I would just write rdtsc inside the asm statement, which works just fine for me and is more readable than some obscure hex code. Assuming it's the correct hex code (and since it neither crashes and returns an ever-increasing number, it seems so), your code is good.
If you want to measure the number of ticks a piece of code takes, you want a tick difference, you just need to subtract two values of the ever-increasing counter. Something like uint64_t t0 = rdtsc(); ... uint64_t t1 = rdtsc() - t0;
Note that for if very accurate measurements isolated from surrounding code are necessary, you need to serialize, that is stall the pipeline, prior to calling rdtsc (or use rdtscp which is only supported on newer processors). The one serializing instruction that can be used at every privilegue level is cpuid.
In reply to the further question in the comment:
The TSC starts at zero when you turn on the computer (and the BIOS resets all counters on all CPUs to the same value, though some BIOSes a few years ago did not do so reliably).
Thus, from your program's point of view, the counter started "some unknown time in the past", and it always increases with every clock tick the CPU sees. Therefore if you execute the instruction returning that counter now and any time later in a different process, it will return a greater value (unless the CPU was suspended or turned off in between). Different runs of the same program get bigger numbers, because the counter keeps growing. Always.
Now, clock_gettime(CLOCK_PROCESS_CPUTIME_ID) is a different matter. This is the CPU time that the OS has given to the process. It starts at zero when your process starts. A new process starts at zero, too. Thus, two processes running after each other will get very similar or identical numbers, not ever growing ones.
clock_gettime(CLOCK_MONOTONIC_RAW) is closer to how RDTSC works (and on some older systems is implemented with it). It returns a value that ever increases. Nowadays, this is typically a HPET. However, this is really time, and not ticks. If your computer goes into low power state (e.g. running at 1/2 normal frequency), it will still advance at the same pace.
There's lots of confusing and/or wrong information about the TSC out there, so I thought I'd try to clear some of it up.
When Intel first introduced the TSC (in original Pentium CPUs) it was clearly documented to count cycles (and not time). However, back then CPUs mostly ran at a fixed frequency, so some people ignored the documented behaviour and used it to measure time instead (most notably, Linux kernel developers). Their code broke in later CPUs that don't run at a fixed frequency (due to power management, etc). Around that time other CPU manufacturers (AMD, Cyrix, Transmeta, etc) were confused and some implemented TSC to measure cycles and some implemented it so it measured time, and some made it configurable (via. an MSR).
Then "multi-chip" systems became more common for servers; and even later multi-core was introduced. This led to minor differences between TSC values on different cores (due to different startup times); but more importantly it also led to major differences between TSC values on different CPUs caused by CPUs running at different speeds (due to power management and/or other factors).
People that were trying to use it wrong from the start (people who used it to measure time and not cycles) complained a lot, and eventually convinced CPU manufacturers to standardise on making the TSC measure time and not cycles.
Of course this was a mess - e.g. it takes a lot of code just to determine what the TSC actually measures if you support all 80x86 CPUs; and different power management technologies (including things like SpeedStep, but also things like sleep states) may effect TSC in different ways on different CPUs; so AMD introduced a "TSC invariant" flag in CPUID to tell the OS that the TSC can be used to measure time correctly.
All recent Intel and AMD CPUs have been like this for a while now - TSC counts time and doesn't measure cycles at all. This means if you want to measure cycles you had to use (model specific) performance monitoring counters. Unfortunately the performance monitoring counters are an even worse mess (due to their model specific nature and convoluted configuration).
good answers already, and Damon already mentioned this in a way in his answer, but I'll add this from the actual x86 manual (volume 2, 4-301) entry for RDTSC:
Loads the current value of the processor's time-stamp counter (a 64-bit MSR) into the EDX:EAX registers. The EDX register is loaded with the high-order 32 bits of the MSR and the EAX register is loaded with the low-order 32 bits. (On processors that support the Intel 64 architecture, the high-order 32 bits of each of RAX and RDX are cleared.)
The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever the processor is reset. See "Time Stamp Counter" in Chapter 17 of the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3B, for specific details of the time stamp counter behavior.

Time Stamp Counter

I am using time stamp counter in my C++ programme by querying the register. However, one problem I encounter is that the function to acquire the time stamp would acquire from different CPU. How could I ensure that my function would always acquire the timestamp from the same CPU or is there anyway to synchronize the CPU? By the way, my programme is running on 4 cores server in Fedora 13 64 bit.
Thanks.
Look at the following excerpt from Intel manual. According to section 16.12, I think the "newer processors" below refers to any processor newer than pentium 4. You can simultaneously and atomically determine the tsc value and the core ID using the rdtscp instruction if it is supported. I haven't tried it though. Good Luck.
Intel 64 and IA-32 Architectures Software Developer's Manual
Volume 3 (3A & 3B): System Programming Guide:
Chapter 16.12.1 Invariant TSC
The time stamp counter in newer processors may support an enhancement, referred
to as invariant TSC. Processor’s support for invariant TSC is indicated by
CPUID.80000007H:EDX[8].
The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is
the architectural behavior moving forward. On processors with invariant TSC
support, the OS may use the TSC for wall clock timer services (instead of ACPI or
HPET timers). TSC reads are much more efficient and do not incur the overhead
associated with a ring transition or access to a platform resource.
Intel also has a guide on code execution benchmarking that discusses cpu association with rdtsc - http://download.intel.com/embedded/software/IA/324264.pdf
In my experience, it is wise to avoid TSC altogether, unless you really want to measure individual clock cycles on individual cores/CPUs.
Potential problems with TSC:
Frequency scaling. Counter does not increment linearly with time...
Different clocks on different CPUs/cores (I would not rule out different frequency scaling on different CPUs, or even differently clocked CPUs - though the latter should be rare).
Unsynchronized counters on different CPUs/cores (even if they use the same frequency).
This basically boils down to that you can only use the TSC to measure elapsed CPU cycles (not elapsed time) on a single CPU in a single threaded application, if you force the affinity for the thread.
The preferred alternative is to use system functions. The most portable (on Unix/Mac) is gettimeofday(), which is usually very accurate. A more appropriate function might be clock_gettime(), but check if it is supported on your system first. Under Windows you can safely use QueryPerformanceCounter().
You can use sched_setaffinity or cpuset feature that lets you create a cpuset and assign tasks to the set.

rdtsc accuracy across CPU cores

I am sending network packets from one thread and receiving replies on a 2nd thread that runs on a different CPU core. My process measures the time between send & receive of each packet (similar to ping). I am using rdtsc for getting high-resolution, low-overhead timing, which is needed by my implementation.
All measurments looks reliable. Still, I am worried about rdtsc accuracy across cores, since I've been reading some texts which implied that tsc is not synced between cores.
I found the following info about TSC in wikipedia
Constant TSC behavior ensures that the duration of each clock tick is
uniform and supports the use of the
TSC as a wall clock timer even if the
processor core changes frequency. This
is the architectural behavior moving
forward for all Intel processors.
Still I am worried about accruracy across cores, and this is my question
More Info
I run my process on an Intel nehalem machine.
Operating System is Linux.
The "constant_tsc" cpu flag is set for all the cores.
X86_FEATURE_CONSTANT_TSC + X86_FEATURE_NONSTOP_TSC bits in cpuid (edx=x80000007, bit #8; check unsynchronized_tsc function of linux kernel for more checks)
Intel's Designer's vol3b, section 16.11.1 Invariant TSC it says the following
"16.11.1 Invariant TSC
The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor's support for invariant TSC is indicated by CPUID.80000007H:EDX[8].
The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource."
So, if TSC can be used for wallclock, they are guaranteed to be in sync.
In fact, it seems that cores doesn´t share TSC, check this thread:
http://software.intel.com/en-us/forums/topic/388964
Summarizing, different cores does not share TSC, sometimes TSC can get out of synchronization if a core change to an specific energy state, but it depends on the kind of CPU, so you need to check the Intel documentation. It seems that most Operating Systems synchronize TSC on boot.
I checked the differences between TSC on different cores, using an exciting-reacting algorithm, on a Linux Debian machine with core i5 processor. The exciter process (in one core) writed the TSC in a shared variable, when the reacting process detected a change in that variable it compares its value and compares it with its own TSC. This is an example output of my test program:
TSC ping-pong test result:
TSC cores (exciter-reactor): 0-1
100 records, avrg: 159, range: 105-269
Dispersion: 13
TSC ping-pong test result:
TSC cores (exciter-reactor): 1-0
100 records, avrg: 167, range: 125-410
Dispersion: 13
The reaction time when the exciter CPU is 0 (159 tics on average) is almost the same than when the exciter CPU is 1 (167 tics). This indicates that they are pretty well synchronized (perhaps with a few tics of difference). On other core pairs, results were very similar.
On the other hand, rdtscp assembly instruction return a value indicating the CPU in which the TSC was read. It is not your case but it can be useful when you want to measure time in a simple code segment and you want to ensure that the process was not moved of CPU in the middle of the code.
On recent processors you can do it between separate cores of the same package (i.e. a system with just one core iX processor), you just can't do it in separate packages (processors), because they won't share the rtc. You can get away with it via cpu affinity (locking relevant threads to specific cores), but then again it would depend on the way your application behaves.
On linux you can check constant_tsc on /proc/cpuinfo in order to see if the processor has a single tsc valid for the entire package. The raw register is in CPUID.80000007H:EDX[8]
What I read around, but have not yet confirmed programatically, is that AMD cpus from revision 11h onwards have the same meaning for this cpuid bit.
On linux you can use clock_gettime(3) with CLOCK_MONOTONIC_RAW, which gives you nanoseconds resulotion and in not subject to ntp updates (if any happened).
You can set thread affinity using sched_set_affinity() API in order to run your thread on one CPU core.
I recommend that you don't use rdtsc. Not only is it not portable, it's not reliable and generally won't work - on some systems the rdtsc does not update uniformly (like if you're using speedstep etc). If you want accurate timing information you should set the SO_TIMESTAMP option on the socket and use recvmsg() to get the message with a (microsecond resolution) timestamp.
Moreover, the timestamp you get with SO_TIMESTAMP actually IS the time the kernel got the packet, not when your task happened to notice.

Time Stamp counter (TSC) when switching between Kernel & User mode

I am wondering if somebody knows some more details about the time stamp counter in Linux when a context switch occurs? Until now I had the opinion, that the TSC value is just increasing by 1 during each clock cycle, independent if in kernel or in user mode. I measured now the performance of an application using the TSC which yielded a performance result of 5 Mio Clock Cyles. Then, I made some changes to the scheduler which means that a context switch takes considerably longer, i.g. 2 Mio cycles instead of 500.000 cycles. The funny bit is, that when measuring the performance of the original application again it still takes 5 Mio cycles... So I am wondering why it did not take considerably longer as a context switch takes now almost 2 Mio clock cyles more? (and there occur at least 3 context during execution of the application).
Is the time stamp counter somehow deactivated during kernel mode? Or is the content of the TSC saved during contest switches? Thanks if someone could point me out what could be the problem!
As you can read on Wikipedia
With the advent of multi-core/hyperthreaded CPUs, systems with multiple CPUs, and "hibernating" operating systems, the TSC cannot be relied on to provide accurate results. The issue has two components: rate of tick and whether all cores (processors) have identical values in their time-keeping registers. There is no promise that the timestamp counters of multiple CPUs on a single motherboard will be synchronized. In such cases, programmers can only get reliable results by locking their code to a single CPU. Even then, the CPU speed may change due to power-saving measures taken by the OS or BIOS, or the system may be hibernated and later resumed (resetting the time stamp counter). Reliance on the time stamp counter also reduces portability, as other processors may not have a similar feature. Recent Intel processors include a constant rate TSC (identified by the constant_tsc flag in Linux's /proc/cpuinfo). With these processors the TSC reads at the processor's maximum rate regardless of the actual CPU running rate. While this makes time keeping more consistent, it can skew benchmarks, where a certain amount of spin-up time is spent at a lower clock rate before the OS switches the processor to the higher rate. This has the effect of making things seem like they require more processor cycles than they normally would.
I believe the TSC is actually a hardware construct of the processor you're using. IE: reading the TSC actually uses the RDTSC processor opcode. I don't even think there's a way for the OS to alter it's value, it just increases with each tick since the last power reset.
Regarding your modifications to the scheduler, is it possible that you're using a multi-core processor in a way that the OS is not switching out your running process? You might put a call to sched_yield() or sleep(0) in your program to see if your scheduler changes start taking effect.

Resources