Outliers during Performance Evaluation - linux

I am trying to do some performance measurements using Intels RDTSC, and it is quite
odd the variations I get during different testruns. In most cases my benchmark in C
needs 3000000 Mio cycles, however, exactly the same execution can in some cases take
5000000, almost double as much. I tried to have no intense workloads running in parallel
so that I get good performance estimations. Anyone an idea where this huge timing
variations can come from? I know that interrupts and stuff can happening, but I did not expect
such huge variations in timing!
PS.: I am running it on a Pentium processor with Linux running on it.
Thanks for feedback,
John

I think the answer is in:
I tried to have no intense workloads
running in parallel
You have insufficient control over this in a modern OS.

According to this Wikipedia article, the RDTSC (time stamp counter) cannot be used reliably for benchmarking on multi-core systems. There is no promise that all cores have the same value in the time stamp register.
On Linux, it is better to use the POSIX clock_gettime function.

You have to take the cache of most modern processors into account. Maybe another process evicts your program's cache content in the case where you measured the long running time.
As Henk pointed out, lots of stuff happen in a modern OS that you can't control that much.

Related

Linux: CPU benchmark requiring longer time and different CPU utilization levels

For my research I need a CPU benchmark to do some experiments on my Ubuntu laptop (Ubuntu 15.10, Memory 7.7 GiB, Intel Core i7-4500U CPU # 1.80HGz x 4, 64bit). In an ideal world, I would like to have a benchmark satisfying the following:
The CPU should be an official benchmark rather than created by my own for transparency purposes.
The time needed to execute the benchmark on my laptop should be at least 5 minutes (the more the better).
The benchmark should result in different levels of CPU throughout execution. For example, I don't want a benchmark which permanently keeps the CPU utilization level at around 100% - so I want a benchmark which will make the CPU utilization vary over time.
Especially points 2 and 3 are really key for my research. However, I couldn't find any suitable benchmarks so far. Benchmarks I found so far include: sysbench, CPU Fibonacci, CPU Blowfish, CPU Cryptofish, CPU N-Queens. However, all of them just need a couple of seconds to complete and the utilization level on my laptop is at 100% constantly.
Question: Does anyone know about a suitable benchmark for me? I am also happy to hear any other comments/questions you have. Thank you!
To choose a benchmark, you need to know exactly what you're trying to measure. Your question doesn't include that, so there's not much anyone can tell you without taking a wild guess.
If you're trying to measure how well Turbo clock speed works to make a power-limited CPU like your laptop run faster for bursty workloads (e.g. to compare Haswell against Skylake's new and improved power management), you could just run something trivial that's 1 second on, 2 seconds off, and count how many loop iterations it manages.
The duty cycle and cycle length should be benchmark parameters, so you can make plots. e.g. with very fast on/off cycles, Skylake's faster-reacting Turbo will ramp up faster and drop down to min power faster (leaving more headroom in the bank for the next burst).
The speaker in that talk (the lead architect for power management on Intel CPUs) says that Javascript benchmarks are actually bursty enough for Skylake's power management to give a measurable speedup, unlike most other benchmarks which just peg the CPU at 100% the whole time. So maybe have a look at Javascript benchmarks, if you want to use well-known off-the-shelf benchmarks.
If rolling your own, put a loop-carried dependency chain in the loop, preferably with something that's not too variable in latency across microarchitectures. A long chain of integer adds would work, and Fibonacci is a good way to stop the compiler from optimizing it away. Either pick a fixed iteration count that works well for current CPU speeds, or check the clock every 10M iterations.
Or set a timer that will fire after some time, and have it set a flag that you check inside the loop. (e.g. from a signal handler). Specifically, alarm(2) may be a good choice. Record how many iterations you did in this burst of work.

Moving threads across CPUs with clock_gettime(CLOCK_MONOTONIC)

I've heard people complain that the WinAPI functions QueryPerformanceFrequency() and QueryPerforamnceCounter() can behave erratically and unstably when the OS decides to move the calling thread to a new physical CPU.
Does anybody know if clock_gettime(CLOCK_MONOTONIC) suffers from similar issues? Or is it more guaranteed to be stable?
Also, are the worries about QPF/QPC on WinAPI just a thing of the past? Or are they still concerns even today?
OP:
I've heard people complain that the WinAPI functions QueryPerformanceFrequency() and QueryPerforamnceCounter() can behave erratically and unstably when the OS decides to move the calling thread to a new physical CPU.
I'm not sure what you mean by "erratic" or "unstable." If you mean drift, or variance between cores, these concerns are probably based on computers that shipped 12-15 years ago (XP and Win2000-based OS). From Microsoft:
QPC is available on Windows XP and Windows 2000 and works well on most systems. However, some hardware systems' BIOS didn't indicate the hardware CPU characteristics correctly (a non-invariant TSC), and some multi-core or multi-processor systems used processors with TSCs that couldn't be synchronized across cores. Systems with flawed firmware that run these versions of Windows might not provide the same QPC reading on different cores if they used the TSC as the basis for QPC.
That has pretty much become a non-issue for most current hardware (see link, below).
OP:
Does anybody know if clock_gettime(CLOCK_MONOTONIC) suffers from similar issues? Or is it more guaranteed to be stable?
Well, any high-frequency clock has to come from somewhere. On a Windows box (since you asked about QPC), where is that value going to come from? Adding an additional layer to any call to QueryPerformanceCounter essentially guarantees less precision, as there will be more CPU instructions in the mix between "now" and "now as reported back to you by the OS" (and, although extremely unlikely, there would be a small increase in the possibility of preemption, contributing further loss of precision).
This applies equally to any Intel-based Linux/BSD/whatever boxes, as they have to run with the same hardware characteristics. In the Intel-architecture world, the highest frequency you'll be able to get is going to be a value based on RDTSC and the operating system's best effort to keep any TSC values as close as possible across cores or processors (at least in the absence of specialized hardware).
This is why any benchmarking should be an average performance expectation based on a large number of data points.
Microsoft actually has a pretty good document outlining the implementation characteristics of high-frequency timers, including HPET and ACPI power management times, and touches briefly on multi-clock and virtualization: http://msdn.microsoft.com/en-us/library/windows/desktop/dn553408(v=vs.85).aspx.
OP:
Also, are the worries about QPF/QPC on WinAPI just a thing of the past? Or are they still concerns even today?
For most of the world, yes. I work on server-based code where microseconds count (high frequency market data), but most people who think a few hundred microseconds are going to make or break their program are kidding themselves. Do you know how long it takes to serve a web page? The CPU itself gets bored with all that waiting, even with thousands of users.

What is the lowest time resolution for somewhat accurate measurements of cpu usage?

Some of the things I want to measure are very short,and I can only repeat them so many times if I don't run any of the setup/dispose code in the middle.
note: on linux,reading /proc/stat
Not very portable and you'll have to take great care so it is reliable, but the Time Stamp Counter definitely has the highest resolution available (increases at every CPU tick).
The time stamp counter has, until
recently, been an excellent
high-resolution, low-overhead way of
getting CPU timing information. With
the advent of multi-core/hyperthreaded
CPUs, systems with multiple CPUs, and
"hibernating" operating systems, the
TSC cannot be relied on to provide
accurate results - unless great care
is taken to correct the possible
flaws: rate of tick and whether all
cores (processors) have identical
values in their time-keeping
registers. There is no promise that
the timestamp counters of multiple
CPUs on a single motherboard will be
synchronized. In such cases,
programmers can only get reliable
results by locking their code to a
single CPU. Even then, the CPU speed
may change due to power-saving
measures taken by the OS or BIOS, or
the system may be hibernated and later
resumed (resetting the time stamp
counter). In those latter cases, to
stay relevant, the counter must be
recalibrated periodically (according
to the time resolution your
application requires).
There's some notes there about Linux specific solutions on the page, too:
Under Linux, similar functionality is
provided by reading the value of
CLOCK_MONOTONIC clock using POSIX
clock_gettime function.

Highly concurrent multi-threaded application requires hardware

I am looking for a hardware, which must run about 256 computationally intensive real-time concurrent tasks in 24 hour mode (one multi-threaded C application). Each task takes about 40-50 MFLOPs, so all tasks require about 10 GFLOPs. CPU-RAM speed is insignificant. All tasks must be managed by a Linux Kernel (32 bit, with SMP).
I am looking for a one-mainboard solution with one multi-core CPU (if such CPU exist). If such CPU doesn't exist, then I need one mulit-socket mainboard solution (with multiple CPUs).
Can you please recommend me any professional CPU/Mainboard solution which will satisfy such requirements? It is also very important that there are no issues with Linux Kernel (2.6.25). No virtualization, no needs in huge RAM or CPU cache. I also would prefer Intel architecture and well-proved stability. I still have doubts that it is feasible at all.
Thank you in advance.
UPDATE:
I think I have found a right answer here and here.
UltraSPARC T2 has 8 cores with 8 threads each. Integrated high-bandwidth memory and IO. The T5140 carries two of them for 128 hardware threads.
The theoretical max raw performance of the 8 floating point units is 11 Giga flops per second (GFlops/s). A huge advantage over other implementations however is that 64 threads can share the units and thus we can achieve an extremely high percentage of theoretical peak. Our experiments have achieved nearly 90% of the 11 Gflop/s. - (http://blogs.oracle.com/deniss/entry/floating_point_performance_on_the)
Rent some Amazon EC2 nodes.
Updated: How about PS3's then? The NASA uses them for their simulation engines.
Maybe use CPU+GPU's in commercial servers?
Build it around FPGAs: nowadays, some variants include processors that can run Linux.
Even though you've given us the specs you think you need, we might be able to help you out better if you tell us what the application is intended to accomplish, and how it was implemented.
There may be a better way to split the work up or deal with it rather than your current solution.
Not Intel architecture but these run linux and have 64 cores on a single die.
TILEPro64
Get a bunch of four- or eight-core machines and split the processing across the machines using some sort of grid or clustering software. Maybe have a look at Beowulf.
As you mentioned, 10GFlops isn't exactly to be sneezed at so in a single machine, it'll be expensive. There's also the problem what you do when the machine breaks, you're unlikely to have a second machine of similar spec available. If you build a cluster using commodity hardware, you're a little more resilient and it's easier to find replacement machines.
MFLOPS and GFLOPS are very poor indicators of how well a program can run on any given CPU. These days, cache footprint is much more important; perhaps branch prediction accuracy as well.
There's almost no way to gauge performance of a given application on different architectures without actually giving it a spin. And even then, you may not get a good idea if you were unlucky enough to unknowingly build with compiler options that ruined your cache footprint, or used a bad threading library, or any of a hundred other things.
I see you'd prefer intel, but if you need one chip, I will again suggest the cell processor -
its theoretical peak performance is arount 25GFlops - kernel 2.6.25 had support for it already.
You could try a pre-slim playstation 3 for experimenting with (that would cost you little) or get yourself a server-based solution at around US$8K - you will have to re-write and fine tune your threads to take advabtage of the SPU co-processors there, but you could achieve your computational needs without breaking a sweat with a single CELL (1 PPC core + 8 SPU's)
NB.: with a playstation 3, you'd have only 6 available co-processors - but you don't seen to be on a budget with this project -
So you could at least try IBM's cell developer kit, which offers an emulator, to see if you can code your solution to run on it.
Thre are commercially available CELL products, both as stand-alone servers in blade form factory, and PCI Express add-on boards for PC workstations from
Mercury Computer Systems:
http://www.mc.com/microsites/cell/products.aspx?id=6986
Mercury does not list any prices on the site, but the pricing seens to be around the previoulsy mentioned U$8000.00 for these PCI Express cards.
A playstation 3 videogame can be purchased for about U$300.00 - and would allow you to prototype your application, and check if it is up to the needed performance. (I myself got one and have Fedora 9 running on it, although I did that as a hobbyst and have not, so far, used it for any calculations - I had also put together a Playstation-3 12 machinne cluster for Molecular simulations at the local University. The application they run did not take advantage of the multimedia SPU's, while I was in touch with then. But even so, clocked at 3.5GHz they performed better than standard ,s imlarly priced, PC's, even considering PS3's are priced 5x higher around here)

How long does a context switch take in Linux?

I'm curious how many cycles it takes to change contexts in Linux. I'm specifically using an E5405 Xeon (x64), but I'd love to see how it compares to other platforms as well.
There`s a free app called LMBench written by Larry McVoy and friends. It provides a bunch of OS & HW benchmarks
One of the tests is called lat_ctx and it measures contex switch latencies.
Google for lmbench and check for yourself on your own HW. Its the only way to get a number meaningful to you.
Gilad
Run vmstat on your machine while doing something that requires heavy context switching. It doesnt tell you how long the actual switch takes, but it will tell you how many switches you do per second.
Then, you have to estimate how much each timeslice spends performing actual code, compared to switching context. Maybe a 100:1 or something? I dont know. 1000:1?
A machine of mine is now doing roughly 3000 switches per second, ie 0.3 ms per timeslice. With a ratio of 100:1 that would mean the actual switch takes 0.003 ms.
But, with multiple cores, threads yielding execution, etc etc, I'm wouldnt draw any conclusion from such a guess :)
I've written code that's able to echo (small) UDP packets at 200k packets per second.
That suggests that it's possible to context switch in not more than 2.5 microseconds, with the actual context switch probably taking somewhat less than that.

Resources