How do I ensure accurate results of stress tests? - linux

I have written a script for a project that stress tests the cpu, the vm and the i/o whilst running vmstat, iostat and sar. The scripts all run for 30 seconds. My tutor has asked me however to ensure that the results are accurate? How can I ever be sure? Surely I just take the machine's word for it after running a few tests? The tests have been run for 60 seconds each and so have the commands to try and ensure a fair test, but how can I be sure that they are accurate according to my tutor's concerns? Any ideas?
The systems are server versions of Ubuntu 12.04, Debian 7 and Suse 11

There is no way to know which are your tutor's concerns, so you should ask him!
"accuracy" usually means that your test results should not be offset by a factor you're not taking into account, like some CPU features being disabled or not used, differences in software configuration, etc.
What is it that you evaluate, anyway? Evaluating CPU performance is not the same as evaluating a particular hardware system, which is yet different if you consider the software as well. Basically, you need to eliminate all differences which are not part of your evaluation, and make sure the rest of the configuration is representative (e.g. installing a modern OS which supports all the features the CPU provides).
And remember that in the end you will always take the machine's word for it, there's just no other way. All you can say is that you have considered all factors you're aware of, and hope that the factors remaining unknown don't have a big influence.

Related

Is there a standard constant *nix benchmark, and if not, how to make a `bogobench`?

Ordinary single-threaded *nix programs can be benchmarked with utils like time, i.e.:
# how long does `seq` take to count to 100,000,000
/usr/bin/time seq 100000000 > /dev/null
Outputs:
1.16user 0.06system 0:01.23elapsed 100%CPU (0avgtext+0avgdata 1944maxresident)k
0inputs+0outputs (0major+80minor)pagefaults 0swaps
...but numbers returned are always system dependent, which in a sense also measures the user's hardware.
Is there some non-relative benchmarking method or command-line util which would return approximately the same virtual timing numbers on any system, (or at least a reasonably large subset of systems)? Just like grep -m1 bogo /proc/cpuinfo returns a roughly approximate but stable unit, such a benchmark should also return a somewhat similar unit of duration.
Suppose for benchmarking ordinary commands we have a magic util bogobench (where "bogo" is an adjective signifying "a somewhat bogus status", but not necessarily having algorithms in common with BogoMIPs):
bogobench foo bar.data
And we run this on two physically separate systems:
a 1996 Pentium II
a 2015 Xeon
Desired output would be something like:
21 bogo-seconds
So bogobench should return about the same number in both cases, even though it probably would finish in much less time on the 2nd system.
A hardware emulator like qemu might be one approach, but not necessarily the only approach:
Insert the code to benchmark into a wrapper script bogo.sh
Copy bogo.sh to a bootable Linux disk image bootimage.iso, within a directory where bogo.sh would autorun then promptly shutdown the emulator. During which it outputs some form of timing data to parse into bogo-seconds.
Run bootimage.iso using one of qemu's more minimal -machine options:
qemu-system-i386 -machine type=isapc bootimage.iso
But I'm not sure how to make qemu use a virtual clock, rather than the host CPU's clock, and qemu itself seems like a heavy tool for a seemingly simple task. (Really MAME or MESS would be more versatile emulators than qemu for such a task -- but I'm not adept with MAME, although MAME currently has some capacity for 80486 PC emulation.)
Online we sometimes compare and contrast timing-based benchmarks made on machine X with one made on machine Y. Whereas I'd like both user X and Y to be able to do their benchmark on a virtual machine Z, with bonus points for emulating X or Y (like MAME) if need be, except with no consideration of X or Y's real run-time, (unlike MAME where emulations are often playable). In this way users could report how programs perform in interesting cases without the programmer having to worry that the results were biased by idiosyncrasies of a user's hardware, such as CPU quirks, background processes hogging resources, etc.
Indeed, even on the user's own hardware, a time based benchmark can be unreliable, as often the user can't be sure some background process, (or bug, or hardware error like a bad sector, or virus), might not be degrading some aspect of performance. Whereas a more virtual benchmark ought to be less susceptible to such influences.
The only sane way I see to implement this is with a cycle-accurate simulator for some kind of hardware design.
AFAIK, no publicly-available cycle-accurate simulators for modern x86 hardware exist, because it's extremely complex and despite a lot of stuff being known about x86 microarchitecture internals (Agner Fog's stuff, Intel's and AMD's own optimization guides, and other stuff in the x86 tag wiki), enough of the behaviour is still a black box full of CPU-design trade-secrets that it's at best possible to simulate something similar. (E.g. branch prediction is definitely one of the most secret but highly important parts).
While it should be possible to come close to simulating Intel Sandybridge or Haswell's actual pipeline and out-of-order core / ROB / RS (at far slower than realtime), nobody has done it that I know of.
But cycle-accurate simulators for other hardware designs do exist: Donald Knuth's MMIX architecture is a clean RISC design that could actually be built in silicon, but currently only exists on paper.
From that link:
Of particular interest is the MMMIX meta-simulator, which is able to do dynamic scheduling of a complex pipeline, allowing superscalar execution with any number of functional units and with many varieties of caching and branch prediction, etc., including a detailed implementation of both hard and soft interrupts.
So you could use this as a reference machine for everyone to run their benchmarks on, and everyone could get comparable results that will tell you how fast something runs on MMIX (after compiling for MMIX with gcc). But not how fast it runs on x86 (presumably also compiling with gcc), which may differ by a significant factor even for two programs that do the same job a different way.
For [fastest-code] challenges over on the Programming puzzles and Code Golf site, #orlp created the GOLF architecture with a simulator that prints timing results, designed for exactly this purpose. It's a toy architecture with stuff like print to stdout by storing to 0xffffffffffffffff, so it's not necessarily going to tell you anything about how fast something will run on any real hardware.
There isn't a full C implementation for GOLF, AFAIK, so you can only really use it with hand-written asm. This is a big difference from MMIX, which optimizing compilers do target.
One practical approach that could (maybe?) be extended to be more accurate over time is to use existing tools to measure some hardware invariant performance metric(s) for the code under test, and then apply a formula to come up with your bogoseconds score.
Unfortunately most easily measurable hardware metrics are not invariant - rather, they depend on the hardware. An obvious one that should be invariant, however, would be "instructions retired". If the code is taking the same code paths every time it is run, the instructions retired count should be the same on all hardware1.
Then you apply some kind of nominal clock speed (let's say 1 GHz) and nominal CPI (let's say 1.0) to get your bogoseconds - if you measure 15e9 instructions, you output a result of 15 bogoseconds.
The primary flaw here is that the nominal CPI may be way off from the actual CPI! While most programs hover around 1 CPI, it's easy to find examples where they can approach 0.25 or whatever the inverse of the width is, or alternately be 10 or more if there are many lengthy stalls. Of course such extreme programs may be what you'd want to benchmark - and even if not you have the issue that if you are using your benchmark to evaluate code changes, it will ignore any improvements or regressions in CPI and look only at instruction count.
Still, it satisfies your requirement in as much as it effectively emulates a machine that executes exactly 1 instruction every cycle, and maybe it's a reasonable broad-picture approach. It is pretty easy to implement with tools like perf stat -e instructions (like one-liner easy).
To patch the holes then you could try to make the formula better - let's say you could add in a factor for cache misses to account for that large source of stalls. Unfortunately, how are you going to measure cache-misses in a hardware invariant way? Performance counters won't help - they rely on the behavior and sizes of your local caches. Well, you could use cachegrind to emulate the caches in a machine-independent way. As it turns out, cachegrind even covers branch prediction. So maybe you could plug your instruction count, cache miss and branch miss numbers into a better formula (e.g., use typical L2, L3, RAM latencies, and a typical cost for branch misses).
That's about as far as this simple approach will take you, I think. After that, you might as well just rip apart any of the existing x862 emulators and add your simple machine model right in there. You don't need to cycle accurate, just pick a nominal width and model it. Probably whatever underlying emulation cachegrind is going might be a good match and you get the cache and branch prediction modeling already for free.
1 Of course, this doesn't rule out bugs or inaccuracies in the instruction counting mechanism.
2 You didn't tag your question x86 - but I'm going to assume that's your target since you mentioned only Intel chips.

How to prevent compiler & linker from taking over entire machine?

When I compile a large project the compiler slows down the machine tremendously, virtually freezes it out. If I'm lucky a keystroke in vim takes a few seconds to register. If I'm not I may as well go for a walk since nothing can be done on my workstation at all.
Is there any way to prevent compiler and linker from consuming the entire machine? More generally, is it possible to limit a family of processes to a portion of computing resources, such as threads, memory, disk access bandwidth?
Something like limiting the resources available to the process tree that originates from the shell that runs the build would be ideal.
Most linux distros have a package called cpulimit. You can use this to limit the CPU usage for the gcc tool chain binaries.
It's mention as an answer to this question.
Limiting certain processes to CPU % - Linux
I'm not an expert about it but you could try starting the compilation with a specific cgroup that has limited resources. I don't know exactly how complicated it is to do it but it shouldn't be too hard.
You could also try changing the nice of the process to give it a lower priority so that it does take the entire machine but will be easily bumped by another process.

Accurate way of measuring overhead in kernel space

I recently implemented a security mechanism for Linux which hooks into system calls. Now I have to measure the overhead caused by it. The project requires to compare the execution time of typical Linux apps with and without the mechanism. By typical Linux apps I assume ex. gzipping 1G file, doing 'find /', grepping files. The main goal is to show the overhead in different types of tasks: CPU bound, I/O bound etc.
The question is: how to organise the test so that they will be reliable? The first important thing is the fact that my mechanism works only in kernel space, so it is relevant to compare systime. I can use 'time' command for it, but is it the most accurate way of measuring systime? Another idea is to run those apps in long loops to minimize error. Then the loops should be inside or outside time command? If they are outside I will get many results - should I choose min, max, median, average?
Thanks for any suggestions.
I think you want more to measure a typical application payload (as Ninjajl's comment suggests, the compilation of the kernel could be a good payload). You probably don't want to measure the overhead inside each syscall itself, or even inside the kernel as a whole.
The reason for this is that most applications spend much more time and resource in user-space than in kernel-land (i.e. syscalls), so overhead inside syscalls is a "second-order" effect and probably don't matter as much. Of course, there are probable exceptions.
Perhaps phoronix test suite might be relevant.
You might be interested by oprofile
See also this answer and this question

/dev/random Extremely Slow?

Some background info: I was looking to run a script on a Red Hat server to read some data from /dev/random and use the Perl unpack() command to convert it to a hex string for usage later on (benchmarking database operations). I ran a few "head -1" on /dev/random and it seemed to be working out fine, but after calling it a few times, it would just kinda hang. After a few minutes, it would finally output a small block of text, then finish.
I switched to /dev/urandom (I really didn't want to, its slower and I don't need that quality of randomness) and it worked fine for the first two or three calls, then it too began hang.
I was wondering if it was the "head" command that was bombing it, so I tried doing some simple I/O using Perl, and it too was hanging.
As a last ditch effort, I used the "dd" command to dump some info out of it directly to a file instead of to the terminal. All I asked of it was 1mb of data, but it took 3 minutes to get ~400 bytes before I killed it.
I checked the process lists, CPU and memory were basically untouched. What exactly could cause /dev/random to crap out like this and what can I do to prevent/fix it in the future?
Edit: Thanks for the help guys! It seems that I had random and urandom mixed up. I've got the script up and running now. Looks like I learned something new today. :)
On most Linux systems, /dev/random is powered from actual entropy gathered by the environment. If your system isn't delivering a large amount of data from /dev/random, it likely means that you're not generating enough environmental randomness to power it.
I'm not sure why you think /dev/urandom is "slower" or higher quality. It reuses an internal entropy pool to generate pseudorandomness - making it slightly lower quality - but it doesn't block. Generally, applications that don't require high-level or long-term cryptography can use /dev/urandom reliably.
Try waiting a little while then reading from /dev/urandom again. It's possible that you've exhausted the internal entropy pool reading so much from /dev/random, breaking both generators - allowing your system to create more entropy should replenish them.
See Wikipedia for more info about /dev/random and /dev/urandom.
This question is pretty old. But still relevant so I'm going to give my answer. Many CPUs today come with a built-in hardware random number generator (RNG). As well many systems come with a trusted platform module (TPM) that also provide a RNG. There are also other options that can be purchased but chances are your computer already has something.
You can use rngd from rng-utils package on most linux distros to seed more random data. For example on fedora 18 all I had to do to enable seeding from the TPM and the CPU RNG (RDRAND instruction) was:
# systemctl enable rngd
# systemctl start rngd
You can compare speed with and without rngd. It's a good idea to run rngd -v -f from command line. That will show you detected entropy sources. Make sure all necessary modules for supporting your sources are loaded. To use TPM, it needs to be activated through the tpm-tools. update: here is a nice howto.
BTW, I've read on the Internet some concerns about TPM RNG often being broken in different ways, but didn't read anything concrete against the RNGs found in Intel, AMD and VIA chips. Using more than one source would be best if you really care about randomness quality.
urandom is good for most use cases (except sometimes during early boot). Most programs nowadays use urandom instead of random. Even openssl does that. See myths about urandom and comparison of random interfaces.
In recent Fedora and RHEL/CentOS rng-tools also support the jitter entropy. You can on lack of hardware options or if you just trust it more than your hardware.
UPDATE: another option for more entropy is HAVEGED (questioned quality). On virtual machines there is a kvm/qemu VirtIORNG (recommended).
UPDATE 2: In Linux 5.6 kernel does its own jitter entropy.
use /dev/urandom, its cryptographically secure.
good read: http://www.2uo.de/myths-about-urandom/
"If you are unsure about whether you should use /dev/random or /dev/urandom, then probably you want to use the latter."
When in doubt in early boot, wether you have enough entropy gathered. use the system call getrandom() instead. [1] (from Linux kernel >= 3.17)
Its best of both worlds,
it blocks until (only once!) enough entropy is gathered,
after that it will never block again.
[1] git kernel commit
If you want more entropy for /dev/random then you'll either need to purchase a hardware RNG or use one of the *_entropyd daemons in order to generate it.
If you are using randomness for testing (not cryptography), then repeatable randomness is better, you can get this with pseudo randomness starting at a known seed. There is usually a good library function for this in most languages.
It is repeatable, for when you find a problem and are trying to debug. It also does not eat up entropy. May be seed the pseudo random generator from /dev/urandom and record the seed in the test log. Perl has a pseudo random number generator you can use.
This fixed it for me.
Use new SecureRandom() instead of SecureRandom.getInstanceStrong()
Some more info can be found here :
https://tersesystems.com/blog/2015/12/17/the-right-way-to-use-securerandom/
/dev/random should be pretty fast these days. However I did notice in OS X reading small bytes from /dev/urandom was really slow. Work around seemed to be to use arc4random instead: https://github.com/crystal-lang/crystal/pull/11974

Using "top" in Linux as semi-permanent instrumentation

I'm trying to find the best way to use 'top' as semi-permanent instrumentation in the development of a box running embedded Linux. (The instrumentation will be removed from the final-test and production releases.)
My first pass is to simply add this to init.d:
top -b -d 15 >/tmp/toploop.out &
This runs top in "batch" mode every 15 seconds. Let's assume that /tmp has plenty of spaceā€¦
Questions:
Is 15 seconds a good value to choose for general-purpose monitoring?
Other than disk space, how seriously is this perturbing the state of the system?
What other (perhaps better) tools could be used like this?
Look at collectd. It's a very light weight system monitoring framework coded for performance.
We use sysstat to monitor things like this.
You might find that vmstat and iostat with a delay and no repeat counter is a better option.
I suspect 15 seconds would be more than adequate unless you actually want to watch what's happening in real time, but that doesn't appear to be the case here.
As far as load, on an idling PIII 900Mhz w/ 768MB of RAM running Ubuntu (not sure which version, but not more than a year old) I have top updating every 0.5 seconds and it's about 2% CPU utilization. At 15s updates, I'm seeing 0.1% CPU utilization.
depending upon what exactly you want, you could use the output of uptime, free, and ps to get most, if not all, of top's information.
If you are looking for overall load, uptime is probably sufficient. However, if you want specific information about processes, you are adventurous, and have the /proc filessystem enabled, you may want to write your own tools. The primary benefit in this environment is that you can focus on exactly what you want and minimize the load introduced to the system.
The proc file system gives your application read access to the kernel memory that keeps track of many of the interesting variables. Reading from /proc is one of the lightest ways to get this information. Additionally, you may be able to get more information than provided by top. I've done this in the past to get amount of time spent in user and system by this process. Additionally, you can use this to get information about the number of file descriptors open by the process. You might also use this to get detailed information about how the network system is working.
Much of this information is pre-processed by other applications which can be used if you get the information you need. However, it is rather straight-forward to read the raw information. Do a man proc for more information.
Pity you haven't said what you are monitoring for.
You should decide whether 15 seconds is ok or not. Feel free to drop it way lower if you wish (and have a fast HDD)
No worries unless you are running a soft real-time system
Have a look at tools suggested in other answers. I'll add another sugestion: "iotop", for answering a "who is thrashing the HDD" questions.
At work for system monitoring during stress tests we use a tool called nmon.
What I love about nmon is it has the ability to export to XLS and generate beautiful graphs for you.
It generates statistics for:
Memory Usage
CPU Usage
Network Usage
Disk I/O
Good luck :)

Resources