Accurate way of measuring overhead in kernel space

Accurate way of measuring overhead in kernel space - linux

I recently implemented a security mechanism for Linux which hooks into system calls. Now I have to measure the overhead caused by it. The project requires to compare the execution time of typical Linux apps with and without the mechanism. By typical Linux apps I assume ex. gzipping 1G file, doing 'find /', grepping files. The main goal is to show the overhead in different types of tasks: CPU bound, I/O bound etc.
The question is: how to organise the test so that they will be reliable? The first important thing is the fact that my mechanism works only in kernel space, so it is relevant to compare systime. I can use 'time' command for it, but is it the most accurate way of measuring systime? Another idea is to run those apps in long loops to minimize error. Then the loops should be inside or outside time command? If they are outside I will get many results - should I choose min, max, median, average?
Thanks for any suggestions.

I think you want more to measure a typical application payload (as Ninjajl's comment suggests, the compilation of the kernel could be a good payload). You probably don't want to measure the overhead inside each syscall itself, or even inside the kernel as a whole.
The reason for this is that most applications spend much more time and resource in user-space than in kernel-land (i.e. syscalls), so overhead inside syscalls is a "second-order" effect and probably don't matter as much. Of course, there are probable exceptions.
Perhaps phoronix test suite might be relevant.
You might be interested by oprofile
See also this answer and this question

Related

Is there a standard constant *nix benchmark, and if not, how to make a `bogobench`?

Ordinary single-threaded *nix programs can be benchmarked with utils like time, i.e.:
# how long does `seq` take to count to 100,000,000
/usr/bin/time seq 100000000 > /dev/null
Outputs:
1.16user 0.06system 0:01.23elapsed 100%CPU (0avgtext+0avgdata 1944maxresident)k
0inputs+0outputs (0major+80minor)pagefaults 0swaps
...but numbers returned are always system dependent, which in a sense also measures the user's hardware.
Is there some non-relative benchmarking method or command-line util which would return approximately the same virtual timing numbers on any system, (or at least a reasonably large subset of systems)? Just like grep -m1 bogo /proc/cpuinfo returns a roughly approximate but stable unit, such a benchmark should also return a somewhat similar unit of duration.
Suppose for benchmarking ordinary commands we have a magic util bogobench (where "bogo" is an adjective signifying "a somewhat bogus status", but not necessarily having algorithms in common with BogoMIPs):
bogobench foo bar.data
And we run this on two physically separate systems:
a 1996 Pentium II
a 2015 Xeon
Desired output would be something like:
21 bogo-seconds
So bogobench should return about the same number in both cases, even though it probably would finish in much less time on the 2nd system.
A hardware emulator like qemu might be one approach, but not necessarily the only approach:
Insert the code to benchmark into a wrapper script bogo.sh
Copy bogo.sh to a bootable Linux disk image bootimage.iso, within a directory where bogo.sh would autorun then promptly shutdown the emulator. During which it outputs some form of timing data to parse into bogo-seconds.
Run bootimage.iso using one of qemu's more minimal -machine options:
qemu-system-i386 -machine type=isapc bootimage.iso
But I'm not sure how to make qemu use a virtual clock, rather than the host CPU's clock, and qemu itself seems like a heavy tool for a seemingly simple task. (Really MAME or MESS would be more versatile emulators than qemu for such a task -- but I'm not adept with MAME, although MAME currently has some capacity for 80486 PC emulation.)
Online we sometimes compare and contrast timing-based benchmarks made on machine X with one made on machine Y. Whereas I'd like both user X and Y to be able to do their benchmark on a virtual machine Z, with bonus points for emulating X or Y (like MAME) if need be, except with no consideration of X or Y's real run-time, (unlike MAME where emulations are often playable). In this way users could report how programs perform in interesting cases without the programmer having to worry that the results were biased by idiosyncrasies of a user's hardware, such as CPU quirks, background processes hogging resources, etc.
Indeed, even on the user's own hardware, a time based benchmark can be unreliable, as often the user can't be sure some background process, (or bug, or hardware error like a bad sector, or virus), might not be degrading some aspect of performance. Whereas a more virtual benchmark ought to be less susceptible to such influences.

The only sane way I see to implement this is with a cycle-accurate simulator for some kind of hardware design.
AFAIK, no publicly-available cycle-accurate simulators for modern x86 hardware exist, because it's extremely complex and despite a lot of stuff being known about x86 microarchitecture internals (Agner Fog's stuff, Intel's and AMD's own optimization guides, and other stuff in the x86 tag wiki), enough of the behaviour is still a black box full of CPU-design trade-secrets that it's at best possible to simulate something similar. (E.g. branch prediction is definitely one of the most secret but highly important parts).
While it should be possible to come close to simulating Intel Sandybridge or Haswell's actual pipeline and out-of-order core / ROB / RS (at far slower than realtime), nobody has done it that I know of.
But cycle-accurate simulators for other hardware designs do exist: Donald Knuth's MMIX architecture is a clean RISC design that could actually be built in silicon, but currently only exists on paper.
From that link:
Of particular interest is the MMMIX meta-simulator, which is able to do dynamic scheduling of a complex pipeline, allowing superscalar execution with any number of functional units and with many varieties of caching and branch prediction, etc., including a detailed implementation of both hard and soft interrupts.
So you could use this as a reference machine for everyone to run their benchmarks on, and everyone could get comparable results that will tell you how fast something runs on MMIX (after compiling for MMIX with gcc). But not how fast it runs on x86 (presumably also compiling with gcc), which may differ by a significant factor even for two programs that do the same job a different way.
For [fastest-code] challenges over on the Programming puzzles and Code Golf site, #orlp created the GOLF architecture with a simulator that prints timing results, designed for exactly this purpose. It's a toy architecture with stuff like print to stdout by storing to 0xffffffffffffffff, so it's not necessarily going to tell you anything about how fast something will run on any real hardware.
There isn't a full C implementation for GOLF, AFAIK, so you can only really use it with hand-written asm. This is a big difference from MMIX, which optimizing compilers do target.

One practical approach that could (maybe?) be extended to be more accurate over time is to use existing tools to measure some hardware invariant performance metric(s) for the code under test, and then apply a formula to come up with your bogoseconds score.
Unfortunately most easily measurable hardware metrics are not invariant - rather, they depend on the hardware. An obvious one that should be invariant, however, would be "instructions retired". If the code is taking the same code paths every time it is run, the instructions retired count should be the same on all hardware1.
Then you apply some kind of nominal clock speed (let's say 1 GHz) and nominal CPI (let's say 1.0) to get your bogoseconds - if you measure 15e9 instructions, you output a result of 15 bogoseconds.
The primary flaw here is that the nominal CPI may be way off from the actual CPI! While most programs hover around 1 CPI, it's easy to find examples where they can approach 0.25 or whatever the inverse of the width is, or alternately be 10 or more if there are many lengthy stalls. Of course such extreme programs may be what you'd want to benchmark - and even if not you have the issue that if you are using your benchmark to evaluate code changes, it will ignore any improvements or regressions in CPI and look only at instruction count.
Still, it satisfies your requirement in as much as it effectively emulates a machine that executes exactly 1 instruction every cycle, and maybe it's a reasonable broad-picture approach. It is pretty easy to implement with tools like perf stat -e instructions (like one-liner easy).
To patch the holes then you could try to make the formula better - let's say you could add in a factor for cache misses to account for that large source of stalls. Unfortunately, how are you going to measure cache-misses in a hardware invariant way? Performance counters won't help - they rely on the behavior and sizes of your local caches. Well, you could use cachegrind to emulate the caches in a machine-independent way. As it turns out, cachegrind even covers branch prediction. So maybe you could plug your instruction count, cache miss and branch miss numbers into a better formula (e.g., use typical L2, L3, RAM latencies, and a typical cost for branch misses).
That's about as far as this simple approach will take you, I think. After that, you might as well just rip apart any of the existing x862 emulators and add your simple machine model right in there. You don't need to cycle accurate, just pick a nominal width and model it. Probably whatever underlying emulation cachegrind is going might be a good match and you get the cache and branch prediction modeling already for free.
1 Of course, this doesn't rule out bugs or inaccuracies in the instruction counting mechanism.
2 You didn't tag your question x86 - but I'm going to assume that's your target since you mentioned only Intel chips.

How to prevent compiler & linker from taking over entire machine?

When I compile a large project the compiler slows down the machine tremendously, virtually freezes it out. If I'm lucky a keystroke in vim takes a few seconds to register. If I'm not I may as well go for a walk since nothing can be done on my workstation at all.
Is there any way to prevent compiler and linker from consuming the entire machine? More generally, is it possible to limit a family of processes to a portion of computing resources, such as threads, memory, disk access bandwidth?
Something like limiting the resources available to the process tree that originates from the shell that runs the build would be ideal.

Most linux distros have a package called cpulimit. You can use this to limit the CPU usage for the gcc tool chain binaries.
It's mention as an answer to this question.
Limiting certain processes to CPU % - Linux

I'm not an expert about it but you could try starting the compilation with a specific cgroup that has limited resources. I don't know exactly how complicated it is to do it but it shouldn't be too hard.
You could also try changing the nice of the process to give it a lower priority so that it does take the entire machine but will be easily bumped by another process.

Limiting the File System Usage Programmatically in Linux

I was assigned to write a system call for Linux kernel, which oddly determines (and reduces) users´ maximum transfer amount per minute (for file operations). This system call will be called lim_fs_usage and will take a parameter for maximum number of bytes all users can access in a minute. For short, I am going to determine bandwidth of all filesystem operations in Linux. The project also asks for choosing appropriate method for distribution of this restricted resource (file access) among the users but I think this
won´t be a big problem.
I did a long long search and scan but could not find a method for managing file system access programmatically. I thought of mapping (mmap())hard drive to memory and manage memory operations but this turned to be useless. I also tried to find an API for virtual file system in order to monitor and limit it but I could not find one. Any ideas, please... Any help is greatly appreciated. Thank you in advance...

I wonder if you could do this as an IO scheduler implementation.
The main difficulty of doing IO bandwidth limitation under Linux is, by the time it reaches anywhere near the device, the kernel has probably long since forgotten who caused it.
Likewise, you can get on some very tricky ground in determining who is responsible for a given piece of IO:
If a binary is demand-loaded, who owns the IO doing that?
A mapped section of memory (demand-loaded executable or otherwise) might be kicked out of memory because someone else used too much ram, thus causing the kernel to choose to evict those pages, which places an unfair burden on the quota of the other user to then page it back in
IO operations can be combined, and might come from different users
A write operation might cause an IO sooner or later depending on how the kernel schedules it; a later schedule may mean that fewer IOs need to be done in the long run, as another write gets done to the same block in the interim; writing to an already dirty block in cache does not make it any dirtier.
If you understand all these and more caveats, and still want to, I imagine doing it as an IO scheduler is the way to go.
IO schedulers are pluggable under Linux (2.6) and can be changed dynamically - the kernel waits for all IO on the device (IO scheduler is switchable per block device) to end and then switches to the new one.

Since it's urgent I'll give you an idea out of the top of my head without doing any research on the feasibility -- what about inserting a hook to monitor system calls that deal with file system access?
You might end up writing specialised kernel modules to handle the various filesystems (ext3, ext4, etc) but as a proof-of-concept you can start with one. Do not forget that root has reserved blocks in memory, process space and disk for his own operations.
Managing memory operations does not sound related to what you're trying to do (but perhaps I am mistaken here).

After a long period of thinking and searching, I decided to use the ¨hooking¨ method proposed. I am thinking of creating a new system call which initializes and manages a global variable like hdd_ bandwith _limit. This variable will be used in Read() and Write() system calls´ modified implementation (instead of ¨count¨ variable). Then I will decide distribution of this resource which is the real issue. Probably I will find out how many users are using the system for a certain moment and divide this resource equally. Will be a Round-Robin-like distribution. But still, I am open to suggestions on this distribution issue. Will it be a SJF or FCFS or Round-Robin? Synchronization is another issue. How can I know a user´s job is short or long? Or whether he is done with the operation or not?

Microsecond accurate (or better) process timing in Linux

I need a very accurate way to time parts of my program. I could use the regular high-resolution clock for this, but that will return wallclock time, which is not what I need: I needthe time spent running only my process.
I distinctly remember seeing a Linux kernel patch that would allow me to time my processes to nanosecond accuracy, except I forgot to bookmark it and I forgot the name of the patch as well :(.
I remember how it works though:
On every context switch, it will read out the value of a high-resolution clock, and add the delta of the last two values to the process time of the running process. This produces a high-resolution accurate view of the process' actual process time.
The regular process time is kept using the regular clock, which is I believe millisecond accurate (1000Hz), which is much too large for my purposes.
Does anyone know what kernel patch I'm talking about? I also remember it was like a word with a letter before or after it -- something like 'rtimer' or something, but I don't remember exactly.
(Other suggestions are welcome too)
The Completely Fair Scheduler suggested suggested by Marko is not what I was looking for, but it looks promising. The problem I have with it is that the calls I can use to get process time are still not returning values that are granular enough.
times() is returning values 21, 22, in milliseconds.
clock() is returning values 21000, 22000, same granularity.
getrusage() is returning values like 210002, 22001 (and somesuch), they look to have a bit better accuracy but the values look conspicuously the same.
So now the problem I'm probably having is that the kernel has the information I need, I just don't know the system call that will return it.

If you are looking for this level of timing resolution, you are probably trying to do some micro-optimization. If that's the case, you should look at PAPI. Not only does it provide both wall-clock and virtual (process only) timing information, it also provides access to CPU event counters, which can be indispensable when you are trying to improve performance.
http://icl.cs.utk.edu/papi/

See this question for some more info.
Something I've used for such things is gettimeofday(). It provides a structure with seconds and microseconds. Call it before the code, and again after. Then just subtract the two structs using timersub, and you can get the time it took in seconds from the tv_usec field.

If you need very small time units to for (I assume) testing the speed of your software, I would reccomend just running the parts you want to time in a loop millions of times, take the time before and after the loop and calculate the average. A nice side-effect of doing this (apart from not needing to figure out how to use nanoseconds) is that you would get more consistent results because the random overhead caused by the os sceduler will be averaged out.
Of course, unless your program doesn't need to be able to run millions of times in a second, it's probably fast enough if you can't measure a millisecond running time.

I believe CFC (Completely Fair Scheduler) is what you're looking for.

You can use the High Precision Event Timer (HPET) if you have a fairly recent 2.6 kernel. Check out Documentation/hpet.txt on how to use it. This solution is platform dependent though and I believe it is only available on newer x86 systems. HPET has at least a 10MHz timer so it should fit your requirements easily.
I believe several PowerPC implementations from Freescale support a cycle exact instruction counter as well. I used this a number of years ago to profile highly optimized code but I can't remember what it is called. I believe Freescale has a kernel patch you have to apply in order to access it from user space.

http://allmybrain.com/2008/06/10/timing-cc-code-on-linux/
might be of help to you (directly if you are doing it in C/C++, but I hope it will give you pointers even if you're not)... It claims to provide microsecond accuracy, which just passes your criterion. :)

I think I found the kernel patch I was looking for. Posting it here so I don't forget the link:
http://user.it.uu.se/~mikpe/linux/perfctr/
http://sourceforge.net/projects/perfctr/
Edit: It works for my purposes, though not very user-friendly.

try the CPU's timestamp counter? Wikipedia seems to suggest using clock_gettime().

Using "top" in Linux as semi-permanent instrumentation

I'm trying to find the best way to use 'top' as semi-permanent instrumentation in the development of a box running embedded Linux. (The instrumentation will be removed from the final-test and production releases.)
My first pass is to simply add this to init.d:
top -b -d 15 >/tmp/toploop.out &
This runs top in "batch" mode every 15 seconds. Let's assume that /tmp has plenty of space…
Questions:
Is 15 seconds a good value to choose for general-purpose monitoring?
Other than disk space, how seriously is this perturbing the state of the system?
What other (perhaps better) tools could be used like this?

Look at collectd. It's a very light weight system monitoring framework coded for performance.

We use sysstat to monitor things like this.

You might find that vmstat and iostat with a delay and no repeat counter is a better option.

I suspect 15 seconds would be more than adequate unless you actually want to watch what's happening in real time, but that doesn't appear to be the case here.
As far as load, on an idling PIII 900Mhz w/ 768MB of RAM running Ubuntu (not sure which version, but not more than a year old) I have top updating every 0.5 seconds and it's about 2% CPU utilization. At 15s updates, I'm seeing 0.1% CPU utilization.
depending upon what exactly you want, you could use the output of uptime, free, and ps to get most, if not all, of top's information.

If you are looking for overall load, uptime is probably sufficient. However, if you want specific information about processes, you are adventurous, and have the /proc filessystem enabled, you may want to write your own tools. The primary benefit in this environment is that you can focus on exactly what you want and minimize the load introduced to the system.
The proc file system gives your application read access to the kernel memory that keeps track of many of the interesting variables. Reading from /proc is one of the lightest ways to get this information. Additionally, you may be able to get more information than provided by top. I've done this in the past to get amount of time spent in user and system by this process. Additionally, you can use this to get information about the number of file descriptors open by the process. You might also use this to get detailed information about how the network system is working.
Much of this information is pre-processed by other applications which can be used if you get the information you need. However, it is rather straight-forward to read the raw information. Do a man proc for more information.

Pity you haven't said what you are monitoring for.
You should decide whether 15 seconds is ok or not. Feel free to drop it way lower if you wish (and have a fast HDD)
No worries unless you are running a soft real-time system
Have a look at tools suggested in other answers. I'll add another sugestion: "iotop", for answering a "who is thrashing the HDD" questions.

At work for system monitoring during stress tests we use a tool called nmon.
What I love about nmon is it has the ability to export to XLS and generate beautiful graphs for you.
It generates statistics for:
Memory Usage
CPU Usage
Network Usage
Disk I/O
Good luck :)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string