how to change perf_event_open max sample rate

how to change perf_event_open max sample rate - linux

I'm using perf_event_open to get samples. I try to get everyone hit of point. But perf_event_open is not fast enough. I try to change the sample rate using below command:
echo 10000000 > /proc/sys/kernel/perf_event_max_sample_rate
But it looks like the value I set was too large. After running my code, perf_event_max_sample_rate is change back to a lower value such as 12500. And when I try to change bigger value,for example 20000000,50000000 and so on, the sample speed is not increased as value I changed to. Is there any way to change perf_event_open sample speed more faster?

It is really not possible to increase the perf_event_max_sample_rate beyond a certain value.
I have tried increasing it to above 100,000 , say for example something like a 200,000 or something more. Every time I did this, the max sample rate always came down to something like 146,500 samples/sec or less. If I recall correctly, this was the maximum I could achieve (i.e. 146,500 samples/sec). This would of course, depend on the kind of machine you are using and the CPU frequencies etc. I was working on an Intel Xeon v-5 Broadwell CPU.
Zulan makes a good point. To make your understanding clearer, the perf sample collection is based on interrupts. Every time the sampling counter overflows, perf would raise an NM(non-maskable) interrupt. This interrupt meanwhile will calculate the time it takes to actually handle the whole interrupt process. You can see this in the below kernel code :-
perf_event_nmi_handler
Now once it has calculated the time for handling the interrupt, it calls another function (in which it passes the interrupt handling time as a parameter) where it tries to examine and compare the current perf_event_max_sample_rate with the time it takes to handle the interrupt. If it finds that the interrupt is taking a long enough time and at the same time, the samples are being generated very frequently, the CPU will obviously not be able to keep up as interrupt work starts getting queued up and you will observe some amount of CPU throttling. If you look at the below function, there will always be an attempt to reduce the sample
Read the below function to understand :-
perf_event_sample_took
Of course, as Zulan suggested, you can try making it 0, but you would get the same maximum number of samples from perf and further hurt the CPU, it is not possible to increase the maximum unless you figure out other means (like tweaking the buffer if possible).

This is a mechanism to limit the overhead caused by perf. You can disable it by setting
sysctl -w kernel.perf_cpu_time_max_percent=0
Use at your own risk - the system may stop to respond.
https://www.kernel.org/doc/Documentation/sysctl/kernel.txt
perf_cpu_time_max_percent:
Hints to the kernel how much CPU time it should be allowed to use to
handle perf sampling events. If the perf subsystem is informed that
its samples are exceeding this limit, it will drop its sampling
frequency to attempt to reduce its CPU usage.
Some perf sampling happens in NMIs. If these samples unexpectedly
take too long to execute, the NMIs can become stacked up next to each
other so much that nothing else is allowed to execute.
0: disable the mechanism. Do not monitor or correct perf's
sampling rate no matter how CPU time it takes.
1-100: attempt to throttle perf's sample rate to this percentage of
CPU. Note: the kernel calculates an "expected" length of each
sample event. 100 here means 100% of that expected length. Even
if this is set to 100, you may still see sample throttling if this
length is exceeded. Set to 0 if you truly do not care how much CPU
is consumed.

Related

Estimate Core capacity required based on load?

I have quad core ubuntu system. say If I see the load average as 60 in last 15 mins during peak time. Load average goes to 150 as well.
This loads happens generally only during peak time. Basically I want to know if there is any standard formula to derive the number of cores ideally required to handle the given load ?
Objective :-
If consider the load as 60 then it means 60 task were in queue on an average at any point of time in last 15 mins ? Adding cpu can help me to server the
request faster or save system from hang or crashing .

Linux load average (as printed by uptime or top) includes tasks in I/O wait, so it can have very little to do with CPU time that could potentially be used in parallel.
If all the tasks were purely CPU bound, then yes 150 sustained load average would mean that potentially 150 cores could be useful. (But if it's not sustained, then it might just be a temporary long queue that wouldn't get that long if you had better CPU throughput.)
If you're getting crashes, that's a huge problem that isn't explainable by high loads. (Unless it's from the out-of-memory killer kicking in.)
It might help to use vmstat or dstat to see how much CPU time is spent in user/kernel space when your load avg. is building up, or if it's probably mostly I/O.
Or of course you probably know what tasks are running on your machine, and whether one single task is I/O bound or CPU bound on an otherwise-idle machine. I/O throughput usually scales a bit positively with queue depth, except on magnetic hard drives when that turns sequential read/write into seek-heavy workloads.

Linux: CPU benchmark requiring longer time and different CPU utilization levels

For my research I need a CPU benchmark to do some experiments on my Ubuntu laptop (Ubuntu 15.10, Memory 7.7 GiB, Intel Core i7-4500U CPU # 1.80HGz x 4, 64bit). In an ideal world, I would like to have a benchmark satisfying the following:
The CPU should be an official benchmark rather than created by my own for transparency purposes.
The time needed to execute the benchmark on my laptop should be at least 5 minutes (the more the better).
The benchmark should result in different levels of CPU throughout execution. For example, I don't want a benchmark which permanently keeps the CPU utilization level at around 100% - so I want a benchmark which will make the CPU utilization vary over time.
Especially points 2 and 3 are really key for my research. However, I couldn't find any suitable benchmarks so far. Benchmarks I found so far include: sysbench, CPU Fibonacci, CPU Blowfish, CPU Cryptofish, CPU N-Queens. However, all of them just need a couple of seconds to complete and the utilization level on my laptop is at 100% constantly.
Question: Does anyone know about a suitable benchmark for me? I am also happy to hear any other comments/questions you have. Thank you!

To choose a benchmark, you need to know exactly what you're trying to measure. Your question doesn't include that, so there's not much anyone can tell you without taking a wild guess.
If you're trying to measure how well Turbo clock speed works to make a power-limited CPU like your laptop run faster for bursty workloads (e.g. to compare Haswell against Skylake's new and improved power management), you could just run something trivial that's 1 second on, 2 seconds off, and count how many loop iterations it manages.
The duty cycle and cycle length should be benchmark parameters, so you can make plots. e.g. with very fast on/off cycles, Skylake's faster-reacting Turbo will ramp up faster and drop down to min power faster (leaving more headroom in the bank for the next burst).
The speaker in that talk (the lead architect for power management on Intel CPUs) says that Javascript benchmarks are actually bursty enough for Skylake's power management to give a measurable speedup, unlike most other benchmarks which just peg the CPU at 100% the whole time. So maybe have a look at Javascript benchmarks, if you want to use well-known off-the-shelf benchmarks.
If rolling your own, put a loop-carried dependency chain in the loop, preferably with something that's not too variable in latency across microarchitectures. A long chain of integer adds would work, and Fibonacci is a good way to stop the compiler from optimizing it away. Either pick a fixed iteration count that works well for current CPU speeds, or check the clock every 10M iterations.
Or set a timer that will fire after some time, and have it set a flag that you check inside the loop. (e.g. from a signal handler). Specifically, alarm(2) may be a good choice. Record how many iterations you did in this burst of work.

Measuring time in assembly

For the several hours now, I was trying to find a way to measure time interval within assmebly code. What I have seen so far, is that I can query the number of CPU cycles, but of course, I'd need to know CPU frequency to translate number of cycles into time. I have found the rdmsr instruction, but it is ring0 instruction, and ring0 is not something I can to put my code in.
Some examples I've found call Windows Query* functions for this, but I am not running on Windows. Is there any way for me to measure time interval in user level? Any other way to get frequency, or may be other clock I can access directly? One-second resolution system clock is of course out of the question :)

I spent quite a while working with cycle counters, and eventually came to the (perhaps obvious) conclusion that RDTSC counts cycles, not time. It will never count time because the computer's clock is being constantly ramped up and down by the power management unit. So the cycle counter is extremely precise for measuring cycles, horribly off by random amounts in real time units. I believe Intel eventually addressed this by locking the cycle counter to a clock that is not affected by the PMU, but I haven't investigated it.
The Windows Query* functions do not actually use the RDTSC cycle counter. I thought they did until I tried to measure really small periods and found it had a 14MHz(?) tick, which turned out to be the PCI data bus clock.
On top of all this, each core has its own cycle counter. So have to pay attention to which core you are using when executing the RDTSC opcode. And each core has its own PMU.
The best timer you will find in Windows user mode is QueryPerformanceCounter() and QueryPerformanceFrequency().

How to convert process cpu usage in clock ticks to percentage?

I setup collectd on my Debian 6 virtual machine for monitoring and performance analysis. Collectd's processes plugin provides statistics about a process' cpu usage, though what units these statistics have is not documented anywhere. It's certainly not jiffies or milliseconds, since the total cpu usage of several processes could go as high as 400,000 (of some unknown unit) per second on a 4 core virtual machine.
By looking at collectd's source code (https://github.com/collectd/collectd/blob/master/src/processes.c - in the ps_read_process function) , I figured out this data is read from the /proc/$pid/stat file of the process. The proc man page (link- http://man7.org/linux/man-pages/man5/proc.5.html) says the cpu usage there is measured in clock ticks.
This is nice, but clock ticks are a little arbitrary for monitoring and performance analysis. I'd like to convert the clock ticks value to something more meaningful, ideally percentage of total cpu time. How can I do that in a portable way, without just assuming my processor provides 3GHZ of clock ticks?

Further inspection of collectd's code revealed that the cpu usage is converted to microseconds.
Also, turns out a similar question was already asked and answered here.

What is the lowest time resolution for somewhat accurate measurements of cpu usage?

Some of the things I want to measure are very short,and I can only repeat them so many times if I don't run any of the setup/dispose code in the middle.
note: on linux,reading /proc/stat

Not very portable and you'll have to take great care so it is reliable, but the Time Stamp Counter definitely has the highest resolution available (increases at every CPU tick).
The time stamp counter has, until
recently, been an excellent
high-resolution, low-overhead way of
getting CPU timing information. With
the advent of multi-core/hyperthreaded
CPUs, systems with multiple CPUs, and
"hibernating" operating systems, the
TSC cannot be relied on to provide
accurate results - unless great care
is taken to correct the possible
flaws: rate of tick and whether all
cores (processors) have identical
values in their time-keeping
registers. There is no promise that
the timestamp counters of multiple
CPUs on a single motherboard will be
synchronized. In such cases,
programmers can only get reliable
results by locking their code to a
single CPU. Even then, the CPU speed
may change due to power-saving
measures taken by the OS or BIOS, or
the system may be hibernated and later
resumed (resetting the time stamp
counter). In those latter cases, to
stay relevant, the counter must be
recalibrated periodically (according
to the time resolution your
application requires).
There's some notes there about Linux specific solutions on the page, too:
Under Linux, similar functionality is
provided by reading the value of
CLOCK_MONOTONIC clock using POSIX
clock_gettime function.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string