What is lockstep sampling? - linux

I have seen this term in several posts about profiling applications but I don't understand what it actually means and how it affects profiling results.
I have seen it here for dtrace:
The rate is also increased to 199 Hertz, as capturing kernel stacks is
much less expensive than user-level stacks. The odd numbered rates, 99
and 199, are used to avoid sampling in lockstep with other activity
and producing misleading results.
Here for perf:
-F 99: sample at 99 Hertz (samples per second). I'll sometimes sample faster than this (up to 999 Hertz), but that also costs overhead. 99
Hertz should be negligible. Also, the value '99' and not '100' is to
avoid lockstep sampling, which can produce skewed results.
From what I have seen all profilers should avoid lockstep sampling because results can be "skewed" and "misleading" but I don't understand why. I guess that this question is applicable to all profilers but I am interested in perf on linux.

Lockstep sampling is when the profiling samples occur at the same frequency as a loop in the application. The result of this would be that the sample often occurs at the same place in the loop, so it will think that that operation is the most common operation, and a likely bottleneck.
An analogy would be if you were trying to determine whether a road experiences congestion, and you sample it every 24 hours. That sample is likely to be in lock-step with traffic variation; if it's at 8am or 5pm, it will coincide with rush hour and conclude that the road is extremely busy; if it's at 3am it will conclude that there's practically no traffic at all.
For sampling to be accurate, it needs to avoid this. Ideally, the samples should be much more frequent than any cycles in the application, or at random intervals, so that the chance it occurs in any particular operation is proportional to the amount of time that operation takes. But this is often not feasible, so the next best thing is to use a sampling rate that doesn't coincide with the likely frequency of program cycles. If there are enough cycles in the program, this should ensure that the samples take place at many different offsets from the beginning of each cycle.
To relate this to the above analogy, sampling every 23 hours or at random times each day will cause the samples to eventually encounter all times of the day; every 23-day cycle of samples will include all hours of the day. This produces a more complete picture of the traffic levels. And sampling every hour would provide a complete picture in just a few weeks.
I'm not sure why odd-numbered frequencies are likely to ensure this. It seems to be based on an assumption that there are natural frequencies for program operations, and these are even.

Related

WMI vs psutil for benchmarking CPU [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 years ago.
Improve this question
I am using three methods for getting CPU usage of windows machine:
psutil with interval of 1 sec:
cpu_usage=psutil.cpu_percent(interval=1)
psutil with interval of 10 sec:
cpu_usage=psutil.cpu_percent(interval=10)
WMI package:
c = wmi.WMI()
query = "SELECT * FROM Win32_PerfFormattedData_PerfOS_Processor where name='_Total'
cpu_usage=c.query(query)[0].PercentUserTime
"`
There is different of approximately 30% between the measurements;
I would like to know the advantage and disadvantages of each way, and which of those is most accurate.
Fundamental to understanding CPU measurement variability is an understanding of both Windows clock ticks (which form the basis of measurements) and logical CPUs (which are being measured active or idle in any given tick), and the way that the methods you describe perform the calculations.
The default Windows clock frequency is 64 ticks per second, or 15.625 milliseconds per tick. If you were to look at the incrementing tick counters in the WMI table Win32_PerfRawData_PerfOS_Processor (the raw data that the table you reference above draws from in its calculations) you would see them incrementing in multiples of 156250 (100-nanosecond ticks). The large number of nanoseconds gives a false sense of precision, when in reality, each processor can be off by one tick at both the beginning and end of each measurement interval by up to 1/64 of a second.
So your elapsed time for what you think as one second could be as short as 62/64 seconds or as long as 66/64 seconds, a +/- 3.125% inaccuracy. Not that big a deal, but one would probably prefer a 10 second interval with a +/- 0.3125% range.
This lack of precision is a bigger issue for the numerator, however, as you don't have the full second to work with. The ticks are split between user, system, and idle for each logical processor. Consider, for example, a quad-core system with hyperthreading using 25% CPU, split 12.5% system, 12.5% user, and 75% idle. You have eight logical processors. During the 1/4 of a second that each processor is active, only 16 clock ticks elapse, 8 ticks in system mode and 8 in user mode. But you could be under or over by 2 clock ticks if you picked the wrong start and end times, a range of 6 to 10 ticks for each of the 8 logical processors. So if you added up ticks for the whole system, you'd get from 48 to 80 ticks vs. the expected 64, a potential error of up to +/- 25%. Two independent measurements in identical circumstances could be off from each other by up to 50%. Want to add "user+system" for overall CPU usage? Up to 100% difference in successive measurements. Of course these are worst case extremes, the actual error will be smaller, and if you averaged all the errors after a long time, they'd net out to zero.
Now, as regards your methods in your question:
1 and 2: psutil.cpu_percent calls GetSystemTimes under the hood. This has an advantage over WMI methods in that you don't get the additional error involved in summing results. However, realize that you're getting the "cumulative total time since the computer started" for these values. A system bouncing between 10% and 30% usage will give you a calculation of about 20% every time with little change. In order to measure CPU usage across shorter time frames, you need to save the result from the previous call, and then measure the differences over the elapsed time. (You should also sum the elapsed ticks to use as the elapsed time, to avoid further inaccuracies.) In this case, a longer interval for the calculation will reduce the errors, although you could certainly do a rolling calculation (e.g., poll every 1 second but use a 10-second historical value for the usage).
3: The WMI _Total instance just sums up the individual logical processor values so has that increased lack of precision (worse the more processors you have) vs. the system-based method in 1 and 2. Additionally, the "Formatted" data is based on the "Raw" data in a similarly named table, with the elapsed time based on your sample interval. This means the very first sample will give you results similar to the cumulative-since-start usage in the first method, but later samples will either give you unchanged data (if your interval is too short) or the "current" usage since the last sample. In addition to this relative lack of control of the calculation using "Formatted" data (that you could overcome by manually calculating "Raw" data) WMI also has the disadvantage of the COM overhead to query the performance counters (sometimes very slow for the first query), while the psutil method is a direct native system call and much faster to use repeatedly.
In summary: for total system time, you're best off using the psutil values, but doing your own calculation based on the differences between successive measurements at a polling interval that you choose. Longer intervals give more accurate measurements but less ability to measure spikes in usage.
WMI gives you more insight into per-processor usage, but at a tradeoff for overhead in exchange for ease of query. If you really want this usage, you might want to get it directly from the system performance counters (which is where WMI ultimately gets them).

how to change perf_event_open max sample rate

I'm using perf_event_open to get samples. I try to get everyone hit of point. But perf_event_open is not fast enough. I try to change the sample rate using below command:
echo 10000000 > /proc/sys/kernel/perf_event_max_sample_rate
But it looks like the value I set was too large. After running my code, perf_event_max_sample_rate is change back to a lower value such as 12500. And when I try to change bigger value,for example 20000000,50000000 and so on, the sample speed is not increased as value I changed to. Is there any way to change perf_event_open sample speed more faster?
It is really not possible to increase the perf_event_max_sample_rate beyond a certain value.
I have tried increasing it to above 100,000 , say for example something like a 200,000 or something more. Every time I did this, the max sample rate always came down to something like 146,500 samples/sec or less. If I recall correctly, this was the maximum I could achieve (i.e. 146,500 samples/sec). This would of course, depend on the kind of machine you are using and the CPU frequencies etc. I was working on an Intel Xeon v-5 Broadwell CPU.
Zulan makes a good point. To make your understanding clearer, the perf sample collection is based on interrupts. Every time the sampling counter overflows, perf would raise an NM(non-maskable) interrupt. This interrupt meanwhile will calculate the time it takes to actually handle the whole interrupt process. You can see this in the below kernel code :-
perf_event_nmi_handler
Now once it has calculated the time for handling the interrupt, it calls another function (in which it passes the interrupt handling time as a parameter) where it tries to examine and compare the current perf_event_max_sample_rate with the time it takes to handle the interrupt. If it finds that the interrupt is taking a long enough time and at the same time, the samples are being generated very frequently, the CPU will obviously not be able to keep up as interrupt work starts getting queued up and you will observe some amount of CPU throttling. If you look at the below function, there will always be an attempt to reduce the sample
Read the below function to understand :-
perf_event_sample_took
Of course, as Zulan suggested, you can try making it 0, but you would get the same maximum number of samples from perf and further hurt the CPU, it is not possible to increase the maximum unless you figure out other means (like tweaking the buffer if possible).
This is a mechanism to limit the overhead caused by perf. You can disable it by setting
sysctl -w kernel.perf_cpu_time_max_percent=0
Use at your own risk - the system may stop to respond.
https://www.kernel.org/doc/Documentation/sysctl/kernel.txt
perf_cpu_time_max_percent:
Hints to the kernel how much CPU time it should be allowed to use to
handle perf sampling events. If the perf subsystem is informed that
its samples are exceeding this limit, it will drop its sampling
frequency to attempt to reduce its CPU usage.
Some perf sampling happens in NMIs. If these samples unexpectedly
take too long to execute, the NMIs can become stacked up next to each
other so much that nothing else is allowed to execute.
0: disable the mechanism. Do not monitor or correct perf's
sampling rate no matter how CPU time it takes.
1-100: attempt to throttle perf's sample rate to this percentage of
CPU. Note: the kernel calculates an "expected" length of each
sample event. 100 here means 100% of that expected length. Even
if this is set to 100, you may still see sample throttling if this
length is exceeded. Set to 0 if you truly do not care how much CPU
is consumed.

Linux: CPU benchmark requiring longer time and different CPU utilization levels

For my research I need a CPU benchmark to do some experiments on my Ubuntu laptop (Ubuntu 15.10, Memory 7.7 GiB, Intel Core i7-4500U CPU # 1.80HGz x 4, 64bit). In an ideal world, I would like to have a benchmark satisfying the following:
The CPU should be an official benchmark rather than created by my own for transparency purposes.
The time needed to execute the benchmark on my laptop should be at least 5 minutes (the more the better).
The benchmark should result in different levels of CPU throughout execution. For example, I don't want a benchmark which permanently keeps the CPU utilization level at around 100% - so I want a benchmark which will make the CPU utilization vary over time.
Especially points 2 and 3 are really key for my research. However, I couldn't find any suitable benchmarks so far. Benchmarks I found so far include: sysbench, CPU Fibonacci, CPU Blowfish, CPU Cryptofish, CPU N-Queens. However, all of them just need a couple of seconds to complete and the utilization level on my laptop is at 100% constantly.
Question: Does anyone know about a suitable benchmark for me? I am also happy to hear any other comments/questions you have. Thank you!
To choose a benchmark, you need to know exactly what you're trying to measure. Your question doesn't include that, so there's not much anyone can tell you without taking a wild guess.
If you're trying to measure how well Turbo clock speed works to make a power-limited CPU like your laptop run faster for bursty workloads (e.g. to compare Haswell against Skylake's new and improved power management), you could just run something trivial that's 1 second on, 2 seconds off, and count how many loop iterations it manages.
The duty cycle and cycle length should be benchmark parameters, so you can make plots. e.g. with very fast on/off cycles, Skylake's faster-reacting Turbo will ramp up faster and drop down to min power faster (leaving more headroom in the bank for the next burst).
The speaker in that talk (the lead architect for power management on Intel CPUs) says that Javascript benchmarks are actually bursty enough for Skylake's power management to give a measurable speedup, unlike most other benchmarks which just peg the CPU at 100% the whole time. So maybe have a look at Javascript benchmarks, if you want to use well-known off-the-shelf benchmarks.
If rolling your own, put a loop-carried dependency chain in the loop, preferably with something that's not too variable in latency across microarchitectures. A long chain of integer adds would work, and Fibonacci is a good way to stop the compiler from optimizing it away. Either pick a fixed iteration count that works well for current CPU speeds, or check the clock every 10M iterations.
Or set a timer that will fire after some time, and have it set a flag that you check inside the loop. (e.g. from a signal handler). Specifically, alarm(2) may be a good choice. Record how many iterations you did in this burst of work.

Approximate Number of CPU Cycles for Various Operations

I am trying to find a reference for approximately how many CPU cycles various operations require.
I don't need exact numbers (as this is going to vary between CPUs) but I'd like something relatively credible that gives ballpark figures that I could cite in discussion with friends.
As an example, we all know that floating point division takes more CPU cycles than say doing a bitshift.
I'd guess that the difference is that the division is around 100 cycles, where as a shift is 1 but I'm looking for something to cite to back that up.
Can anyone recommend such a resource?
I did a small app to test this. A very approximate app using synthmaker free edition... e is for empty, numbers are very approx cycles
divide|e:115|10
mult|e: 48|10
add|e: 48|10
subs|e: 50|10
compare>|e: 50|10
sin|e:135:10
The readings in the cycle analyser vary wildly from 50 to 100, usually single or double of the expected amount, these are figures that represent averages,the cycle analyzer is a very rough tool, but it gives fair results, a workaround user made exponent coded in ASM that calculates both the exp and the base at audio rate for example is around 800 cycles, so I'd say the above figures are close to at least 50 percent. I thought the divide was way more! It seems about twice as much. If you want the file I made to run in SM free version mail me, I was going to save an exe that is why i did it but you cant save in free version silly me! I am not going to code it from square one in version 1.17 :/
ant.stewart at the place yahoo dotty com.
For x86 processors, see IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual, probably Appendix C.
However, it's not in any way easy to figure out how many cycles an instruction takes to execute on a modern x86 processor, as it depends too much on e.g. accessing data in cache,aligned access, whether branch prediction fails, if there's a stall in the instruction pipeline and quite a lot of other things.
This is going to be hardware-dependent. The best thing to do is to run some benchmarks on the particular hardware you want to test.
A benchmark would go roughly like this:
Run a primitive operation a million times (say, adding two integers)
Record the time it took to run (say, in seconds)
Multiply by the number of cycles your machine executes per second - this will give you the total number of cycles spent.
Divide 1000000 by the number from the previous step - this will give you the number of instructions per cycle. Keep in mind that with pipelining, this could be less than 1.
There is research made by Agner Fog:
Instruction tables
Instruction tables: Lists of instruction latencies, throughputs and
micro-operation breakdowns for Intel, AMD, and VIA CPUs.
Last updated 2021-03-22

Comparing CPU speed likely improvements for business hardware upgrade justification

I have c# Console app, Monte Carlo simulation entirely CPU bound, execution time is inversely proportional to the number of dedicated threads/cores available (I keep a 1:1 ratio between cores/threads).
It currently runs daily on:
AMD Opteron 275 # 2.21 GHz (4 core)
The app is multithread using 3 threads, the 4th thread is for another Process Controller app.
It takes 15 hours per day to run.
I need to estimate as best I can how long the same work would take to run on a system configured with the following CPU's:
http://en.wikipedia.org/wiki/Intel_Nehalem_(microarchitecture)
2 x X5570
2 x X5540
and compare the cases, I will recode it use the available threads. I want to justify that we need a Server with 2 x x5570 CPUs over the cheaper x5540 (they support 2 cpus on a single motherboard). This should make available 8 cores, 16 threads (that's how the Nehalem chips work I believe) to the operating system. So for my app that's 15 threads to the Monte Carlo Simulation.
Any ideas how to do this? Is there a website I can go and see benchmark data for all 3 CPUS involved for a single threaded benchmark? I can then extrapolate for my case and number of threads. I have access to the current system to install and run a benchmark on if necessary.
Note the business are also dictating the workload for this app over the next 3 months will increase about 20 times and needs to complete in a 24 hour clock.
Any help much appreciated.
Have also posted this here: http://www.passmark.com/forum/showthread.php?t=2308 hopefully they can better explain their benchmarking so I can effectively get a score per core which would be much more helpful.
have you considered recreating the algorithm in cuda? It uses current day GPU's to increase calculations like these 10-100 fold. This way you just need to buy a fat videocard
Finding a single-box server which can scale according to the needs you've described is going to be difficult. I would recommend looking at Sun CoolThreads or other high-thread count servers even if their individual clock speeds are lower. http://www.sun.com/servers/coolthreads/overview/performance.jsp
The T5240 supports 128 threads: http://www.sun.com/servers/coolthreads/t5240/index.xml
Memory and CPU cache bandwidth may be a limiting factor for you if the datasets are as large as they sound. How much time is spent getting data from disk? Would massively increased RAM sizes and caches help?
You might want to step back and see if there is a different algorithm which can provide the same or similar solutions with fewer calculations.
It sounds like you've spent a lot of time optimizing the the calculation thread, but is every calculation being performed actually important to the final result?
Is there a way to shortcut calculations anywhere?
Is there a way to identify items which have negligible effects on the end result, and skip those calculations?
Can a lower resolution model be used for early iterations with detail added in progressive iterations?
Monte Carlo algorithms I am familiar with are non-deterministic, and run time would be related to the number of samples; is there any way to optimize the sampling model to limit the number of items examined?
Obviously I don't know what problem domain or data set you are processing, but there may be another approach which can yield equivalent results.
tomshardware.com contains a comprehensive list of CPU benchmarks. However... you can't just divide them, you need to find as close to an apples to apples comparison as you can get and you won't quite get it because the mix of instructions on your workload may or may not depend.
I would guess please don't take this as official, you need to have real data for this that you're probably in the 1.5x - 1.75x single threaded speedup if work is cpu bound and not highly vectorized.
You also need to take into account that you are:
1) using C# and the CLR, unless you've taken steps to prevent it GC may kick in and serialize you.
2) the nehalems have hyperthreads so you won't be seeing perfect 16x speedup, more likely you'll see 8x to 12x speedup depending on how optimized your code is. Be optimistic here though (just don't expect 16x).
3) I don't know how much contention you have, getting good scaling on 3 threads != good scaling on 16 threads, there may be dragons here (and usually is).
I would envelope calc this as:
15 hours * 3 threads / 1.5 x = 30 hours of single threaded work time on a nehalem.
30 / 12 = 2.5 hours (best case)
30 / 8 = 3.75 hours (worst case)
implies a parallel run time if there is truly a 20x increase:
2.5 hours * 20 = 50 hours (best case)
3.74 hours * 20 = 75 hours (worst case)
How much have you profiled, can you squeeze 2x out of app? 1 server may be enough, but likely won't be.
And for gosh sakes try out the task parallel library in .Net 4.0 or the .Net 3.5 CTP it's supposed to help with this sort of thing.
-Rick
I'm going to go out on a limb and say that even the dual-socket X5570 will not be able to scale to the workload you envision. You need to distribute your computation across multiple systems. Simple math:
Current Workload
3 cores * 15 real-world-hours = 45 cpu-time-hours
Proposed 20X Workload
45 cpu-time-hours * 20 = 900 cpu-time-hours
900 cpu-time-hours / (20 hours-per-day-per-core) = 45 cores
Thus, you would need the equivalent of 45 2.2GHz Opteron cores to achieve your goal (despite increasing processing time from 15 hours to 20 hours per day), assuming a completely linear scaling of performance. Even if the Nehalem CPUs are 3X faster per-thread you will still be at the outside edge of your performance envelop - with no room to grow. That also assumes that hyper-threading will even work for your application.
The best-case estimates I've seen would put the X5570 at perhaps 2X the performance of your existing Opteron.
Source: http://www.dailytech.com/Server+roundup+Intel+Nehalem+Xeon+versus+AMD+Shanghai+Opteron/article15036.htm
It'd be swinging big hammer, but perhaps it makes sense to look at some heavy-iron 4-way servers. They are expensive, but at least you could get up to 24 physical cores in a single box. If you've exhausted all other means of optimization (including SIMD), then it's something to consider.
I'd also be weary of other bottlenecks such as memory bandwidth. I don't know the performance characteristics of Monte Carlo Simulations, but ramping up one resource might reveal some other bottleneck.

Resources