How to measure lock contention? - statistics

I'm reading http://lse.sourceforge.net/locking/dcache/dcache_lock.html, in which spinlock time for each functions is measured:
SPINLOCKS HOLD WAIT
UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME
5.3% 16.5% 0.6us(2787us) 5.0us(3094us)(0.89%) 15069563 83.5% 16.5% 0% dcache_lock
0.01% 10.9% 0.2us( 7.5us) 5.3us( 116us)(0.00%) 119448 89.1% 10.9% 0% d_alloc+0x128
0.04% 14.2% 0.3us( 42us) 6.3us( 925us)(0.02%) 233290 85.8% 14.2% 0% d_delete+0x10
0.00% 3.5% 0.2us( 3.1us) 5.6us( 41us)(0.00%) 5050 96.5% 3.5% 0% d_delete+0x94
I'd like to know where these statistics are from. I tried oprofile, but it seems oprofile cannot measure lock holding and waiting time for a specific lock. And valgrind's drd slows down applications too much, which will make the result less accurate and also consume too much time. mutrace seems good, but as the name points out, I'm afraid it can only trace mutex exclusions.
So is there any other tool, or how to use the tools I mentioned above, to get lock contention statistics?
Thanks for your reply.

Finally I find the performance measuring tool used in the article, which needs to patch kernel .
The introduction page can be found at http://oss.sgi.com/projects/lockmeter/, and the latest kernel patch corresponds to kernel version 2.6.16, which you can download here.

One way to tell is just get it running, pause it, and take a random stackshot of all the threads. Then do it again, several times. Then the fraction of stack samples that terminate in locking code is the percent of time you are after, roughly. It will also tell you which locations the locking is performed in. If you're after accuracy, take more samples. This works in any language or operating system.

Related

memory usage discrepency between pprof and ps

I have been trying to profile the heap usage of a cli tool built with cobra.
The pprof tool is showing like the following,
Flat Flat% Sum% Cum Cum% Name Inlined?
1.58GB 49.98% 49.98% 1.58GB 49.98% os.ReadFile
1.58GB 49.98% 99.95% 1.58GB 50.02% github.com/bytedance/sonic.(*frozenConfig).Unmarshal
0 0.00% 99.95% 3.16GB 100.00% runtime.main
0 0.00% 99.95% 3.16GB 100.00% main.main
0 0.00% 99.95% 3.16GB 100.00% github.com/spf13/cobra.(*Command).execute
0 0.00% 99.95% 3.16GB 100.00% github.com/spf13/cobra.(*Command).ExecuteC
0 0.00% 99.95% 3.16GB 100.00% github.com/spf13/cobra.(*Command).Execute (inline)
0 0.00% 99.95% 3.16GB 100.00% github.com/mirantis/broker/misc.ParseUcpNodesInspect
0 0.00% 99.95% 3.16GB 100.00% github.com/mirantis/broker/cmd.glob..func3
0 0.00% 99.95% 3.16GB 100.00% github.com/mirantis/broker/cmd.getInfos
0 0.00% 99.95% 3.16GB 100.00% github.com/mirantis/broker/cmd.Execute
0 0.00% 99.95% 1.58GB 50.02% github.com/bytedance/sonic.Unmarshal
But ps is sowing at the end it almost consumes 6752.23 Mb(rss).
Also, I am putting the defer profile.Start(profile.MemProfileHeap).Stop() at the last function gets executed. Putting the profiler in the func main doesn't show anything. So I traced through the functions and found the considerable usage of memory at the last one.
My question is, how do I find the missing ~3gb of memory?
There are multiple problems (with your question):
ps (and top etc) show multiple memory readings. The only one of interest is typically called RES or RSS. You don't tell which one it was.
Basically, looking at the reading typically named VIRT is not interesting.
As Volker said, pprof does not measure memory consumption, it measures (in the mode you have run it) memory allocation rate—in the sense of "how much", not "how frequently".
To understand what it means, consider how pprof works.
During profiling, a timer ticks, and on each tick, the profiler sort of snaphots your running program, scans the stacks of all live goroutines and attributes live objects on the heap to the variables contained in the stack frames of those stacks, and each stack frame belongs to an active function.
This means that, if your process will call, say, os.ReadFile—which, by its contract, allocates a slice of bytes long enough to contain the whole contents of the file to be read,—100 times to read 1 GiB file each time, and the profiler's timer will manage to pinpoint each of these 100 calls (it can miss some of the calls as it's sampling), os.ReadFile will be attributed to having had allocated 100 GiB.
But if your program is not written in such a way that it holds each of the slices returned by these calls, but rather does something with those slices and throws them away after processing, the slices from the past calls will likely be already collected by the GC by the time the newer ones are allocated.
While not required by the spec, the two "standard" contemporary implementations of Go—the one originally dubbed "gc", which most people think is the implementation, and the GCC frontend—feature garbage collector which runs in parallel with the flow of your own process; the moments it actually collects the garbage produced by your process are governed by a set of complicated heuristics (start here if interested) which try to balance between spending CPU time for GC and spending RAM for not doing it ;-) , and it means for short-lived processes, the GC might not kick in even a single time, meaning your process will end with all the generated garbage still floating, and all that memory will be reclaimed by the OS in the usual way when the process ends.
When the GC collects garbage, the freed memory is not returned to the OS immediately. Instead, two-staged process is involved:
First, the freed regions are returned to the memory manager which is a part of the Go rutime powering your running program.
This is a sensible thing because in a typical program memory churn is usually high enough and freed memory will likely be quickly allocated back again.
Second, memory pages staying free long enough are marked to let the OS know it can use it for its own needs.
Basically it means that even after the GC frees some memory, you won't see this outside the running Go process as this memory is first retuned to the process' own pool.
Different versions of Go (again, I mean the "gc" implementation) implemented different policies about returning the freed pages to the OS: first they were marked by madvise(2) as MADV_FREE, then as MADV_DONTNEED and then again as MADV_FREE.
If you happen to use a version of Go whose runtime marks freed memory as MADV_DONTNEED, the readings of RSS will be even less sensible because the memory marked that way still counts against the process' RSS even though the OS was hinted it can reclaim that memory when needed.
To recap.
This topic is complex enough and you appear to be drawing certain conclusions too fast ;-)
An update.
I've decided to expand on memory management a bit because I feel like certain bits and pieces may be missing from the big picture of this stuff in your head, and because of this you might find the comments to your question to be moot and dismissive.
The reasoning for the advice to not measure memory consumption of programs written in Go using ps, top and friends is rooted in the fact the memory management implemented in the runtime environments powering programs written in contemporary high-level programming languages is quite far removed from the down-to-the-metal memory management implemented in the OS kernels and the hardware they run on.
Let's consider Linux to have concrete tangible examples.
You certainly can ask the kernel directly to allocate a memory for you: mmap(2) is a syscall which does that.
If you call it with MAP_PRIVATE (and usually also with MAP_ANONYMOUS), the kernel will make sure the page table of your process has one or more new entries for as many pages of memory to contain the contiguous region of as many bytes as you have requested, and return the address of the first page in the sequence.
At this time you might think that the RSS of your process had grown by that number of bytes, but it had not: the memory was "reserved" but not actually allocated; for a memory page to really get allocated, the process had to "touch" any byte within the page—by reading it or writing it: this will generate the so-called "page fault" on the CPU, and the in-kernel handler will ask the hardware to actually allocate a real "hardware" memory page. Only after that the page will actually count against the process' RSS.
OK, that's fun, but you probably can see a problem: it's not too convenient to operate with complete pages (wich can be of different size on different systems; typically it's 4 KiB on systems of the x86 lineage): when you program in a high-level language, you don't think on such a low level about the memory; instead, you expect the running program to somehow materialize "objects" (I do not mean OOP here; just pieces of memory containing values of some language- or user-defined types) as you need them.
These objects may be of any size, most of the time way smaller than a single memory page, and—what is more important,—most of the time you do not even think about how much space these objects are consuming when allocated.
Even when programming in a language like C, which these days is considered to be quite low-level, you're usually accustomed to using memory management functions in the malloc(3) family provided by the standard C library, which allow you to allocate regions of memory of arbitrary size.
A way to solve this sort of problem is to have a higher-level memory manager on top on what the kernel can do for your program, and the fact is, every single general-purpose program written in a high-level language (even C and C++!) is using one: for interpreted languages (such as Perl, Tcl, Python, POSIX shell etc) it is provided by the interpreter; for byte-compiled languages such as Java, it is provided by the process which executes that code (such as JRE for Java); for languages which compile down to machine (CPU) code—such as the "stock" implementation of Go—it is provided by the "runtime" code included into the resulting executable image file or linked into the program dynamically when it's being loaded into the memory for execution.
Such memory managers are usually quite complicated as they have to deal with many complex problems such as memory fragmentation, and they usually have to avoid talking to the kernel as much as possible because syscalls are slow.
The latter requirement naturally means process-level memory managers try to cache the memory they have once taken from the kernel, and are reluctant to release it back.
All this means that, say, in a typical active Go program you might have crazy memory churn — hordes of small objects being allocated and deallocated all the time which has next to no effect on the values of RSS monitored "from the outside" of the process: all this churn is handled by the in-process memory manager and—as in the case of the stock Go implementation—the GC which is naturally tightly integrated with the MM.
Because of that, to have useful actionable idea about what is happening in a long-running production-grade Go program, such program usually provides a set of continuously updated metrics (delivering, collecting and monitoring them is called telemetry). For Go programs, a part of the program tasked with producing these metrics can either make periodic calls to runtime.ReadMemStats and runtime/debug.ReadGCStats or directly use what the runtime/metrics has to offer. Looking at such metrics in a monitoring system such as Zabbix, Graphana etc is quite instructive: you can literally see how the amount of free memory available to the in-process MM increases after each GC cycle while the RSS stays roughly the same.
Also note that you might consider running your Go program with various GC-related debugging settings in a special environment variable GODEBUG described here: basically, you make the Go runtime powering your running program emit detailed information on how the GC is working (also see this).
Hope this will make your curious to make further exploration of these matters ;-)
You might find this to be a good introduction on memory management implemented by the Go runtime—in connection with the kernel and the hardware; recommended read.

How to measure the context switching overhead of a very large program?

I am trying to measure the impact of CPU scheduler on a large AI program (https://github.com/mozilla/DeepSpeech).
By using strace, I can see that it uses a lot of (~200) CPU threads.
I have tried using Linux Perf to measure this, but I have only been able to find the number of context switch events, not the overhead of them.
What I am trying to achieve is the total CPU core-seconds spent on context switching. Since it is a pretty large program, I would prefer non-invasive tools to avoid having to edit the source code of this program.
How can I do this?
Are you sure most of those 200 threads are actually waiting to run at the same time, not waiting for data from a system call? I guess you can tell from perf stat that context-switches are actually pretty high, but part of the question is whether they're high for the threads doing the critical work.
The cost of a context-switch is reflected in cache misses once a thread is running again. (And stopping OoO exec from finding as much ILP right at the interrupt boundary). This cost is more significant than the cost of the kernel code that saves/restores registers. So even if there was a way to measure how much time the CPUs spent in kernel context-switch code (possible with perf record sampling profiler as long as your perf_event_paranoid setting allows recording kernel addresses), that wouldn't be an accurate reflection of the true cost.
Even making a system call has a similar (but lower and more frequent) performance cost from serializing OoO exec, as well as disturbing caches (and TLB). There's a useful characterization of this on real modern CPUs (from 2010) in a paper by Livio & Stumm, especially the graph on the first page of IPC (instructions per cycle) dropping after a system call returns, and taking time to recover: FlexSC: Flexible System Call Scheduling with Exception-Less System Calls. (Conference presentation: https://www.usenix.org/conference/osdi10/flexsc-flexible-system-call-scheduling-exception-less-system-calls)
You might estimate context-switch cost by running the program on a system with enough cores not to need to context-switch much at all (e.g. a big many-core Xeon or Epyc), vs. on fewer cores but with the same CPUs / caches / inter-core latency and so on. So, on the same system with taskset --cpu-list 0-8 ./program to limit how many cores it can use.
Look at the total user-space CPU-seconds used: the amount higher is the extra amount of CPU time needed because of slowdowns from context switched. The wall-clock time will of course be higher when the same work has to compete for fewer cores, but perf stat includes a "task-clock" output which tells you a total time in CPU-milliseconds that threads of your process spent on CPUs. That would be constant for the same amount of work, with perfect scaling to more threads, and/or to the same threads competing for more / fewer cores.
But that would tell you about context-switch overhead on that big system with big caches and higher latency between cores than on a small desktop.

What is considered a high miss rate/low hit rate in caches?

I have been trying to profile some code that I wrote as a small memory test on my machine and by using perf I noticed:
Performance counter stats for './MemBenchmark':
15,980 LLC-loads
8,714 LLC-load-misses # 54.53% of all LL-cache hits
10.002878281 seconds time elapsed
The whole idea of the benchmark is 'stress' the memory so in my books the higher I can make the miss rate the better I think.
EDIT: Is there functionality within Perf that will allow a file to be profiled into different sections? e.g. If main() contains three for loops, is it possible to profile each loop individually to see the number of LLC load misses?
Remember that LLC-loads only counts loads that missed in L1d and L2. As a fraction of total loads (L1-dcache-loads), that's probably a very good hit rate for the cache hierarchy overall (thanks to good locality and/or successful prefetch.)
(Your CPU has a 3-level cache, so the Last Level is the shared L3; the L1 and L2 are private per-core caches. On CPU with only 2 levels of cache, the LLC would be L2.)
Only 9k accesses that had to go all the way to DRAM 10 seconds is very very good.
A low LLC hit rate with such a low total LLC-loads tells you that your workload has good locality for most of its accesses, but the accesses that do miss often have to go all the way to DRAM, and only half of them benefit from having L3 cache at all.
related: Cache friendly offline random read, and see #BeeOnRope's answer on Understanding perf detail when comparing two different implementations of a BFS algorithm where he says the absolute number of LLC misses is what counts for performance.
An algorithm with poor locality will generate a lot of L2 misses, and often a lot of L3 hits (quite possibly with a high L3 hit rate), but also many total L3 misses, so the pipeline is stalled a lot of the time waiting for memory.
What metric could you suggest to measure how my program performs in terms of stressing the memory?
Do you want to know how much total memory traffic your program causes, including prefetches? i.e. what kind of impact it might have on other programs competing for memory bandwidth? offcore_requests.all_requests could tell you how many requests (including L2 prefetches, page walks, and both loads and stores, but not L3 prefetches) make it past L2 to the shared L3 cache, whether or not they hit in shared L3. (Use the ocperf.py wrapper for perf. My Skylake has that event; IDK if your Nehalem will.)
As far as detecting whether your code bottlenecks on memory, LLC-load-misses per second as an absolute measure would be reasonable. Skylake at least has a cycle_activity.stalls_l3_miss to count cycles where no uops executed and there was an outstanding L3 miss. If that's more than a couple % of total cycles, you'd want to look into avoiding those stalls.
(I haven't tried using these events to learn anything myself, they might not be the most useful suggestion. It's hard to know the right question to ask yourself when profiling; there are lots of events you could look at but using them to learn something that helps you figure out how to change your code is hard. It helps a lot to have a good mental picture of how your code uses memory, so you know what to look for. For such a general question, it's hard to say much.)
Is there a way you could suggest that can break down the benchmark file to see which loops are causing the most stress?
You can use perf record -e whatever / perf report -Mintel to do statistical sample-based profiling for any event you want, to see where the hotspots are.
But for cache misses, sometimes the blame lies with some code that looped over an array and evicted lots of valuable data, not the code touching the valuable data that would still be hot.
A loop over a big array might not see many cache misses itself if hardware prefetching does its job.
linux perf: how to interpret and find hotspots. It can be very useful to use stack sampling if you don't know exactly what's slow and fast in your program. Sampling the call stack on each event will show you which function call high up in the call tree is to blame for all the work its callees are doing. Avoiding that call in the first place can be much better than speeding up the functions it calls by a bit.
(Avoid work instead of just doing the same work with better brute force. Careful applications of the maximum brute-force a modern CPU can bring to bear with AVX2 is useful after you've established that you can't avoid doing it in the first place.)

Linux: CPU benchmark requiring longer time and different CPU utilization levels

For my research I need a CPU benchmark to do some experiments on my Ubuntu laptop (Ubuntu 15.10, Memory 7.7 GiB, Intel Core i7-4500U CPU # 1.80HGz x 4, 64bit). In an ideal world, I would like to have a benchmark satisfying the following:
The CPU should be an official benchmark rather than created by my own for transparency purposes.
The time needed to execute the benchmark on my laptop should be at least 5 minutes (the more the better).
The benchmark should result in different levels of CPU throughout execution. For example, I don't want a benchmark which permanently keeps the CPU utilization level at around 100% - so I want a benchmark which will make the CPU utilization vary over time.
Especially points 2 and 3 are really key for my research. However, I couldn't find any suitable benchmarks so far. Benchmarks I found so far include: sysbench, CPU Fibonacci, CPU Blowfish, CPU Cryptofish, CPU N-Queens. However, all of them just need a couple of seconds to complete and the utilization level on my laptop is at 100% constantly.
Question: Does anyone know about a suitable benchmark for me? I am also happy to hear any other comments/questions you have. Thank you!
To choose a benchmark, you need to know exactly what you're trying to measure. Your question doesn't include that, so there's not much anyone can tell you without taking a wild guess.
If you're trying to measure how well Turbo clock speed works to make a power-limited CPU like your laptop run faster for bursty workloads (e.g. to compare Haswell against Skylake's new and improved power management), you could just run something trivial that's 1 second on, 2 seconds off, and count how many loop iterations it manages.
The duty cycle and cycle length should be benchmark parameters, so you can make plots. e.g. with very fast on/off cycles, Skylake's faster-reacting Turbo will ramp up faster and drop down to min power faster (leaving more headroom in the bank for the next burst).
The speaker in that talk (the lead architect for power management on Intel CPUs) says that Javascript benchmarks are actually bursty enough for Skylake's power management to give a measurable speedup, unlike most other benchmarks which just peg the CPU at 100% the whole time. So maybe have a look at Javascript benchmarks, if you want to use well-known off-the-shelf benchmarks.
If rolling your own, put a loop-carried dependency chain in the loop, preferably with something that's not too variable in latency across microarchitectures. A long chain of integer adds would work, and Fibonacci is a good way to stop the compiler from optimizing it away. Either pick a fixed iteration count that works well for current CPU speeds, or check the clock every 10M iterations.
Or set a timer that will fire after some time, and have it set a flag that you check inside the loop. (e.g. from a signal handler). Specifically, alarm(2) may be a good choice. Record how many iterations you did in this burst of work.

Severe multi-threaded memory bottleneck after reaching a specific number of cores

We are testing our software for the first time on a machine with > 12 cores for scalability and we are encountering a nasty drop in performance after the 12th thread is added. After spending a couple days on this, we are stumped regarding what to try next.
The test system is a dual Opteron 6174 (2x12 cores) with 16 GB of memory, Windows Server 2008 R2.
Basically, performance peaks from 10 - 12 threads, then drops off a cliff and is soon performing work at about the same rate it was with about 4 threads. The drop-off is fairly steep and by 16 - 20 threads it reaches bottom in terms of throughput. We have tested both with a single process running multiple threads and as multiple processes running single threads-- the results are pretty much the same. The processing is fairly memory intensive and somewhat disk intensive.
We are fairly certain this is a memory bottleneck, but we don't believe it a cache issue. The evidence is as follows:
CPU usages continues to climb from 50 to 100% when scaling from 12 to 24 threads. If we were having synchronization/deadlock issues, we would have expected CPU usage to top out before reaching 100%.
Testing while copying a large amount of files in the background had very little impact on the processing rates. We think this rules out disk i/o as the bottleneck.
The commit charge is only about 4 GBs, so we should be well below the threshold in which paging would become an issue.
The best data comes from using AMD's CodeAnalyst tool. CodeAnalyst shows the windows kernel goes from taking about 6% of the cpu time with 12 threads to 80-90% of CPU time when using 24 threads. A vast majority of that time is spent in the ExAcquireResourceSharedLite (50%) and KeAcquireInStackQueuedSpinLockAtDpcLevel (46%) functions. Here are the highlights of the kernel's factor change when going from running with 12 threads to running with 24:
Instructions: 5.56 (times more)
Clock cycles: 10.39
Memory operations: 4.58
Cache miss ratio: 0.25 (actual cache miss ratio is 0.1, 4 times smaller than with 12 threads)
Avg cache miss latency: 8.92
Total cache miss latency: 6.69
Mem bank load conflict: 11.32
Mem bank store conflict: 2.73
Mem forwarded: 7.42
We thought this might be evidence of the problem described in this paper, however we found that pinning each worker thread/process to a particular core didn't improve the results at all (if anything, performance got a little worse).
So that's where we're at. Any ideas on the precise cause of this bottleneck or how we might avoid it?
I'm not sure that I understand the issues completely such that I can offer you a solution but from what you've explained I may have some alternative view points which may be of help.
I program in C so what works for me may not be applicable in your case.
Your processors have 12MB of L3 and 6MB of L2 which is big but in my view they're seldom big enough!
You're probably using rdtsc for timing individual sections. When I use it I have a statistics structure into which I send the measurement results from different parts of the executing code. Average, minimum, maximum and number of observations are obvious but also standard deviation has its place in that it can help you decide whether a large maximum value should be researched or not. Standard deviation only needs to be calculated when it needs to be read out: until then it can be stored in its components (n, sum x, sum x^2). Unless you're timing very short sequences you can omit the preceding synchronizing instruction. Make sure you quantifiy the timing overhead, if only to be able to rule it out as insignificant.
When I program multi-threaded I try to make each core's/thread's task as "memory limited" as possible. By memory limited I mean not doing things which requires unnecessary memory access. Unnecessary memory access usually means as much inline code as possible and as litte OS access as possible. To me the OS is a great unknown in terms of how much memory work a call to it will generate so I try to keep calls to it to a minimum. In the same manner but usually to a lesser performance impacting extent I try to avoid calling application functions: if they must be called I'd rather they didn't call a lot of other stuff.
In the same manner I minimize memory allocations: if I need several I add them together into one and then subdivide that one big allocation into smaller ones. This will help later allocations in that they will need to loop through fewer blocks before finding the block returned. I only block initialize when absolutely necessary.
I also try to reduce code size by inlining. When moving/setting small blocks of memory I prefer using intrinsics based on rep movsb and rep stosb rather than calling memcopy/memset which are usually both optimized for larger blocks and not especially limited in size.
I've only recently begun using spinlocks but I implement them such that they become inline (anything is better than calling the OS!). I guess the OS alternative is critical sections and though they are fast local spinlocks are faster. Since they perform additional processing it means that they prevent application processing from being performed during that time. This is the implementation:
inline void spinlock_init (SPINLOCK *slp)
{
slp->lock_part=0;
}
inline char spinlock_failed (SPINLOCK *slp)
{
return (char) __xchg (&slp->lock_part,1);
}
Or more elaborate (but not overly so):
inline char spinlock_failed (SPINLOCK *slp)
{
if (__xchg (&slp->lock_part,1)==1) return 1;
slp->count_part=1;
return 0;
}
And to release
inline void spinlock_leave (SPINLOCK *slp)
{
slp->lock_part=0;
}
Or
inline void spinlock_leave (SPINLOCK *slp)
{
if (slp->count_part==0) __breakpoint ();
if (--slp->count_part==0) slp->lock_part=0;
}
The count part is something I've brought along from embedded (and other programming) where it is used for handling nested interrupts.
I'm also a big fan of IOCPs for their efficiency in handling IO events and threads but your description does not indicate whether your application could use them. In any case you appear to economize on them, which is good.
To address your bullet points:
1) If you have 12 cores at 100% usage and 12 cores idle, then your total CPU usage would be 50%. If your synchronization is spinlock-esque, then your threads would still be saturating their CPUs even while not accomplishing useful work.
2) skipped
3) I agree with your conclusion. In the future, you should know that Perfmon has a counter: Process\Page Faults/sec that can verify this.
4) If you don't have the private symbols for ntoskrnl, CodeAnalyst may not be able to tell you the correct function names in its profile. Rather, it can only point to the nearest function for which it has symbols. Can you get stack traces with the profiles using CodeAnalyst? This could help you determine what operation your threads perform that drives the kernel usage.
Also, my former team at Microsoft has provided a number of tools and guidelines for performance analysis here, including taking stack traces on CPU profiles.

Resources