I have been trying to profile some code that I wrote as a small memory test on my machine and by using perf I noticed:
Performance counter stats for './MemBenchmark':
15,980 LLC-loads
8,714 LLC-load-misses # 54.53% of all LL-cache hits
10.002878281 seconds time elapsed
The whole idea of the benchmark is 'stress' the memory so in my books the higher I can make the miss rate the better I think.
EDIT: Is there functionality within Perf that will allow a file to be profiled into different sections? e.g. If main() contains three for loops, is it possible to profile each loop individually to see the number of LLC load misses?
Remember that LLC-loads only counts loads that missed in L1d and L2. As a fraction of total loads (L1-dcache-loads), that's probably a very good hit rate for the cache hierarchy overall (thanks to good locality and/or successful prefetch.)
(Your CPU has a 3-level cache, so the Last Level is the shared L3; the L1 and L2 are private per-core caches. On CPU with only 2 levels of cache, the LLC would be L2.)
Only 9k accesses that had to go all the way to DRAM 10 seconds is very very good.
A low LLC hit rate with such a low total LLC-loads tells you that your workload has good locality for most of its accesses, but the accesses that do miss often have to go all the way to DRAM, and only half of them benefit from having L3 cache at all.
related: Cache friendly offline random read, and see #BeeOnRope's answer on Understanding perf detail when comparing two different implementations of a BFS algorithm where he says the absolute number of LLC misses is what counts for performance.
An algorithm with poor locality will generate a lot of L2 misses, and often a lot of L3 hits (quite possibly with a high L3 hit rate), but also many total L3 misses, so the pipeline is stalled a lot of the time waiting for memory.
What metric could you suggest to measure how my program performs in terms of stressing the memory?
Do you want to know how much total memory traffic your program causes, including prefetches? i.e. what kind of impact it might have on other programs competing for memory bandwidth? offcore_requests.all_requests could tell you how many requests (including L2 prefetches, page walks, and both loads and stores, but not L3 prefetches) make it past L2 to the shared L3 cache, whether or not they hit in shared L3. (Use the ocperf.py wrapper for perf. My Skylake has that event; IDK if your Nehalem will.)
As far as detecting whether your code bottlenecks on memory, LLC-load-misses per second as an absolute measure would be reasonable. Skylake at least has a cycle_activity.stalls_l3_miss to count cycles where no uops executed and there was an outstanding L3 miss. If that's more than a couple % of total cycles, you'd want to look into avoiding those stalls.
(I haven't tried using these events to learn anything myself, they might not be the most useful suggestion. It's hard to know the right question to ask yourself when profiling; there are lots of events you could look at but using them to learn something that helps you figure out how to change your code is hard. It helps a lot to have a good mental picture of how your code uses memory, so you know what to look for. For such a general question, it's hard to say much.)
Is there a way you could suggest that can break down the benchmark file to see which loops are causing the most stress?
You can use perf record -e whatever / perf report -Mintel to do statistical sample-based profiling for any event you want, to see where the hotspots are.
But for cache misses, sometimes the blame lies with some code that looped over an array and evicted lots of valuable data, not the code touching the valuable data that would still be hot.
A loop over a big array might not see many cache misses itself if hardware prefetching does its job.
linux perf: how to interpret and find hotspots. It can be very useful to use stack sampling if you don't know exactly what's slow and fast in your program. Sampling the call stack on each event will show you which function call high up in the call tree is to blame for all the work its callees are doing. Avoiding that call in the first place can be much better than speeding up the functions it calls by a bit.
(Avoid work instead of just doing the same work with better brute force. Careful applications of the maximum brute-force a modern CPU can bring to bear with AVX2 is useful after you've established that you can't avoid doing it in the first place.)
Related
I am currently trying to develop a judging system that measure not only time and memory use but also more deeper information such as cache misses and etc., which I assume the hardware counters (using perf) are perfect for it.
But for the timing part,
I wonder if using purely the cycle count to determine execution speed is reliable enough?
Hope to know about the pros and cons about this decision.
So you're proposing measuring CPU cycles, instead of seconds? Sounds somewhat reasonable.
For some microbenchmarks that's good, and mostly factors out the variations due to CPU frequency changes. (And delays due to interrupts if you count only user-space cycles, if you're microbenching a loop that doesn't make system calls. Only the secondary effects of interrupts are then visible, i.e. serializing the pipeline and perhaps evicting some of your data from cache / TLB.)
But the memory (and maybe L3 cache) stay at constant speed while CPU frequency changes, so the relative cost of a cache miss changes: The same response time in nanoseconds is fewer core clock cycles, so out-of-order exec can hide more of it more easily. And available memory bandwidth is higher relative to what a core can use. So HW prefetch has an easier time keeping up.
e.g. at 4.3GHz, a load that missed in L2 cache but hits in L3 on Skylake-server might have a total latency of about 79 core clock cycles. (https://www.7-cpu.com/cpu/Skylake_X.html - i7-7820X (Skylake X), 8 cores).
At 800MHz idle clock speed, an L2 cache miss is still 14 cycles (because it runs at core speed). But if another core is keeping the L3 cache (and the uncore in general) at high clock speed, the off-core part of that round-trip request will take many fewer core clock cycles.
e.g. we can make a back-of-the-envelope calculation by assuming that all the extra time for an L3 hit vs. an L2 hit is spent in the uncore, not the core, and takes a fixed number of nanoseconds. Since we have that time in cycles of a 4.3GHz clock, the math works out as 14 + (79-14)*8/43 cycles for an L3 hit at 800MHz = 26 cycles, down from 79.
This rough calculation actually matches up with the 7-cpu.com numbers for the same CPU with a core at 3.6GHz: L3 Cache Latency = 68 cycles. 14 + (79-14)*36/43 = 68.4.
Note that I picked a "server" part because different cores can run at different clock speeds. That's not the case in "client" CPUs like i7-6700k. Uncore (L3, interconnect, etc.) may still be able to vary independently of the cores, e.g. staying high for the GPU. Also, server parts have higher latency outside the core. (e.g. 4GHz Skylake i7-6700k with turbo disabled has L3 latency of only 42 core clock cycles, not 68 or 79.)
See also Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? for why/how L3 and memory latency affect max possible single-core memory bandwidth.
Of course, if you control the CPU frequency by allowing some warm-up, or for tasks that run for more than a trivial amount of time, this isn't a big deal.
(Although do note that Skylake will sometimes lower the clock speed when very memory-bound, which unfortunately hurts bandwidth even more, at the default energy_performance_preference = balance_power, but "balance_performance" or "performance" can avoid that. Slowing down CPU Frequency by imposing memory stress)
Do note that counting only cycles won't remove the cost of context switches (extra cache misses after switching back to this thread, and draining the ROB sucks). Or of competition from other cores for memory bandwidth.
e.g. another thread running on the other logical core of the same physical core will often seriously reduce IPC. Overall throughput usually goes up some, depending on the task, but individual per-thread throughput goes down.
Skylake has a perf event for tracking hyperthreading competition: cpu_clk_thread_unhalted.one_thread_active - IIRC that event count increments at something like 24MHz when your task is running and has the core all to itself. So if you see less than that, you know you had some competition and spent some time with the ROB partitioned and trading front-end cycles with another thread.
So there are a bunch of effects, and it's up to you to decide whether it's useful. Sorting results by core clock cycles sounds reasonable, but you should probably include CPU-seconds (task-clock) and average-frequency in the results to help people spot outliers / glitches.
I'm looking for a Linux utility that allows profiling the cache eviction in my program.
Specifically, I'm interested in finding what causes certain cache line(s) to be repeatedly evicted from L2 cache.
Any suggestions?
You have several options at your disposal, some of which are free. Below I'll mostly talk about profiling L2 misses, not necessarily L2 evictions since those are more or less the same thing. Lines get evicted from the L2 because another line is being brought in, and another line is being brought in usually due to an L2 miss1.
Cachegrind
First, I'd try out cachegrind. This basically runs your binary under a type of lightweight virtual machine which allows it to intercept all memory accesses and subsequently model their effect on the cache. It can pinpoint exactly where cache misses occur, who is responsible for eviction and so on.
It is important to note that cachegrind doesn't actually tell you what's going on with the hardware caches but rather what happens in its cache model. Since the L1 and L2 are simple enough on Intel x86, the cachegrind model should be accurate, except in unusual cases.
Cachegrind can only simulate two cache levels, but modern Intel has 3 or sometimes 4. That shouldn't be a problem if you are trying to evaluate L2 misses though. By default cachegrind sets the L1 cache to the detected values of the local L1 cache, and it's LLC to the detected values of the LLC. In your case, you'll want to override that latter decision to reflect the L2 cache, not the LLC. You can find the details in the manual, but this should be correct for recent Intel Broadwell and earlier:
--LL=262144,8,64
For Skylake client/Kaby Lake and friends you'd want:
--LL=262144,4,64
For Skylake-X server you'll want to look up the new values because the L2 changed.
The primary downside of this approach is that you can't be 100% sure that the cache model is an accurate reflection of reality (e.g., it doesn't model things like prefetching or virtual-physical paging). Another downside is that running a process under cachegrind is probably an order of magnitude slower than running it native, but for an investigation outside of "production" this probably isn't an issue.
Perf
You can use the default, included and free profiling tool to learn exactly what's actually going on with your actual hardware: perf.
In particular, you can use perf record combined with perf report or perf annotate to determine where in your program misses are occurring. You can start with something like this:
perf record -e mem_load_retired.l2_miss <your process>
This periodically records where an L2 misses appears. You can display the result with perf report which lets you explore the results interactively. There are lots of other options such as --call-graph to record the full call graph which may be useful.
The perf record approach always where you where in your code something is happening, but it doesn't help you determine what memory was being accessed when the misses occurred. That often doesn't matter: the location in the code often makes it very obvious what memory is being accessed. Sometimes, however, that's not the case: you have some code that might access a large region of memory and you want to know the address to figure how why misses are occurring.
In that case you can use perf mem, which records both the location, in code, of the miss and the address of the miss. This tool isn't as polished as the others, but the source is at least available so you could always make some improvements. I cover this option in some detail in another answer.
The primary disadvantage of perf is that it is less straightforward to use than something like cachegrind. The behavior and available events depends on your hardware and kernel version, and sometimes things like stack traces don't work, etc. You have to be relatively comfortable with the command line to make good use of this tool.
VTune
This tool uses the same underlying performance counters as perf, but uses a GUI based exploration and is perhaps easier to jump into than perf. It takes more of a top down approach: telling you where the problems are and allowing you to drill down, whereas perf is more about "here's the raw data, figure out what's wrong".
It provides specific analyses like the Memory Access Analysis which might be appropriate for your problem.
The main downside is that this is a paid product, unless you qualify to use it for free. It may be somewhat easier to use than perf but it's still not exactly easy and there is a lot of magic that goes on so if something goes wrong it may be hard to debug.
1 In some scenarios this might not be true. The main one I can think of is if prefetching into L2 causes most lines to arrive before they are missing. In that case, the number of L2 replacements might be might higher than the number of L2 misses. This is the kind of thing that cachegrind won't be able to help you with, but perf can: you can compare the number of L2 lines in/replaced to the number of L2 misses and see if they are close. If they aren't, you'll have to play around with other counters to see if prefetching is the cause.
For my research I need a CPU benchmark to do some experiments on my Ubuntu laptop (Ubuntu 15.10, Memory 7.7 GiB, Intel Core i7-4500U CPU # 1.80HGz x 4, 64bit). In an ideal world, I would like to have a benchmark satisfying the following:
The CPU should be an official benchmark rather than created by my own for transparency purposes.
The time needed to execute the benchmark on my laptop should be at least 5 minutes (the more the better).
The benchmark should result in different levels of CPU throughout execution. For example, I don't want a benchmark which permanently keeps the CPU utilization level at around 100% - so I want a benchmark which will make the CPU utilization vary over time.
Especially points 2 and 3 are really key for my research. However, I couldn't find any suitable benchmarks so far. Benchmarks I found so far include: sysbench, CPU Fibonacci, CPU Blowfish, CPU Cryptofish, CPU N-Queens. However, all of them just need a couple of seconds to complete and the utilization level on my laptop is at 100% constantly.
Question: Does anyone know about a suitable benchmark for me? I am also happy to hear any other comments/questions you have. Thank you!
To choose a benchmark, you need to know exactly what you're trying to measure. Your question doesn't include that, so there's not much anyone can tell you without taking a wild guess.
If you're trying to measure how well Turbo clock speed works to make a power-limited CPU like your laptop run faster for bursty workloads (e.g. to compare Haswell against Skylake's new and improved power management), you could just run something trivial that's 1 second on, 2 seconds off, and count how many loop iterations it manages.
The duty cycle and cycle length should be benchmark parameters, so you can make plots. e.g. with very fast on/off cycles, Skylake's faster-reacting Turbo will ramp up faster and drop down to min power faster (leaving more headroom in the bank for the next burst).
The speaker in that talk (the lead architect for power management on Intel CPUs) says that Javascript benchmarks are actually bursty enough for Skylake's power management to give a measurable speedup, unlike most other benchmarks which just peg the CPU at 100% the whole time. So maybe have a look at Javascript benchmarks, if you want to use well-known off-the-shelf benchmarks.
If rolling your own, put a loop-carried dependency chain in the loop, preferably with something that's not too variable in latency across microarchitectures. A long chain of integer adds would work, and Fibonacci is a good way to stop the compiler from optimizing it away. Either pick a fixed iteration count that works well for current CPU speeds, or check the clock every 10M iterations.
Or set a timer that will fire after some time, and have it set a flag that you check inside the loop. (e.g. from a signal handler). Specifically, alarm(2) may be a good choice. Record how many iterations you did in this burst of work.
How to measure cycles spent in accessing shared remote cache say L3. I need to get this cache access information both system-wide and for per-thread. Is there any specific tool/hardware requirements. Or can I use any formula to get an approximate value of cycles spent over a time interval
To get the average latencies (when a single thread is running) to various caches present on your machine, you can use memory profiler tools such as RMMA for windows (http://cpu.rightmark.org/products/rmma.shtml) and Lmbench for linux.
You can also write your own benchmarks based on the ideas used by these tools.
See the answers posted on this StackOverflow question:
measuring latencies of memory
Or Google for how the Lmbench benchmark works.
If you want to find exact latencies for particular memory access patterns, you will need to use a simulator. This way you can trace a memory access as it flows through the memory system. However simulators will not model all the effects that are present in a modern processor or memory system.
If you want to learn how multiple threads affect the average latency to L3, I think the best bet would be to write your own benchmark.
We are testing our software for the first time on a machine with > 12 cores for scalability and we are encountering a nasty drop in performance after the 12th thread is added. After spending a couple days on this, we are stumped regarding what to try next.
The test system is a dual Opteron 6174 (2x12 cores) with 16 GB of memory, Windows Server 2008 R2.
Basically, performance peaks from 10 - 12 threads, then drops off a cliff and is soon performing work at about the same rate it was with about 4 threads. The drop-off is fairly steep and by 16 - 20 threads it reaches bottom in terms of throughput. We have tested both with a single process running multiple threads and as multiple processes running single threads-- the results are pretty much the same. The processing is fairly memory intensive and somewhat disk intensive.
We are fairly certain this is a memory bottleneck, but we don't believe it a cache issue. The evidence is as follows:
CPU usages continues to climb from 50 to 100% when scaling from 12 to 24 threads. If we were having synchronization/deadlock issues, we would have expected CPU usage to top out before reaching 100%.
Testing while copying a large amount of files in the background had very little impact on the processing rates. We think this rules out disk i/o as the bottleneck.
The commit charge is only about 4 GBs, so we should be well below the threshold in which paging would become an issue.
The best data comes from using AMD's CodeAnalyst tool. CodeAnalyst shows the windows kernel goes from taking about 6% of the cpu time with 12 threads to 80-90% of CPU time when using 24 threads. A vast majority of that time is spent in the ExAcquireResourceSharedLite (50%) and KeAcquireInStackQueuedSpinLockAtDpcLevel (46%) functions. Here are the highlights of the kernel's factor change when going from running with 12 threads to running with 24:
Instructions: 5.56 (times more)
Clock cycles: 10.39
Memory operations: 4.58
Cache miss ratio: 0.25 (actual cache miss ratio is 0.1, 4 times smaller than with 12 threads)
Avg cache miss latency: 8.92
Total cache miss latency: 6.69
Mem bank load conflict: 11.32
Mem bank store conflict: 2.73
Mem forwarded: 7.42
We thought this might be evidence of the problem described in this paper, however we found that pinning each worker thread/process to a particular core didn't improve the results at all (if anything, performance got a little worse).
So that's where we're at. Any ideas on the precise cause of this bottleneck or how we might avoid it?
I'm not sure that I understand the issues completely such that I can offer you a solution but from what you've explained I may have some alternative view points which may be of help.
I program in C so what works for me may not be applicable in your case.
Your processors have 12MB of L3 and 6MB of L2 which is big but in my view they're seldom big enough!
You're probably using rdtsc for timing individual sections. When I use it I have a statistics structure into which I send the measurement results from different parts of the executing code. Average, minimum, maximum and number of observations are obvious but also standard deviation has its place in that it can help you decide whether a large maximum value should be researched or not. Standard deviation only needs to be calculated when it needs to be read out: until then it can be stored in its components (n, sum x, sum x^2). Unless you're timing very short sequences you can omit the preceding synchronizing instruction. Make sure you quantifiy the timing overhead, if only to be able to rule it out as insignificant.
When I program multi-threaded I try to make each core's/thread's task as "memory limited" as possible. By memory limited I mean not doing things which requires unnecessary memory access. Unnecessary memory access usually means as much inline code as possible and as litte OS access as possible. To me the OS is a great unknown in terms of how much memory work a call to it will generate so I try to keep calls to it to a minimum. In the same manner but usually to a lesser performance impacting extent I try to avoid calling application functions: if they must be called I'd rather they didn't call a lot of other stuff.
In the same manner I minimize memory allocations: if I need several I add them together into one and then subdivide that one big allocation into smaller ones. This will help later allocations in that they will need to loop through fewer blocks before finding the block returned. I only block initialize when absolutely necessary.
I also try to reduce code size by inlining. When moving/setting small blocks of memory I prefer using intrinsics based on rep movsb and rep stosb rather than calling memcopy/memset which are usually both optimized for larger blocks and not especially limited in size.
I've only recently begun using spinlocks but I implement them such that they become inline (anything is better than calling the OS!). I guess the OS alternative is critical sections and though they are fast local spinlocks are faster. Since they perform additional processing it means that they prevent application processing from being performed during that time. This is the implementation:
inline void spinlock_init (SPINLOCK *slp)
{
slp->lock_part=0;
}
inline char spinlock_failed (SPINLOCK *slp)
{
return (char) __xchg (&slp->lock_part,1);
}
Or more elaborate (but not overly so):
inline char spinlock_failed (SPINLOCK *slp)
{
if (__xchg (&slp->lock_part,1)==1) return 1;
slp->count_part=1;
return 0;
}
And to release
inline void spinlock_leave (SPINLOCK *slp)
{
slp->lock_part=0;
}
Or
inline void spinlock_leave (SPINLOCK *slp)
{
if (slp->count_part==0) __breakpoint ();
if (--slp->count_part==0) slp->lock_part=0;
}
The count part is something I've brought along from embedded (and other programming) where it is used for handling nested interrupts.
I'm also a big fan of IOCPs for their efficiency in handling IO events and threads but your description does not indicate whether your application could use them. In any case you appear to economize on them, which is good.
To address your bullet points:
1) If you have 12 cores at 100% usage and 12 cores idle, then your total CPU usage would be 50%. If your synchronization is spinlock-esque, then your threads would still be saturating their CPUs even while not accomplishing useful work.
2) skipped
3) I agree with your conclusion. In the future, you should know that Perfmon has a counter: Process\Page Faults/sec that can verify this.
4) If you don't have the private symbols for ntoskrnl, CodeAnalyst may not be able to tell you the correct function names in its profile. Rather, it can only point to the nearest function for which it has symbols. Can you get stack traces with the profiles using CodeAnalyst? This could help you determine what operation your threads perform that drives the kernel usage.
Also, my former team at Microsoft has provided a number of tools and guidelines for performance analysis here, including taking stack traces on CPU profiles.