Is cycle count itself reliable on program timing? - performance-testing

I am currently trying to develop a judging system that measure not only time and memory use but also more deeper information such as cache misses and etc., which I assume the hardware counters (using perf) are perfect for it.
But for the timing part,
I wonder if using purely the cycle count to determine execution speed is reliable enough?
Hope to know about the pros and cons about this decision.

So you're proposing measuring CPU cycles, instead of seconds? Sounds somewhat reasonable.
For some microbenchmarks that's good, and mostly factors out the variations due to CPU frequency changes. (And delays due to interrupts if you count only user-space cycles, if you're microbenching a loop that doesn't make system calls. Only the secondary effects of interrupts are then visible, i.e. serializing the pipeline and perhaps evicting some of your data from cache / TLB.)
But the memory (and maybe L3 cache) stay at constant speed while CPU frequency changes, so the relative cost of a cache miss changes: The same response time in nanoseconds is fewer core clock cycles, so out-of-order exec can hide more of it more easily. And available memory bandwidth is higher relative to what a core can use. So HW prefetch has an easier time keeping up.
e.g. at 4.3GHz, a load that missed in L2 cache but hits in L3 on Skylake-server might have a total latency of about 79 core clock cycles. (https://www.7-cpu.com/cpu/Skylake_X.html - i7-7820X (Skylake X), 8 cores).
At 800MHz idle clock speed, an L2 cache miss is still 14 cycles (because it runs at core speed). But if another core is keeping the L3 cache (and the uncore in general) at high clock speed, the off-core part of that round-trip request will take many fewer core clock cycles.
e.g. we can make a back-of-the-envelope calculation by assuming that all the extra time for an L3 hit vs. an L2 hit is spent in the uncore, not the core, and takes a fixed number of nanoseconds. Since we have that time in cycles of a 4.3GHz clock, the math works out as 14 + (79-14)*8/43 cycles for an L3 hit at 800MHz = 26 cycles, down from 79.
This rough calculation actually matches up with the 7-cpu.com numbers for the same CPU with a core at 3.6GHz: L3 Cache Latency = 68 cycles. 14 + (79-14)*36/43 = 68.4.
Note that I picked a "server" part because different cores can run at different clock speeds. That's not the case in "client" CPUs like i7-6700k. Uncore (L3, interconnect, etc.) may still be able to vary independently of the cores, e.g. staying high for the GPU. Also, server parts have higher latency outside the core. (e.g. 4GHz Skylake i7-6700k with turbo disabled has L3 latency of only 42 core clock cycles, not 68 or 79.)
See also Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? for why/how L3 and memory latency affect max possible single-core memory bandwidth.
Of course, if you control the CPU frequency by allowing some warm-up, or for tasks that run for more than a trivial amount of time, this isn't a big deal.
(Although do note that Skylake will sometimes lower the clock speed when very memory-bound, which unfortunately hurts bandwidth even more, at the default energy_performance_preference = balance_power, but "balance_performance" or "performance" can avoid that. Slowing down CPU Frequency by imposing memory stress)
Do note that counting only cycles won't remove the cost of context switches (extra cache misses after switching back to this thread, and draining the ROB sucks). Or of competition from other cores for memory bandwidth.
e.g. another thread running on the other logical core of the same physical core will often seriously reduce IPC. Overall throughput usually goes up some, depending on the task, but individual per-thread throughput goes down.
Skylake has a perf event for tracking hyperthreading competition: cpu_clk_thread_unhalted.one_thread_active - IIRC that event count increments at something like 24MHz when your task is running and has the core all to itself. So if you see less than that, you know you had some competition and spent some time with the ROB partitioned and trading front-end cycles with another thread.
So there are a bunch of effects, and it's up to you to decide whether it's useful. Sorting results by core clock cycles sounds reasonable, but you should probably include CPU-seconds (task-clock) and average-frequency in the results to help people spot outliers / glitches.

Related

How to measure the context switching overhead of a very large program?

I am trying to measure the impact of CPU scheduler on a large AI program (https://github.com/mozilla/DeepSpeech).
By using strace, I can see that it uses a lot of (~200) CPU threads.
I have tried using Linux Perf to measure this, but I have only been able to find the number of context switch events, not the overhead of them.
What I am trying to achieve is the total CPU core-seconds spent on context switching. Since it is a pretty large program, I would prefer non-invasive tools to avoid having to edit the source code of this program.
How can I do this?
Are you sure most of those 200 threads are actually waiting to run at the same time, not waiting for data from a system call? I guess you can tell from perf stat that context-switches are actually pretty high, but part of the question is whether they're high for the threads doing the critical work.
The cost of a context-switch is reflected in cache misses once a thread is running again. (And stopping OoO exec from finding as much ILP right at the interrupt boundary). This cost is more significant than the cost of the kernel code that saves/restores registers. So even if there was a way to measure how much time the CPUs spent in kernel context-switch code (possible with perf record sampling profiler as long as your perf_event_paranoid setting allows recording kernel addresses), that wouldn't be an accurate reflection of the true cost.
Even making a system call has a similar (but lower and more frequent) performance cost from serializing OoO exec, as well as disturbing caches (and TLB). There's a useful characterization of this on real modern CPUs (from 2010) in a paper by Livio & Stumm, especially the graph on the first page of IPC (instructions per cycle) dropping after a system call returns, and taking time to recover: FlexSC: Flexible System Call Scheduling with Exception-Less System Calls. (Conference presentation: https://www.usenix.org/conference/osdi10/flexsc-flexible-system-call-scheduling-exception-less-system-calls)
You might estimate context-switch cost by running the program on a system with enough cores not to need to context-switch much at all (e.g. a big many-core Xeon or Epyc), vs. on fewer cores but with the same CPUs / caches / inter-core latency and so on. So, on the same system with taskset --cpu-list 0-8 ./program to limit how many cores it can use.
Look at the total user-space CPU-seconds used: the amount higher is the extra amount of CPU time needed because of slowdowns from context switched. The wall-clock time will of course be higher when the same work has to compete for fewer cores, but perf stat includes a "task-clock" output which tells you a total time in CPU-milliseconds that threads of your process spent on CPUs. That would be constant for the same amount of work, with perfect scaling to more threads, and/or to the same threads competing for more / fewer cores.
But that would tell you about context-switch overhead on that big system with big caches and higher latency between cores than on a small desktop.

Where to find ipc (or cpi) value of Intel processors (say skylake) when diff no of physical and logical cores are used?

I am very new to this field and my question might be too stupid but please help me understand the fundamental here.
I want to know the instruction per cycle (ipc) or clock per instruction (cpi) of recent intel processors such as skylake or cascade lake. And I am also looking for these values when different no of physical cores are used and when hyper threading is used.
I thought spec cpu2017 benchmark results could help me here, but I could not find my ans there. They just compare the total execution time by time taken by some reference machine and gives the ratio. Am I missing something here?
I thought this is one of the very first performance parameters and should be calculated and published by some standard benchmark, but I could not find any. Am I missing something here?
Another related question which comes to my mind (and I think everybody might want to know) is what is the best it can provide using all the cores and threads (least cpi and max ipc)?
Please help me find ipc / cpi value of skylake (any Intel processor) when say maximum (28) cores are used and when hyperthreading is also enabled.
The IPC cost of hyperthreading (or SMT in general on non-Intel CPUs) totally depends on the workload.
If you're already bottlenecked on branch mispredicts, cache misses, or long dependency chains (low ILP), having 2 threads running on the same core leads to minimal interference.
(Partitioning the ROB reduces the ability to find ILP in either thread, though, so again it depends on the details.)
Competitive sharing of uop cache and L1d/L1i / L2 caches also might or might not be a problem, depending on cache footprint.
There is no general answer independent of workload
Some workloads get a major speedup from using HT to double the number of logical cores. Some high-ILP workloads actually do worse because of cache conflicts. (Workloads that can already come close to saturating the front-end at 4 uops per clock on Intel before Icelake, for example).
Agner Fog's microarch guide says a bit about this for some microarchitectures that support hyperthreading. https://agner.org/optimize/
IIRC, some AMD CPUs have higher front-end throughput with hyperthreading, but I think only Bulldozer-family.
Max throughput is not affected by HT, and each core is independent. e.g. 4 uops per clock for a Skylake core. Doubling the number of physical cores always doubles theoretical uops / clock. Obviously not all workloads parallelize efficiently, so running more threads might need more total instructions/uops, and/or create more memory stalls for communication.
Hyperthreading just helps you come closer to that more of the time by letting 2 threads fill each other's "bubbles" from stalls.

What is considered a high miss rate/low hit rate in caches?

I have been trying to profile some code that I wrote as a small memory test on my machine and by using perf I noticed:
Performance counter stats for './MemBenchmark':
15,980 LLC-loads
8,714 LLC-load-misses # 54.53% of all LL-cache hits
10.002878281 seconds time elapsed
The whole idea of the benchmark is 'stress' the memory so in my books the higher I can make the miss rate the better I think.
EDIT: Is there functionality within Perf that will allow a file to be profiled into different sections? e.g. If main() contains three for loops, is it possible to profile each loop individually to see the number of LLC load misses?
Remember that LLC-loads only counts loads that missed in L1d and L2. As a fraction of total loads (L1-dcache-loads), that's probably a very good hit rate for the cache hierarchy overall (thanks to good locality and/or successful prefetch.)
(Your CPU has a 3-level cache, so the Last Level is the shared L3; the L1 and L2 are private per-core caches. On CPU with only 2 levels of cache, the LLC would be L2.)
Only 9k accesses that had to go all the way to DRAM 10 seconds is very very good.
A low LLC hit rate with such a low total LLC-loads tells you that your workload has good locality for most of its accesses, but the accesses that do miss often have to go all the way to DRAM, and only half of them benefit from having L3 cache at all.
related: Cache friendly offline random read, and see #BeeOnRope's answer on Understanding perf detail when comparing two different implementations of a BFS algorithm where he says the absolute number of LLC misses is what counts for performance.
An algorithm with poor locality will generate a lot of L2 misses, and often a lot of L3 hits (quite possibly with a high L3 hit rate), but also many total L3 misses, so the pipeline is stalled a lot of the time waiting for memory.
What metric could you suggest to measure how my program performs in terms of stressing the memory?
Do you want to know how much total memory traffic your program causes, including prefetches? i.e. what kind of impact it might have on other programs competing for memory bandwidth? offcore_requests.all_requests could tell you how many requests (including L2 prefetches, page walks, and both loads and stores, but not L3 prefetches) make it past L2 to the shared L3 cache, whether or not they hit in shared L3. (Use the ocperf.py wrapper for perf. My Skylake has that event; IDK if your Nehalem will.)
As far as detecting whether your code bottlenecks on memory, LLC-load-misses per second as an absolute measure would be reasonable. Skylake at least has a cycle_activity.stalls_l3_miss to count cycles where no uops executed and there was an outstanding L3 miss. If that's more than a couple % of total cycles, you'd want to look into avoiding those stalls.
(I haven't tried using these events to learn anything myself, they might not be the most useful suggestion. It's hard to know the right question to ask yourself when profiling; there are lots of events you could look at but using them to learn something that helps you figure out how to change your code is hard. It helps a lot to have a good mental picture of how your code uses memory, so you know what to look for. For such a general question, it's hard to say much.)
Is there a way you could suggest that can break down the benchmark file to see which loops are causing the most stress?
You can use perf record -e whatever / perf report -Mintel to do statistical sample-based profiling for any event you want, to see where the hotspots are.
But for cache misses, sometimes the blame lies with some code that looped over an array and evicted lots of valuable data, not the code touching the valuable data that would still be hot.
A loop over a big array might not see many cache misses itself if hardware prefetching does its job.
linux perf: how to interpret and find hotspots. It can be very useful to use stack sampling if you don't know exactly what's slow and fast in your program. Sampling the call stack on each event will show you which function call high up in the call tree is to blame for all the work its callees are doing. Avoiding that call in the first place can be much better than speeding up the functions it calls by a bit.
(Avoid work instead of just doing the same work with better brute force. Careful applications of the maximum brute-force a modern CPU can bring to bear with AVX2 is useful after you've established that you can't avoid doing it in the first place.)

Can a hyper-threaded processor core execute two threads at the exact same time?

I'm having a hard time understanding hyper-threading. If the logical core doesn't actually exist, what's the point of using hyper-threading?. The wikipedia article states that:
For each processor core that is physically present, the operating system addresses two virtual (logical) cores and shares the workload between them when possible.
If the two logical cores share the same execution unit, that means one of the threads will have to be put on hold while the other executes, that being said, I don't understand how hyper-threading can be useful, since you're not actually introducing a new execution unit. I can't wrap my head around this
See my answer on a softwareengineering.SE question for some details about how modern CPUs find and exploit instruction-level parallelism (ILP) by running multiple instructions at once. (Including a block diagram of Intel Haswell's pipeline, and links to more CPU microarchitecture details). Also Modern Microprocessors
A 90-Minute Guide!
You have a CPU with lots of execution units and a front-end that can keep them mostly supplied with work to do, but only under good conditions. Stalls like cache misses or branch mispredicts, or just limited parallelism (e.g. a loop that does one long chain of FP additions, bottlenecking on FP latency at one (scalar or SIMD) add per 4 or 5 clocks instead of one or two per clock) will result in throughput of much less than 4 instructions per cycle, and leave execution units idle.
The point of HT (and Simultaneous Multithreading (SMT) in general) is to keep those hungry execution units fed with work to do, even when running code with low ILP or lots of stalls (cache misses / branch mispredicts).
SMT only adds a bit of extra logic to the pipeline so it can keep track of two separate architectural contexts at the same time. So it costs a lot less die area and power than having twice or 4x as many full cores. (Knight's Landing Xeon Phi runs 4 threads per core, mainstream Intel CPUs run 2. Some non-x86 chips run 8 threads per core, aimed at database-server type workloads.) But of course having to divide out-of-order execution resources between logical threads often means the throughput gain is significantly below 2x or 4x, often far below, and for some workloads is negative.
Also related What is the difference between Hyperthreading and Multithreading? Does AMD Zen use either? - AMD's SMT is basically the same as Intel's, just not using the trademark "Hyperthreading" for it. See also other links in my answer there, like https://www.realworldtech.com/nehalem/3/ and especially https://www.realworldtech.com/alpha-ev8-smt/ for an intro with diagrams to what SMT is all about. (Many members of the Alpha EV8 design team was hired by Intel after DEC folded, and went on to implement SMT in Netburst (Pentium 4) which Intel branded Hyperthreading.)
Common misconceptions
Hyperthreading is not just optimized context switching. Simpler designs that switch to the other thread on a cache miss are possible, but HT is more advanced than that. (Switch-on-stall, or round-robin "barrel processor").
With two threads active, the front-end alternates between threads every cycle (in the fetch, decode, and issue/rename stages), but the out-of-order back-end can actually execute uops from both logical cores in the same cycle. The issue/rename stage is 4 uops wide on Intel before Ice Lake.
In pipeline stages that normally alternate, any time one thread is stalled, the other thread gets all the cycles in that stage. HT is much better than just fixed alternating, because one thread can get lots of work done while the other is recovering from a branch mispredict or waiting for a cache miss.
Note that up to 10 or 12 cache misses can be outstanding at once (from L1D cache in Intel CPUs: this is the number of LFB (Line Fill Buffers), and memory requests are pipelined. But if the address for the next load depends on an earlier load (e.g. pointer chasing through a tree or linked list), the CPU doesn't know where to load from and can't keep multiple requests in flight. So it is actually useful for both threads to be waiting on cache misses in parallel.
Some resources are statically partitioned when two threads are active, some are competitively shared. See this pdf of slides for some details. (For more details about how to actually optimize asm for Intel and AMD CPUs, see Agner Fog's microarchitecture PDF.)
When one logical core "sleeps" (i.e. the kernel runs a HLT instruction or whatever MWAIT to enter a deeper sleep), the physical core transitions to single-thread mode and lets the still-active logical core have all the resources (including the full ReOrder Buffer size, and other statically-partitioned resources), so it's ability to find and exploit ILP in the single thread still running increases more than when the other thread is simply stalled on a cache miss.
BTW, some workloads actually run slower with HT. If your working set barely fits in L2 or L1D cache, then running two on the same core will lead to a lot more cache misses. For very well-tuned high-throughput code that can already keep the execution units saturated
(like an optimized matrix multiply in high-performance computing), it can make sense to disable HT. Always benchmark.
On Skylake, I've found that video encoding (with x265 -preset slower, 1080p) is about 15% faster with 8 threads instead of 4, on my quad-core i7-6700k. I didn't actually disable HT for the 4-thread test, but Linux's scheduler is good at not bouncing threads around and running threads on separate physical cores when there are enough to go around. A 15% speedup is pretty good considering that x265 has a lot of hand-written asm and runs very high instructions-per-cycle even when it has a whole core to itself. (Slower presets like I used tend to be more CPU-bound than memory-bound.)

Multicore processor core communication speeds

I would look to find the speed of communication between the two cores of a computer.
I'm in the very early stages of planning to massively parallelise a sequential program and I need to think about network communication speeds vs. communication between cores on a single processor.
Ubuntu linux probably provides some way of seeing this sort of information? I would have thought speed fluctuates.. I just need some average value. I'm basically needing to write something up at the moment and it would be good to talk about these ratios.
Any ideas?
Thanks.
According to this benchmark: http://www.dragonsteelmods.com/index.php?option=com_content&task=view&id=6120&Itemid=38&limit=1&limitstart=4 (Last image on the page)
On an Intel Q6600, inter-core latency is 32 nanoseconds. Network latency is measured in milliseconds which 1,000,000 milliseconds / nanosecond. "Good" network latency is considered around or under 100ms, so given that, the difference is about the order of 1 million times faster for inter-core latency.
Besides latency though there's also Bandwidth to consider. Again based on the linked bookmark, benchmark for that particular configuration, inter-core bandwidth is about 14GB/sec whereas according to this: http://www.tomshardware.com/reviews/gigabit-ethernet-bandwidth,2321-3.html, real-world test of a Gigabit ethernet connection shows about 35.8MB/sec so the difference there is smaller, only on the order of around 500 times faster in terms of bandwidth as opposed to a 1,000,000 times in latency. Depending on which is more important to your application that might change your numbers.
The network speeds are measured in milliseconds for Ethernet ($5-$100/port), or microseconds for specialized MPI hardware like Dolphin on Myrintet (~ $1k/port). Inter-core speeds are measured in nanoseconds, as the data is copied from one memory area to another, and then some signal is sent from one CPU to another (the data will be protected from simultaneous access by a mutex or a full-bodied queue).
So, using a back'o'the'napkin calculation the ratio is 1:10^6.
Inter-core communication is going to be massively faster. Why ?
the network layer imposes a massive overhead in terms of packets, addressing, handling contention etc.
the physical distances will impose a sizeable impact
How you measure inter-core communication speed would be very difficult. But given the above I think it's a redundant calculation to make.
This is a non-trivial thing to find. The speed of data transfer between two cores depends entirely on the application. It could depend on any (or all) of - the speed of register access, the clock speed of the cores, the system bus speed, the latency of your cache, the latency of your memory, etc., etc., etc. In short, run a benchmark or you'll be guessing in the dark.

Resources