How to measure the context switching overhead of a very large program?

How to measure the context switching overhead of a very large program? - multithreading

I am trying to measure the impact of CPU scheduler on a large AI program (https://github.com/mozilla/DeepSpeech).
By using strace, I can see that it uses a lot of (~200) CPU threads.
I have tried using Linux Perf to measure this, but I have only been able to find the number of context switch events, not the overhead of them.
What I am trying to achieve is the total CPU core-seconds spent on context switching. Since it is a pretty large program, I would prefer non-invasive tools to avoid having to edit the source code of this program.
How can I do this?

Are you sure most of those 200 threads are actually waiting to run at the same time, not waiting for data from a system call? I guess you can tell from perf stat that context-switches are actually pretty high, but part of the question is whether they're high for the threads doing the critical work.
The cost of a context-switch is reflected in cache misses once a thread is running again. (And stopping OoO exec from finding as much ILP right at the interrupt boundary). This cost is more significant than the cost of the kernel code that saves/restores registers. So even if there was a way to measure how much time the CPUs spent in kernel context-switch code (possible with perf record sampling profiler as long as your perf_event_paranoid setting allows recording kernel addresses), that wouldn't be an accurate reflection of the true cost.
Even making a system call has a similar (but lower and more frequent) performance cost from serializing OoO exec, as well as disturbing caches (and TLB). There's a useful characterization of this on real modern CPUs (from 2010) in a paper by Livio & Stumm, especially the graph on the first page of IPC (instructions per cycle) dropping after a system call returns, and taking time to recover: FlexSC: Flexible System Call Scheduling with Exception-Less System Calls. (Conference presentation: https://www.usenix.org/conference/osdi10/flexsc-flexible-system-call-scheduling-exception-less-system-calls)
You might estimate context-switch cost by running the program on a system with enough cores not to need to context-switch much at all (e.g. a big many-core Xeon or Epyc), vs. on fewer cores but with the same CPUs / caches / inter-core latency and so on. So, on the same system with taskset --cpu-list 0-8 ./program to limit how many cores it can use.
Look at the total user-space CPU-seconds used: the amount higher is the extra amount of CPU time needed because of slowdowns from context switched. The wall-clock time will of course be higher when the same work has to compete for fewer cores, but perf stat includes a "task-clock" output which tells you a total time in CPU-milliseconds that threads of your process spent on CPUs. That would be constant for the same amount of work, with perfect scaling to more threads, and/or to the same threads competing for more / fewer cores.
But that would tell you about context-switch overhead on that big system with big caches and higher latency between cores than on a small desktop.

Related

Is cycle count itself reliable on program timing?

I am currently trying to develop a judging system that measure not only time and memory use but also more deeper information such as cache misses and etc., which I assume the hardware counters (using perf) are perfect for it.
But for the timing part,
I wonder if using purely the cycle count to determine execution speed is reliable enough?
Hope to know about the pros and cons about this decision.

So you're proposing measuring CPU cycles, instead of seconds? Sounds somewhat reasonable.
For some microbenchmarks that's good, and mostly factors out the variations due to CPU frequency changes. (And delays due to interrupts if you count only user-space cycles, if you're microbenching a loop that doesn't make system calls. Only the secondary effects of interrupts are then visible, i.e. serializing the pipeline and perhaps evicting some of your data from cache / TLB.)
But the memory (and maybe L3 cache) stay at constant speed while CPU frequency changes, so the relative cost of a cache miss changes: The same response time in nanoseconds is fewer core clock cycles, so out-of-order exec can hide more of it more easily. And available memory bandwidth is higher relative to what a core can use. So HW prefetch has an easier time keeping up.
e.g. at 4.3GHz, a load that missed in L2 cache but hits in L3 on Skylake-server might have a total latency of about 79 core clock cycles. (https://www.7-cpu.com/cpu/Skylake_X.html - i7-7820X (Skylake X), 8 cores).
At 800MHz idle clock speed, an L2 cache miss is still 14 cycles (because it runs at core speed). But if another core is keeping the L3 cache (and the uncore in general) at high clock speed, the off-core part of that round-trip request will take many fewer core clock cycles.
e.g. we can make a back-of-the-envelope calculation by assuming that all the extra time for an L3 hit vs. an L2 hit is spent in the uncore, not the core, and takes a fixed number of nanoseconds. Since we have that time in cycles of a 4.3GHz clock, the math works out as 14 + (79-14)*8/43 cycles for an L3 hit at 800MHz = 26 cycles, down from 79.
This rough calculation actually matches up with the 7-cpu.com numbers for the same CPU with a core at 3.6GHz: L3 Cache Latency = 68 cycles. 14 + (79-14)*36/43 = 68.4.
Note that I picked a "server" part because different cores can run at different clock speeds. That's not the case in "client" CPUs like i7-6700k. Uncore (L3, interconnect, etc.) may still be able to vary independently of the cores, e.g. staying high for the GPU. Also, server parts have higher latency outside the core. (e.g. 4GHz Skylake i7-6700k with turbo disabled has L3 latency of only 42 core clock cycles, not 68 or 79.)
See also Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? for why/how L3 and memory latency affect max possible single-core memory bandwidth.
Of course, if you control the CPU frequency by allowing some warm-up, or for tasks that run for more than a trivial amount of time, this isn't a big deal.
(Although do note that Skylake will sometimes lower the clock speed when very memory-bound, which unfortunately hurts bandwidth even more, at the default energy_performance_preference = balance_power, but "balance_performance" or "performance" can avoid that. Slowing down CPU Frequency by imposing memory stress)
Do note that counting only cycles won't remove the cost of context switches (extra cache misses after switching back to this thread, and draining the ROB sucks). Or of competition from other cores for memory bandwidth.
e.g. another thread running on the other logical core of the same physical core will often seriously reduce IPC. Overall throughput usually goes up some, depending on the task, but individual per-thread throughput goes down.
Skylake has a perf event for tracking hyperthreading competition: cpu_clk_thread_unhalted.one_thread_active - IIRC that event count increments at something like 24MHz when your task is running and has the core all to itself. So if you see less than that, you know you had some competition and spent some time with the ROB partitioned and trading front-end cycles with another thread.
So there are a bunch of effects, and it's up to you to decide whether it's useful. Sorting results by core clock cycles sounds reasonable, but you should probably include CPU-seconds (task-clock) and average-frequency in the results to help people spot outliers / glitches.

Where to find ipc (or cpi) value of Intel processors (say skylake) when diff no of physical and logical cores are used?

I am very new to this field and my question might be too stupid but please help me understand the fundamental here.
I want to know the instruction per cycle (ipc) or clock per instruction (cpi) of recent intel processors such as skylake or cascade lake. And I am also looking for these values when different no of physical cores are used and when hyper threading is used.
I thought spec cpu2017 benchmark results could help me here, but I could not find my ans there. They just compare the total execution time by time taken by some reference machine and gives the ratio. Am I missing something here?
I thought this is one of the very first performance parameters and should be calculated and published by some standard benchmark, but I could not find any. Am I missing something here?
Another related question which comes to my mind (and I think everybody might want to know) is what is the best it can provide using all the cores and threads (least cpi and max ipc)?
Please help me find ipc / cpi value of skylake (any Intel processor) when say maximum (28) cores are used and when hyperthreading is also enabled.

The IPC cost of hyperthreading (or SMT in general on non-Intel CPUs) totally depends on the workload.
If you're already bottlenecked on branch mispredicts, cache misses, or long dependency chains (low ILP), having 2 threads running on the same core leads to minimal interference.
(Partitioning the ROB reduces the ability to find ILP in either thread, though, so again it depends on the details.)
Competitive sharing of uop cache and L1d/L1i / L2 caches also might or might not be a problem, depending on cache footprint.
There is no general answer independent of workload
Some workloads get a major speedup from using HT to double the number of logical cores. Some high-ILP workloads actually do worse because of cache conflicts. (Workloads that can already come close to saturating the front-end at 4 uops per clock on Intel before Icelake, for example).
Agner Fog's microarch guide says a bit about this for some microarchitectures that support hyperthreading. https://agner.org/optimize/
IIRC, some AMD CPUs have higher front-end throughput with hyperthreading, but I think only Bulldozer-family.
Max throughput is not affected by HT, and each core is independent. e.g. 4 uops per clock for a Skylake core. Doubling the number of physical cores always doubles theoretical uops / clock. Obviously not all workloads parallelize efficiently, so running more threads might need more total instructions/uops, and/or create more memory stalls for communication.
Hyperthreading just helps you come closer to that more of the time by letting 2 threads fill each other's "bubbles" from stalls.

Can a hyper-threaded processor core execute two threads at the exact same time?

I'm having a hard time understanding hyper-threading. If the logical core doesn't actually exist, what's the point of using hyper-threading?. The wikipedia article states that:
For each processor core that is physically present, the operating system addresses two virtual (logical) cores and shares the workload between them when possible.
If the two logical cores share the same execution unit, that means one of the threads will have to be put on hold while the other executes, that being said, I don't understand how hyper-threading can be useful, since you're not actually introducing a new execution unit. I can't wrap my head around this

See my answer on a softwareengineering.SE question for some details about how modern CPUs find and exploit instruction-level parallelism (ILP) by running multiple instructions at once. (Including a block diagram of Intel Haswell's pipeline, and links to more CPU microarchitecture details). Also Modern Microprocessors
A 90-Minute Guide!
You have a CPU with lots of execution units and a front-end that can keep them mostly supplied with work to do, but only under good conditions. Stalls like cache misses or branch mispredicts, or just limited parallelism (e.g. a loop that does one long chain of FP additions, bottlenecking on FP latency at one (scalar or SIMD) add per 4 or 5 clocks instead of one or two per clock) will result in throughput of much less than 4 instructions per cycle, and leave execution units idle.
The point of HT (and Simultaneous Multithreading (SMT) in general) is to keep those hungry execution units fed with work to do, even when running code with low ILP or lots of stalls (cache misses / branch mispredicts).
SMT only adds a bit of extra logic to the pipeline so it can keep track of two separate architectural contexts at the same time. So it costs a lot less die area and power than having twice or 4x as many full cores. (Knight's Landing Xeon Phi runs 4 threads per core, mainstream Intel CPUs run 2. Some non-x86 chips run 8 threads per core, aimed at database-server type workloads.) But of course having to divide out-of-order execution resources between logical threads often means the throughput gain is significantly below 2x or 4x, often far below, and for some workloads is negative.
Also related What is the difference between Hyperthreading and Multithreading? Does AMD Zen use either? - AMD's SMT is basically the same as Intel's, just not using the trademark "Hyperthreading" for it. See also other links in my answer there, like https://www.realworldtech.com/nehalem/3/ and especially https://www.realworldtech.com/alpha-ev8-smt/ for an intro with diagrams to what SMT is all about. (Many members of the Alpha EV8 design team was hired by Intel after DEC folded, and went on to implement SMT in Netburst (Pentium 4) which Intel branded Hyperthreading.)
Common misconceptions
Hyperthreading is not just optimized context switching. Simpler designs that switch to the other thread on a cache miss are possible, but HT is more advanced than that. (Switch-on-stall, or round-robin "barrel processor").
With two threads active, the front-end alternates between threads every cycle (in the fetch, decode, and issue/rename stages), but the out-of-order back-end can actually execute uops from both logical cores in the same cycle. The issue/rename stage is 4 uops wide on Intel before Ice Lake.
In pipeline stages that normally alternate, any time one thread is stalled, the other thread gets all the cycles in that stage. HT is much better than just fixed alternating, because one thread can get lots of work done while the other is recovering from a branch mispredict or waiting for a cache miss.
Note that up to 10 or 12 cache misses can be outstanding at once (from L1D cache in Intel CPUs: this is the number of LFB (Line Fill Buffers), and memory requests are pipelined. But if the address for the next load depends on an earlier load (e.g. pointer chasing through a tree or linked list), the CPU doesn't know where to load from and can't keep multiple requests in flight. So it is actually useful for both threads to be waiting on cache misses in parallel.
Some resources are statically partitioned when two threads are active, some are competitively shared. See this pdf of slides for some details. (For more details about how to actually optimize asm for Intel and AMD CPUs, see Agner Fog's microarchitecture PDF.)
When one logical core "sleeps" (i.e. the kernel runs a HLT instruction or whatever MWAIT to enter a deeper sleep), the physical core transitions to single-thread mode and lets the still-active logical core have all the resources (including the full ReOrder Buffer size, and other statically-partitioned resources), so it's ability to find and exploit ILP in the single thread still running increases more than when the other thread is simply stalled on a cache miss.
BTW, some workloads actually run slower with HT. If your working set barely fits in L2 or L1D cache, then running two on the same core will lead to a lot more cache misses. For very well-tuned high-throughput code that can already keep the execution units saturated
(like an optimized matrix multiply in high-performance computing), it can make sense to disable HT. Always benchmark.
On Skylake, I've found that video encoding (with x265 -preset slower, 1080p) is about 15% faster with 8 threads instead of 4, on my quad-core i7-6700k. I didn't actually disable HT for the 4-thread test, but Linux's scheduler is good at not bouncing threads around and running threads on separate physical cores when there are enough to go around. A 15% speedup is pretty good considering that x265 has a lot of hand-written asm and runs very high instructions-per-cycle even when it has a whole core to itself. (Slower presets like I used tend to be more CPU-bound than memory-bound.)

Where is the point at which adding additional cores or CPUs doesn’t improve the performance at all?

*Adding a second core or CPU might increase the performance of your parallel program, but it is unlikely to double it. Likewise, a
four-core machine is not going to execute your parallel program four
times as quickly— in part because of the overhead and coordination
described in the previous sections. However, the design of the
computer hardware also limits its ability to scale. You can expect a
significant improvement in performance, but it won’t be 100 percent
per additional core, and there will almost certainly be a point at
which adding additional cores or CPUs doesn’t improve the performance
at all.
*
I read the paragraph above from a book. But I don't get the last sentence.
So, Where is the point at which adding additional cores or CPUs doesn’t improve the performance at all?

If you take a serial program and a parallel version of the same program then the parallel program has to do some operations that the serial program does not, specifically operations concerned with coordinating the operations of the multiple processors. These contribute to what is often called 'parallel overhead' -- additional work that a parallel program has to do. This is one of the factors that makes it difficult to get 2x speed-up on 2 processors, 4x on 4 or 32000x on 32000 processors.
If you examine the code of a parallel program you will often find segments which are serial, that is which only use one processor while the others are idle. There are some (fragments of) algorithms which are not parallelisable, and there are some operations which are often not parallelised but which could be: I/O operations for instance, to parallelise these you need some sort of parallel I/O system. This 'serial fraction' provides an irreducible minimum time for your computation. Amdahl's Law explains this, and that article provides a useful starting point for your further reading.
Even when you do have a program which is well parallelised the scaling (ie the way speed-up changes as the number of processors increases) does not equal 1. For most parallel programs the size of the parallel overhead (or the amount of processor time which is devoted to operations which are only necessary for parallel computing) increases as some function of the number of processors. This often means that adding processors adds parallel overhead and at some point in the scaling of your program and jobs the increase in overhead cancels out (or even reverses) the increase in processor power. The article on Amdahl's Law also covers Gustafson's Law which is relevant here.
I've phrased this all in very general terms, no consideration of current processor and computer architectures; what I am describing are features of parallel computation (as currently understood) not of any particular program or computer.
I flat out disagree with #Daniel Pittman's assertion that these issues are of only theoretical concern. Some of us are working very hard to make our programs scale to very large numbers of processors (1000s). And almost all desktop and office development these days, and most mobile development too, targets multi-processor systems and using all those cores is a major concern.
Finally, to answer your question, at what point does adding processors no longer increase execution speed, now that is an architecture- and program-dependent question. Happily, it is one that is amenable to empirical investigation. Figuring out the scalability of parallel programs, and identifying ways of improving it, are a growing niche within the software engineering 'profession'.

#High Performance Mark is right. This happens when you are trying to solve a fixed size problem in the fastest possible way, so that Amdahl' law applies. It does not (usually) happen when you are trying to solve in a fixed time a problem. In the former case, you are willing to use the same amount of time to solve a problem
whose size is bigger;
whose size is exactly the same as before, but with a greeter accuracy.
In this situation, Gustafson's law applies.
So, let's go back to fixed size problems.
In the speedup formula you can distinguish these components:
Inherently sequential computations: σ(n)
Potentially parallel computations: ϕ(n)
Overhead (Communication operations etc): κ(n,p)
and the speedup for p processors for a problem size n is
Adding processors reduces the computation time but increases the communication time (for message-passing algorithms; it increases the synchronization overhead etcfor shared-memory algorithm); if we continue adding more processors, at some point the communication time increase will be larger than the corresponding computation time decrease.
When this happens, the parallel execution time begins to increase.
Speedup is inversely proportional to execution time, so that its curve begins to decline.
For any fixed problem size, there is an optimum number of processors that minimizes the overall parallel execution time.
Here is how you can compute exactly (analytical solution in closed form) the point at which you get no benefit by adding additional processors (or cores if you prefer).

The answer is, of course, "it depends", but in the current world of shared memory multi-processors the short version is "when traffic coordinating shared memory or other resources consumes all available bus bandwidth and/or CPU time".
That is a very theoretical problem, though. Almost nothing scales well enough to keep taking advantage of more cores at small numbers. Few applications benefit from 4, less from 8, and almost none from 64 cores today - well below any theoretical limitations on performance.

If we're talking x86 that architecture is more or less at its limits. # 3 GHz electricity travels 10 cm (actually somewhat less) per Hz, the die is about 1 cm square, components have to be able to switch states in that single Hz (1/3000000000 of a second). The current manufacturing process (22nm) gives interconnections that are 88 (silicon) atoms wide (I may have misunderstood this). With this in mind you realize that there isn't that much more that can be done with physics here (how narrow can an interconnection be? 10 atoms? 20?). At the other end the manufacturer, to be able to market a device as "higher performing" than its predecessor, adds a core which theoretically doubles the processing power.
"Theoretically" is not actually completely true. Some specially written applications will subdivide a large problem into parts that are small enough to be contained inside a single core and its exclusive caches (L1 & L2). A part is given to the core and it processes for a significant amount of time without accessing the L3 cache or RAM (which it shares with other cores and therefore will be where collisions/bottlenecks will occur). Upon completion it writes its results to RAM and receives a new part of the problem to work on.
If a core spends 99% of its time doing internal processing and 1% reading from and writing to shared memory (L3 cache and RAM) you could have an additional 99 cores doing the same thing because, in the end, the limiting factor will be the number of accesses the shared memory is capable of. Given my example of 99:1 such an application could make efficient use of 100 cores.
With more common programs - office, ie, etc - the extra processing power available will hardly be noticed. Some parts of the programs may have smaller parts written to take advantage of multiple cores and if you know which ones you may notice that those parts of the programs are much faster.
The 3 GHz was used as an example because it works well with the speed of light which is 300000000 meters/sec. I read recently that AMD's latest architecture was able to execute at 5 GHz but this was with special coolers and, even then, it was slower (processed less) than an intel i7 running at a significantly slower frequency.

It heavily depends on your program architecture/design. Adding cores improves parallel processing. If your program is not doing anything in parallel but only sequentially, adding cores would not improve its performance at all. It might improve other things though like framework internal processing (if you're using a framework).
So the more parallel processing is allowed in your program the better it scales with more cores. But if your program has limits on parallel processing (by design or nature of data) it will not scale indefinitely. It takes a lot of effort to make program run on hundreds of cores mainly because of growing overhead, resource locking and required data coordination. The most powerful supercomputers are indeed massively multi-core but writing programs that can utilize them is a significant effort and they can only show their power in an inherently parallel tasks.

Time Stamp counter (TSC) when switching between Kernel & User mode

I am wondering if somebody knows some more details about the time stamp counter in Linux when a context switch occurs? Until now I had the opinion, that the TSC value is just increasing by 1 during each clock cycle, independent if in kernel or in user mode. I measured now the performance of an application using the TSC which yielded a performance result of 5 Mio Clock Cyles. Then, I made some changes to the scheduler which means that a context switch takes considerably longer, i.g. 2 Mio cycles instead of 500.000 cycles. The funny bit is, that when measuring the performance of the original application again it still takes 5 Mio cycles... So I am wondering why it did not take considerably longer as a context switch takes now almost 2 Mio clock cyles more? (and there occur at least 3 context during execution of the application).
Is the time stamp counter somehow deactivated during kernel mode? Or is the content of the TSC saved during contest switches? Thanks if someone could point me out what could be the problem!

As you can read on Wikipedia
With the advent of multi-core/hyperthreaded CPUs, systems with multiple CPUs, and "hibernating" operating systems, the TSC cannot be relied on to provide accurate results. The issue has two components: rate of tick and whether all cores (processors) have identical values in their time-keeping registers. There is no promise that the timestamp counters of multiple CPUs on a single motherboard will be synchronized. In such cases, programmers can only get reliable results by locking their code to a single CPU. Even then, the CPU speed may change due to power-saving measures taken by the OS or BIOS, or the system may be hibernated and later resumed (resetting the time stamp counter). Reliance on the time stamp counter also reduces portability, as other processors may not have a similar feature. Recent Intel processors include a constant rate TSC (identified by the constant_tsc flag in Linux's /proc/cpuinfo). With these processors the TSC reads at the processor's maximum rate regardless of the actual CPU running rate. While this makes time keeping more consistent, it can skew benchmarks, where a certain amount of spin-up time is spent at a lower clock rate before the OS switches the processor to the higher rate. This has the effect of making things seem like they require more processor cycles than they normally would.

I believe the TSC is actually a hardware construct of the processor you're using. IE: reading the TSC actually uses the RDTSC processor opcode. I don't even think there's a way for the OS to alter it's value, it just increases with each tick since the last power reset.
Regarding your modifications to the scheduler, is it possible that you're using a multi-core processor in a way that the OS is not switching out your running process? You might put a call to sched_yield() or sleep(0) in your program to see if your scheduler changes start taking effect.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string