How would a multithreaded program be more energy efficient? - multithreading

In its Energy-Efficient Software Guidelines Intel suggests that programs are designed multithreaded for better energy efficiency.
I don't get it. Suppose I have a quad core processor that can switch off unused cores. Suppose my code is perfectly parallelizeable (synchronization overhead is negligible).
If I use only one core I burn one core for one hour, if I use four cores I burn four cores for 15 minutes - the same amount of core-hours either way. Where's the saving?

I suspect it has to do with a non-linear relation between CPU utilization and power consumption. So if you can spread 100% CPU utilization over 4 CPUs each will have 25% utilization - and say 12% consumption.
This is especially true when dynamic CPU scaling is used according to Wikipedia the power drain of a CPU is P = C(V^2)F. When a CPU is running faster it requires higher voltages - and that 'to the power of 2' becomes crucial. Furthermore the voltage will be a function of F (which means F can be solved for V) giving something like P = C(F^2)F. Thus by spreading the load over 4 CPUs (running at 100% capacity at that frequency) you can mitigate the cost for the same work.
We can make F a function of L (load) at 100% of one core (as it would be in your OS), so:
F = 1000 + L/100 * 500 = 1000 + 5L
p = C((1000 + 5L)^2)(1000 + 5L) = C(1000 + 5L)^3
Now that we can relate load (L) to the power consumption we can see the characteristics of the power consumption given everything on one core:
p = C(1000 + 5L)^3
p = 1000000000 + 15000000L + 75000L^2 + 125L^3
Or spread over 4 cores:
p = 4C(1000 + (5/4)L)^3
p = 4000000000 + 15000000L + 18750.4L^2 + 7.5L^3
Notice the factors in front of the L^2 and L^3.

During that one hour, the one core isn't the only thing you keep running.

You burn 4 times energy with 4 cores but you do 4 times more work too! If, as you said, the synchro is negligible and the work is parallelizable, you'll spend 4 times less time.
Using multiple threads can save energy when you have i/o waits. One thread can wait while other threads can perform other computations; instead of having your application idle.

A CPU is one part of a computer. It has fans, a motherboard, hard drives, graphics card, RAM etc, lets call this the BASE. If your doing scientific computing (i.e., a compute cluster) you are powering many computers. If you are powering 100's of BASE's anyway, why not allow those BASES to have multiple physical CPU's on them so those CPU's can share the resources of the BASE, physical and logical.
Now INTEL's marketing blurb probably also depends on the fact that these days, each CPU wafer contains multiple cores. Powering multiple physical CPU's is different to powering a single physical cpu with multiple cores.
So if amount of work done per unit of power is the benchmark in question, then modern CPU's performing highly parallel tasks then yes you get more bang for your buck, compared with the previous generation of processors. As not only can you get more cores / cpu, it is also common to get BASE's which can take multiple CPU's.
One may easily assert that one top-end system can now house the processing power of 8-16 singl-cpu single-core CPU's of the past (assuming that in this hypothetical case, that on the new system and the older generation system, each core has the same processing power ).

If a program is multithreaded that doesn't mean that it would use more cores. It just means that more tasks are dealt with in the same time so the overall processor time is shorter.

There are 3 reasons, two of which have already been pointed out:
More overall time means that other (non-CPU) components need to run longer, even if the net calculation for the CPU remains the same
More threads mean more things are done at the same time (because stalls are used for something useful), again the overall real time is reduced.
The CPU power consumption for running the same calculations on one core is not the same. Intel CPUs have a built-in clock boosting for single-core usage (I forgot the marketing buzzword for it). A higher clock means dysproportionally more power consumption and dysproportionally more heat, which again requires the fan to spin faster, too.
So in summary, you consume more power with the CPU and more power for cooling the CPU for a longer time, and you run other components for a longer time, too.
As a 4th reason, one could allege (note that this is only an assumption!) that Intel CPUs are hyperthreaded, and since hyperthreaded cores share some resources, running two threads at once is more efficient than running one thread twice as long.

Related

Multi-thread programming with logical threads

The theory of multirhead programing explained is based on the number of cores, but nowdays processors have more logical cores than physical ones. The question is, if a well-implemented parallel algorithm is run on a processor with 4 physical and 8 logical cores, the speedup will be 4 or 8 times (the best case without couting the cost of parallelism and additional staff).
For example below you can see the results of image filtering, having 4 cores and 8 thread CPU. It looks like upper bound is 4 time speed up, but in case of using 8 threads it seems to be the best speed up among the rest
Logical cores are only useful if your code is latency bound. This is the case when there are stalls (eg. cache miss) or when instructions are sequentiallized a lot (eg. a loop of dependent divisions). In such a case, the processor could truly execute 2 threads in parallel on the same physical core (using two logical cores). A relatively good example is a naive matrix transposition. Logical cores do not help a lot on many optimized codes because optimized codes do not stall often and generally expose a lot of instruction level parallelism (eg. due to loop unrolling).
When you are measuring a speed up, it is generally not relevant to use logical cores unless you know that the workload can benefit from them (the ones inherently latency bound) or when the processor is designed so they should be use (on Xeon Phi or POWER processors for example). I expect logical cores not to be useful on an optimized image filtering workload.
Note that logical cores tends to make benchmark results harder to understand.

Where to find ipc (or cpi) value of Intel processors (say skylake) when diff no of physical and logical cores are used?

I am very new to this field and my question might be too stupid but please help me understand the fundamental here.
I want to know the instruction per cycle (ipc) or clock per instruction (cpi) of recent intel processors such as skylake or cascade lake. And I am also looking for these values when different no of physical cores are used and when hyper threading is used.
I thought spec cpu2017 benchmark results could help me here, but I could not find my ans there. They just compare the total execution time by time taken by some reference machine and gives the ratio. Am I missing something here?
I thought this is one of the very first performance parameters and should be calculated and published by some standard benchmark, but I could not find any. Am I missing something here?
Another related question which comes to my mind (and I think everybody might want to know) is what is the best it can provide using all the cores and threads (least cpi and max ipc)?
Please help me find ipc / cpi value of skylake (any Intel processor) when say maximum (28) cores are used and when hyperthreading is also enabled.
The IPC cost of hyperthreading (or SMT in general on non-Intel CPUs) totally depends on the workload.
If you're already bottlenecked on branch mispredicts, cache misses, or long dependency chains (low ILP), having 2 threads running on the same core leads to minimal interference.
(Partitioning the ROB reduces the ability to find ILP in either thread, though, so again it depends on the details.)
Competitive sharing of uop cache and L1d/L1i / L2 caches also might or might not be a problem, depending on cache footprint.
There is no general answer independent of workload
Some workloads get a major speedup from using HT to double the number of logical cores. Some high-ILP workloads actually do worse because of cache conflicts. (Workloads that can already come close to saturating the front-end at 4 uops per clock on Intel before Icelake, for example).
Agner Fog's microarch guide says a bit about this for some microarchitectures that support hyperthreading. https://agner.org/optimize/
IIRC, some AMD CPUs have higher front-end throughput with hyperthreading, but I think only Bulldozer-family.
Max throughput is not affected by HT, and each core is independent. e.g. 4 uops per clock for a Skylake core. Doubling the number of physical cores always doubles theoretical uops / clock. Obviously not all workloads parallelize efficiently, so running more threads might need more total instructions/uops, and/or create more memory stalls for communication.
Hyperthreading just helps you come closer to that more of the time by letting 2 threads fill each other's "bubbles" from stalls.

Can a hyper-threaded processor core execute two threads at the exact same time?

I'm having a hard time understanding hyper-threading. If the logical core doesn't actually exist, what's the point of using hyper-threading?. The wikipedia article states that:
For each processor core that is physically present, the operating system addresses two virtual (logical) cores and shares the workload between them when possible.
If the two logical cores share the same execution unit, that means one of the threads will have to be put on hold while the other executes, that being said, I don't understand how hyper-threading can be useful, since you're not actually introducing a new execution unit. I can't wrap my head around this
See my answer on a softwareengineering.SE question for some details about how modern CPUs find and exploit instruction-level parallelism (ILP) by running multiple instructions at once. (Including a block diagram of Intel Haswell's pipeline, and links to more CPU microarchitecture details). Also Modern Microprocessors
A 90-Minute Guide!
You have a CPU with lots of execution units and a front-end that can keep them mostly supplied with work to do, but only under good conditions. Stalls like cache misses or branch mispredicts, or just limited parallelism (e.g. a loop that does one long chain of FP additions, bottlenecking on FP latency at one (scalar or SIMD) add per 4 or 5 clocks instead of one or two per clock) will result in throughput of much less than 4 instructions per cycle, and leave execution units idle.
The point of HT (and Simultaneous Multithreading (SMT) in general) is to keep those hungry execution units fed with work to do, even when running code with low ILP or lots of stalls (cache misses / branch mispredicts).
SMT only adds a bit of extra logic to the pipeline so it can keep track of two separate architectural contexts at the same time. So it costs a lot less die area and power than having twice or 4x as many full cores. (Knight's Landing Xeon Phi runs 4 threads per core, mainstream Intel CPUs run 2. Some non-x86 chips run 8 threads per core, aimed at database-server type workloads.) But of course having to divide out-of-order execution resources between logical threads often means the throughput gain is significantly below 2x or 4x, often far below, and for some workloads is negative.
Also related What is the difference between Hyperthreading and Multithreading? Does AMD Zen use either? - AMD's SMT is basically the same as Intel's, just not using the trademark "Hyperthreading" for it. See also other links in my answer there, like https://www.realworldtech.com/nehalem/3/ and especially https://www.realworldtech.com/alpha-ev8-smt/ for an intro with diagrams to what SMT is all about. (Many members of the Alpha EV8 design team was hired by Intel after DEC folded, and went on to implement SMT in Netburst (Pentium 4) which Intel branded Hyperthreading.)
Common misconceptions
Hyperthreading is not just optimized context switching. Simpler designs that switch to the other thread on a cache miss are possible, but HT is more advanced than that. (Switch-on-stall, or round-robin "barrel processor").
With two threads active, the front-end alternates between threads every cycle (in the fetch, decode, and issue/rename stages), but the out-of-order back-end can actually execute uops from both logical cores in the same cycle. The issue/rename stage is 4 uops wide on Intel before Ice Lake.
In pipeline stages that normally alternate, any time one thread is stalled, the other thread gets all the cycles in that stage. HT is much better than just fixed alternating, because one thread can get lots of work done while the other is recovering from a branch mispredict or waiting for a cache miss.
Note that up to 10 or 12 cache misses can be outstanding at once (from L1D cache in Intel CPUs: this is the number of LFB (Line Fill Buffers), and memory requests are pipelined. But if the address for the next load depends on an earlier load (e.g. pointer chasing through a tree or linked list), the CPU doesn't know where to load from and can't keep multiple requests in flight. So it is actually useful for both threads to be waiting on cache misses in parallel.
Some resources are statically partitioned when two threads are active, some are competitively shared. See this pdf of slides for some details. (For more details about how to actually optimize asm for Intel and AMD CPUs, see Agner Fog's microarchitecture PDF.)
When one logical core "sleeps" (i.e. the kernel runs a HLT instruction or whatever MWAIT to enter a deeper sleep), the physical core transitions to single-thread mode and lets the still-active logical core have all the resources (including the full ReOrder Buffer size, and other statically-partitioned resources), so it's ability to find and exploit ILP in the single thread still running increases more than when the other thread is simply stalled on a cache miss.
BTW, some workloads actually run slower with HT. If your working set barely fits in L2 or L1D cache, then running two on the same core will lead to a lot more cache misses. For very well-tuned high-throughput code that can already keep the execution units saturated
(like an optimized matrix multiply in high-performance computing), it can make sense to disable HT. Always benchmark.
On Skylake, I've found that video encoding (with x265 -preset slower, 1080p) is about 15% faster with 8 threads instead of 4, on my quad-core i7-6700k. I didn't actually disable HT for the 4-thread test, but Linux's scheduler is good at not bouncing threads around and running threads on separate physical cores when there are enough to go around. A 15% speedup is pretty good considering that x265 has a lot of hand-written asm and runs very high instructions-per-cycle even when it has a whole core to itself. (Slower presets like I used tend to be more CPU-bound than memory-bound.)

Multi-threading in SWI-Prolog provides little improvement

I have a code that carries heavy computations: executing pred(In, Out) let's say 64K times where each execution takes 1-10 seconds.
I want to use multi-threaded (64) machine for speeding up the process.
I use concurrent_maplist for this:
concurrent_maplist(pred, List_of_64K_In, List_of_64K_Out).
I get approx 8 times speed up but not more than that.
I thought the reason was the following notice on concurrent_maplist:
Note that the the overhead of this predicate is considerable and
therefore Goal must be fairly expensive before one reaches a speedup.
To make a goal fairly expensive, I modified the code as:
% putting 1K pred/2 in heavy_pred/2
concurrent_maplist(heavy_pred, List_of_64_List_of_1k_In, List_of_64_List_of_1k_Out).
heavy_pred(List_of_In, List_of_Out) :-
maplist(pred, List_of_In, List_of_Out).
Surprisingly (for me), I do not get further speed up with this change.
I wonder how to get further speed up with multi-threading?
Some additional details:
Architecture: x86_64, AMD, 14.04.1-Ubuntu.
swipl -v: SWI-Prolog version 6.6.4 for amd64.
pred/2 is a theorem prover which takes formulas and tries to prove them.
It uses standard predicates with few non-standard ones: cyclic_term/1, write/1, copy_term/2, etc.
To make your cores to work, you can use threads that can be fired up by your application. Using the very latest swipl version may help you get the latest improvements it offers.
You can also fire up several applications, each running e.g. 8 threads. Result: more cores would get work, albeit the total will take up more memory than running a single application with more threads. Now, you will need to manage the different application instances so that the overall job progresses (and don't repeat work done on other apps). I imagine that this is needed even in the case of a single application with more threads. My answer is not very technically steeped, but would get you to use more cores for a cpu intensive job.
Multi-core CPUs don't scale linearly when the results are merged. I checked whether there is a difference between concurrent_maplist/2 and the new concurrent_and/2. I am using an example that has fairly large work items from here.
I added testing for concurrent_maplist/2:
/* parallel, concurrent_maplist */
count3(N) :-
findall(Y, between(1, 1000, Y), L),
concurrent_maplist(slice, L, R),
aggregate_all(sum(M), member(M, R), N).
Interestingly there is not much difference between concurrent_and/2 and concurrent_maplist/2. Possibly because the number of work items is only 1000, due to the fairly larger work item itself. Nevertheless 1'000'000 numbers were check for primality:
?- current_prolog_flag(cpu_count, X).
X = 8.
/* sequential */
?- time(count(N)).
% 138,647,812 inferences, 9.781 CPU in 9.792 seconds (100% CPU, 14174856 Lips)
N = 78499.
/* parallel, concurrent_and */
?- time(count2(N)).
% 4,450 inferences, 0.000 CPU in 2.458 seconds (0% CPU, Infinite Lips)
N = 78499.
/* parallel, concurrent_maplist */
?- time(count3(N)).
% 23,183 inferences, 0.000 CPU in 2.423 seconds (0% CPU, Infinite Lips)
N = 78499.
Although the machine had 8 logical cores, thanks to hyper threading. The speed-up is close to the number of physical cores, which is only 4. So in case of your machine with 64 logical cores, that is what cpu_count Prolog flag returns, you should check how many physical cores there are.

Where is the point at which adding additional cores or CPUs doesn’t improve the performance at all?

*Adding a second core or CPU might increase the performance of your parallel program, but it is unlikely to double it. Likewise, a
four-core machine is not going to execute your parallel program four
times as quickly— in part because of the overhead and coordination
described in the previous sections. However, the design of the
computer hardware also limits its ability to scale. You can expect a
significant improvement in performance, but it won’t be 100 percent
per additional core, and there will almost certainly be a point at
which adding additional cores or CPUs doesn’t improve the performance
at all.
*
I read the paragraph above from a book. But I don't get the last sentence.
So, Where is the point at which adding additional cores or CPUs doesn’t improve the performance at all?
If you take a serial program and a parallel version of the same program then the parallel program has to do some operations that the serial program does not, specifically operations concerned with coordinating the operations of the multiple processors. These contribute to what is often called 'parallel overhead' -- additional work that a parallel program has to do. This is one of the factors that makes it difficult to get 2x speed-up on 2 processors, 4x on 4 or 32000x on 32000 processors.
If you examine the code of a parallel program you will often find segments which are serial, that is which only use one processor while the others are idle. There are some (fragments of) algorithms which are not parallelisable, and there are some operations which are often not parallelised but which could be: I/O operations for instance, to parallelise these you need some sort of parallel I/O system. This 'serial fraction' provides an irreducible minimum time for your computation. Amdahl's Law explains this, and that article provides a useful starting point for your further reading.
Even when you do have a program which is well parallelised the scaling (ie the way speed-up changes as the number of processors increases) does not equal 1. For most parallel programs the size of the parallel overhead (or the amount of processor time which is devoted to operations which are only necessary for parallel computing) increases as some function of the number of processors. This often means that adding processors adds parallel overhead and at some point in the scaling of your program and jobs the increase in overhead cancels out (or even reverses) the increase in processor power. The article on Amdahl's Law also covers Gustafson's Law which is relevant here.
I've phrased this all in very general terms, no consideration of current processor and computer architectures; what I am describing are features of parallel computation (as currently understood) not of any particular program or computer.
I flat out disagree with #Daniel Pittman's assertion that these issues are of only theoretical concern. Some of us are working very hard to make our programs scale to very large numbers of processors (1000s). And almost all desktop and office development these days, and most mobile development too, targets multi-processor systems and using all those cores is a major concern.
Finally, to answer your question, at what point does adding processors no longer increase execution speed, now that is an architecture- and program-dependent question. Happily, it is one that is amenable to empirical investigation. Figuring out the scalability of parallel programs, and identifying ways of improving it, are a growing niche within the software engineering 'profession'.
#High Performance Mark is right. This happens when you are trying to solve a fixed size problem in the fastest possible way, so that Amdahl' law applies. It does not (usually) happen when you are trying to solve in a fixed time a problem. In the former case, you are willing to use the same amount of time to solve a problem
whose size is bigger;
whose size is exactly the same as before, but with a greeter accuracy.
In this situation, Gustafson's law applies.
So, let's go back to fixed size problems.
In the speedup formula you can distinguish these components:
Inherently sequential computations: σ(n)
Potentially parallel computations: ϕ(n)
Overhead (Communication operations etc): κ(n,p)
and the speedup for p processors for a problem size n is
Adding processors reduces the computation time but increases the communication time (for message-passing algorithms; it increases the synchronization overhead etcfor shared-memory algorithm); if we continue adding more processors, at some point the communication time increase will be larger than the corresponding computation time decrease.
When this happens, the parallel execution time begins to increase.
Speedup is inversely proportional to execution time, so that its curve begins to decline.
For any fixed problem size, there is an optimum number of processors that minimizes the overall parallel execution time.
Here is how you can compute exactly (analytical solution in closed form) the point at which you get no benefit by adding additional processors (or cores if you prefer).
The answer is, of course, "it depends", but in the current world of shared memory multi-processors the short version is "when traffic coordinating shared memory or other resources consumes all available bus bandwidth and/or CPU time".
That is a very theoretical problem, though. Almost nothing scales well enough to keep taking advantage of more cores at small numbers. Few applications benefit from 4, less from 8, and almost none from 64 cores today - well below any theoretical limitations on performance.
If we're talking x86 that architecture is more or less at its limits. # 3 GHz electricity travels 10 cm (actually somewhat less) per Hz, the die is about 1 cm square, components have to be able to switch states in that single Hz (1/3000000000 of a second). The current manufacturing process (22nm) gives interconnections that are 88 (silicon) atoms wide (I may have misunderstood this). With this in mind you realize that there isn't that much more that can be done with physics here (how narrow can an interconnection be? 10 atoms? 20?). At the other end the manufacturer, to be able to market a device as "higher performing" than its predecessor, adds a core which theoretically doubles the processing power.
"Theoretically" is not actually completely true. Some specially written applications will subdivide a large problem into parts that are small enough to be contained inside a single core and its exclusive caches (L1 & L2). A part is given to the core and it processes for a significant amount of time without accessing the L3 cache or RAM (which it shares with other cores and therefore will be where collisions/bottlenecks will occur). Upon completion it writes its results to RAM and receives a new part of the problem to work on.
If a core spends 99% of its time doing internal processing and 1% reading from and writing to shared memory (L3 cache and RAM) you could have an additional 99 cores doing the same thing because, in the end, the limiting factor will be the number of accesses the shared memory is capable of. Given my example of 99:1 such an application could make efficient use of 100 cores.
With more common programs - office, ie, etc - the extra processing power available will hardly be noticed. Some parts of the programs may have smaller parts written to take advantage of multiple cores and if you know which ones you may notice that those parts of the programs are much faster.
The 3 GHz was used as an example because it works well with the speed of light which is 300000000 meters/sec. I read recently that AMD's latest architecture was able to execute at 5 GHz but this was with special coolers and, even then, it was slower (processed less) than an intel i7 running at a significantly slower frequency.
It heavily depends on your program architecture/design. Adding cores improves parallel processing. If your program is not doing anything in parallel but only sequentially, adding cores would not improve its performance at all. It might improve other things though like framework internal processing (if you're using a framework).
So the more parallel processing is allowed in your program the better it scales with more cores. But if your program has limits on parallel processing (by design or nature of data) it will not scale indefinitely. It takes a lot of effort to make program run on hundreds of cores mainly because of growing overhead, resource locking and required data coordination. The most powerful supercomputers are indeed massively multi-core but writing programs that can utilize them is a significant effort and they can only show their power in an inherently parallel tasks.

Resources