Parallel Speedup and Efficiency - multithreading

Just have a quick question.
What is the difference between a parallel speedup and efficiency.
Thanks

If you have two workers, in a naive world, you should be able to finish a job in half the time. This is a 2x speedup. If the two workers do not interact at all and are working independently, a 2x speedup is theoretically possible. This class of problems are called embarassingly-parallel, and aren't very common.
In this case both of the workers can continue working just as quickly as the original worker could i.e., efficiency is at 100%.
Amdahl's Law: In real-world computing though, there are always some shared resources, and consequently some contention between the two workers. Which mean both workers will likely run a bit slower than the original single worker.
Efficiency now becomes a measure of the drop in speed for each worker. Say they're running at 0.9 times the original single worker speed, the efficiency is now 90%
The drop in efficiency also means in the original amount of time, either worker has only completed 90% of their job. So the actual speedup drop from 2x to 1.8x

Just adding technical definitions here. Let T_1 be the time required by your application to complete on 1 processor (sequential time) and T_p the time required by your application when executed on p cores. Then, the speedup S is defined as
S = T_1 / T_p
The speedup measures the acceleration you obtain when using p cores,
The corresponding efficiency E is defined as
E = T_1 / (p T_p)
The efficiency measures how well you are utilizing the p cores in parallel.
The maximum theoretical speedup on p cores is p, but you will attain this limit only for embarrassingly parallel apps (no communication and no other overheads). Correspondingly, the maximum efficiency is 1 (or 100%). In practice, a common rule of thumb is that an app should aim to achieve at least an efficiency of 0.7 (or 70%).

One lady can have a baby in 9 months.
If you put 9 ladies on the case, you can't get a baby in 1 month. That's zero speedup and 11% efficiency.
If you put 9 ladies on the case, you can get 9 babies in 9 months. That is 9x speedup and 100% efficiency.

Related

Is parallelism ever useful for an application that is always using all available CPU cores?

Say I have a large list of integers and I need to iterate through them and double every single one. These are stateless/independent operations so I could easily split the workload across multiple threads/cores.
But what if I need to do this on a high traffic website?
Each request to the server spins up a new thread so all my server cores will always be busy processing incoming requests. In this case, since there are no cores without work, parallel processing would have no positive effect on performance, right?
Or am I missing/misunderstanding something here?
Q :"... parallel processing would have no positive effect on performance, right?"
A :Given the above defined context, the deduction was right.
Details matter, yet some general principles ( as a Rule of Thumb ) hold -- No matter how high or low is the "surrounding" ( background ) workload of the just-[CONCURRENT]-ly scheduled resources ( pure-[SERIAL], round-robin alike, serving the O/S queue-of-waiting threads ( ignoring for the moment slight imbalance, coming from a relative priority re-ordering ) for mapping threads onto different CPU-cores plus additional, forced hopping of active threads from one CPU-core to another CPU-core due to actual thermal-throttling reasons, thus introduced increased CacheLine "depleted" hit-rate, resulting in 3 orders of magnitude worse RAM-I/O, compared to in-Cache computing -- are just few performance degrading phenomena to account for in considering real-world limits for trying to go for maximum performance ) -- all this reduces the net-effect of idealised scheduling. Plus, it adds add-on overheads to handling any additional stream-of-execution, adding more overheads, than without doing so.
A simple, interactive GUI-tool will say more, than any amount of text on this subject. Just tweak the overhead fraction and first test for a full, 100%-parallel problem fraction ( which will never happen in real world, even the loading of the process itself into RAM is a pure-[SERIAL] part of the overall computing Problem-under-review, isn't it? ). Once seeing the the "costs" of growing overheads accumulated, move the p-fraction from full-100% to some 0.997, 0.998, 0.997 ... to see the Speedup impact of smaller p-fraction ( parallel / ( parallel + serial ) )
If interested in more hardware & software ( having the same dilemma ) details, feel free to use this & this & this:

Multiprocessing: why doesn't a single thread just use more cpu?

I'm learning about multiprocessing and it seems to be applicable in one of two scenarios:
our program is waitng for some I/O, so it makes sense to go do something else while waiting;
we break our program up so that individual parts of it can run "in parellel", in an attempt to take full advantage of the cpu
My confusion is about the second case. I'm probably just lacking in my understanding of how cpus really work: but if our single thread process is only using 1% of the cpu and it therefore makes sense to get more threads going, then why wouldn't we just (somehow?) speed up that single process so that it uses more cpu and finishes faster?
but if our single thread process is only using 1% of the cpu and it therefore makes sense to get more threads going, then why wouldn't we just (somehow?) speed up that single process so that it uses more cpu and finishes faster?
We don't know how to. There seem to be fundamental limitations to how fast we can do things that we haven't quite figured out how to get around. So instead, we do more than one thing at a time.
It takes a woman 9 months to make a baby. So if you want lots of babies, you get lots of women. You don't try to get one woman to go faster.
Say you want to raise 7 to the twenty-millionth power and also raise 11 to the twenty-millionth power. Each of these two operations can be reduced in the number of steps, but you will reach a limit. Say each operation takes N sequential steps (each requiring the output from the previous step as its input) and the fastest we can do a single step is Q nanoseconds. With one thread, it will take at least 2NQ nanoseconds to perform all the operations. With two threads, can do one step from each of the two operations at the same time, reducing the time minimum to N*Q nanoseconds.
That's a big win.
I might be wrong, but when we split things into threads, we want to make use of multi-core architecture of our CPUs.
We mostly think CPUs being a single unit, but you must've heard about how i5 is a quad-core processor, meaning it has 4 cores-- or 4 cores make a CPU, i3 is a dual core processor-- i.e, it only has two cores.
So the aggregate CPU utilization for quad-core would be 100% split into 4x25. There's a difference b/w concurrency and parallelism. Parallel means each thread runs on a separate core, making full use of it. Now you have 4 people doing one job-- or a better analogy would be there are 4 printers in the office, and 4 people can go ahead and get the copies that they want. This is parallelism.
Using that same analogy let's extend it to just one copier/printer and 4 people want to make copies, what we do is make use concurrency, we print each requested copy but only 25% of it, then we switch to the next person, then the next and then the next, this will take 4 iterations for all the copies to get printed. Even though we utilized 100% of the copier's capability, still our guys had to wait-- this waiting time also depends on what was the length of the document they wanted to print-- so we use something like pre-emption, you can only execute/print for a certain amount of time, before we start printing for the next guy.
Speeding up a single process-- allocating it 100% of the CPU is not a problem [although we want to run bunch of other stuff like GUI, play music, system services etc, but 85% is doable], the execution time becomes 1/4th when it's distributed b/w the CPUs. Imagine you have to print a book, with 4 copiers, book is 400pages long-- you use 4 copiers to print 100pages each. Will be faster right?
I hope I made some sense, Going to sleep.

multi-threading theoretical scenario

I've a multi-threaded application which uses a threadpool of 10 threads. Each thread takes 5 minutes to process input. Is there a law/formula which governs the total time taken to process n inputs?
In other words, is it right to say that every 5 minutes, 10 inputs can be processed, so to process 100 inputs, it will take 50 minutes?
In addition to the computing power (processors/cores) and hardware resource dependencies (hard disk, I/O competition, etc.), the data dependency should also be considered. For example, if the processing of each input includes updating a shared data by all the other threads, which requires locking (mutex), then the total throughput will be less than 10 times, even if it is a multi-core processor with more than 10 cores. The maximum speed-up depends on the proportion of the critical section. If you need a formula, refer to the famous Amdahl's law: en.wikipedia.org/wiki/Amdahl's_law
Not really, you have to consider the total computing power required. If for example a thread takes 5 minutes to do the work, and the processor is completely consumed during that time, then additional threads will not help you. On the other extreme, if the processor utilization is near zero (all of the time is spent waiting for I/O for example), then your proposed calculation would work. So you have to consider the actual resources being used by the computation.

How to divide load between different processors

I am running some parallel code on a machine which has 4 intel processors and 8 cores on each .I am using TBB.Suppose a given loop(that I parallelize ) has X iterations how should I choose my grainsize to ensure the load is evenly divided?
Assume you have N equally powerful CPUs.
If there are no loop carried dependencies (e.g, nothing in iteration i is used by following iterations), then you can simply run loop iterations 0..X/N on CPU 1, and iterations (X/N)+1..(2*X/N) on CPU 2, etc, assuming that each iteration takes exactly the same amount of time, or at least an average amount of that doesn't vary wildly.
If there are loop carried
dependencies, you may have a problem if iteration i depends on all previous iterations. If it only dependes on the the previous k iterations, you can have CPU1 do iterations 0..X/N, and CPU2 do iterations X/N-k..(2*X/N), wasting some work but allowing CPU2 to collect the results it needs, etc. for all processors.
If iterations take wildly varying amounts of time, you're better off setting up a worklist containing the iterations,
and have the CPUs grab iterations from the workslist as they complete previous iterations. This way the work is divided up as demand appears. You have to be sure that the time per unit of work grabbed is lots larger than the effort to get the work, or you'll get no parallel advantage; one way to do this is to grab a small range of iterations from the worklist, such that the total work in the range exceeds the scheduling overhead significantly.
With TBB, you don't have to select a grain size for parallel_for. In most cases, TBB will dynamically load balance the work pretty well by default. The answer of Ira Baxter correctly describes how you should partition the work across a pool of threads; but TBB already has similar mechanisms in place that do this for you.
ADDED: Surely manual work partitioning might get better results in complex cases. Though in this case one would likely need to use TBB tasks, as parallel_for might not provide enough control; for example, in general it is not possible to specify the exact size of a per-thread chunk.

Algorithm to optimize # threads used in a calculation

I'm performing an operation, lets call it CalculateSomeData. CalculateSomeData operates in successive "generations", numbered 1..x. The number of generations in the entire run is fixed by the input parameters to CalculateSomeData and is known a priori. A single generation takes anywhere from 30 minutes to 2 hours to complete. Some of that variability is due to the input parameters and that cannot be controlled. However, a portion of that variability is due to things like hardware capacities, CPU load from other processes, network bandwidth load, etc. One parameter that can be controlled per-generation is the number of threads that CalculateSomeData uses. Right now that's fixed and likely non-optimal. I'd like to track the time each generation takes and then have some algorithm by which I tweak the number of threads so that each successive generation improves upon the prior generation's calculation time (minimizing time). What approach should I use? How applicable are genetic algorithms? Intuition tells me that the range is going to be fairly tight - maybe 1 to 16 threads on a dual quad-core processor machine.
any pointers, pseudocode, etc. are much appreciated.
How about an evolutionary algorithm.
Start with a guess. 1 thread per CPU core seems good, but depends on the task at hand.
Measure the average time for each task in the generation. Compare it to the time taken by the previous generation. (Assume effectively infinite time and 0 threads for generation 0).
If the most recent generation tasks averaged a better time than the one before, continue to change the number of threads in the same direction as you did last step (so if the last generation had more threads than the previous thread, then add a thread for the new generation, but if it had fewer, then use one fewer (obviously with a lower limit of 1 thread).
If the most recent generation tasks took longer, on average, than the previous generation, then change the number of threads in the opposite direction (so if increasing the number of threads resulted in worse time, use one fewer thread next time).
As long as the optimal number of threads isn't too close to 1, then you'll probably end up oscillating between 3 values that are all reasonably close to optimal. You may want to explicitly detect this case and lock yourself into the central value, if you have a large number of generations to deal with.
If the calculations are completely CPU bound the number of threads should be equal to the number of cores on the machine. That way you minimize the number of context switches.
If your calculations involve I/O, network, synchronization or something else that blocks execution you must find the limiting resource and measure the utilization. You need to monitor the utilization and slowly add more threads until the utilization gets close to 100%. You should have as few threads as possible to saturate your limiting resource.
You should divide up your generations into lots of small tasks and put them in a queue. Spawn one thread per core and have each thread grab a task to do, run it to completion, and repeat.
You want lots more tasks than cores to make sure that you don't end up with just one task running at the end of the generation and all other threads idle. This is what is likely to happen if you set #tasks = #threads = #cores as Albin suggests (unless you can ensure that all tasks take precisely the same amount of time).
You also probably don't want more threads than cores. Context switching isn't terribly expensive, but the larger cache footprint that comes with having more than #cores tasks simultaneously active could hurt you (unless your tasks use very little memory).

Resources