Load balancing by matching process peaks and minima - multithreading

Consider a machine running two independent multi threaded process which have a periodic load and an amplitude greater then 50% of the total CPU power. Say a load like 0.6 sin(t)^2 (t is time) for both process. In general of course there could be fluctuations and differing periods.
If the processes are locked in phase with a phase difference of pi/2 the computation is not slowed down. However for certain phase differences the CPU would throttle at maximum load and slow down the computation time.
Are there mechanisms to avoid these overlaps? On which level are they implemented?

Related

Concurrent threads, processes and multiple cores

I'm trying to understand the usage of CPU cores with regard to concurrent threads and processes. Please see the below questions:
Assume I have 2 CPU cores. When there are 2 processes running, each process has only 1 thread. Are the two processes using the 2 cores?
Assume I have 2 CPU cores. When there is 1 process running, which has 2 threads. Are the two threads using the 2 cores?
Assume I have 2 CPU cores. When there are 2 processes running, each process has 2 threads. How are the two cores used by those processes and cores?
How to calculate the maximum real concurrent execution given CPU cores? What are other factor I should take into account?
1,2: Quite likely but not definitely. A portion of the system software determines what runs where. It would be unlikely to choose to keep a process or thread waiting for cpu attention when there is one that is otherwise idle, it isn't absolute.
Most processing involves some sort of transfer to and from a device, network, etc.. Typically this necessitates a period of inactivity waiting for the transfer to complete. During this inactivity, another process / thread can run on that cpu. So, if a given process is 30% cpu time and 70% I/O time, then I can run about 3 of them concurrently on a single cpu without degrading performance.
3,4: Like the paragraph above implies, depending upon the workload, their could be any distribution of the threads among the cpus. If the threads were all compute bound (100% cpu), most operating systems switch between them at a small enough granularity that all remain lively, and large enough that the switching has a minimal impact on them.
This scheduling may take other notions into consideration, such as data affinity. Recently touched bits of data are likely to remain in the cpu cache when a thread has relinquished it. The next time the thread is to be scheduled, it would be best to put it onto that cpu, to retain the effort required to warm the cache for it. It might also think that two threads of one process (address space) are more likely to share data, so should prefer the same cpu.
4: depending upon your system, there are likely to be many performance analysis tools available. Top, on UNIX-inspired systems is a simple tool which gives system wide utilization information, and the simple tool time will show how much time a process spent on a cpu vs real-world time. If you run each of your tasks sequentially, noting the cpu-time that they take, then time them running concurrently, the ratio between these cpu-times indicates the scaling factor of your concurrent app. Note that real-world time can be misleading because of io-overlap.

Context switch: what happens in a worst case scenario?

I want to understand how a certain worst case scenario of context switch happens. Say I have 10 CPU cores running a single process. Everything is CPU intensive, no thread is sleeping (waiting for I/O).
(I am mainly concerned with mainstream modern personal computer architectures and systems, typically x64 with Windows, Linux...)
Correct me if I'm wrong: running 10 CPU/RAM intensive independent threads is most often a near optimal situation. The amount of time spent in context switch is rather negligible. While the system may sometimes decide to re-attribute threads to different cores in a round-robin fashion causing a reset of RAM caches, it has a minor effect and works almost as if each thread was running on a single fixed core.
Only the main RAM bus may be a limitation since all threads share it, but it's not the point I'm interested in here. Reducing the number of threads will not increase the throughput anyway.
Now assume you still have 10 cores but run 1000 threads. The scheduler could theoretically decide to switch rarely (say every second) running 10 threads for a second, then 10 others... and the whole thing would still be close to optimal performance (throughput).
But it does not seem to be the case and it looks like threads are switched intensively causing a strongly suboptimal performance (throughput). Am I right about it? What is the main cause for this suboptimal performance? A few numbers would be nice if you have any idea of orders of magnitude of (for example): switches per second, performance loss caused by switching...
I'm going to answer my own question (after some search).
On windows, the number of context switches can be measured with performance counters: https://technet.microsoft.com/en-us/library/cc938606.aspx
I measured it on my machine (core i7/Windows 10) and the order of magnitude is around 1000/s by core when the number of running threads is more than the number of cores (and these threads are full CPU).
The time needed for a context switch varies quite a bit depending on:
what registers need to be saved
if FPU registers need to be saved
the processor model (of course)
You can read: https://www.quora.com/How-long-does-a-context-switch-take or http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
A slightly pessimistic avg. order of magnitude seems to be 1000 ns. Thus the total time for all context switches on each core is 1ms per second, that is 0.1%.
This does not depend on the number of threads: if you run 100 or 1000 threads, the number of switches does not change. As a conclusion the time spent in context switching is somehow negligible.
This reasoning is correct as long as the threads are pure CPU with only small memory read/write like a few local variables. I ran a test with full CPU threads and the difference between a few and 1000 threads is not noticeable.
But the situation changes when RAM is involved and switches makes CPU (memory) cache less efficient. A worse case is when:
computation can be split into 1000 independent "data" parts
each part of the data fits just into the memory cache (say L1 or L2) of a core
each part needs to be read many times
In this situation, running 10 threads to completion, then ten others... would take full advantage of the cache, while running 1000 threads at a time would causes the cache to be useful only during 1ms.
But if the data of several threads could fit into the cache, or if the threads read common data to some degree, or if each thread reads the data just once, then it is possible that running 1000 threads vs. running 10 threads a hundred times will have similar throughput.
It is more a matter a adapting parallelism to memory access. And it depends very much on the way memory needs to be accessed.
The time spend in context switching is negligible, the time lost because of wrong usage of caches may sometimes be problem, sometimes not, depending on how the memory is accessed and shared.

Parallel Speedup and Efficiency

Just have a quick question.
What is the difference between a parallel speedup and efficiency.
Thanks
If you have two workers, in a naive world, you should be able to finish a job in half the time. This is a 2x speedup. If the two workers do not interact at all and are working independently, a 2x speedup is theoretically possible. This class of problems are called embarassingly-parallel, and aren't very common.
In this case both of the workers can continue working just as quickly as the original worker could i.e., efficiency is at 100%.
Amdahl's Law: In real-world computing though, there are always some shared resources, and consequently some contention between the two workers. Which mean both workers will likely run a bit slower than the original single worker.
Efficiency now becomes a measure of the drop in speed for each worker. Say they're running at 0.9 times the original single worker speed, the efficiency is now 90%
The drop in efficiency also means in the original amount of time, either worker has only completed 90% of their job. So the actual speedup drop from 2x to 1.8x
Just adding technical definitions here. Let T_1 be the time required by your application to complete on 1 processor (sequential time) and T_p the time required by your application when executed on p cores. Then, the speedup S is defined as
S = T_1 / T_p
The speedup measures the acceleration you obtain when using p cores,
The corresponding efficiency E is defined as
E = T_1 / (p T_p)
The efficiency measures how well you are utilizing the p cores in parallel.
The maximum theoretical speedup on p cores is p, but you will attain this limit only for embarrassingly parallel apps (no communication and no other overheads). Correspondingly, the maximum efficiency is 1 (or 100%). In practice, a common rule of thumb is that an app should aim to achieve at least an efficiency of 0.7 (or 70%).
One lady can have a baby in 9 months.
If you put 9 ladies on the case, you can't get a baby in 1 month. That's zero speedup and 11% efficiency.
If you put 9 ladies on the case, you can get 9 babies in 9 months. That is 9x speedup and 100% efficiency.

multi-threading theoretical scenario

I've a multi-threaded application which uses a threadpool of 10 threads. Each thread takes 5 minutes to process input. Is there a law/formula which governs the total time taken to process n inputs?
In other words, is it right to say that every 5 minutes, 10 inputs can be processed, so to process 100 inputs, it will take 50 minutes?
In addition to the computing power (processors/cores) and hardware resource dependencies (hard disk, I/O competition, etc.), the data dependency should also be considered. For example, if the processing of each input includes updating a shared data by all the other threads, which requires locking (mutex), then the total throughput will be less than 10 times, even if it is a multi-core processor with more than 10 cores. The maximum speed-up depends on the proportion of the critical section. If you need a formula, refer to the famous Amdahl's law: en.wikipedia.org/wiki/Amdahl's_law
Not really, you have to consider the total computing power required. If for example a thread takes 5 minutes to do the work, and the processor is completely consumed during that time, then additional threads will not help you. On the other extreme, if the processor utilization is near zero (all of the time is spent waiting for I/O for example), then your proposed calculation would work. So you have to consider the actual resources being used by the computation.

Algorithm to optimize # threads used in a calculation

I'm performing an operation, lets call it CalculateSomeData. CalculateSomeData operates in successive "generations", numbered 1..x. The number of generations in the entire run is fixed by the input parameters to CalculateSomeData and is known a priori. A single generation takes anywhere from 30 minutes to 2 hours to complete. Some of that variability is due to the input parameters and that cannot be controlled. However, a portion of that variability is due to things like hardware capacities, CPU load from other processes, network bandwidth load, etc. One parameter that can be controlled per-generation is the number of threads that CalculateSomeData uses. Right now that's fixed and likely non-optimal. I'd like to track the time each generation takes and then have some algorithm by which I tweak the number of threads so that each successive generation improves upon the prior generation's calculation time (minimizing time). What approach should I use? How applicable are genetic algorithms? Intuition tells me that the range is going to be fairly tight - maybe 1 to 16 threads on a dual quad-core processor machine.
any pointers, pseudocode, etc. are much appreciated.
How about an evolutionary algorithm.
Start with a guess. 1 thread per CPU core seems good, but depends on the task at hand.
Measure the average time for each task in the generation. Compare it to the time taken by the previous generation. (Assume effectively infinite time and 0 threads for generation 0).
If the most recent generation tasks averaged a better time than the one before, continue to change the number of threads in the same direction as you did last step (so if the last generation had more threads than the previous thread, then add a thread for the new generation, but if it had fewer, then use one fewer (obviously with a lower limit of 1 thread).
If the most recent generation tasks took longer, on average, than the previous generation, then change the number of threads in the opposite direction (so if increasing the number of threads resulted in worse time, use one fewer thread next time).
As long as the optimal number of threads isn't too close to 1, then you'll probably end up oscillating between 3 values that are all reasonably close to optimal. You may want to explicitly detect this case and lock yourself into the central value, if you have a large number of generations to deal with.
If the calculations are completely CPU bound the number of threads should be equal to the number of cores on the machine. That way you minimize the number of context switches.
If your calculations involve I/O, network, synchronization or something else that blocks execution you must find the limiting resource and measure the utilization. You need to monitor the utilization and slowly add more threads until the utilization gets close to 100%. You should have as few threads as possible to saturate your limiting resource.
You should divide up your generations into lots of small tasks and put them in a queue. Spawn one thread per core and have each thread grab a task to do, run it to completion, and repeat.
You want lots more tasks than cores to make sure that you don't end up with just one task running at the end of the generation and all other threads idle. This is what is likely to happen if you set #tasks = #threads = #cores as Albin suggests (unless you can ensure that all tasks take precisely the same amount of time).
You also probably don't want more threads than cores. Context switching isn't terribly expensive, but the larger cache footprint that comes with having more than #cores tasks simultaneously active could hurt you (unless your tasks use very little memory).

Resources