I program although I am not a computer scientist. Therefore I would like to see if I understood correctly the challenge in splitting a workload. Is this below the correct way to think about it?
Specifically, is the following statement (1) correct?
(1) If A(X_a) + A(X_b) + A(X_c) + ... = B(X_a,X_b,X_c, ...) = Y is an equation that is being computed
whether or not it can be computed more rapidly from the perspective of the computer by assigning parts of the equation to be computed by individual threads at the same time depends on the following
if X_m changes when A(X_n) changes for m not equal to n, then dividing the workload for that particular computation is gives less of a performance gain, and if this is true for every combination of m and n in the system, then no performance gain for multithreading over single threading is possible.
Or in other words do I understand correctly that presence of linked variables decreases ability to multithread successfully because X_b and X_c depend on what A(X_a) and it bottlenecks the process: the other threads know A but have to wait for the first thread to give an output before they have instructions to execute, so simultaneous working on parts of an instruction which is easily broken up into parts cannot be done and the computation takes as much time one one thread doing each part of the calculation one after the other as it does to perform on more than one thread working at once and summing the results in order they complete on the fly on another thread.
(2) Or is there a way around the above bottleneck? For example if this bottleneck is known in advance, the first thread can start early and store in memory the results to A(X_n) for all n that bottleneck the operation, and then split the workload efficiently, one A(X_i) to the i th thread, but to do this, the first thread would have to predict in some way when the calculation B(X_a,X_b,X_c, ...) must be executed BEFORE B(X_a,X_b,X_c, ...) is actually executed, otherwise it would run into the bottleneck.
[EDIT: To clarify, in context of NWP's answer. If the clarification is too long / unclear, please leave a comment, and I'll make a few graphics in LaTeX to shorten the question writeup.]
Suppose the longest path in the program "compute I" is 5 units of time in the example. If you know this longest path, and the running system can anticipate (based on past frequency of execution) when this program "compute I" will be run in the future, subprogram "compute B->E" (which does not depend on anything else but is a proper subset of the longest path of program "compute I") may be executed in advance. The result is stored in memory prior to the user requesting "compute I".
If so, is the max speedup considered to be 9/4? The B->E is ready, so other threads do not have to wait for it. Or is max speed up for "compute I" still considered to be 9/5?
The anticipation program run before has a cost, but this cost may be spread over each instance of execution of "compute I". If the anticipation program has 15 steps, but the program "compute I" is run typically 100 times per each execution of the anticipation program, and all steps cost equally, do we simply say the max speedup possible in "compute I" is therefore 9/(5 - 1 + 15/100)?
The speedup possible now appears to depend not only on the number of threads, and the longest path, but also on the memory available to store precalculations and how far in advance another program can anticipate "compute I" will be run and precalculate proper subprograms of it. Another program "compute X" may have the same length of the longest path as "compute I" but the system cannot anticipate that "compute X" will be run equally as far in advance as "compute I". How do we weight the speedup achieved (i) at expense of increasing memory to store precalculations (ii) timing of execution of some programs can be anticipated further in advance than of other program allowing bottleneck to be precalculated and this way cutting down the longest path?
But if a longest path can be dynamically cut down in dynamics by improving predictive precalculation of subprograms and greater memory for storing results of precalculation, can bottlenecks be considered at all as determining the ultimate upper boundary to speedup due to splitting a computational workload?
From the linked variables dependency bottleneck perspective / graph bottle perspective, the ultimate upper boundary of speedup to multithreading a program "compute I" appears to be determined by longest subprogram (other subprograms depend on it / wait for it). But from the dynamics perspective, where the whole system is running before and after the program "compute I" is executed as a part of it, sufficient predictability of timing of future execution of "compute I" and ability to store more and more precalculations of its independent subprograms can completely cut down length of all subprograms of "compute I" to 1 unit, meaning it can in possibly achieve a speedup of 9/1 = 9, if sufficient predictability and memory is available.
Which perspective is the correct one for estimating the upper bounds to speedup by multithreading? (A program run in a system running a long time with sufficient memory seems to have no limit to multithreading, whereas if it is looked at by itself, there is a very definite fixed limit to the speedup.)
Or is the question of ability to cut down longest path by anticipation and partial precalculation a moot one because speedup in that case varies with the user's decision to execute a program in a way that can be predicted and so cannot upper boundary to multithreading speedup due to anticipation cannot be know to a program writer or system designer and should be ignored / not relied upon to exist?
I do not quite understand which things depend on what from your description but I can give you some theory. There is Ahmdal's law which gives you an upper bound of the speedup you can achieve based on how parallelizable a given algorithm is assuming you have enough processors. If you can parallelize 50% of the calculation you can get a maximum speedup of 2x. 95% parallelization gives you a maximum speedup of 20x. To figure out how much speedup you can get you need to know how much of your problem can be parallelized. This can be done by drawing a graph of the things you need to do and which depend on what and figure out the longest path. Example:
In this example the longest path would be B->E->F->H->I. All blocks are assumed to take the same time to execute. So there are 9 blocks, the longest path is 5 blocks, so your maximum achievable speedup is 9/5 = 1.8x. In practice you need to consider that your computer can only run a limited number of threads in parallel, that some blocks take longer than others and that there is a cost involved in creating threads and using appropriate locking mechanisms to prevent data races. Those can be added to the graph by giving each block a cost and finding the longest path based on adding cost including the cost of threading mechanisms. Although this method only gives you an upper bound it tends to be very humbling. I hope this allows you to draw a graph and find the answer.
EDIT:
I forgot to say that Ahmdal's law compares executing the code with a single thread to executing the code with an infinite number of threads with no overhead. If you make the multithreaded version execute different code than the single threaded version you are no longer bound by Ahmdal's law.
With enough memory and time you can calculate the results for all possible inputs and then just do a lookup based on a given input to find the result. Such a system would get higher speedup because it does not actually calculate anything and is not bound by Ahmdal's law. If you manage to optimize B->E to take zero units of time the longest path becomes 3 and there are only 8 nodes giving you a maximum speedup of 8/3 = 2.66x which is better than the 1.8x of before. That is only the speedup possibility by multithreading though, actually the first version takes 4 time units and the second version 3 time units. Optimizing code can give you more speedup than multithreading. The graph can still be useful though. Assuming you do not run out of cores the graph can tell you which parts of your program are worth optimizing and which are not. Assuming you do run out of cores the graph can tell you which paths should be prioritized. In my example I calculate A, B, C and D simultaneously and therefore need a quadcore to make it work. If I move C down in time to execute in parallel to E and make D run parallel to H a dualcore will suffice for the same speedup of 1.8x.
Related
In computer architecture, Amdahl's law gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved.
Slatency is the theoretical speedup in latency of the execution of the whole task;
s is the speedup in latency of the execution of the part of the task that benefits from the improvement of the resources of the system;
p is the percentage of the execution time of the whole task concerning the part that benefits from the improvement of the resources of the system before the improvement.
Slatency = 1/[(1-p) + (p/s)]
It is all theoretical and it had me thinking, when is it inapplicable. And how accurate it is for estimating CPU performance ?
Often when you want to tune some part of a program, you make a microbenchmark to test just that in isolation.
That doesn't always reflect how it will behave when run as part of the full program. (i.e. with other work between executions of the part you're tuning, instead of in a tight loop.)
e.g. if you found sin(x) calculations were expensive and replaced them a lookup table, that might win in a microbenchmark, because a small-enough table stays hot in cache when called back-to-back with no other work in between. Similarly, microbenchmarks measure performance with branch-prediction primed, and with no code-cache pressure (which can make loop unrolling look better than it is).
But that just means your estimate of s is wrong for the function as part of the whole program, not that Amdahl's law is inaccurate. That's just a case of using it wrong.
However, this does lead to a real answer to your question:
Speeding up one part of a program in a way that leads to more cache or TLB misses, branch mispredictions, etc. in other parts of the program does violate Amdahl's law.
If I've got a long-running process that uses, on average, 10% of the CPU as it does its job, and I run two copies of it in parallel, I can expect them to collectively use 20% of the CPU. Or if I run one copy on a different CPU that's twice as fast, I can expect it to use 5% of that CPU.
What I'd like to do is get a handle on the CPU requirements of such a process, but in a reasonably CPU independent way, i.e. not as a percentage.
I'm wondering how meaningful it might be to assign a simple "cycles per second needed" number to such an app, by multiplying its CPU percentage by the bogomips number of the machine where I measured it. That is, a process that uses 10% of the CPU on a machine with a bogomips value of 1000 could be said to require 100,000,000 (bogus) instructions per second.
(Disclaimers: Of course I know that bogomips are bogus, that instructions do not equal cycles, and that cycle and instruction timings are not at all comparable between disparate processor families. I'm looking for rough, linear comparisons here, not precise counts.)
In some more detail: Suppose I've got a system with an assortment of long-running processes running on a possibly CPU-constrained machine. I might want to predict, in advance, whether they'll all run without overloading the CPU. Or I might want to implement checks (simple ones) that no process is using more CPU than it's supposed to. I'm willing to empirically measure the performance of each process in advance, to help me make these predictions and implement these checks. What I'm exploring here is, what's the right unit to measure the performance in?
For example, today I might be running processes A, B, D, E, and H on processor X. I might observe that the percent of CPU used by the processes is 10, 5, 1, 5, and 20%, respectively. 10+5+1+5+20 is 41, and 41 is comfortably less than 100, so I'm fine.
But tomorrow I might want to run processes A, B, C, H, and J on a different processor Y running at half the clock rate. Even if I also know something about the performance of processes C and J, it just seems unnecessarily messy to try to do the math based on percentages, when the CPU (that the percentages are of) is a moving target.
As mentioned, I might also want to assign an explicit CPU "budget" to each long-running process, and for sanity's sake I might want that budget to stay reasonably valid over time. That is, I might want to say that process A is only allowed to use 100,000,000 cycles per second. If it ever uses 150,000,000, something is wrong. But if I move everything to a 2x faster CPU tomorrow, I do not want process A to be able to get away with using twice as much, because I might have in mind to use the extra CPU power for other processes.
Finally, if this has made any sense, and if multiplying by bogomips is not a good way of doing what I'm trying to do, does anyone have any better ideas?
(Oh, and one more disclaimer: the question obviously gets more complicated for multi-processor machines, and multi-core processors. I'll worry about those additional wrinkles later, not today.)
Completely unreasonable. No, seriously. Indeed, removing bogomips was proposed, precisely because they're meaningless, but was overridden because it broke some users' functionality: https://lwn.net/Articles/627930/ (search for 'bogomips').
Consider the time for completing a task on a processor core is a distribution with mean m and standard deviation s. If the same task runs on n cores, what is the mean and standard deviation of the time it takes to complete the task? (the task is finished when one of the cores finishes the task)
This is more of a statistics question, than anything else. Without information on the distribution function of the time t a single task needs to complete, I could only give you a hint: You need to calculate the distribution function of the minimum of t for n of your tasks, as seen here. Using that you can then calculate the mean and the standard deviation.
PS: Is this homework?
EDIT:
Whether - and how much - it's worth to use multiple cores, depends on several things:
What you need to do. If you have to run the same program with different inputs, launching multiple instances makes a lot of sense. It might not cut down the overall time down to 1/n and each experiment will still need at least as much time as before, but the time needed for the whole series will be signigicantly less.
If on the other hand, you are hoping to run the same task with e.g. a different seed and keep the one that converges the fastest, you will probably gain far less, as estimated by the first part of my answer.
How well you have parallelized your tasks. n completely independent tasks is the ideal scenario. n threads with multiple synchronization points etc are not going to be near as efficient.
How well your hardware can handle multiple tasks. For example if each of these tasks needs a lot of memory, it will probably be faster to use a single core only, than forcing the system to use the swap space/pagefile/whatever your OS calls it by running multiple instances at once.
I'm performing an operation, lets call it CalculateSomeData. CalculateSomeData operates in successive "generations", numbered 1..x. The number of generations in the entire run is fixed by the input parameters to CalculateSomeData and is known a priori. A single generation takes anywhere from 30 minutes to 2 hours to complete. Some of that variability is due to the input parameters and that cannot be controlled. However, a portion of that variability is due to things like hardware capacities, CPU load from other processes, network bandwidth load, etc. One parameter that can be controlled per-generation is the number of threads that CalculateSomeData uses. Right now that's fixed and likely non-optimal. I'd like to track the time each generation takes and then have some algorithm by which I tweak the number of threads so that each successive generation improves upon the prior generation's calculation time (minimizing time). What approach should I use? How applicable are genetic algorithms? Intuition tells me that the range is going to be fairly tight - maybe 1 to 16 threads on a dual quad-core processor machine.
any pointers, pseudocode, etc. are much appreciated.
How about an evolutionary algorithm.
Start with a guess. 1 thread per CPU core seems good, but depends on the task at hand.
Measure the average time for each task in the generation. Compare it to the time taken by the previous generation. (Assume effectively infinite time and 0 threads for generation 0).
If the most recent generation tasks averaged a better time than the one before, continue to change the number of threads in the same direction as you did last step (so if the last generation had more threads than the previous thread, then add a thread for the new generation, but if it had fewer, then use one fewer (obviously with a lower limit of 1 thread).
If the most recent generation tasks took longer, on average, than the previous generation, then change the number of threads in the opposite direction (so if increasing the number of threads resulted in worse time, use one fewer thread next time).
As long as the optimal number of threads isn't too close to 1, then you'll probably end up oscillating between 3 values that are all reasonably close to optimal. You may want to explicitly detect this case and lock yourself into the central value, if you have a large number of generations to deal with.
If the calculations are completely CPU bound the number of threads should be equal to the number of cores on the machine. That way you minimize the number of context switches.
If your calculations involve I/O, network, synchronization or something else that blocks execution you must find the limiting resource and measure the utilization. You need to monitor the utilization and slowly add more threads until the utilization gets close to 100%. You should have as few threads as possible to saturate your limiting resource.
You should divide up your generations into lots of small tasks and put them in a queue. Spawn one thread per core and have each thread grab a task to do, run it to completion, and repeat.
You want lots more tasks than cores to make sure that you don't end up with just one task running at the end of the generation and all other threads idle. This is what is likely to happen if you set #tasks = #threads = #cores as Albin suggests (unless you can ensure that all tasks take precisely the same amount of time).
You also probably don't want more threads than cores. Context switching isn't terribly expensive, but the larger cache footprint that comes with having more than #cores tasks simultaneously active could hurt you (unless your tasks use very little memory).
I'm working on a parallelization library for the D programming language. Now that I'm pretty happy with the basic primitives (parallel foreach, map, reduce and tasks/futures), I'm starting to think about some higher level parallel algorithms. Among the more obvious candidates for parallelization is sorting.
My first question is, are parallelized versions of sorting algorithms useful in the real world, or are they mostly academic? If they are useful, where are they useful? I personally would seldom use them in my work, simply because I usually peg all of my cores at 100% using a much coarser grained level of parallelism than a single sort() call.
Secondly, it seems like quick sort is almost embarrassingly parallel for large arrays, yet I can't get the near-linear speedups I believe I should be getting. For a quick sort, the only inherently serial part is the first partition. I tried parallelizing a quick sort by, after each partition, sorting the two subarrays in parallel. In simplified pseudocode:
// I tweaked this number a bunch. Anything smaller than this and the
// overhead is smaller than the parallelization gains.
const smallestToParallelize = 500;
void quickSort(T)(T[] array) {
if(array.length < someConstant) {
insertionSort(array);
return;
}
size_t pivotPosition = partition(array);
if(array.length >= smallestToParallelize) {
// Sort left subarray in a task pool thread.
auto myTask = taskPool.execute(quickSort(array[0..pivotPosition]));
quickSort(array[pivotPosition + 1..$]);
myTask.workWait();
} else {
// Regular serial quick sort.
quickSort(array[0..pivotPosition]);
quickSort(array[pivotPosition + 1..$]);
}
}
Even for very large arrays, where the time the first partition takes is negligible, I can only get about a 30% speedup on a dual core, compared to a purely serial version of the algorithm. I'm guessing the bottleneck is shared memory access. Any insight on how to eliminate this bottleneck or what else the bottleneck might be?
Edit: My task pool has a fixed number of threads, equal to the number of cores in the system minus 1 (since the main thread also does work). Also, the type of wait I'm using is a work wait, i.e. if the task is started but not finished, the thread calling workWait() steals other jobs off the pool and does them until the one it's waiting on is done. If the task isn't started, it is completed in the current thread. This means that the waiting isn't inefficient. As long as there is work to be done, all threads will be kept busy.
Keep in mind I'm not an expert on parallel sort, and folks make research careers out of parallel sort but...
1) are they useful in the real world.
of course they are, if you need to sort something expensive (like strings or worse) and you aren't pegging all the cores.
think UI code where you need to sort a large dynamic list of strings based on context
think something like a barnes-hut n-bodies sim where you need to sort the particles
2) Quicksort seems like it would give a linear speedup, but it isn't. The partition step is a sequential bottleneck, you will see this if you profile and it will tend to cap out at 2-3x on a quad core.
If you want to get good speedups on a smaller system you need to ensure that your per task overheads are really small and ideally you will want to ensure that you don't have too many threads running, i.e. not much more than 2 on a dual core. A thread pool probably isn't the right abstraction.
If you want to get good speedups on a larger system you'll need to look at the scan based parallel sorts, there are papers on this. bitonic sort is also quite easy parallelize as is merge sort. A parallel radix sort can also be useful, there is one in the PPL (if you aren't averse to Visual Studio 11).
I'm no expert but... here is what I'd look at:
First of all, I've heard that as a rule of thumb, algorithms that look at small bits of a problem from the start tends to work better as parallel algorithms.
Looking at your implementation, try making the parallel/serial switch go the other way: partition the array and sort in parallel until you have N segments, then go serial. If you are more or less grabbing a new thread for each parallel case, then N should be ~ your core count. OTOH if your thread pool is of fixed size and acts as a queue of short lived delegates, then I'd use N ~ 2+ times your core count (so that cores don't sit idle because one partition finished faster).
Other tweaks:
skip the myTask.wait(); at the local level and rather have a wrapper function that waits on all the tasks.
Make a separate serial implementation of the function that avoids the depth check.
"My first question is, are parallelized versions of sorting algorithms useful in the real world" - depends on the size of the data set that you are working on in the real work. For small sets of data the answer is no. For larger data sets it depends not only on the size of the data set but also the specific architecture of the system.
One of the limiting factors that will prevent the expected increase in performance is the cache layout of the system. If the data can fit in the L1 cache of a core, then there is little to gain by sorting across multiple cores as you incur the penalty of the L1 cache miss between each iteration of the sorting algorithm.
The same reasoning applies to chips that have multiple L2 caches and NUMA (non-uniform memory access) architectures. So the more cores that you want to distribute the sorting across, the smallestToParallelize constant will need to be increased accordingly.
Another limiting factor which you identified is shared memory access, or contention over the memory bus. Since the memory bus can only satisfy a certain number of memory accesses per second; having additional cores that do essentially nothing but read and write to main memory will put a lot of stress on the memory system.
The last factor that I should point out is the thread pool itself as it may not be as efficient as you think. Because you have threads that steal and generate work from a shared queue, that queue requires synchronization methods; and depending on how those are implemented, they can cause very long serial sections in your code.
I don't know if answers here are applicable any longer or if my suggestions are applicable to D.
Anyway ...
Assuming that D allows it, there is always the possibility of providing prefetch hints to the caches. The core in question requests that data it will soon (not immediately) need be loaded into a certain cache level. In the ideal case the data will have been fetched by the time the core starts working on it. More likely the prefetch process will be more or less on the way which at least will result in less wait states than if the data were fetched "cold."
You'll still be constrained by the overall cache-to-RAM throughput capacity so you'll need to have organized the data such that so much data is in the core's exclusive caches that it can spend a fair amount of time there before having to write updated data.
The code and data need to be organized according to the concept of cache lines (fetch units of 64 bytes each) which is the smallest-sized unit in a cache. This should result in that for two cores the work needs to be organized such that the memory system works half as much per core (assuming 100% scalability) as before when only one core was working and the work hadn't been organized. For four cores a quarter as much and so on. It's quite a challenge but by no means impossible, it just depends on how imaginative you are in restructuring the work. As always, there are solutions that cannot be conceived ... until someone does just that!
I don't know how WYSIWYG D is compared to C - which I use - but in general I think the process of developing scaleable applications is ameliorated by how much the developer can influence the compiler in its actual machine code generation. For interpreted languages there will be so much memory work going on by the interpreter that you risk not being able to discern improvements from the general "background noise."
I once wrote a multi-threaded shellsort which ran 70% faster on two cores compared to one and 100% on three cores compared to one. Four cores ran slower than three. So I know the dilemmas you face.
I would like to point you to External Sorting[1] which faces similar problems. Usually, this class of algorithms is used mostly to cope with large volumes of data, but their main point is that they split up large chunks into smaller and unrelated problems, which are therefore really great to run in parallel. You "only" need to stitch together the partial results afterwards, which is not quite as parallel (but relatively cheap compared to the actual sorting).
An External Merge Sort would also work really well with an unknown amount of threads. You just split the work-load arbitrarily, and give each chunk of n elements to a thread whenever there is one idle, until all your work units are done, at which point you can start joining them up.
[1] http://en.wikipedia.org/wiki/External_sorting