How accurate is amdahl's law? - linux

In computer architecture, Amdahl's law gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved.
Slatency is the theoretical speedup in latency of the execution of the whole task;
s is the speedup in latency of the execution of the part of the task that benefits from the improvement of the resources of the system;
p is the percentage of the execution time of the whole task concerning the part that benefits from the improvement of the resources of the system before the improvement.
Slatency = 1/[(1-p) + (p/s)]
It is all theoretical and it had me thinking, when is it inapplicable. And how accurate it is for estimating CPU performance ?

Often when you want to tune some part of a program, you make a microbenchmark to test just that in isolation.
That doesn't always reflect how it will behave when run as part of the full program. (i.e. with other work between executions of the part you're tuning, instead of in a tight loop.)
e.g. if you found sin(x) calculations were expensive and replaced them a lookup table, that might win in a microbenchmark, because a small-enough table stays hot in cache when called back-to-back with no other work in between. Similarly, microbenchmarks measure performance with branch-prediction primed, and with no code-cache pressure (which can make loop unrolling look better than it is).
But that just means your estimate of s is wrong for the function as part of the whole program, not that Amdahl's law is inaccurate. That's just a case of using it wrong.
However, this does lead to a real answer to your question:
Speeding up one part of a program in a way that leads to more cache or TLB misses, branch mispredictions, etc. in other parts of the program does violate Amdahl's law.

Related

The number of times to run a profiling experiment

I am trying to profile a CUDA Application. I had a basic doubt about performance analysis and workload characterization of HPC programs. Let us say I want to analyse the wall clock time(the end-to-end time of execution of a program). How many times should one run the same experiment to account for the variation in the wall clock time measurement?
Thanks.
How many times should one run the same experiment to account for the
variation in the wall clock time measurement?
The question statement assumes that there will be a variation in execution time. Had the question been
How many times should one run CUDA code for performance analysis and workload characterization?
then I would have answered
Once.
Let me explain why ... and give you some reasons for disagreeing with me ...
Fundamentally, computers are deterministic and the execution of a program is deterministic. (Though, and see below, some programs can provide an impression of non-determinism but they do so deterministically unless equipped with exotic peripherals.)
So what might be the causes of a difference in execution times between two runs of the same program?
Physics
Do the bits move faster between RAM and CPU as the temperature of the components varies? I haven't a clue but if they do I'm quite sure that within the usual temperature ranges at which computers operate the relative difference is going to be down in the nano- range. I think any other differences arising from the physics of computation are going to be similarly utterly negligible. Only lesson here, perhaps, is don't do performance analysis on a program which only takes a microsecond or two to execute.
Note that I ignore, for the purposes of this answer, the capability of some processors to adjust their clock rates in response to their temperature. This would have some (possibly large) impact on a program's execution time, but all you'd learn is how to use it as a thermometer.
Contention for System Resources
By which I mean matters such as other processes (including the operating system) running on the same CPU / core, other traffic on the memory bus, other processes using I/O, etc. Sure, yes, these may have a major impact on a program's execution time. But what do variations in run times between runs of your program tell you in these cases? They tell you how busy the system was doing other work at the same time. And make it very difficult to analyse your program's performance.
A lesson here is to run your program on an otherwise quiet machine. Indeed one of the characteristics of the management of HPC systems in general is that they aim to provide a quiet platform to provide a reliable run time to user codes.
Another lesson is to avoid including in your measurement of execution time the time taken for operations, such as disk reads and writes or network communications, over which you have no control.
If your program is a heavy user of, say, disks, then you should probably be measuring i/o rates using one of the standard benchmarking codes for the purpose to get a clear idea of the potential impact on your program.
Program Features
There may be aspects of your program which can reasonably be expected to produce different times from one run to the next. For example, if your program relies on randomness then different rolls of the dice might have some impact on execution time. (In this case you might want to run the program more than once to see how sensitive it is to the operations of the RNG.)
However, I exclude from this third source of variability the running of the code with different inputs or parameters. If you want to measure the scalability of program execution time wrt input size then you surely will have to run the program a number of times.
In conclusion
There is very little of interest to be learned, about a program, by running it more than once with no differences in the work it is doing from one run to the next.
And yes, in my early days I was guilty of running the same program multiple times to see how the execution time varied. I learned that it didn't, and that's where I got this answer from.
This kind of test demonstrates how well the compiled application interacts with the OS/computing environment where it will be used, as opposed to the efficiency of a specific algorithm or architecture. I do this kind of test by running the application three times in a row after a clean reboot/spinup. I'm looking for any differences caused by the OS loading and caching libraries or runtime environments on the first execution; and I expect the next two runtimes to be similar to each other (and faster than the first one). If they are not, then more investigation is needed.
Two further comments: it is difficult to be certain that you know what libraries and runtimes your application requires, and how a given computing environment will handle them, if you have a complex application with lots of dependencies.
Also, I recommend avoiding specifying the application runtime for a customer, because it is very hard to control the customer's computing environment. Focus on the things you can control in your application: architecture, algorithms, library version.

Splitting a computational workload: where is it possible or impossible

I program although I am not a computer scientist. Therefore I would like to see if I understood correctly the challenge in splitting a workload. Is this below the correct way to think about it?
Specifically, is the following statement (1) correct?
(1) If A(X_a) + A(X_b) + A(X_c) + ... = B(X_a,X_b,X_c, ...) = Y is an equation that is being computed
whether or not it can be computed more rapidly from the perspective of the computer by assigning parts of the equation to be computed by individual threads at the same time depends on the following
if X_m changes when A(X_n) changes for m not equal to n, then dividing the workload for that particular computation is gives less of a performance gain, and if this is true for every combination of m and n in the system, then no performance gain for multithreading over single threading is possible.
Or in other words do I understand correctly that presence of linked variables decreases ability to multithread successfully because X_b and X_c depend on what A(X_a) and it bottlenecks the process: the other threads know A but have to wait for the first thread to give an output before they have instructions to execute, so simultaneous working on parts of an instruction which is easily broken up into parts cannot be done and the computation takes as much time one one thread doing each part of the calculation one after the other as it does to perform on more than one thread working at once and summing the results in order they complete on the fly on another thread.
(2) Or is there a way around the above bottleneck? For example if this bottleneck is known in advance, the first thread can start early and store in memory the results to A(X_n) for all n that bottleneck the operation, and then split the workload efficiently, one A(X_i) to the i th thread, but to do this, the first thread would have to predict in some way when the calculation B(X_a,X_b,X_c, ...) must be executed BEFORE B(X_a,X_b,X_c, ...) is actually executed, otherwise it would run into the bottleneck.
[EDIT: To clarify, in context of NWP's answer. If the clarification is too long / unclear, please leave a comment, and I'll make a few graphics in LaTeX to shorten the question writeup.]
Suppose the longest path in the program "compute I" is 5 units of time in the example. If you know this longest path, and the running system can anticipate (based on past frequency of execution) when this program "compute I" will be run in the future, subprogram "compute B->E" (which does not depend on anything else but is a proper subset of the longest path of program "compute I") may be executed in advance. The result is stored in memory prior to the user requesting "compute I".
If so, is the max speedup considered to be 9/4? The B->E is ready, so other threads do not have to wait for it. Or is max speed up for "compute I" still considered to be 9/5?
The anticipation program run before has a cost, but this cost may be spread over each instance of execution of "compute I". If the anticipation program has 15 steps, but the program "compute I" is run typically 100 times per each execution of the anticipation program, and all steps cost equally, do we simply say the max speedup possible in "compute I" is therefore 9/(5 - 1 + 15/100)?
The speedup possible now appears to depend not only on the number of threads, and the longest path, but also on the memory available to store precalculations and how far in advance another program can anticipate "compute I" will be run and precalculate proper subprograms of it. Another program "compute X" may have the same length of the longest path as "compute I" but the system cannot anticipate that "compute X" will be run equally as far in advance as "compute I". How do we weight the speedup achieved (i) at expense of increasing memory to store precalculations (ii) timing of execution of some programs can be anticipated further in advance than of other program allowing bottleneck to be precalculated and this way cutting down the longest path?
But if a longest path can be dynamically cut down in dynamics by improving predictive precalculation of subprograms and greater memory for storing results of precalculation, can bottlenecks be considered at all as determining the ultimate upper boundary to speedup due to splitting a computational workload?
From the linked variables dependency bottleneck perspective / graph bottle perspective, the ultimate upper boundary of speedup to multithreading a program "compute I" appears to be determined by longest subprogram (other subprograms depend on it / wait for it). But from the dynamics perspective, where the whole system is running before and after the program "compute I" is executed as a part of it, sufficient predictability of timing of future execution of "compute I" and ability to store more and more precalculations of its independent subprograms can completely cut down length of all subprograms of "compute I" to 1 unit, meaning it can in possibly achieve a speedup of 9/1 = 9, if sufficient predictability and memory is available.
Which perspective is the correct one for estimating the upper bounds to speedup by multithreading? (A program run in a system running a long time with sufficient memory seems to have no limit to multithreading, whereas if it is looked at by itself, there is a very definite fixed limit to the speedup.)
Or is the question of ability to cut down longest path by anticipation and partial precalculation a moot one because speedup in that case varies with the user's decision to execute a program in a way that can be predicted and so cannot upper boundary to multithreading speedup due to anticipation cannot be know to a program writer or system designer and should be ignored / not relied upon to exist?
I do not quite understand which things depend on what from your description but I can give you some theory. There is Ahmdal's law which gives you an upper bound of the speedup you can achieve based on how parallelizable a given algorithm is assuming you have enough processors. If you can parallelize 50% of the calculation you can get a maximum speedup of 2x. 95% parallelization gives you a maximum speedup of 20x. To figure out how much speedup you can get you need to know how much of your problem can be parallelized. This can be done by drawing a graph of the things you need to do and which depend on what and figure out the longest path. Example:
In this example the longest path would be B->E->F->H->I. All blocks are assumed to take the same time to execute. So there are 9 blocks, the longest path is 5 blocks, so your maximum achievable speedup is 9/5 = 1.8x. In practice you need to consider that your computer can only run a limited number of threads in parallel, that some blocks take longer than others and that there is a cost involved in creating threads and using appropriate locking mechanisms to prevent data races. Those can be added to the graph by giving each block a cost and finding the longest path based on adding cost including the cost of threading mechanisms. Although this method only gives you an upper bound it tends to be very humbling. I hope this allows you to draw a graph and find the answer.
EDIT:
I forgot to say that Ahmdal's law compares executing the code with a single thread to executing the code with an infinite number of threads with no overhead. If you make the multithreaded version execute different code than the single threaded version you are no longer bound by Ahmdal's law.
With enough memory and time you can calculate the results for all possible inputs and then just do a lookup based on a given input to find the result. Such a system would get higher speedup because it does not actually calculate anything and is not bound by Ahmdal's law. If you manage to optimize B->E to take zero units of time the longest path becomes 3 and there are only 8 nodes giving you a maximum speedup of 8/3 = 2.66x which is better than the 1.8x of before. That is only the speedup possibility by multithreading though, actually the first version takes 4 time units and the second version 3 time units. Optimizing code can give you more speedup than multithreading. The graph can still be useful though. Assuming you do not run out of cores the graph can tell you which parts of your program are worth optimizing and which are not. Assuming you do run out of cores the graph can tell you which paths should be prioritized. In my example I calculate A, B, C and D simultaneously and therefore need a quadcore to make it work. If I move C down in time to execute in parallel to E and make D run parallel to H a dualcore will suffice for the same speedup of 1.8x.

HLSL operator/functions cycle count

I'm modeling some algorithms to be run on GPU's. Is there a reference or something as to how many cycles the various intrinsics and calculations take on modern hardware? (nvidia 5xx+ series, amd 6xxx+ series) I cant seem to find any official word on this even though there are some mentions of the raised costs of normalization, square root and other functions throughout their documentation.. thanks.
Unfortunately, the cycle count documentation you're looking for either doesn't exist, or (if it does) it probably won't be as useful as you would expect. You're correct that some of the more complex GPU instructions take more time to execute than the simpler ones, but cycle counts are only important when instruction execution time is main performance bottleneck; GPUs are designed such that this is very rarely the case.
The way GPU shader programs achieve such high performance is by running many (potentially thousands) of shader threads in parallel. Each shader thread generally executes no more than a single instruction before being swapped out for a different thread. In perfect conditions, there are enough threads in flight that some of them are always ready to execute their next instruction, so the GPU never has to stall; this hides the latency of any operation executed by a single thread. If the GPU is doing useful work every cycle, then it's as if every shader instruction executes in a single cycle. In this case, the only way to make your program faster is to make it shorter (fewer instructions = fewer cycles of work overall).
Under more realistic conditions, when there isn't enough work to keep the GPU fully loaded, the bottleneck is virtually guaranteed to be memory accesses rather than ALU operations. A single texture fetch can take thousands of cycles to return in the worst case; with unpredictable stalls like that, it's generally not worth worrying about whether sqrt() takes more cycles than dot().
So, the key to maximizing GPU performance isn't to use faster instructions. It's about maximizing occupancy -- that is, making sure there's enough work to keep the GPU sufficiently busy to hide instruction / memory latencies. It's about being smart about your memory accesses, to minimize those agonizing round-trips to DRAM. And sometimes, when you're really lucky, it's about using fewer instructions.
http://books.google.ee/books?id=5FAWBK9g-wAC&lpg=PA274&ots=UWQi5qznrv&dq=instruction%20slot%20cost%20hlsl&pg=PA210#v=onepage&q=table%20a-8&f=false
this is the closest thing i've found so far, it is outdated(sm3) but i guess better than nothing.
does operator/functions have cycle? I know assembly instructions have cycle, that's the low level time measurement, and mostly depends on CPU.since operator and functions are all high level programming stuffs. so I don't think they have such measurement.

Multi-threaded performance and profiling

I have a program that scales badly to multiple threads, although – theoretically – it should scale linearly: it's a calculation that splits into smaller chunks and doesn't need system calls, library calls, locking, etc. Running with four threads is only about twice as fast as running with a single thread (on a quad core system), while I'd expect a number closer to four times as fast.
The run time of the implementations with pthreads, C++0x threads and OpenMP agree.
In order to pinpoint the cause, I tried gprof (useless) and valgrind (I didn't see anything obvious). How can I effectively benchmark what's causing the slowdown? Any generic ideas as to its possible causes?
— Update —
The calculation involves Monte Carlo integration and I noticed that an unreasonable amount of time is spent generating random numbers. While I don't know yet why this happens with four threads, I noticed that the random number generator is not reentrant. When using mutexes, the running time explodes. I'll reimplement this part before checking for other problems.
I did reimplement the sampling classes which did improve performance substantially. The remaining problem was, in fact, contention of the CPU caches (it was revealed by cachegrind as Evgeny suspected.)
You can use oprofile. Or a poor man's pseudo-profiler: run the program under gdb, stop it and look where it is stopped. "valgrind --tool=cachegrind" will show you how efficiently CPU cache is used.
Monte Carlo integration seems to be very memory-intensive algorithm. Try to estimate, how memory bandwidth is used. It may be the limiting factor for your program's performance. Also if your system is only 2-core with hyperthreading, it should not work much faster with 4 threads, comparing with 2 threads.

(When) are parallel sorts practical and how do you write an efficient one?

I'm working on a parallelization library for the D programming language. Now that I'm pretty happy with the basic primitives (parallel foreach, map, reduce and tasks/futures), I'm starting to think about some higher level parallel algorithms. Among the more obvious candidates for parallelization is sorting.
My first question is, are parallelized versions of sorting algorithms useful in the real world, or are they mostly academic? If they are useful, where are they useful? I personally would seldom use them in my work, simply because I usually peg all of my cores at 100% using a much coarser grained level of parallelism than a single sort() call.
Secondly, it seems like quick sort is almost embarrassingly parallel for large arrays, yet I can't get the near-linear speedups I believe I should be getting. For a quick sort, the only inherently serial part is the first partition. I tried parallelizing a quick sort by, after each partition, sorting the two subarrays in parallel. In simplified pseudocode:
// I tweaked this number a bunch. Anything smaller than this and the
// overhead is smaller than the parallelization gains.
const smallestToParallelize = 500;
void quickSort(T)(T[] array) {
if(array.length < someConstant) {
insertionSort(array);
return;
}
size_t pivotPosition = partition(array);
if(array.length >= smallestToParallelize) {
// Sort left subarray in a task pool thread.
auto myTask = taskPool.execute(quickSort(array[0..pivotPosition]));
quickSort(array[pivotPosition + 1..$]);
myTask.workWait();
} else {
// Regular serial quick sort.
quickSort(array[0..pivotPosition]);
quickSort(array[pivotPosition + 1..$]);
}
}
Even for very large arrays, where the time the first partition takes is negligible, I can only get about a 30% speedup on a dual core, compared to a purely serial version of the algorithm. I'm guessing the bottleneck is shared memory access. Any insight on how to eliminate this bottleneck or what else the bottleneck might be?
Edit: My task pool has a fixed number of threads, equal to the number of cores in the system minus 1 (since the main thread also does work). Also, the type of wait I'm using is a work wait, i.e. if the task is started but not finished, the thread calling workWait() steals other jobs off the pool and does them until the one it's waiting on is done. If the task isn't started, it is completed in the current thread. This means that the waiting isn't inefficient. As long as there is work to be done, all threads will be kept busy.
Keep in mind I'm not an expert on parallel sort, and folks make research careers out of parallel sort but...
1) are they useful in the real world.
of course they are, if you need to sort something expensive (like strings or worse) and you aren't pegging all the cores.
think UI code where you need to sort a large dynamic list of strings based on context
think something like a barnes-hut n-bodies sim where you need to sort the particles
2) Quicksort seems like it would give a linear speedup, but it isn't. The partition step is a sequential bottleneck, you will see this if you profile and it will tend to cap out at 2-3x on a quad core.
If you want to get good speedups on a smaller system you need to ensure that your per task overheads are really small and ideally you will want to ensure that you don't have too many threads running, i.e. not much more than 2 on a dual core. A thread pool probably isn't the right abstraction.
If you want to get good speedups on a larger system you'll need to look at the scan based parallel sorts, there are papers on this. bitonic sort is also quite easy parallelize as is merge sort. A parallel radix sort can also be useful, there is one in the PPL (if you aren't averse to Visual Studio 11).
I'm no expert but... here is what I'd look at:
First of all, I've heard that as a rule of thumb, algorithms that look at small bits of a problem from the start tends to work better as parallel algorithms.
Looking at your implementation, try making the parallel/serial switch go the other way: partition the array and sort in parallel until you have N segments, then go serial. If you are more or less grabbing a new thread for each parallel case, then N should be ~ your core count. OTOH if your thread pool is of fixed size and acts as a queue of short lived delegates, then I'd use N ~ 2+ times your core count (so that cores don't sit idle because one partition finished faster).
Other tweaks:
skip the myTask.wait(); at the local level and rather have a wrapper function that waits on all the tasks.
Make a separate serial implementation of the function that avoids the depth check.
"My first question is, are parallelized versions of sorting algorithms useful in the real world" - depends on the size of the data set that you are working on in the real work. For small sets of data the answer is no. For larger data sets it depends not only on the size of the data set but also the specific architecture of the system.
One of the limiting factors that will prevent the expected increase in performance is the cache layout of the system. If the data can fit in the L1 cache of a core, then there is little to gain by sorting across multiple cores as you incur the penalty of the L1 cache miss between each iteration of the sorting algorithm.
The same reasoning applies to chips that have multiple L2 caches and NUMA (non-uniform memory access) architectures. So the more cores that you want to distribute the sorting across, the smallestToParallelize constant will need to be increased accordingly.
Another limiting factor which you identified is shared memory access, or contention over the memory bus. Since the memory bus can only satisfy a certain number of memory accesses per second; having additional cores that do essentially nothing but read and write to main memory will put a lot of stress on the memory system.
The last factor that I should point out is the thread pool itself as it may not be as efficient as you think. Because you have threads that steal and generate work from a shared queue, that queue requires synchronization methods; and depending on how those are implemented, they can cause very long serial sections in your code.
I don't know if answers here are applicable any longer or if my suggestions are applicable to D.
Anyway ...
Assuming that D allows it, there is always the possibility of providing prefetch hints to the caches. The core in question requests that data it will soon (not immediately) need be loaded into a certain cache level. In the ideal case the data will have been fetched by the time the core starts working on it. More likely the prefetch process will be more or less on the way which at least will result in less wait states than if the data were fetched "cold."
You'll still be constrained by the overall cache-to-RAM throughput capacity so you'll need to have organized the data such that so much data is in the core's exclusive caches that it can spend a fair amount of time there before having to write updated data.
The code and data need to be organized according to the concept of cache lines (fetch units of 64 bytes each) which is the smallest-sized unit in a cache. This should result in that for two cores the work needs to be organized such that the memory system works half as much per core (assuming 100% scalability) as before when only one core was working and the work hadn't been organized. For four cores a quarter as much and so on. It's quite a challenge but by no means impossible, it just depends on how imaginative you are in restructuring the work. As always, there are solutions that cannot be conceived ... until someone does just that!
I don't know how WYSIWYG D is compared to C - which I use - but in general I think the process of developing scaleable applications is ameliorated by how much the developer can influence the compiler in its actual machine code generation. For interpreted languages there will be so much memory work going on by the interpreter that you risk not being able to discern improvements from the general "background noise."
I once wrote a multi-threaded shellsort which ran 70% faster on two cores compared to one and 100% on three cores compared to one. Four cores ran slower than three. So I know the dilemmas you face.
I would like to point you to External Sorting[1] which faces similar problems. Usually, this class of algorithms is used mostly to cope with large volumes of data, but their main point is that they split up large chunks into smaller and unrelated problems, which are therefore really great to run in parallel. You "only" need to stitch together the partial results afterwards, which is not quite as parallel (but relatively cheap compared to the actual sorting).
An External Merge Sort would also work really well with an unknown amount of threads. You just split the work-load arbitrarily, and give each chunk of n elements to a thread whenever there is one idle, until all your work units are done, at which point you can start joining them up.
[1] http://en.wikipedia.org/wiki/External_sorting

Resources