How to divide load between different processors - multithreading

I am running some parallel code on a machine which has 4 intel processors and 8 cores on each .I am using TBB.Suppose a given loop(that I parallelize ) has X iterations how should I choose my grainsize to ensure the load is evenly divided?

Assume you have N equally powerful CPUs.
If there are no loop carried dependencies (e.g, nothing in iteration i is used by following iterations), then you can simply run loop iterations 0..X/N on CPU 1, and iterations (X/N)+1..(2*X/N) on CPU 2, etc, assuming that each iteration takes exactly the same amount of time, or at least an average amount of that doesn't vary wildly.
If there are loop carried
dependencies, you may have a problem if iteration i depends on all previous iterations. If it only dependes on the the previous k iterations, you can have CPU1 do iterations 0..X/N, and CPU2 do iterations X/N-k..(2*X/N), wasting some work but allowing CPU2 to collect the results it needs, etc. for all processors.
If iterations take wildly varying amounts of time, you're better off setting up a worklist containing the iterations,
and have the CPUs grab iterations from the workslist as they complete previous iterations. This way the work is divided up as demand appears. You have to be sure that the time per unit of work grabbed is lots larger than the effort to get the work, or you'll get no parallel advantage; one way to do this is to grab a small range of iterations from the worklist, such that the total work in the range exceeds the scheduling overhead significantly.

With TBB, you don't have to select a grain size for parallel_for. In most cases, TBB will dynamically load balance the work pretty well by default. The answer of Ira Baxter correctly describes how you should partition the work across a pool of threads; but TBB already has similar mechanisms in place that do this for you.
ADDED: Surely manual work partitioning might get better results in complex cases. Though in this case one would likely need to use TBB tasks, as parallel_for might not provide enough control; for example, in general it is not possible to specify the exact size of a per-thread chunk.

Related

Multiprocessing: why doesn't a single thread just use more cpu?

I'm learning about multiprocessing and it seems to be applicable in one of two scenarios:
our program is waitng for some I/O, so it makes sense to go do something else while waiting;
we break our program up so that individual parts of it can run "in parellel", in an attempt to take full advantage of the cpu
My confusion is about the second case. I'm probably just lacking in my understanding of how cpus really work: but if our single thread process is only using 1% of the cpu and it therefore makes sense to get more threads going, then why wouldn't we just (somehow?) speed up that single process so that it uses more cpu and finishes faster?
but if our single thread process is only using 1% of the cpu and it therefore makes sense to get more threads going, then why wouldn't we just (somehow?) speed up that single process so that it uses more cpu and finishes faster?
We don't know how to. There seem to be fundamental limitations to how fast we can do things that we haven't quite figured out how to get around. So instead, we do more than one thing at a time.
It takes a woman 9 months to make a baby. So if you want lots of babies, you get lots of women. You don't try to get one woman to go faster.
Say you want to raise 7 to the twenty-millionth power and also raise 11 to the twenty-millionth power. Each of these two operations can be reduced in the number of steps, but you will reach a limit. Say each operation takes N sequential steps (each requiring the output from the previous step as its input) and the fastest we can do a single step is Q nanoseconds. With one thread, it will take at least 2NQ nanoseconds to perform all the operations. With two threads, can do one step from each of the two operations at the same time, reducing the time minimum to N*Q nanoseconds.
That's a big win.
I might be wrong, but when we split things into threads, we want to make use of multi-core architecture of our CPUs.
We mostly think CPUs being a single unit, but you must've heard about how i5 is a quad-core processor, meaning it has 4 cores-- or 4 cores make a CPU, i3 is a dual core processor-- i.e, it only has two cores.
So the aggregate CPU utilization for quad-core would be 100% split into 4x25. There's a difference b/w concurrency and parallelism. Parallel means each thread runs on a separate core, making full use of it. Now you have 4 people doing one job-- or a better analogy would be there are 4 printers in the office, and 4 people can go ahead and get the copies that they want. This is parallelism.
Using that same analogy let's extend it to just one copier/printer and 4 people want to make copies, what we do is make use concurrency, we print each requested copy but only 25% of it, then we switch to the next person, then the next and then the next, this will take 4 iterations for all the copies to get printed. Even though we utilized 100% of the copier's capability, still our guys had to wait-- this waiting time also depends on what was the length of the document they wanted to print-- so we use something like pre-emption, you can only execute/print for a certain amount of time, before we start printing for the next guy.
Speeding up a single process-- allocating it 100% of the CPU is not a problem [although we want to run bunch of other stuff like GUI, play music, system services etc, but 85% is doable], the execution time becomes 1/4th when it's distributed b/w the CPUs. Imagine you have to print a book, with 4 copiers, book is 400pages long-- you use 4 copiers to print 100pages each. Will be faster right?
I hope I made some sense, Going to sleep.

OpenMP: 20% time in "implicit barrier"?

I'm havening look at some code that uses OpenMP, though I'm not too familiar with it. (The code nor OpenMP.)
When running a profiler against it, I see that the program is supposedly spending about 20% of wall-clock time in an "OMP implicit barrier" function.
Is that typical of OpenMP, or does that (maybe) imply that work load is not distributed evenly among threads?
Thanks
There are implicit barriers at the end of most OpenMP constructs like for (in C/C++) or do (in Fortran), sections and single (however, there's no barrier at the end of the master construct). The nowait clause can be used to disable these implicit barriers if the algorithm allows for the different threads to run desynchronised after the worksharing directive. Another implicit barrier is located at the end of each parallel region as part of the fork/join execution model.
You have correctly guessed that high percentage of implicit barrier wait time usually means that the worksharing is far from optimal. It could be that there are (lots of) large single constructs or it could be that there are parallel loops (for/do constructs) with varying execution time for each iteration. If the imbalance comes from loops with varying computational time in each iteration (canonical example is drawing the Mandelbrot set), then the loop schedule can be changed to dynamic using the schedule(dynamic,chunk) clause, where chunk is the chunk size (>= 1). The smaller the chunk size, the better is the load balanced but there would be higher overhead from the dynamical loop dispatcher. The bigger the chunk size the lower the overhead but more load imbalance would appear. The optimal value often depends on the kind of problem and on the hardware so one has to tweak the value in order to obtain the best performance on the particular system where the code executes.

HLSL operator/functions cycle count

I'm modeling some algorithms to be run on GPU's. Is there a reference or something as to how many cycles the various intrinsics and calculations take on modern hardware? (nvidia 5xx+ series, amd 6xxx+ series) I cant seem to find any official word on this even though there are some mentions of the raised costs of normalization, square root and other functions throughout their documentation.. thanks.
Unfortunately, the cycle count documentation you're looking for either doesn't exist, or (if it does) it probably won't be as useful as you would expect. You're correct that some of the more complex GPU instructions take more time to execute than the simpler ones, but cycle counts are only important when instruction execution time is main performance bottleneck; GPUs are designed such that this is very rarely the case.
The way GPU shader programs achieve such high performance is by running many (potentially thousands) of shader threads in parallel. Each shader thread generally executes no more than a single instruction before being swapped out for a different thread. In perfect conditions, there are enough threads in flight that some of them are always ready to execute their next instruction, so the GPU never has to stall; this hides the latency of any operation executed by a single thread. If the GPU is doing useful work every cycle, then it's as if every shader instruction executes in a single cycle. In this case, the only way to make your program faster is to make it shorter (fewer instructions = fewer cycles of work overall).
Under more realistic conditions, when there isn't enough work to keep the GPU fully loaded, the bottleneck is virtually guaranteed to be memory accesses rather than ALU operations. A single texture fetch can take thousands of cycles to return in the worst case; with unpredictable stalls like that, it's generally not worth worrying about whether sqrt() takes more cycles than dot().
So, the key to maximizing GPU performance isn't to use faster instructions. It's about maximizing occupancy -- that is, making sure there's enough work to keep the GPU sufficiently busy to hide instruction / memory latencies. It's about being smart about your memory accesses, to minimize those agonizing round-trips to DRAM. And sometimes, when you're really lucky, it's about using fewer instructions.
http://books.google.ee/books?id=5FAWBK9g-wAC&lpg=PA274&ots=UWQi5qznrv&dq=instruction%20slot%20cost%20hlsl&pg=PA210#v=onepage&q=table%20a-8&f=false
this is the closest thing i've found so far, it is outdated(sm3) but i guess better than nothing.
does operator/functions have cycle? I know assembly instructions have cycle, that's the low level time measurement, and mostly depends on CPU.since operator and functions are all high level programming stuffs. so I don't think they have such measurement.

Algorithm to optimize # threads used in a calculation

I'm performing an operation, lets call it CalculateSomeData. CalculateSomeData operates in successive "generations", numbered 1..x. The number of generations in the entire run is fixed by the input parameters to CalculateSomeData and is known a priori. A single generation takes anywhere from 30 minutes to 2 hours to complete. Some of that variability is due to the input parameters and that cannot be controlled. However, a portion of that variability is due to things like hardware capacities, CPU load from other processes, network bandwidth load, etc. One parameter that can be controlled per-generation is the number of threads that CalculateSomeData uses. Right now that's fixed and likely non-optimal. I'd like to track the time each generation takes and then have some algorithm by which I tweak the number of threads so that each successive generation improves upon the prior generation's calculation time (minimizing time). What approach should I use? How applicable are genetic algorithms? Intuition tells me that the range is going to be fairly tight - maybe 1 to 16 threads on a dual quad-core processor machine.
any pointers, pseudocode, etc. are much appreciated.
How about an evolutionary algorithm.
Start with a guess. 1 thread per CPU core seems good, but depends on the task at hand.
Measure the average time for each task in the generation. Compare it to the time taken by the previous generation. (Assume effectively infinite time and 0 threads for generation 0).
If the most recent generation tasks averaged a better time than the one before, continue to change the number of threads in the same direction as you did last step (so if the last generation had more threads than the previous thread, then add a thread for the new generation, but if it had fewer, then use one fewer (obviously with a lower limit of 1 thread).
If the most recent generation tasks took longer, on average, than the previous generation, then change the number of threads in the opposite direction (so if increasing the number of threads resulted in worse time, use one fewer thread next time).
As long as the optimal number of threads isn't too close to 1, then you'll probably end up oscillating between 3 values that are all reasonably close to optimal. You may want to explicitly detect this case and lock yourself into the central value, if you have a large number of generations to deal with.
If the calculations are completely CPU bound the number of threads should be equal to the number of cores on the machine. That way you minimize the number of context switches.
If your calculations involve I/O, network, synchronization or something else that blocks execution you must find the limiting resource and measure the utilization. You need to monitor the utilization and slowly add more threads until the utilization gets close to 100%. You should have as few threads as possible to saturate your limiting resource.
You should divide up your generations into lots of small tasks and put them in a queue. Spawn one thread per core and have each thread grab a task to do, run it to completion, and repeat.
You want lots more tasks than cores to make sure that you don't end up with just one task running at the end of the generation and all other threads idle. This is what is likely to happen if you set #tasks = #threads = #cores as Albin suggests (unless you can ensure that all tasks take precisely the same amount of time).
You also probably don't want more threads than cores. Context switching isn't terribly expensive, but the larger cache footprint that comes with having more than #cores tasks simultaneously active could hurt you (unless your tasks use very little memory).

(When) are parallel sorts practical and how do you write an efficient one?

I'm working on a parallelization library for the D programming language. Now that I'm pretty happy with the basic primitives (parallel foreach, map, reduce and tasks/futures), I'm starting to think about some higher level parallel algorithms. Among the more obvious candidates for parallelization is sorting.
My first question is, are parallelized versions of sorting algorithms useful in the real world, or are they mostly academic? If they are useful, where are they useful? I personally would seldom use them in my work, simply because I usually peg all of my cores at 100% using a much coarser grained level of parallelism than a single sort() call.
Secondly, it seems like quick sort is almost embarrassingly parallel for large arrays, yet I can't get the near-linear speedups I believe I should be getting. For a quick sort, the only inherently serial part is the first partition. I tried parallelizing a quick sort by, after each partition, sorting the two subarrays in parallel. In simplified pseudocode:
// I tweaked this number a bunch. Anything smaller than this and the
// overhead is smaller than the parallelization gains.
const smallestToParallelize = 500;
void quickSort(T)(T[] array) {
if(array.length < someConstant) {
insertionSort(array);
return;
}
size_t pivotPosition = partition(array);
if(array.length >= smallestToParallelize) {
// Sort left subarray in a task pool thread.
auto myTask = taskPool.execute(quickSort(array[0..pivotPosition]));
quickSort(array[pivotPosition + 1..$]);
myTask.workWait();
} else {
// Regular serial quick sort.
quickSort(array[0..pivotPosition]);
quickSort(array[pivotPosition + 1..$]);
}
}
Even for very large arrays, where the time the first partition takes is negligible, I can only get about a 30% speedup on a dual core, compared to a purely serial version of the algorithm. I'm guessing the bottleneck is shared memory access. Any insight on how to eliminate this bottleneck or what else the bottleneck might be?
Edit: My task pool has a fixed number of threads, equal to the number of cores in the system minus 1 (since the main thread also does work). Also, the type of wait I'm using is a work wait, i.e. if the task is started but not finished, the thread calling workWait() steals other jobs off the pool and does them until the one it's waiting on is done. If the task isn't started, it is completed in the current thread. This means that the waiting isn't inefficient. As long as there is work to be done, all threads will be kept busy.
Keep in mind I'm not an expert on parallel sort, and folks make research careers out of parallel sort but...
1) are they useful in the real world.
of course they are, if you need to sort something expensive (like strings or worse) and you aren't pegging all the cores.
think UI code where you need to sort a large dynamic list of strings based on context
think something like a barnes-hut n-bodies sim where you need to sort the particles
2) Quicksort seems like it would give a linear speedup, but it isn't. The partition step is a sequential bottleneck, you will see this if you profile and it will tend to cap out at 2-3x on a quad core.
If you want to get good speedups on a smaller system you need to ensure that your per task overheads are really small and ideally you will want to ensure that you don't have too many threads running, i.e. not much more than 2 on a dual core. A thread pool probably isn't the right abstraction.
If you want to get good speedups on a larger system you'll need to look at the scan based parallel sorts, there are papers on this. bitonic sort is also quite easy parallelize as is merge sort. A parallel radix sort can also be useful, there is one in the PPL (if you aren't averse to Visual Studio 11).
I'm no expert but... here is what I'd look at:
First of all, I've heard that as a rule of thumb, algorithms that look at small bits of a problem from the start tends to work better as parallel algorithms.
Looking at your implementation, try making the parallel/serial switch go the other way: partition the array and sort in parallel until you have N segments, then go serial. If you are more or less grabbing a new thread for each parallel case, then N should be ~ your core count. OTOH if your thread pool is of fixed size and acts as a queue of short lived delegates, then I'd use N ~ 2+ times your core count (so that cores don't sit idle because one partition finished faster).
Other tweaks:
skip the myTask.wait(); at the local level and rather have a wrapper function that waits on all the tasks.
Make a separate serial implementation of the function that avoids the depth check.
"My first question is, are parallelized versions of sorting algorithms useful in the real world" - depends on the size of the data set that you are working on in the real work. For small sets of data the answer is no. For larger data sets it depends not only on the size of the data set but also the specific architecture of the system.
One of the limiting factors that will prevent the expected increase in performance is the cache layout of the system. If the data can fit in the L1 cache of a core, then there is little to gain by sorting across multiple cores as you incur the penalty of the L1 cache miss between each iteration of the sorting algorithm.
The same reasoning applies to chips that have multiple L2 caches and NUMA (non-uniform memory access) architectures. So the more cores that you want to distribute the sorting across, the smallestToParallelize constant will need to be increased accordingly.
Another limiting factor which you identified is shared memory access, or contention over the memory bus. Since the memory bus can only satisfy a certain number of memory accesses per second; having additional cores that do essentially nothing but read and write to main memory will put a lot of stress on the memory system.
The last factor that I should point out is the thread pool itself as it may not be as efficient as you think. Because you have threads that steal and generate work from a shared queue, that queue requires synchronization methods; and depending on how those are implemented, they can cause very long serial sections in your code.
I don't know if answers here are applicable any longer or if my suggestions are applicable to D.
Anyway ...
Assuming that D allows it, there is always the possibility of providing prefetch hints to the caches. The core in question requests that data it will soon (not immediately) need be loaded into a certain cache level. In the ideal case the data will have been fetched by the time the core starts working on it. More likely the prefetch process will be more or less on the way which at least will result in less wait states than if the data were fetched "cold."
You'll still be constrained by the overall cache-to-RAM throughput capacity so you'll need to have organized the data such that so much data is in the core's exclusive caches that it can spend a fair amount of time there before having to write updated data.
The code and data need to be organized according to the concept of cache lines (fetch units of 64 bytes each) which is the smallest-sized unit in a cache. This should result in that for two cores the work needs to be organized such that the memory system works half as much per core (assuming 100% scalability) as before when only one core was working and the work hadn't been organized. For four cores a quarter as much and so on. It's quite a challenge but by no means impossible, it just depends on how imaginative you are in restructuring the work. As always, there are solutions that cannot be conceived ... until someone does just that!
I don't know how WYSIWYG D is compared to C - which I use - but in general I think the process of developing scaleable applications is ameliorated by how much the developer can influence the compiler in its actual machine code generation. For interpreted languages there will be so much memory work going on by the interpreter that you risk not being able to discern improvements from the general "background noise."
I once wrote a multi-threaded shellsort which ran 70% faster on two cores compared to one and 100% on three cores compared to one. Four cores ran slower than three. So I know the dilemmas you face.
I would like to point you to External Sorting[1] which faces similar problems. Usually, this class of algorithms is used mostly to cope with large volumes of data, but their main point is that they split up large chunks into smaller and unrelated problems, which are therefore really great to run in parallel. You "only" need to stitch together the partial results afterwards, which is not quite as parallel (but relatively cheap compared to the actual sorting).
An External Merge Sort would also work really well with an unknown amount of threads. You just split the work-load arbitrarily, and give each chunk of n elements to a thread whenever there is one idle, until all your work units are done, at which point you can start joining them up.
[1] http://en.wikipedia.org/wiki/External_sorting

Resources