PRAM models for parallel computing come in the three main flavours: EREW , CREW, CRCW.
I can understand how EREW, CREW can be implemented on a multicore machine. But how
would one go about implementing the CRCW model on a multicore CPU ? Is it even a practical model, since concurrent writes are not possible and every basic parallel programming course
goes into great details into race conditions.
Essentially this means that trying to avoid race conditions and trying to implement concurrent
writes are two opposing goals.
First up: We know that the PRAM is a theoretical, or abstract machine. There are several simplifications made so that it may be used for analyzing/designing parallel algorithms.
Next, let's talk about the ways in which one may do 'concurrent writes' meaningfully.
Concurrent write memories are usually divided into subclasses, based on how they behave:
Priority based CW - Processors have a priority, and if multiple concurrent writes to the same location arrive, the write from the processor of highest priority gets committed to memory.
Arbitary CW - One processor's write is arbitrarily chosen for commit.
Common CW - Multiple concurrent writes to the same location are committed only if the values being written are the same. i.e. all writing processors must agree on the value being written.
Reduction CW - A reduction operator is applied on the multiple values being written. e.g. a summation, where multiple concurrent writes to the same location lead to the sum of the values being written to be committed to memory.
These subclasses lead to some interesting algorithms. Some of the examples I remember from class are:
A CRCW-PRAM where the concurrent write is achieved as a summation can sum an arbitrarily large number of integers in a single timestep. There is a processor for each integer in the input array. All processors write their value to the same location. Done.
Imagine a CRCW-PRAM where the memory commits concurrent writes only if the value written by all processors is the same. Now imagine N numbers A[1] ... A[N], whose maximum you need to find. Here's how you'd do it:
Step 1.
N2 processors will compare each value to each other value, and write the result to a 2D array:
parallel_for i in [1,N]
parallel_for j in [1,N]
if (A[i] >= A[j])
B[i,j] = 1
else
B[i,j] = 0
So in this 2D array, the column corresponding to the biggest number will be all 1's.
Step 2:
Find the column which has only 1's. And store the corresponding value as the max.
parallel_for i in [1,N]
M[i] = 1
parallel_for j in [1,N]
if (B[i,j] = 0)
M[i] = 0 // multiple concurrent writes of *same* value
if M[i]
max = A[i]
Finally, is it possible to implement for real?
Yes, it is possible. Designing, say, a register file, or a memory and associated logic, which has multiple write ports, and which arbitrates concurrent writes to the same address in a meaningful way (like the ways I described above) is possible. You can probably already see that based on the subclasses I mentioned. Whether or not it is practical, I cannot say. I can say that in my limited experience with computers (which involves mostly using general purpose hardware, like the Core Duo machine I'm currently sitting before), I haven't seen one in practice.
EDIT: I did find a CRCW implementation. The wikipedia article on PRAM describes a CRCW machine which can find the max of an array in 2 clock cycles (using the same algorithm as the one above). The description is in SystemVerilog and can be implemented in an FPGA.
Related
My problem is a fluid flow simulation but I will try to make the question as generic as possible. I have gone through the OpenMP API manual and OpenMP for F95. But as I am only 5-days old to multithreading, I seek your help after being baffled by the smorgasbord of options to optimise the code. I am using Intel Xeon CPU E5-2630 v4 # 2.20GHz with one socket and 10 cores in that socket (with hyperthreading becoming 20 CPUs).
My whole simulation is basically filled with two kinds of nested loops as in (i) and (ii) below.
i) Where an array element (C(I,J,K) and D(I,J,K) below) depends on the previous K-1 grid point and hence I can't parallelise the outer most loop, e.g.,
Nx=256, Ny=209, Nz=64
DO K = 2,NY-1
!$OMP PARALLEL DO
DO J = 1, NZ
DO I = 1, NX/2+1
C(I,J,K) = C(I,J,K)/(A(I,J,K)*C(I,J,K-1))
D(I,J,K) = (D(I,J,K)-D(I,J,K-1))/(C(I,J,K-1))
END DO
END DO
!$OMP END PARALLEL DO
END DO
A(:,:,1:NY) is already calculated in a different subroutine and hence
is available as a shared variable to the OpenMP threads.
ii) Where the update variable (A) do no depend on other grid points and hence I can parallelise all the loops, like the following:
!$OMP PARALLEL DO
DO K = 1, NY
DO J=1,NZ
DO I=1,NX
A(I,J,K)=(B(I,J,K)-B(I,J,K-1))/C(K-1)
END DO
END DO
END DO
!$OMP END PARALLEL DO
B(:,:,1:NY) and C(:,:,1:NY) are already calculated in a different subroutine
Question (a): Do the above nested-loops have a race condition?
Question (b): The output is correct and matches the serial code, but:
b(i): are there any loopholes in the codes that can make them work incorrectly in certain situations?
b(ii): can the output be correct with a race condition?
Question (c): Are there any ways to optimise these code further? There are many options in the above-mentioned manuals, but some help on pointing me to the right direction would be highly appreciated.
I run the codes with
$ ulimit -s unlimited
$ export OMP_NUM_THREADS=16
$ gfortran -O3 mycode.f90 -fopenmp -o mycode
With 16 threads it takes about 80 time units while with 6, 10 and 20 # of threads it take 105, 101 and 100 time units.
Question (d): I know there could be many reasons for the above, but is there a thumb rule to follow on choosing the right number of threads (except hit-and-trial as somewhat implied in answers to this question)?
Question (e): Is ulimit -s unlimited a good option? (without it I get a segmentation fault (core dumped) error)
Thanks.
(a) You have a race condition only if multiple threads perform accesses to the same location (without synchronization) and at least one of those is a write.
The second code snippet does not have a race condition because you only write to each location of A exactly once and never read from it.
Similarly, the first code snippet does not have a race condition as you only read/write from/to each location in the K slices of C and D once and then don't read it again within the same parallel section (because K is fixed within each parallel region). Reading from the K-1 slice is of course not a race.
(b)
(bi) Have you looked at the numerics? There seems to be a lot of room for catastrophic cancellation. But that's not threading-related.
(bii) Theoretically, yes. Really depends on how egregious it is. There is no race condition here though.
(c) Profile! Are you memory bound or CPU bound (presumably the former)? Do you get a lot of cache misses? Is there false sharing (I doubt it)? Can you rearrange your data and/or loops to improve cache behavior? Are your strided accesses aligned critically? There are many of these kinds of performance gotchas, it'll take time and experience to understand and recognize them. My advice would be to become particularly familiar with the impact of caching, that's at the heart of most performance questions.
(d) If you care about performance, you have to profile and compare. Modern CPUs are so fiendishly complex that you have little chance to predict performance of any but the most trivial snippets. As mentioned in the answers you linked, if you're memory bound then more threads tend to make things worse, but your own results show that performance still improves by having slightly more than one thread per physical core (presumably because the divisions are a bit slow?).
(e) That sounds like you are allocating large arrays on the stack. That itself is probably a bad idea precisely because you tend to run out of stack space. But I'm not familiar with Fortran best practices so I can't help you much there.
I want to see the intrinsic difference between a thread and a long-running go block in Clojure. In particular, I want to figure out which one I should use in my context.
I understand if one creates a go-block, then it is managed to run in a so-called thread-pool, the default size is 8. But thread will create a new thread.
In my case, there is an input stream that takes values from somewhere and the value is taken as an input. Some calculations are performed and the result is inserted into a result channel. In short, we have input and out put channel, and the calculation is done in the loop. So as to achieve concurrency, I have two choices, either use a go-block or use thread.
I wonder what is the intrinsic difference between these two. (We may assume there is no I/O during the calculations.) The sample code looks like the following:
(go-loop []
(when-let [input (<! input-stream)]
... ; calculations here
(>! result-chan result))
(recur))
(thread
(loop []
(when-let [input (<!! input-stream)]
... ; calculations here
(put! result-chan result))
(recur)))
I realize the number of threads that can be run simultaneously is exactly the number of CPU cores. Then in this case, is go-block and thread showing no differences if I am creating more than 8 thread or go-blocks?
I might want to simulate the differences in performance in my own laptop, but the production environment is quite different from the simulated one. I could draw no conclusions.
By the way, the calculation is not so heavy. If the inputs are not so large, 8,000 loops can be run in 1 second.
Another consideration is whether go-block vs thread will have an impact on GC performance.
There's a few things to note here.
Firstly, the thread pool that threads are created on via clojure.core.async/thread is what is known as a cached thread pool, meaning although it will re-use recently used threads inside that pool, it's essentially unbounded. Which of course means it could potentially hog a lot of system resources if left unchecked.
But given that what you're doing inside each asynchronous process is very lightweight, threads to me seem a little overkill. Of course, it's also important to take into account the quantity of items you expect to hit the input stream, if this number is large you could potentially overwhelm core.async's thread pool for go macros, potentially to the point where we're waiting for a thread to become available.
You also didn't mention preciously where you're getting the input values from, are the inputs some fixed data-set that remains constant at the start of the program, or are inputs continuously feed into the input stream from some source over time?
If it's the former then I would suggest you lean more towards transducers and I would argue that a CSP model isn't a good fit for your problem since you aren't modelling communication between separate components in your program, rather you're just processing data in parallel.
If it's the latter then I presume you have some other process that's listening to the result channel and doing something important with those results, in which case I would say your usage of go-blocks is perfectly acceptable.
I have a ConcurrentLinkedQueue and I want to split it into two halves and let two separate threads handle each. I have tried using Spliterator but I do not understand how to get the partitioned queues.
ConcurrentLinkedQueue<int[]> q = // contains a large number of elements
Spliterator<int[]> p1 = q.spliterator();
Spliterator<int[]> p2 = p1.trySplit();
p1.getQueue();
p2.getQueue();
I want to but cannot do p1.getQueue() etc.
Please let me know the correct way to do it.
You can't split it in half in general, I mean to split in half this queue must have a size at each point in time. And while CLQ does have a size() method, it's documentation is pretty clear that this size requires O(n) traversal time and because this is a concurrent queue it's size might not be accurate at all (it is named concurrent for a reason after all). The current Spliterator from CLQ splits it in batches from what I can see.
If you want to split it in half logically and process the elements, then I would suggest moving to some Blocking implementation that has a drainTo method, this way you could drain the elements to an ArrayList for example, that will split much better (half, then half again and so on).
On a side note, why would you want to do the processing in different threads yourself? This seems very counter-intuitive, the Spliterator is designed to work for parallel streams. Calling trySplit once is probably not even enough - you have to call it until it returns null... Either way doing these things on your own sounds like a very bad idea to me.
I'm preparing a college exam in parallel computing.
The main purpose is to speedup as much as possible a Montecarlo simulation about electron drift in earth magnetic field.
I've already developed something with two layers of parallelization:
MPI used to make te code run on several machines
OpenMP to run parallel simulation inside the single computer
Now comes the question: I would like to keep on-demand the task execution.
The fastest computer must be able to execute more work the the slower ones.
The problem partition is done via master-worker cycle, so there is no actual struggle about achieving this result.
Since the number of tasks (a block of n electrons to simulate) executed by a worker is not prior defined I have two roads to follow:
every thread in every worker has is own RNG initialized with random generated seed (different generation method). The unbalancing of the cluster will change results, but in this approach the result is as casual as possible.
every electron has his own seed, granting reproducibility of the simulation despite of which worker runs the single task. Must have a better RNG.
Lets's poll about this. What's your suggestion?
Have fun
gf
What to poll about here?
Clearly, only approach #2 is a feasible one. Each source particle starts with it's own and stable seed. It makes result reproducible AND debuggable (for a lack of better word).
Well-known Monte Carlo code MCNP5+ used this scheme for good, runs on multi-cores and MPI. To implement it you'll need RNG with fast skip-ahead (a.k.a. leapfrog or discard) feature. And there are quite a few of them. They are based upon fast exponent computation, paper by F. Brown, "Random Number Generation with Arbitrary Stride", Trans. Am. Nucl. Soc. (Nov. 1994). Basically, skip-ahead is log(N) with Brown approach.
Simplest version which is about the same as MCNP5 one is here https://github.com/Iwan-Zotow/LCG-PLE63
More complicated (and slower, but higher quality) RNG is here http://www.pcg-random.org/
I'm looking for a design pattern that would fit my application design.
My application processes large amounts of data and produces some graphs.
Data processing (fetching from files, CPU intensive calculations) and graph operations (drawing, updating) are done in seperate threads.
Graph can be scrolled - in this case new data portions need to be processed.
Because there can be several series on a graph, multiple threads can be spawned (two threads per serie, one for dataset update and one for graph update).
I don't want to create multiple progress bars. Instead, I'd like to have single progress bar that inform about global progress. At the moment I can think of MVC and Observer/Observable, but it's a little bit blurry :) Maybe somebody could point me in a right direction, thanks.
I once spent the best part of a week trying to make a smooth, non-hiccupy progress bar over a very complex algorithm.
The algorithm had 6 different steps. Each step had timing characteristics that were seriously dependent on A) the underlying data being processed, not just the "amount" of data but also the "type" of data and B) 2 of the steps scaled extremely well with increasing number of cpus, 2 steps ran in 2 threads and 2 steps were effectively single-threaded.
The mix of data effectively had a much larger impact on execution time of each step than number of cores.
The solution that finally cracked it was really quite simple. I made 6 functions that analyzed the data set and tried to predict the actual run-time of each analysis step. The heuristic in each function analyzed both the data sets under analysis and the number of cpus. Based on run-time data from my own 4 core machine, each function basically returned the number of milliseconds it was expected to take, on my machine.
f1(..) + f2(..) + f3(..) + f4(..) + f5(..) + f6(..) = total runtime in milliseconds
Now given this information, you can effectively know what percentage of the total execution time each step is supposed to take. Now if you say step1 is supposed to take 40% of the execution time, you basically need to find out how to emit 40 1% events from that algorithm. Say the for-loop is processing 100,000 items, you could probably do:
for (int i = 0; i < numItems; i++){
if (i % (numItems / percentageOfTotalForThisStep) == 0) emitProgressEvent();
.. do the actual processing ..
}
This algorithm gave us a silky smooth progress bar that performed flawlessly. Your implementation technology can have different forms of scaling and features available in the progress bar, but the basic way of thinking about the problem is the same.
And yes, it did not really matter that the heuristic reference numbers were worked out on my machine - the only real problem is if you want to change the numbers when running on a different machine. But you still know the ratio (which is the only really important thing here), so you can see how your local hardware runs differently from the one I had.
Now the average SO reader may wonder why on earth someone would spend a week making a smooth progress bar. The feature was requested by the head salesman, and I believe he used it in sales meetings to get contracts. Money talks ;)
In situations with threads or asynchronous processes/tasks like this, I find it helpful to have an abstract type or object in the main thread that represents (and ideally encapsulates) each process. So, for each worker thread, there will presumably be an object (let's call it Operation) in the main thread to manage that worker, and obviously there will be some kind of list-like data structure to hold these Operations.
Where applicable, each Operation provides the start/stop methods for its worker, and in some cases - such as yours - numeric properties representing the progress and expected total time or work of that particular Operation's task. The units don't necessarily need to be time-based, if you know you'll be performing 6,230 calculations, you can just think of these properties as calculation counts. Furthermore, each task will need to have some way of updating its owning Operation of its current progress in whatever mechanism is appropriate (callbacks, closures, event dispatching, or whatever mechanism your programming language/threading framework provides).
So while your actual work is being performed off in separate threads, a corresponding Operation object in the "main" thread is continually being updated/notified of its worker's progress. The progress bar can update itself accordingly, mapping the total of the Operations' "expected" times to its total, and the total of the Operations' "progress" times to its current progress, in whatever way makes sense for your progress bar framework.
Obviously there's a ton of other considerations/work that needs be done in actually implementing this, but I hope this gives you the gist of it.
Multiple progress bars aren't such a bad idea, mind you. Or maybe a complex progress bar that shows several threads running (like download manager programs sometimes have). As long as the UI is intuitive, your users will appreciate the extra data.
When I try to answer such design questions I first try to look at similar or analogous problems in other application, and how they're solved. So I would suggest you do some research by considering other applications that display complex progress (like the download manager example) and try to adapt an existing solution to your application.
Sorry I can't offer more specific design, this is just general advice. :)
Stick with Observer/Observable for this kind of thing. Some object observes the various series processing threads and reports status by updating the summary bar.