I am looking for a concurrent algorithm which would help me in detecting cycles in a directed graph.
I know that the sequential algorithm uses a dfs with colouring, however I think that it will fail in a multi threaded environment. One example of a directed graph to illustrate it:
A->(B, C), B-> (D), D-> (E), C-> (E), E-> (F)
A
/ \
B C
| |
D |
\ /
E
|
F
(I hope the above makes it clear. The edges in the graph are all top to botton)
For the above directed graph, the following execution is possible during concurrent execution.
(the colouring scheme I assumed is white - unvisited, grey - execution of dfs not finished and black - finished execution and visit)
Dfs(B) by thread 1, which eventually colour E as grey and does a dfs(E) (leading to F). Before this is finished, thread 2 executes dfs(C). It realises that E is grey and reports a cycle which is obviously not the case.
I checked that Tarjan's algo could also be used for cycle detection, but again I do not think its execution will be correct in a multi threaded environment.
Could somebody please help me out on this?
Thanks.
As Ira states let each thread use its own colour.
But, If you have a fixed number of threads use a bit map for each of the colours.
As, long as you processor supports an atomic bit test and set (i.e. BTST on x86) you wont event need locking as each thread will be testing and setting a different bit.
If the bit is set then the item is coloured grey.
PS: If you need more colours then you can use more bits.
For multithreaded cycle detection, it's better to use a variant of the Kahn algorithm (for topological sort) instead of DFS. This uses the facts that:
1) If a directed graph is acyclic, then it has at least one vertex with no in-edges, and at least one vertex with no out-edges;
2) A vertex with no in-edges or no out-edges cannot participate in a cycle; so
3) If you remove a vertex with no in-edges or no out-edges, you're left with a smaller directed graph with the same cycles as the original.
So, to do a parallel cycle detection, you can:
1) First, use a parallel BFS to build a data structure that keeps track of the in-degree and out-degree of each vertex.
2) Then, in parallel, remove vertices with in-degree or out-degree 0. Note that removing a vertex will decrement the in-degrees or out-degrees of adjacent nodes.
3) When you're out of vertices to remove, you're left with all the vertices that are involved in cycles. If there aren't any, then the original graph was acyclic.
Both the parallel BFS (step 1) and parallel vertex removal (step 2) are easily accomplished with parallel work queues. In step 1, when you see a vertex for the first time, add a task to the queue that processes adjacent vertices. In step 2, when you decrement a vertex's in-degree or out-degree to 0, add a task to remove it from the graph.
Note that this algorithm works just as well if you remove only nodes with in-degree 0 or nodes with out-degree 0, but opportunities for parallelism are somewhat reduced.
You should easily find distributed deadlock detection algorithms, that adress the cycle detection problem.
I understand that distributed isn't exactly multithread, but you should still find hints there.
Edit : added a restricted solution.
Related
I was asked this question to reverse a singly linked list as big as having 7 million nodes by using threads efficiently. Using recursion doesn't look feasible if there are so many nodes so I opted for divide and conquer where in each thread be given a chunk of linked list which gets reversed by just making the node pointer point back to previous node by store a reference to current, future and past node and later adding it with reversed chunks from other threads. But the interviewer insisted that the size of the link list is not know, and you can do it without finding the size in an efficient manner. Well I couldn't figure it out , how would you go about it ?
Such questions I like to implement "top-down":
Assume that you already have a Class that implement Runnable or extends Thread out of which you can create instances and run, each instance receives two parameters: a pointer to a Node in the List and number of Nodes to reverse
Your main traverse all 7 million nodes and "marks" the starting points for your threads, say we have 7 threads, the marked points will be: 1, 1,000,000, 2,000,000,... save the marked nodes in an array or whichever data-structure you like
After you finished "marking the starting points, create the threads and give each one of them its starting point and the counter 1,000,000
After all the threads are done, "glue" each of the marking points to point back to the last node of the previous thread (which should be saved in another "static" ordered data-structure).
Now that we have a plan - all that's left to do is implement a (considerably easy) algorithm that, give the number N and a Node x, it will reverse the next N nodes (including x) in a singly linked list :)
I am seeking to improve the performance by reduce scene graph traversal overhead before each render call.I am not very experienced with multi-threaded software design so after reading a couple of articles regarding multi-threaded rendering I am unsure how to approach this issue:
My rendering engine is completely deterministic and renders frames based on incoming transformation instructions in sequential manner at each new frame.I currently see the threaded scene graph update routine as something like this:
--------------CPU-------------------------------------|------GPU--------|----Frame Number----|
Update Frame 0 Transforms (spawn thread) | GL RenderCall | Frame 0
Update Frame 1 Transforms (spawn thread) | GL RenderCall | Frame 1
Update Frame 2 Transforms (spawn thread) | GL RenderCall | Frame 2
...
.......
...............
Before the first draw call I start updating first(Frame 1) frame in separate tread and proceed with render call.At the end of that call I start new thread for update of frame 2 ,check if the thread for frame one is done and if true , I call next render call.And so on and so on.
That is how I see this happening.I have 2 questions:
1.Is it the proper (simple) way to design this kind of system?
2.What is the likelihood of render loop stalls because the scene graph update thread hasn't finished the update in synch with the start of the next render call?
I know some of the people here will say it depends on specific scene graph tree complexity, but I would like to know how it usually goes in reality and what are the major drawbacks of such a design/
As you probably know, you shouldn't render to a common OpenGL drawable from multiple threads, as this would result in a net slowdown. However preparing the drawing, aka the frame setup is a valid step to parallelize. It always boils down to generate a linear list of objects to draw in order to maximize throughput and generate a correct result.
Of course the actual generation steps depend on the structure used. But for a multithreaded design it usually boils down to a map and reduce kind of approach. Creating and synchronizing threads has a certain overhead. Luckily those problems are addressed by systems like OpenMP. I also suggest you perform the frame setup phase during the SwapBuffers wait of the preceding frame.
PRAM models for parallel computing come in the three main flavours: EREW , CREW, CRCW.
I can understand how EREW, CREW can be implemented on a multicore machine. But how
would one go about implementing the CRCW model on a multicore CPU ? Is it even a practical model, since concurrent writes are not possible and every basic parallel programming course
goes into great details into race conditions.
Essentially this means that trying to avoid race conditions and trying to implement concurrent
writes are two opposing goals.
First up: We know that the PRAM is a theoretical, or abstract machine. There are several simplifications made so that it may be used for analyzing/designing parallel algorithms.
Next, let's talk about the ways in which one may do 'concurrent writes' meaningfully.
Concurrent write memories are usually divided into subclasses, based on how they behave:
Priority based CW - Processors have a priority, and if multiple concurrent writes to the same location arrive, the write from the processor of highest priority gets committed to memory.
Arbitary CW - One processor's write is arbitrarily chosen for commit.
Common CW - Multiple concurrent writes to the same location are committed only if the values being written are the same. i.e. all writing processors must agree on the value being written.
Reduction CW - A reduction operator is applied on the multiple values being written. e.g. a summation, where multiple concurrent writes to the same location lead to the sum of the values being written to be committed to memory.
These subclasses lead to some interesting algorithms. Some of the examples I remember from class are:
A CRCW-PRAM where the concurrent write is achieved as a summation can sum an arbitrarily large number of integers in a single timestep. There is a processor for each integer in the input array. All processors write their value to the same location. Done.
Imagine a CRCW-PRAM where the memory commits concurrent writes only if the value written by all processors is the same. Now imagine N numbers A[1] ... A[N], whose maximum you need to find. Here's how you'd do it:
Step 1.
N2 processors will compare each value to each other value, and write the result to a 2D array:
parallel_for i in [1,N]
parallel_for j in [1,N]
if (A[i] >= A[j])
B[i,j] = 1
else
B[i,j] = 0
So in this 2D array, the column corresponding to the biggest number will be all 1's.
Step 2:
Find the column which has only 1's. And store the corresponding value as the max.
parallel_for i in [1,N]
M[i] = 1
parallel_for j in [1,N]
if (B[i,j] = 0)
M[i] = 0 // multiple concurrent writes of *same* value
if M[i]
max = A[i]
Finally, is it possible to implement for real?
Yes, it is possible. Designing, say, a register file, or a memory and associated logic, which has multiple write ports, and which arbitrates concurrent writes to the same address in a meaningful way (like the ways I described above) is possible. You can probably already see that based on the subclasses I mentioned. Whether or not it is practical, I cannot say. I can say that in my limited experience with computers (which involves mostly using general purpose hardware, like the Core Duo machine I'm currently sitting before), I haven't seen one in practice.
EDIT: I did find a CRCW implementation. The wikipedia article on PRAM describes a CRCW machine which can find the max of an array in 2 clock cycles (using the same algorithm as the one above). The description is in SystemVerilog and can be implemented in an FPGA.
I am having problem understanding the complete step and incomplete step in greedy scheduling in Multi-threaded programing in cilk.
Here is the power-point presentation for reference.
Cilk ++ Multi-threaded Programming
The problem I have understanding is in from slide # 32 - 37.
Can someone please explain especially the how is
Complete step>=P threads ready to run
incomplete steps < p threads ready
Thanks for your time and help
First, note that "threads" mentioned in the slides are not like OS threads as one may think. Their definition of a thread is given at slide 10: "a maximal sequence of instructions not containing parallel control (spawn, sync, return)". To avoid further confusion, let me call it a task instead.
On slides 32-35, a circle represents a task ("thread"), and edges represent dependencies between tasks. And the sentences you ask about are in fact definitions: when P or more tasks are ready to run (and so all P processors can be busy doing some work) the situation is called a complete step, while if less than P tasks are ready, the situation is called an incomplete step. To simplify the analysis, it is (implicitly) assumed that all tasks contain equal work (of size 1).
Then the theorem on the slide 35 provides an upper bound of time required for a greedy scheduler to run a program. Since all the execution is a sequence of complete and incomplete steps, the execution time is the sum of all steps. Since each complete step performs exactly P work, the number of complete steps cannot be bigger than T1 (total work) divided by P. Then, each incomplete step must execute a task belonging to the critical path (because at every step at least one critical path task must be ready, and incomplete steps execute all ready tasks); so the overall number of incomplete steps does not exceed the span T_inf (critical path length). Thus the sum of T1/P and T_inf gives an upper bound on execution time.
The rest of slides in the "Scheduling Theory" section are rather straightforward.
I'm looking for a design pattern that would fit my application design.
My application processes large amounts of data and produces some graphs.
Data processing (fetching from files, CPU intensive calculations) and graph operations (drawing, updating) are done in seperate threads.
Graph can be scrolled - in this case new data portions need to be processed.
Because there can be several series on a graph, multiple threads can be spawned (two threads per serie, one for dataset update and one for graph update).
I don't want to create multiple progress bars. Instead, I'd like to have single progress bar that inform about global progress. At the moment I can think of MVC and Observer/Observable, but it's a little bit blurry :) Maybe somebody could point me in a right direction, thanks.
I once spent the best part of a week trying to make a smooth, non-hiccupy progress bar over a very complex algorithm.
The algorithm had 6 different steps. Each step had timing characteristics that were seriously dependent on A) the underlying data being processed, not just the "amount" of data but also the "type" of data and B) 2 of the steps scaled extremely well with increasing number of cpus, 2 steps ran in 2 threads and 2 steps were effectively single-threaded.
The mix of data effectively had a much larger impact on execution time of each step than number of cores.
The solution that finally cracked it was really quite simple. I made 6 functions that analyzed the data set and tried to predict the actual run-time of each analysis step. The heuristic in each function analyzed both the data sets under analysis and the number of cpus. Based on run-time data from my own 4 core machine, each function basically returned the number of milliseconds it was expected to take, on my machine.
f1(..) + f2(..) + f3(..) + f4(..) + f5(..) + f6(..) = total runtime in milliseconds
Now given this information, you can effectively know what percentage of the total execution time each step is supposed to take. Now if you say step1 is supposed to take 40% of the execution time, you basically need to find out how to emit 40 1% events from that algorithm. Say the for-loop is processing 100,000 items, you could probably do:
for (int i = 0; i < numItems; i++){
if (i % (numItems / percentageOfTotalForThisStep) == 0) emitProgressEvent();
.. do the actual processing ..
}
This algorithm gave us a silky smooth progress bar that performed flawlessly. Your implementation technology can have different forms of scaling and features available in the progress bar, but the basic way of thinking about the problem is the same.
And yes, it did not really matter that the heuristic reference numbers were worked out on my machine - the only real problem is if you want to change the numbers when running on a different machine. But you still know the ratio (which is the only really important thing here), so you can see how your local hardware runs differently from the one I had.
Now the average SO reader may wonder why on earth someone would spend a week making a smooth progress bar. The feature was requested by the head salesman, and I believe he used it in sales meetings to get contracts. Money talks ;)
In situations with threads or asynchronous processes/tasks like this, I find it helpful to have an abstract type or object in the main thread that represents (and ideally encapsulates) each process. So, for each worker thread, there will presumably be an object (let's call it Operation) in the main thread to manage that worker, and obviously there will be some kind of list-like data structure to hold these Operations.
Where applicable, each Operation provides the start/stop methods for its worker, and in some cases - such as yours - numeric properties representing the progress and expected total time or work of that particular Operation's task. The units don't necessarily need to be time-based, if you know you'll be performing 6,230 calculations, you can just think of these properties as calculation counts. Furthermore, each task will need to have some way of updating its owning Operation of its current progress in whatever mechanism is appropriate (callbacks, closures, event dispatching, or whatever mechanism your programming language/threading framework provides).
So while your actual work is being performed off in separate threads, a corresponding Operation object in the "main" thread is continually being updated/notified of its worker's progress. The progress bar can update itself accordingly, mapping the total of the Operations' "expected" times to its total, and the total of the Operations' "progress" times to its current progress, in whatever way makes sense for your progress bar framework.
Obviously there's a ton of other considerations/work that needs be done in actually implementing this, but I hope this gives you the gist of it.
Multiple progress bars aren't such a bad idea, mind you. Or maybe a complex progress bar that shows several threads running (like download manager programs sometimes have). As long as the UI is intuitive, your users will appreciate the extra data.
When I try to answer such design questions I first try to look at similar or analogous problems in other application, and how they're solved. So I would suggest you do some research by considering other applications that display complex progress (like the download manager example) and try to adapt an existing solution to your application.
Sorry I can't offer more specific design, this is just general advice. :)
Stick with Observer/Observable for this kind of thing. Some object observes the various series processing threads and reports status by updating the summary bar.