How to have multiple threads work on the same graph?

How to have multiple threads work on the same graph? - multithreading

So I'm trying to create a very large Minesweeper game, and I was going to have the whole map be stored in a container, probably a 2D array. In Minesweeper, when you reveal a blank tile, all 8 adjacent tiles are revealed, and if one of those are a blank tile, then its neighbors are revealed, and so the whole thing is recursive.
For a very large map, this can easily escalate so that it takes way too long for the whole algorithm to finish. I want to use threads to reveal the tiles in parallel, but if all threads are accessing the same graph, then I'll have to use locks. Pretty much every thread will just sit there waiting for their turn to access the map when really I want it to be done in parallel.
My question is, how do you multithread algorithms that access a graph e.g. BFS? Do you partition the graph and then have one thread per partition? Or do you just go ahead and use locks? I'm really interested in the general theory so if anyone has links to any good reading material I'd love to take a look at those too (hopefully I can understand it though!).

Related

How Might I organize vertex data in WebGL for a frame-by-frame (very specific) animated program?

I have been working on an animated graphics project with very specific requirements, and after quite a bit of searching and test coding, I have figured that I could take several approaches, but the Khronos and MDN documentation I have been reading coupled with other posts I have seen here don't answer all of my questions regarding my particular project. In the meantime, I have written short test programs (setting infrastructure for testing).
Firstly, I should describe the project:
The main object drawn to the screen is a simple quad surrounded by a black outline (LINE_LOOP or LINES will do, probably, though I have had issues with z-fighting...that will be left for another question). When the user interacts with the program, exactly one new quad is created and immediately drawn, but for a set amount of time its vertices move around until the quad moves to its final destination. (Note that translations won't do.) Random black lines are also drawn, and sometimes those lines also move around.
Once one of the quads reaches its final spot, it never moves again.
A new quad is always atop old quads (closer to the screen). That means that I need to layer the quads and lines from oldest to newest.
*this also means that it would probably be best to assign z-values to each quad and line, even if the graphics are in pixel coordinates and use an orthographic matrix. Would everyone agree with this?
Given these parameters, I have a few options with varying levels of complexity:
1> Take the object-oriented approach and just assign a buffer to each quad, and the same goes for the random lines. --creation and destruction of buffers every frame for the one shape that is moving. I truthfully think that this is a terrible idea that might only work in a higher level library that does heavy optimization underneath. This approach also doesn't take advantage of the fact that almost every quad will stay the same.
[vertices0] ... , [verticesN]
Draw x N (many draws for many small-size buffers)
2> Assign a z-value to each quad, outline, and line (as mentioned above). Allocate a huge vertex buffer and element buffer to store all permanently-in-their-final-positions quads. Resize only in the very unlikely case someone interacts for long enough. Create a second tiny buffer to store the one temporary moving quad and use bufferSubData every frame. When the quad reaches its destination, bufferSubData it into the large buffer and overwrite the small buffer upon creation of the next quad...all on the same frame. The main questions I have here are: is it possible (safe?) to use bufferSubData and draw it on the same frame? Also, would I use DYNAMIC_DRAW on both buffers even though the larger one would see fewer updates?
[permanent vertices ... | uninitialized (keep a count)]
bufferSubData -> [tempVerticesForOneQuad]
Draw 2x
3> Still create the large and small buffers, but instead of using bufferSubData every frame, create a second shader program and add an attribute for the new/moving quad that explicitly sets the vertex positions for the animation (I would pass vertex index attributes). Only draw with the small buffer when the quad is moving. For the frame when the quad reaches its destination, draw both large and small buffer, but then bufferSubData the final coordinates into the large permanent buffer to be used in the next frame.
switchToShaderProgramA();
[permanent vertices...| uninitialized (keep a count)]
switchToShaderProgramB();
[temp vertices] <- shader B accepts indices for each vertex so we can do all animation in the vertex shader
---last frame of movement arrives : bufferSubData into the permanent vertices buffer for when the the next quad is created
I get the sense that the third option might be the best, but I would like to learn whether there are some other factors that I did not consider. For example, my assumption that a program switch, additional attributes, and vertex shader manipulation would be faster than just substituting the buffer values as in 2>. The advantage of approach 3> (I think) is that I can defer the buffer substitution to a time when nothing needs to be drawn.
Still, I am still not sure of how to work with the randomly-appearing lines. I can't take the "single quad vertex buffer" approach since the number of lines cannot be predicted. Might I also allocate a large buffer for the moving lines? Those also stay after the quad is finished moving, though I don't think that I could use the vertex shader trick because there would be too many attributes to set (as opposed to the 4 for the one quad). I suppose that I could create a large "permanent line data" buffer first, but what to do during the animation is tricky because the lines move. Maybe bufferSubData() + draw on the same frame is not terrible? Or it could be. This is where I need advise.
I understand that this question might not be too specific code-wise, but I don't believe that I would be allowed to show the core of the program. All I have is the typical WebGL boilerplate ready.
I am looking forward to hearing people's thoughts on how I might proceed and whether there are any trade-offs I might have missed when considering the three options above.
Thank you in advance, and please feel free to ask any additional questions if clarification is necessary.

Honestly, for what you're describing, it doesn't sound to me like it matters which you choose. On modern hardware, drawing a few hundred quads and a few thousand lines each frame would not really tax the hardware much.
Having said that, I agree that approach 1 seems very inefficient. Approach 2 sounds perfectly fine. You can safely draw a buffer on the same frame that you uploaded the data. I don't think it matters much whether you use DYNAMIC_DRAW or STATIC_DRAW for the buffer. I tend to think of dynamic buffers as being something you're updating every frame. If you only update it every few seconds or less, then static is fine. Approach 3 is also fine. Between 2 and 3, I'd say do whichever is easier for you to understand and program.
Likewise, for the lines, I would use a separate buffer. It sounds like that one changes per frame, so I would use DYNAMIC_DRAW for that. Allocating a single large buffer for it and performing a glBufferSubData() per frame is probably a fine strategy. As always, trying it and profiling it will tell you for sure.

several walkers walking on a grid: How to organize the threads?

My algorithm is processing DEMs. a DEM (Digital Elevation Model) is a representations of ground topography where elevation is known at grid nodes.
My problem can be summarized as follows:
Q is a queue containing nodes to visit.
at start, the boundary of the grid is pushed in Q.
while Q is not empty, do
remove Node N from the top of Q
if N was never visited then do
consider the 8 neighbors of N
among them select the unvisited ones
among them select those with a higher elevation than N's
push these at Q's tail
mark N as visited
done
done
As described, the algorithm will mark as 'visited' every node that can be reached from the border by a continuously ascendant path. It is worth noticing that the order of processing the nodes in the queue is unimportant. Note also that some points may request a tortuous ascendant path to be reached from the border. Think for example to a cone with a furrow spiraling around it. The ridge of the furrow is such a unique and tortuous path capable of reaching the top of the cone without never descending into the furrow.
Anyway, I want to mutithread this algorithm. I am still in the first step of wondering which is the best organization of data and threads in order to have as least pain as possible at debugging the beast when it is written.
My first thought is to divide the grid into tiles and split the Queue in as many tiles as there is in the grid. The tiles are piled in a work-list. A few threads are parsing the work-list and grab any tile where something can be done at the moment.
Working on a specific tile will firstly need that the tile's queue is not empty. I may also need that the neighboring tiles can be locked if the walker's tile has to visit a node at the edge of the tile.
I am thinking that when a walker cannot lock a neighboring tile while it needs to, then it can skip to the next node in the local queue, or even the thread itself can release the tile to the work-list and seek for another tile to work on.
My actual experience of multi-thread programming is good enough to understand that this lovely description is very likely to turn into a nightmare when I will debug it. However I am not experienced enough to evaluate the various possibilities of programming the algorithm and make a good decision, having in mind that I will not be given a month to debug a spaghetti dish.
Thanks for reading :)

Improving simulation performance via concurrency

Consider this sequential procedure on a data structure containing collections (for simplicity, call them lists) of Doubles. For as long as I feel like, do:
Select two different lists from the structure at random
Calculate a statistic based on those lists
Flip a coin based on that statistic
Possibly modify one of the lists, based on the outcome of the coin toss
The goal is to eventually achieve convergence to something, so the 'solution' is linear in the number of iterations. An implementation of this procedure can be seen in the SO question here, and here is an intuitive visualization:
It seems that this procedure could be better performed - that is, convergence could be achieved faster - by using several workers executing concurrently on separate OS threads, ex:
I guess a perfectly-realized implementation of this should be able to achieve a solution in O(n/P) time, for P the number of available compute resources.
Reading up on Haskell concurrency has left my head spinning with terms like MVar, TVar, TChan, acid-state, etc. What seems clear is that a concurrent implementation of this procedure would look very different from the one I linked above. But, the procedure itself seems to essentially be a pretty tame algorithm on what is essentially an in-memory database, which is a problem that I'm sure somebody has come across before.
I'm guessing I will have to use some kind of mutable, concurrent data structure that supports decent random access (that is, to random idle elements) & modification. I am getting a bit lost when I try to piece together all the things that this might require with a view towards improving performance (STM seems dubious, for example).
What data structures, concurrency concepts, etc. are suitable for this kind of task, if the goal is a performance boost over a sequential implementation?

Keep it simple:
forkIO for lightweight, super-cheap threads.
MVar, for fast, thread safe shared memory.
and the appropriate sequence type (probably vector, maybe lists if you only prepend)
a good stats package
and a fast random number source (e.g. mersenne-random-pure64)
You can try the fancier stuff later. For raw performance, keep things simple first: keep the number of locks down (e.g. one per buffer); make sure to compile your code and use the threaded runtime (ghc -O2) and you should be off to a great start.
RWH has a intro chapter to cover the basics of concurrent Haskell.

Java : Use Disruptor or Not . .

Hy,
Currently I am developing a program that takes 2 values from an amq queue and performs a series of mathematical calculations on them. A topic has been created on the amq server to which my program subscribes and receive messages via callbacks (listeners).
Now whenever a message arrives the two values are taken out of and added to the SynchronizedDescriptiveStatistics object. After each addition to the list of values the whole sequence of calculations is performed all over again (this is part of the requirement actually).
The problem I am facing right now is that since I am using listeners, sometimes a single or more messages are received in the middle of calculations. Although SynchronizedDescriptiveStatistics takes care of all the thread related issues it self but it adds all the waiting values in its list of numbers at once when it comes out of lock or something. While my problem was to add one value then perform calcls on it then second value and on and on.
The solution I came up with is to use job queues in my program (not amq queues). In this way whenever calcs are over the program would look for further jobs in the queue and goes on accordingly.
Since I am also looking for efficiency and speed I thought the Disruptor framework might be good for this problem and it is optimized for threaded situations. But I am not sure if its worth the trouble of implementing Disruptor in to my application because regular standard queue might be enough for what I am trying to do.
Let me also tell you that the data on which the calcs need to be performed is a lot and it will keep on coming and the whole calcs will need to be performed all over again for each addition of a single value in a continuous fashion. So keeping in mind the efficiency and the huge volume of data what do you think will be useful in the long run.
Waiting for a reply. . .
Regards.

I'll give our typical answer to this question: test first, and make your decision based on your results.
Although you talk about efficiency, you don't specifically say that performance is a fundamental requirement. If you have an idea of your performance requirements, you could mock up a simple prototype using queues versus a basic implementation of the Disruptor, and take measurements of the performance of both.
If one comes off substantially better than the other, that's your answer. If, however, one is much more effort to implement, especially if it's also not giving you the efficiency you require, or you don't have any hard performance requirements, then that suggests that solution is not the right one.
Measure first, and decide based on your results.

Typical rendering strategy for many and varied complex objects in directx?

I am learning directx. It provides a huge amount of freedom in how to do things, but presumably different stategies perform differently and it provides little guidance as to what well performing usage patterns might be.
When using directx is it typical to have to swap in a bunch of new data multiple times on each render?
The most obvious, and probably really inefficient, way to use it would be like this.
Stragety 1
On every single render
Load everything for model 0 (textures included) and render it (IASetVertexBuffers, VSSetShader, PSSetShader, PSSetShaderResources, PSSetConstantBuffers, VSSetConstantBuffers, Draw)
Load everything for model 1 (textures included) and render it (IASetVertexBuffers, VSSetShader, PSSetShader, PSSetShaderResources, PSSetConstantBuffers, VSSetConstantBuffers, Draw)
etc...
I am guessing you can make this more efficient partly if the biggest things to load are given dedicated slots, e.g. if the texture for model 0 is really complicated, don't reload it on each step, just load it into slot 1 and leave it there. Of course since I'm not sure how many registers there are certain to be of each type in DX11 this is complicated (can anyone point to docuemntation on that?)
Stragety 2
Choose some texture slots for loading and others for perpetual storage of your most complex textures.
Once only
Load most complicated models, shaders and textures into slots dedicated for perpetual storage
On every single render
Load everything not already present for model 0 using slots you set aside for loading and render it (IASetVertexBuffers, VSSetShader, PSSetShader, PSSetShaderResources, PSSetConstantBuffers, VSSetConstantBuffers, Draw)
Load everything not already present for model 1 using slots you set aside for loading and render it (IASetVertexBuffers, VSSetShader, PSSetShader, PSSetShaderResources, PSSetConstantBuffers, VSSetConstantBuffers, Draw)
etc...
Strategy 3
I have no idea, but the above are probably all wrong because I am really new at this.
What are the standard strategies for efficient rendering on directx (specifically DX11) to make it as efficient as possible?

DirectX manages the resources for you and tries to keep them in video memory as long as it can to optimize performance, but can only do so up to the limit of video memory in the card. There is also overhead in every state change even if the resource is still in video memory.
A general strategy for optimizing this is to minimize the number of state changes during the rendering pass. Commonly this means drawing all polygons that use the same texture in a batch, and all objects using the same vertex buffers in a batch. So generally you would try to draw as many primitives as you can before changing the state to draw more primitives
This often will make the rendering code a little more complicated and harder to maintain, so you will want to do some profiling to determine how much optimization you are willing to do.
Generally you will get better performance increases through more general algorithm changes beyond the scope of this question. Some examples would be reducing polygon counts for distant objects and occlusion queries. A popular true phrase is "the fastest polygons are the ones you don't draw". Here are a couple of quick links:
http://msdn.microsoft.com/en-us/library/bb147263%28v=vs.85%29.aspx
http://www.gamasutra.com/view/feature/3243/optimizing_direct3d_applications_.php
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter06.html

Other answers are better answers to the question per se, but by far the most relevant thing I found since asking was this discussion on gamedev.net in which some big title games are profiled for state changes and draw calls.
What comes out of it is that big name games don't appear to actually worry too much about this, i.e. it can take significant time to write code that addresses this sort of issue and the time it takes to spend writing code fussing with it probably isn't worth the time lost getting your application finished.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string