I am seeking to improve the performance by reduce scene graph traversal overhead before each render call.I am not very experienced with multi-threaded software design so after reading a couple of articles regarding multi-threaded rendering I am unsure how to approach this issue:
My rendering engine is completely deterministic and renders frames based on incoming transformation instructions in sequential manner at each new frame.I currently see the threaded scene graph update routine as something like this:
--------------CPU-------------------------------------|------GPU--------|----Frame Number----|
Update Frame 0 Transforms (spawn thread) | GL RenderCall | Frame 0
Update Frame 1 Transforms (spawn thread) | GL RenderCall | Frame 1
Update Frame 2 Transforms (spawn thread) | GL RenderCall | Frame 2
...
.......
...............
Before the first draw call I start updating first(Frame 1) frame in separate tread and proceed with render call.At the end of that call I start new thread for update of frame 2 ,check if the thread for frame one is done and if true , I call next render call.And so on and so on.
That is how I see this happening.I have 2 questions:
1.Is it the proper (simple) way to design this kind of system?
2.What is the likelihood of render loop stalls because the scene graph update thread hasn't finished the update in synch with the start of the next render call?
I know some of the people here will say it depends on specific scene graph tree complexity, but I would like to know how it usually goes in reality and what are the major drawbacks of such a design/
As you probably know, you shouldn't render to a common OpenGL drawable from multiple threads, as this would result in a net slowdown. However preparing the drawing, aka the frame setup is a valid step to parallelize. It always boils down to generate a linear list of objects to draw in order to maximize throughput and generate a correct result.
Of course the actual generation steps depend on the structure used. But for a multithreaded design it usually boils down to a map and reduce kind of approach. Creating and synchronizing threads has a certain overhead. Luckily those problems are addressed by systems like OpenMP. I also suggest you perform the frame setup phase during the SwapBuffers wait of the preceding frame.
Related
I switched to Vulkan from OpenGL to use multi-threading improvements.
In OpenGL, I was able to load dynamically object to the scene (buffer, textures, etc) while rendering by using a waiting system. I was loading all app-side stuffs in a thread, then when it was ready, just before a frame render in the main thread, I was sending everything into the video memory. That was fine.
With Vulkan, I know I can call some functions between threads without provoking the well known segfault from OpenGL. But, this doesn't works with vkQueueSubmit(). I already know, I tried the naive way. To me, it seems logical you can't bother a queue from multiple threads.
I came with some ideas, but I don't know which one is good or bad.
First, I would go the OpenGL way, I will prepare everything I can from the CPU/App side, then just before render a frame, I will submit buffers (with transfer queue) to the video memory. But I feel there is no a real improvement from OpenGL way...
Second, I will try to use the synchronization mechanism to be able to send buffers in a thread and render from an other. But I keep reading there is a lot of way to slow down everything by causing irrelevant locks or by using incorrectly semaphores and fences.
So my question, is basically what path to pick to solve this problem ? How can I load a buffer dynamically from an other thread while the main thread is rendering without making too much pain to performances ? How Vulkan can help ?
If you want to stream resources for immediate use (i.e. the main render cannot proceed without them), then you're pretty much going to either block the main thread waiting, or have it spin doing something visually interesting (e.g. an animated loading screen) waiting for the resources to load.
If you want to stream resources while the app is doing real rendering then the main trick here is to load resources asynchronously in the background and only switch to using those resources in the main thread once they are already loaded. If the main thread ever ends up actually blocked on a semaphore then you've probably already started dropping frames, so your "engine" design needs to ensure that never happens. A lot of game use simple low-detail proxy objects as stand-in versions while the high-detail version is loading in the background.
None of this is particularly related to the graphics API - both GL and Vulkan need the same macro-scale behavior. Vulkan API features don't particularly help because the major bottlenecks which cause problems here are storage/network/CPU which have nothing to do with the graphics part of the problem.
I decided to trust threads !
In the first place it seems to work, I get a lot of :
[MESSAGE:Validation Error: [ UNASSIGNED-Threading-MultipleThreads ] Object 0: handle = 0x56414228bad8, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0x141cb623 | THREADING ERROR : vkQueueSubmit(): object of type VkQueue is simultaneously used in thread 0x7f6b977fe640 and thread 0x7f6bc2bcb740]
But it works !
So, the basic idea is to have a thread for loading objects while the engine is drawing. This thread takes care of creating the UBO for the location of the object, then when the geometry is loaded from RAM, it creates the VBO and IBO (I left material with image/UBO on hold for now), then creates the graphics pipeline (with layout, descriptor layout, shaders compiled with GLSLang on the fly) (The next idea is to reuse pipeline for similar needs) and finallly sets a flag to say the object is ready to use. In the other hand, I have my main thread rendering and takes new objects when they shows up ready.
I think it works because I have a gentle video card (GTX 1070) with multiple queues setup, I had one for graphics and an other one for transfer setup.
I'm pretty sure, this will crash or goes poorly with a GPU with a single queue, and this should be why the validation layers tolds me these messages.
I am using SDL2's SDL_Renderer to:
draw various shapes into textures
showing those textures in the window
As the first step can be time-consuming due to the amount of shapes, I am considering putting it into a separate thread. A texture that is not finished would simply not show (I can ensure that in my code), but the application wouldn't have to wait for the step 1 to complete.
But I am not sure how to handle the renderer object: currently I am using a single global one for both steps 1 and 2. If I just do 1 and 2 in two threads with the same renderer object, it is going to fail miserabely, right? But I am not sure if I can create two separate renderer objects?
The answer of the question: SDL2 Multiple renderers? suggests that renderers are tied to specific windows and I am using just one.
I'm writing a ebook reader app for Windows Store. I'm using Direct2D + DXGI swap chains to render book pages on screen.
My book content sometimes is quite complex (geometry, bitmaps, masks, etc), so it can take up to 100 ms to render it. So I'm trying to do an off-screen rendering to a bitmap in a separate thread, and then just show this bitmap in main thread.
However, I can't figure how to do it efficiently.
So far I've tried two approaches:
Use a single ID2D1Factory with D2D1_FACTORY_TYPE_MULTI_THREADED flag, create ID2D1BitmapRenderTarget and use it in background thread for off-screen rendering. (This additionally requires ID2D1Multithread::Enter/Leave on IDXGISwapChain::Present operations). Problem is, ID2D1RenderTarget::EndDraw operation in background thread sometimes take up to 100ms, and main thread rendering is blocked for this period due to internal Direct2D locking.
Use a separate ID2D1Factory in background thread (as described in http://www.sdknews.com/ios/using-direct2d-for-server-side-rendering) and turn off internal Direct2D synchronization. There is no cross-locking betwen two threads in this case. Unfortunately, in this case I can't use resulting bitmap in main ID2D1Factory directly, because it belongs to a different factory. I have to move bitmap data to CPU memory, then copy it into GPU memory of the main ID2D1Factory. This operation also introduce significant lags (I believe it to be due to large memory accesses, but I'm not sure).
Is there a way to do this efficiently?
P.S. All the timing here are given for Acer Switch 10 tablet. On regular Core i7 PC both approaches work without any visible lag.
Ok, I've found a solution.
Basically, all I needed is to modify approach 2 to use DXGI resource sharing between two DirectX factory sets. I'll skip all the gory details (they can be found here: http://xboxforums.create.msdn.com/forums/t/66208.aspx), but basic steps are:
Create two sets of DirectX resources: main (which will be used to onscreen rendering), and secondary (for offscreen rendering).
Using ID3D11Device2 from main resource set, create D3D 2D texture by CreateTexture2D D3D11_BIND_RENDER_TARGET, D3D11_BIND_SHADER_RESOURCE, D3D11_RESOURCE_MISC_SHARED_NTHANDLE and D3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX flags.
Get shared handle from it by casting it to IDXGIResource1 and calling CreateSharedHandle from it with XGI_SHARED_RESOURCE_READ and DXGI_SHARED_RESOURCE_WRITE.
Open this shared texture in secondary resource set in background thread by calling ID3D11Device2::OpenSharedResource1.
Acquire keyed mutex of this texture (IDXGIKeyedMutex::AcquireSync), create render target from it (ID2D1Factory2::CreateDxgiSurfaceRenderTarget), draw on it and release mutex (IDXGIKeyedMutex::ReleaseSync).
On the main thread, in the main resource set, acquire mutex and create shared bitmap from texture created in step 2, draw this bitmap, then release mutex.
Note that mutex locking stuff is necessary. Not doing it results in some cryptic DirectX debug error messages, and erroneous operation or even crashing.
tl;dr: Render to bitmaps on background thread in software mode. Draw from bitmaps to render target on UI thread in hardware mode.
The best approach I've been able to find so far is to use background threads with software rendering (IWICImagingFactory::CreateBitmap and ID2D1Factory::CreateWicBitmapRenderTarget) and then copy it to a hardware bitmap back on the thread with the hardware render target via ID2D1RenderTarget::CreateBitmapFromWicBitmap. And then blit that using ID2D1RenderTarget::DrawBitmap.
This is how paint.net 4.0 does selection rendering. When you're drawing a selection with the lasso tool, it will use a background thread to draw the selection outline asynchronously (the UI thread does not wait for this to complete). You can end up with a very complicated polygon due to the stroke style and animations. I render it 4 times, where each animation frame has a slightly different offset for the dashed stroke style.
Obviously this rendering can take awhile as the polygon becomes more complex (that is, if you keep scribbling for awhile). I have a few other special optimizations for when you use the Move Selection tool which allows you to do transformations (rotate, translate, scale): if the background thread hasn't yet re-rendered the current polygon with the new transform, then I will render the old bitmap (with the current polygon and old transform) with the new transform applied. The selection outline may be distorted (scaling) or clipped (translated outside of viewable area) while the background thread catches up, but it's a small price to pay for 60fps responsiveness. This optimization works very well because you can't be modifying the polygon and transform of a selection at the same time.
I apologize up front for this long post, but as you can probably see I have been thinking about this for quite some time, and I feel I need some input from other people before my head explodes :-)
I have been experimenting for some time now with various ways of building a game engine which satifies all the following criteria:
Complete seperation of object updating and object rendering
Full determinism
Updating and rendering at individual speeds
No blocking on shared resources
Complete seperation of object updating and object rendering
Seperation of object updating and object rendering seems to be vital to ensure optimal usage of resources while sending data to the graphics API and swapping buffers.
Even if you want to ensure full parallelism to use multiple cores of a CPU it seems that this seperation must still be managed.
Full determinism
Many game types, and especially multiplayer versions, must ensure full determinism. Otherwise players will experience different states of the same game effectively breaking the game logic. Determinism is required for game replays as well. And it is useful for other purposes where it is important that each run of a simulation produces the same result every time given the same starting conditions and inputs.
Updating and rendering at individual speeds
This is really a prerequisite for full determinism as you cannot have the simulation depend on rendering speeds (ie the various monitor refresh rates, graphics adapter speed etc.). During optimal conditions the update speed should be set at a certain fixed interval (eg. 25 updates per second - maybe less depending on the update type), and the rendering speed should be whatever the client's monitor refresh rate / graphics adapter allows.
This implies that rendering speed higher that update speed should be allowed. And while that sounds like a waste there are known tricks to ensure that the added rendering cycles are not wastes (interpolation / extrapolation) which means that faster monitors / adapters would be rewarded with a more visually pleasing experience as they should.
Rendering speeds lower than update speed must also be allowed though, even if this does in fact result in wasted updating cycles - at least the added updating cycles are not all presented to the user. This is however necessary to ensure a smooth multiplayer experience even if the rendering in one of the clients slows to a sudden crawl for one reason or another.
No blocking on shared resources
If the other criterias mentioned above are to be implemented it must also follow that we cannot allow rendering to be waiting for updating or vice versa. Of course it is painfully obvious that when 2 different threads share access to resources and one thread is updating some of these resources then it is impossible to guarantee that blocking will never take place. It is, however, possible to keep this blocking at an absolute minimum - for example when switching pointer references between queue of updated object and a queue of previously rendered objects.
So...
My question to all you skilled people in here is: Am I asking for too much?
I have been reading about ideas of these various topics on many sites. But always it seems that one part or the other is left out from the suggestions I've seen. And maybe the reason is that you cannot have it all without compromise.
I started this seemingly common quest a long time ago when I was putting my thoughts about it in this thread:
Thoughts about rendering loop strategies
Back then my first naive assumption was that it shouldn't matter if updating and reading happened simultaneously since this variations object state was so small that you shouldn't notice if one object was occasionally a step ahead of the other.
Now I am somewhat wiser, but still confused at times.
The most promising and detailed description of a method that would allow for all my wishes to come through was this:
http://blog.slapware.eu/game-engine/programming/multithreaded-renderloop-part1/
A three-state model that will ensure that the renderer can always choose a new queue for rendering without any wait (except perhaps a micro-second while switching pointer-references). At the same time the updater can alway gain access to 2 queues required for building the next state tree (1 queue for creating/updating the next state, and 1 queue for reading the previsous - which can be done even while the renderer reads it as well).
I recently found time to make a sample implementation of this, and it works very well, but for two issues.
One is a minor issue of having to deal with multiple references to all involved objects
The other is more serious (unless I'm just being too needy). And that is the fact that extrapolation - as opposed to intrapolation - is used to maintain a visually pleasing representation of the states given a fast screen refresh rate. While both methods do the job of showing states deviating from the solidly calculated object states, extrapolation seems to me to produce much more visible artifacts when the predictions fail to represent reality. My position seems to be supported by this:
http://gafferongames.com/networked-physics/snapshots-and-interpolation/
And it is not possible to implement interpolation in the three-state design as far as I can tell, since it requires the renderer to have read-access to 2 queues at all times to calculate the intermediate state between two known states.
So I was toying with extending the three-state model suggested on the slapware-blog to utilize interpolation instead of extrapolation - and at the same time try to simplify the multi-reference structur. While it seems to me to be possible, I am wondering if the price is too high. In order to meet all my goals I would need to have
2 queues (or states) exclusively held by the renderer (they could be used by another thread for read-only purposes, but never updated, or switched during rendering
1 queue (or state) with the newest updated state ready to switch over to the renderer, when it is done rendering the current scene
1 queue (or state) with the next frame being built/updated by the updater
1 queue (or state) containing a copy of the frame last built/updated. This is the same state as last sent to the renderer, so this queue/state should be accessible by both the updater for reading the previous state and the renderer for rendering the state.
So that would mean that I should keep at all times 4 copies of render states to be able to keep this design running smoothly, locklessly, deterministically.
I fear that I'm overthinking this. So if any of you have advise to pull me back on the ground, or advises of what can be improved, critique of the design, or perhaps references to good resources explaining how these goals can be achieved, or why this is or isn't a good idea - please hit me with them :-)
I'm looking for a design pattern that would fit my application design.
My application processes large amounts of data and produces some graphs.
Data processing (fetching from files, CPU intensive calculations) and graph operations (drawing, updating) are done in seperate threads.
Graph can be scrolled - in this case new data portions need to be processed.
Because there can be several series on a graph, multiple threads can be spawned (two threads per serie, one for dataset update and one for graph update).
I don't want to create multiple progress bars. Instead, I'd like to have single progress bar that inform about global progress. At the moment I can think of MVC and Observer/Observable, but it's a little bit blurry :) Maybe somebody could point me in a right direction, thanks.
I once spent the best part of a week trying to make a smooth, non-hiccupy progress bar over a very complex algorithm.
The algorithm had 6 different steps. Each step had timing characteristics that were seriously dependent on A) the underlying data being processed, not just the "amount" of data but also the "type" of data and B) 2 of the steps scaled extremely well with increasing number of cpus, 2 steps ran in 2 threads and 2 steps were effectively single-threaded.
The mix of data effectively had a much larger impact on execution time of each step than number of cores.
The solution that finally cracked it was really quite simple. I made 6 functions that analyzed the data set and tried to predict the actual run-time of each analysis step. The heuristic in each function analyzed both the data sets under analysis and the number of cpus. Based on run-time data from my own 4 core machine, each function basically returned the number of milliseconds it was expected to take, on my machine.
f1(..) + f2(..) + f3(..) + f4(..) + f5(..) + f6(..) = total runtime in milliseconds
Now given this information, you can effectively know what percentage of the total execution time each step is supposed to take. Now if you say step1 is supposed to take 40% of the execution time, you basically need to find out how to emit 40 1% events from that algorithm. Say the for-loop is processing 100,000 items, you could probably do:
for (int i = 0; i < numItems; i++){
if (i % (numItems / percentageOfTotalForThisStep) == 0) emitProgressEvent();
.. do the actual processing ..
}
This algorithm gave us a silky smooth progress bar that performed flawlessly. Your implementation technology can have different forms of scaling and features available in the progress bar, but the basic way of thinking about the problem is the same.
And yes, it did not really matter that the heuristic reference numbers were worked out on my machine - the only real problem is if you want to change the numbers when running on a different machine. But you still know the ratio (which is the only really important thing here), so you can see how your local hardware runs differently from the one I had.
Now the average SO reader may wonder why on earth someone would spend a week making a smooth progress bar. The feature was requested by the head salesman, and I believe he used it in sales meetings to get contracts. Money talks ;)
In situations with threads or asynchronous processes/tasks like this, I find it helpful to have an abstract type or object in the main thread that represents (and ideally encapsulates) each process. So, for each worker thread, there will presumably be an object (let's call it Operation) in the main thread to manage that worker, and obviously there will be some kind of list-like data structure to hold these Operations.
Where applicable, each Operation provides the start/stop methods for its worker, and in some cases - such as yours - numeric properties representing the progress and expected total time or work of that particular Operation's task. The units don't necessarily need to be time-based, if you know you'll be performing 6,230 calculations, you can just think of these properties as calculation counts. Furthermore, each task will need to have some way of updating its owning Operation of its current progress in whatever mechanism is appropriate (callbacks, closures, event dispatching, or whatever mechanism your programming language/threading framework provides).
So while your actual work is being performed off in separate threads, a corresponding Operation object in the "main" thread is continually being updated/notified of its worker's progress. The progress bar can update itself accordingly, mapping the total of the Operations' "expected" times to its total, and the total of the Operations' "progress" times to its current progress, in whatever way makes sense for your progress bar framework.
Obviously there's a ton of other considerations/work that needs be done in actually implementing this, but I hope this gives you the gist of it.
Multiple progress bars aren't such a bad idea, mind you. Or maybe a complex progress bar that shows several threads running (like download manager programs sometimes have). As long as the UI is intuitive, your users will appreciate the extra data.
When I try to answer such design questions I first try to look at similar or analogous problems in other application, and how they're solved. So I would suggest you do some research by considering other applications that display complex progress (like the download manager example) and try to adapt an existing solution to your application.
Sorry I can't offer more specific design, this is just general advice. :)
Stick with Observer/Observable for this kind of thing. Some object observes the various series processing threads and reports status by updating the summary bar.