Efficient Direct2D multithreading - multithreading

I'm writing a ebook reader app for Windows Store. I'm using Direct2D + DXGI swap chains to render book pages on screen.
My book content sometimes is quite complex (geometry, bitmaps, masks, etc), so it can take up to 100 ms to render it. So I'm trying to do an off-screen rendering to a bitmap in a separate thread, and then just show this bitmap in main thread.
However, I can't figure how to do it efficiently.
So far I've tried two approaches:
Use a single ID2D1Factory with D2D1_FACTORY_TYPE_MULTI_THREADED flag, create ID2D1BitmapRenderTarget and use it in background thread for off-screen rendering. (This additionally requires ID2D1Multithread::Enter/Leave on IDXGISwapChain::Present operations). Problem is, ID2D1RenderTarget::EndDraw operation in background thread sometimes take up to 100ms, and main thread rendering is blocked for this period due to internal Direct2D locking.
Use a separate ID2D1Factory in background thread (as described in http://www.sdknews.com/ios/using-direct2d-for-server-side-rendering) and turn off internal Direct2D synchronization. There is no cross-locking betwen two threads in this case. Unfortunately, in this case I can't use resulting bitmap in main ID2D1Factory directly, because it belongs to a different factory. I have to move bitmap data to CPU memory, then copy it into GPU memory of the main ID2D1Factory. This operation also introduce significant lags (I believe it to be due to large memory accesses, but I'm not sure).
Is there a way to do this efficiently?
P.S. All the timing here are given for Acer Switch 10 tablet. On regular Core i7 PC both approaches work without any visible lag.

Ok, I've found a solution.
Basically, all I needed is to modify approach 2 to use DXGI resource sharing between two DirectX factory sets. I'll skip all the gory details (they can be found here: http://xboxforums.create.msdn.com/forums/t/66208.aspx), but basic steps are:
Create two sets of DirectX resources: main (which will be used to onscreen rendering), and secondary (for offscreen rendering).
Using ID3D11Device2 from main resource set, create D3D 2D texture by CreateTexture2D D3D11_BIND_RENDER_TARGET, D3D11_BIND_SHADER_RESOURCE, D3D11_RESOURCE_MISC_SHARED_NTHANDLE and D3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX flags.
Get shared handle from it by casting it to IDXGIResource1 and calling CreateSharedHandle from it with XGI_SHARED_RESOURCE_READ and DXGI_SHARED_RESOURCE_WRITE.
Open this shared texture in secondary resource set in background thread by calling ID3D11Device2::OpenSharedResource1.
Acquire keyed mutex of this texture (IDXGIKeyedMutex::AcquireSync), create render target from it (ID2D1Factory2::CreateDxgiSurfaceRenderTarget), draw on it and release mutex (IDXGIKeyedMutex::ReleaseSync).
On the main thread, in the main resource set, acquire mutex and create shared bitmap from texture created in step 2, draw this bitmap, then release mutex.
Note that mutex locking stuff is necessary. Not doing it results in some cryptic DirectX debug error messages, and erroneous operation or even crashing.

tl;dr: Render to bitmaps on background thread in software mode. Draw from bitmaps to render target on UI thread in hardware mode.
The best approach I've been able to find so far is to use background threads with software rendering (IWICImagingFactory::CreateBitmap and ID2D1Factory::CreateWicBitmapRenderTarget) and then copy it to a hardware bitmap back on the thread with the hardware render target via ID2D1RenderTarget::CreateBitmapFromWicBitmap. And then blit that using ID2D1RenderTarget::DrawBitmap.
This is how paint.net 4.0 does selection rendering. When you're drawing a selection with the lasso tool, it will use a background thread to draw the selection outline asynchronously (the UI thread does not wait for this to complete). You can end up with a very complicated polygon due to the stroke style and animations. I render it 4 times, where each animation frame has a slightly different offset for the dashed stroke style.
Obviously this rendering can take awhile as the polygon becomes more complex (that is, if you keep scribbling for awhile). I have a few other special optimizations for when you use the Move Selection tool which allows you to do transformations (rotate, translate, scale): if the background thread hasn't yet re-rendered the current polygon with the new transform, then I will render the old bitmap (with the current polygon and old transform) with the new transform applied. The selection outline may be distorted (scaling) or clipped (translated outside of viewable area) while the background thread catches up, but it's a small price to pay for 60fps responsiveness. This optimization works very well because you can't be modifying the polygon and transform of a selection at the same time.

Related

Flickering frames in Imediate mode

I am using vulkan (ash) to render a scene.
The rendering algorithm first does a raytracing pass (using the traditional pipeline not the NV extensions) then it renders a few frames normally.
On Nvidia it renders fine, on AMD I am experiencing a flicker.
The draw structure is:
Raytrace
Render 6 meshes with 6 draw calls
If I render that in immediate mode, I get flickering, if I do so in FIFO mode I don't. In immediate mode, if I put the thread to sleep in between the raytracing call and the mesh calls I get flickering no matter how slow the sleep is even at a 1 second sleep the flcikering occurs. The pattern of the flicker is 4 calls rendered normally, 2 calls the RT image is the only thing rendered, as if it had been called after the 6 regular mesh calls, even though it;s issued before them.
All f this suggests a synchronization bug. The issue is I have no idea what I forgot to do.
Each draw call is drawing to a swapchain image, each draw call has this structure:
Set up uniforms, desciptor sets, pipelines...
Create a fence
Submit draw command to queue signaling fence
Wait for fence
Wait for device
Delete fence
I am currently doing that at the end of every single one of the 7 calls. What did I forget to sync? What barrier am I missing?
I am using the passless rendering extension, so I don;t have any render passes. I might need an image barrier on the swapchain as a consequence but I don;t know if, and what it would need to be.

Loading/removing dynamically buffers with Vulkan

I switched to Vulkan from OpenGL to use multi-threading improvements.
In OpenGL, I was able to load dynamically object to the scene (buffer, textures, etc) while rendering by using a waiting system. I was loading all app-side stuffs in a thread, then when it was ready, just before a frame render in the main thread, I was sending everything into the video memory. That was fine.
With Vulkan, I know I can call some functions between threads without provoking the well known segfault from OpenGL. But, this doesn't works with vkQueueSubmit(). I already know, I tried the naive way. To me, it seems logical you can't bother a queue from multiple threads.
I came with some ideas, but I don't know which one is good or bad.
First, I would go the OpenGL way, I will prepare everything I can from the CPU/App side, then just before render a frame, I will submit buffers (with transfer queue) to the video memory. But I feel there is no a real improvement from OpenGL way...
Second, I will try to use the synchronization mechanism to be able to send buffers in a thread and render from an other. But I keep reading there is a lot of way to slow down everything by causing irrelevant locks or by using incorrectly semaphores and fences.
So my question, is basically what path to pick to solve this problem ? How can I load a buffer dynamically from an other thread while the main thread is rendering without making too much pain to performances ? How Vulkan can help ?
If you want to stream resources for immediate use (i.e. the main render cannot proceed without them), then you're pretty much going to either block the main thread waiting, or have it spin doing something visually interesting (e.g. an animated loading screen) waiting for the resources to load.
If you want to stream resources while the app is doing real rendering then the main trick here is to load resources asynchronously in the background and only switch to using those resources in the main thread once they are already loaded. If the main thread ever ends up actually blocked on a semaphore then you've probably already started dropping frames, so your "engine" design needs to ensure that never happens. A lot of game use simple low-detail proxy objects as stand-in versions while the high-detail version is loading in the background.
None of this is particularly related to the graphics API - both GL and Vulkan need the same macro-scale behavior. Vulkan API features don't particularly help because the major bottlenecks which cause problems here are storage/network/CPU which have nothing to do with the graphics part of the problem.
I decided to trust threads !
In the first place it seems to work, I get a lot of :
[MESSAGE:Validation Error: [ UNASSIGNED-Threading-MultipleThreads ] Object 0: handle = 0x56414228bad8, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0x141cb623 | THREADING ERROR : vkQueueSubmit(): object of type VkQueue is simultaneously used in thread 0x7f6b977fe640 and thread 0x7f6bc2bcb740]
But it works !
So, the basic idea is to have a thread for loading objects while the engine is drawing. This thread takes care of creating the UBO for the location of the object, then when the geometry is loaded from RAM, it creates the VBO and IBO (I left material with image/UBO on hold for now), then creates the graphics pipeline (with layout, descriptor layout, shaders compiled with GLSLang on the fly) (The next idea is to reuse pipeline for similar needs) and finallly sets a flag to say the object is ready to use. In the other hand, I have my main thread rendering and takes new objects when they shows up ready.
I think it works because I have a gentle video card (GTX 1070) with multiple queues setup, I had one for graphics and an other one for transfer setup.
I'm pretty sure, this will crash or goes poorly with a GPU with a single queue, and this should be why the validation layers tolds me these messages.

Multiple OpenGL contexts, multiple windows, multithreading, and vsync

I am creating a graphical user interface application using OpenGL in which there can be any number of windows - "multi-document interface" style.
If there were one window, the main loop could look like this:
handle events
draw()
swap buffers (vsync causes this to block until vertical monitor refresh)
However consider the main loop when there are 3 windows:
each window handle events
each window draw()
window 1 swap buffers (block until vsync)
(some time later) window 2 swap buffers (block until vsync)
(some time later) window 3 swap buffers (block until vsync)
Oops... now rendering one frame of the application is happening at 1/3 of the proper framerate.
Workaround: Utility Window
One workaround is to have only one of the windows with vsync turned on, and the rest of them with vsync turned off. Call swapBuffers() on the vsync window first and draw that one, then draw the rest of the windows and swapBuffers() on each one.
This workaround will probably look fine most of the time, but it's not without issues:
it is inelegant to have one window be special
a race condition could still cause screen tearing
some platforms ignore the vsync setting and force it to be on
I read that switching which OpenGL context is bound is an expensive operation and should be avoided.
Workaround: One Thread Per Window
Since there can be one OpenGL context bound per thread, perhaps the answer is to have one thread per window.
I still want the GUI to be single threaded, however, so the main loop for a 3-window situation would look like this:
(for each window)
lock global mutex
handle events
draw()
unlock global mutex
swapBuffers()
Will this work? This other question indicates that it will not:
It turns out that the windows are 'fighting' each other: it looks like
the SwapBuffers calls are synchronized and wait for each other, even
though they are in separate threads. I'm measuring the frame-to-frame
time of each window and with two windows, this drops to 30 fps, with
three to 20 fps, etc.
To investigate this claim I created a simple test program. This program creates N windows and N threads, binds one window per thread, requests each window to have vsync on, and then reports the frame rate. So far the results are as follows:
Linux, X11, 4.4.0 NVIDIA 346.47 (2015-04-13)
frame rate is 60fps no matter how many windows are open.
OSX 10.9.5 (2015-04-13)
frame rate is not capped; swap buffers is not blocking.
Workaround: Only One Context, One Big Framebuffer
Another idea I thought of: have only one OpenGL context, and one big framebuffer, the size of all the windows put together.
Each frame, each window calls glViewport to set their respective rectangle of the framebuffer before drawing.
After all drawing is complete, swapBuffers() on the only OpenGL context.
I'm about to investigate whether this workaround will work or not. Some questions I have are:
Will it be OK to have such a big framebuffer?
Is it OK to call glViewport multiple times every frame?
Will the windowing library API that I am using even allow me to create OpenGL contexts independent of windows?
Wasted space in the framebuffer if the windows are all different sizes?
Camilla Berglund, maintainer of GLFW says:
That's not how glViewport works. It's not
how buffer swapping works either. Each window will have a
framebuffer. You can't make them share one. Buffer swapping is
per window framebuffer and a context can only be bound to a single
window at a time. That is at OS level and not a limitation of
GLFW.
Workaround: Only One Context
This question indicates that this algorithm might work:
Activate OpenGL context on window 1
Draw scene in to window 1
Activate OpenGL context on window 2
Draw scene in to window 2
Activate OpenGL context on window 3
Draw scene in to window 3
For all Windows
SwapBuffers
According to the question asker,
With V-Sync enabled, SwapBuffers will sync to the slowest monitor and
windows on faster monitors will get slowed down.
It looks like they only tested this on Microsoft Windows and it's not clear that this solution will work everywhere.
Also once again many sources tell me that makeContextCurrent() is too slow to have in the draw() routine.
It also looks like this is not spec conformant with EGL. In order to allow another thread to eglSwapBuffers(), you have to eglMakeCurrent(NULL) which means your eglSwapBuffers now is supposed to return EGL_BAD_CONTEXT.
The Question
So, my question is: what's the best way to solve the problem of having a multi-windowed application with vsync on? This seems like a common problem but I have not yet read a satisfying solution for it.
Similar Questions
Similar to this question: Synchronizing multiple OpenGL windows to vsync but I want a platform-agnostic solution - or at least a solution for each platform.
And this question: Using SwapBuffers() with multiple OpenGL canvases and vertical sync? but really this problem has nothing to do with Python.
swap buffers (vsync causes this to block until vertical monitor refresh)
No, it doesn't block. The buffer swap call may return immediately and not block. What it does however is inserting a synchronization point so that execution of commands altering the back buffer is delayed until the buffer swap happened. The OpenGL command queue is of limited length. Thus once the command queue is full, futher OpenGL calls will block the program until further commands can be pushes into the queue.
Also the buffer swap is not an OpenGL operation. It's a graphics / windowing system level operation and happens independent of the OpenGL context. Just look at the buffer swap functions: The only parameter they take are a handle to the drawable (=window). In fact even if you have multiple OpenGL contexts operating on a single drawable, you swap the buffer only once; and you can do it without a OpenGL context being current on the drawable at all.
So the usual approach is:
' first do all the drawing operations
foreach w in windows:
foreach ctx in w.contexts:
ctx.make_current(w)
do_opengl_stuff()
glFlush()
' with all the drawing commands issued
' loop over all the windows and issue
' the buffer swaps.
foreach w in windows:
w.swap_buffers()
Since the buffer swap does not block, you can issue all the buffer swaps for all the windows, without getting delayed by V-Sync. However the next OpenGL drawing command that addresses a back buffer issued for swapping will likely stall.
A workaround for that is using an FBO into which the actual drawing happens and combine this with a loop doing the FBO blit to the back buffer before the swap buffer loop:
' first do all the drawing operations
foreach w in windows:
foreach ctx in w.contexts:
ctx.make_current(w)
glBindFramebuffer(GL_DRAW_BUFFER, ctx.master_fbo)
do_opengl_stuff()
glFlush()
' blit the FBOs' renderbuffers to the main back buffer
foreach w in windows:
foreach ctx in w.contexts:
ctx.make_current(w)
glBindFramebuffer(GL_DRAW_BUFFER, 0)
blit_renderbuffer_to_backbuffer(ctx.master_renderbuffer)
glFlush()
' with all the drawing commands issued
' loop over all the windows and issue
' the buffer swaps.
foreach w in windows:
w.swap_buffers()
thanks #andrewrk for all theses research, i personnaly do like that :
Create first window and his opengl context with double buffer.
Active vsync on this window (swapinterval 1)
Create others windows and attach first context with double buffer.
Disable vsync on theses others window (swapinterval 0)
For each frame
For invert each window (the one with vsync enable at the end).
wglMakeCurrent(hdc, commonContext);
draw.
SwapBuffer
In that manner, i achieve the vsync and all window are based on this same vsync.
But i encoutered problem without aero : tearing...

OpenGL multithreaded scene graph traversal

I am seeking to improve the performance by reduce scene graph traversal overhead before each render call.I am not very experienced with multi-threaded software design so after reading a couple of articles regarding multi-threaded rendering I am unsure how to approach this issue:
My rendering engine is completely deterministic and renders frames based on incoming transformation instructions in sequential manner at each new frame.I currently see the threaded scene graph update routine as something like this:
--------------CPU-------------------------------------|------GPU--------|----Frame Number----|
Update Frame 0 Transforms (spawn thread) | GL RenderCall | Frame 0
Update Frame 1 Transforms (spawn thread) | GL RenderCall | Frame 1
Update Frame 2 Transforms (spawn thread) | GL RenderCall | Frame 2
...
.......
...............
Before the first draw call I start updating first(Frame 1) frame in separate tread and proceed with render call.At the end of that call I start new thread for update of frame 2 ,check if the thread for frame one is done and if true , I call next render call.And so on and so on.
That is how I see this happening.I have 2 questions:
1.Is it the proper (simple) way to design this kind of system?
2.What is the likelihood of render loop stalls because the scene graph update thread hasn't finished the update in synch with the start of the next render call?
I know some of the people here will say it depends on specific scene graph tree complexity, but I would like to know how it usually goes in reality and what are the major drawbacks of such a design/
As you probably know, you shouldn't render to a common OpenGL drawable from multiple threads, as this would result in a net slowdown. However preparing the drawing, aka the frame setup is a valid step to parallelize. It always boils down to generate a linear list of objects to draw in order to maximize throughput and generate a correct result.
Of course the actual generation steps depend on the structure used. But for a multithreaded design it usually boils down to a map and reduce kind of approach. Creating and synchronizing threads has a certain overhead. Luckily those problems are addressed by systems like OpenMP. I also suggest you perform the frame setup phase during the SwapBuffers wait of the preceding frame.

How can I load a texture in separate thread in cocos2d-x?

I faced the need to use multi-threading to load an additional texture on-the-fly in order to reduce the memory footprint.
The example case is that I have 10 types of enemy to use in the a single level but the enemies will come out type by type. The context of "type by type" means one type of enemy comes out and the player kills all of its instances, then it's time to call in another type. The process goes like this until all types come out, then the level is complete.
You can see it's better to not initially load all enemy's texture at once in the starting time (it's pretty big 2048*2048 with lots of animation frames inside which I need to create them in time of creation for each type of enemy). I turn this to multi-thread to load an additional texture when I need it. But I knew that cocos2d-x is not thread-safe. I planned to use CCSpriteFrameCache class to load a texture from .plist + .png file then re-create animation there and finally create a CCSprite from it to represent a new type of enemy instance. If I don't use multi-thread, I might suffer from delay of lag that would occur of loading a large size of texture.
So how can I load a texture in separate thread in cocos2d-x following my goal above? Any idea to avoid thread-safe issue but still can accomplish my goal is also appreciated.
Note: I'm developing on iOS platform.
I found that async-loading of image is already there inside cocos2d-x.
You can build a testing project of cocos2d-x and look into "Texture2DTest", then tap on the left arrow to see how async-loading look like.
I have taken a look inside the code.
You can use addImageAsync method of CCtextureCache to load additional texture on-the-fly without interfere or slow down other parts such as the current animation that is running.
In fact, addImageAsync of CCTextureCache will load CCTexture2D object for you and return back to its callback method to receive. You have additional task to make use of it on your behalf.
Please note that CCSpriteFrameCache uses CCTextureCache to load frames. So this applies to it as well for my case to load spritesheet consisting of frames to be used in animation creation. But unfortunately async type of method is not provided for CCSpriteFrameCache class. You have to manually load texture object via CCTextureCache then plug it in
void CCSpriteFrameCache::addSpriteFramesWithFile(const char *pszPlist, CCTexture2D *pobTexture)
There's 2 file in testing project you can take a look at.
Texture2dTest.cpp
TextureCacheTest.cpp

Resources