DirectX produces TDR under Intel when reusing command queue - graphics

I'm currently working on a directx rendering environment and encounter undefined behaviour under Intel cards.
I'm currently using trivial synchronization, so that the CPU always waits for the GPU to finish after every ExecuteCommandLists().
Everything works perfectly fine, drawing, copying my resources etc.
Since I work with different "passes" (draw passes, memory transfer pass, clear pass) I also have a "present" pass. There I have one commandlist with just 2 tasks, using a resource barrier to transform the render target from render target to present mode. I use the same command queue that is used for the swap chain and where all the drawing happens. Previous tasks were fully executed and waited for.
When I try to execute the two tasks it produces a device removal under Intel cards without any further explanations. On other cards from different vendors it works perfectly fine.
It definitely has something to do with the command queue, when I remove the resource barrier calls, it still crashes. If I create a new command queue it runs through, but crashes as well but in this case with a useful debug message that I need to use the same queue as for the swap chain.
Whats going wrong here and how can I get more information about the device removal?
Thanks for your help!

Related

Loading/removing dynamically buffers with Vulkan

I switched to Vulkan from OpenGL to use multi-threading improvements.
In OpenGL, I was able to load dynamically object to the scene (buffer, textures, etc) while rendering by using a waiting system. I was loading all app-side stuffs in a thread, then when it was ready, just before a frame render in the main thread, I was sending everything into the video memory. That was fine.
With Vulkan, I know I can call some functions between threads without provoking the well known segfault from OpenGL. But, this doesn't works with vkQueueSubmit(). I already know, I tried the naive way. To me, it seems logical you can't bother a queue from multiple threads.
I came with some ideas, but I don't know which one is good or bad.
First, I would go the OpenGL way, I will prepare everything I can from the CPU/App side, then just before render a frame, I will submit buffers (with transfer queue) to the video memory. But I feel there is no a real improvement from OpenGL way...
Second, I will try to use the synchronization mechanism to be able to send buffers in a thread and render from an other. But I keep reading there is a lot of way to slow down everything by causing irrelevant locks or by using incorrectly semaphores and fences.
So my question, is basically what path to pick to solve this problem ? How can I load a buffer dynamically from an other thread while the main thread is rendering without making too much pain to performances ? How Vulkan can help ?
If you want to stream resources for immediate use (i.e. the main render cannot proceed without them), then you're pretty much going to either block the main thread waiting, or have it spin doing something visually interesting (e.g. an animated loading screen) waiting for the resources to load.
If you want to stream resources while the app is doing real rendering then the main trick here is to load resources asynchronously in the background and only switch to using those resources in the main thread once they are already loaded. If the main thread ever ends up actually blocked on a semaphore then you've probably already started dropping frames, so your "engine" design needs to ensure that never happens. A lot of game use simple low-detail proxy objects as stand-in versions while the high-detail version is loading in the background.
None of this is particularly related to the graphics API - both GL and Vulkan need the same macro-scale behavior. Vulkan API features don't particularly help because the major bottlenecks which cause problems here are storage/network/CPU which have nothing to do with the graphics part of the problem.
I decided to trust threads !
In the first place it seems to work, I get a lot of :
[MESSAGE:Validation Error: [ UNASSIGNED-Threading-MultipleThreads ] Object 0: handle = 0x56414228bad8, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0x141cb623 | THREADING ERROR : vkQueueSubmit(): object of type VkQueue is simultaneously used in thread 0x7f6b977fe640 and thread 0x7f6bc2bcb740]
But it works !
So, the basic idea is to have a thread for loading objects while the engine is drawing. This thread takes care of creating the UBO for the location of the object, then when the geometry is loaded from RAM, it creates the VBO and IBO (I left material with image/UBO on hold for now), then creates the graphics pipeline (with layout, descriptor layout, shaders compiled with GLSLang on the fly) (The next idea is to reuse pipeline for similar needs) and finallly sets a flag to say the object is ready to use. In the other hand, I have my main thread rendering and takes new objects when they shows up ready.
I think it works because I have a gentle video card (GTX 1070) with multiple queues setup, I had one for graphics and an other one for transfer setup.
I'm pretty sure, this will crash or goes poorly with a GPU with a single queue, and this should be why the validation layers tolds me these messages.

Why does vkAcquireNextImageKHR() never block my thread?

I am using Vulkan graphics API (via BGFX) to render. And I have been measuring how much (wall-clock) time my calls take.
What I do not understand is that vkAcquireNextImageKHR() is always fast, and never blocks. Even though I disable the time-out and use a semaphore to wait for presentation.
The presentation is locked to a 60Hz display rate, and I see my main-loop indeed run at 16.6 or 33.3 ms.
Shouldn't I see the wait-time for this display rate show up in the length of the vkAcquireNextImageKHR() call?
The profiler measures this call as 0.2ms or so, and never a substantial part of a frame.
VkResult result = vkAcquireNextImageKHR(
m_device
, m_swapchain
, UINT64_MAX
, renderWait
, VK_NULL_HANDLE
, &m_backBufferColorIdx
);
Target hardware is a handheld console.
The whole purpose of Vulkan is to alleviate CPU bottlenecks. Making the CPU stop until the GPU is ready for something would be the opposite of that. Especially if the CPU itself isn't actually going to use the result of this operation.
As such, all vkAcquireNextImageKHR function does is let you know which image will be made available to you next. This is the minimum that needs to happen in order for you to be able to use that image (for example, by building command buffers that reference the image in some way). However, that image is not yet available to you.
This is why this function requires you to provide a semaphore and/or a fence: so that the process which consumes the image can wait for the image to be made available.
If the process which consumes the image is just a bunch of commands in a command buffer (ie: something you submit with vkQueueSubmit), you can simply have that batch of work wait on the semaphore given to the acquire operation. That means all of the waiting happens in the GPU. Where it belongs.
The fence is there if you (for some reason) want the CPU to be able to wait until the acquire is done. But Vulkan, as an explicit, low-level API, forces you to explicitly say that this is what you want (and it almost never is what you want).

AAssetManager_open deadlocks on Pixel/Pixel2 XL

I have in beta tests that is getting ANRs on a Pixel, and a Pixel 2 XL both running 8.1. From the logs I am getting back it appears that the call to AAssetManager_open is blocked waiting for a mutex to unlock.
From the log there is not discernible threading conflicts. On the one device it is happen on the third asset read as the app is loading. All of which are separate (and clean up) but sequential. The other device the deadlock is later. No threads are touching related code.
I have yet to encounter this issue on another device so I can't even play with it locally to understand further (I don't have either device). From what Android source I could find the mutex locked doesn't have anything complex about its usage.
The calls are happening on the main thread (hence the ANRs), so I can patch the issue by moving them off there. Ideally, though, I want to fix (or at least understand the cause of) the underlying problem of why the deadlock is happening on these devices in the first place.
So, what I am wondering is if there are any known ways to create a deadlock with AAssetManager_open?
On a side note, the closet I have found is a single article that mentions in passing people getting ANRs on AAssetManager_open in the Oreo pre-release builds but I can't find anything else on that.
Edit: I know have a crash on a different 8.1 device (OnePlus5), so it appears to not be specific to Pixel but 8.1 in general.
I also moved what AssetManager reads were on the main thread just incase and as expected the issue still exists (just don't get an ANR).
Edit #2: It is specific to 8.1 with AdMob mediation.
The reason of deadlock can be invalid pointer to AAssetManager. Please make sure, that pointer which you get from AAssetManager_fromJava is still valid (was not destroyed by GC).
link to AAssetManager_fromJava description
Opening an asset may be a blocking operation. Some assets are stored uncompressed, and AAssetManager_open() can return handle to work with the 'file' in-place. Other assets must be uncompressed to a local cache before they can be used. Some of these files will already be unzipped to your disk, and AAssetManager_open() will return almost instantaneously. Others must be unzipped, and this will compete for CPU with other threads, producing ANR if you are unlucky.
The bottom line, do your best to open assets not from the UI thread.

multithreaded IDirect3DDevice9::CreateDevice freeze

I'm moving my renderer to a different thread.
During this process I'm making two calls to IDirect3D9::CreateDevice:
1. from the 'rendering thread' - in order to create a rendering device and resize it properly
2. from the 'main thread' - here I'm creating a Null device in order to compile shaders etc.
These calls of course can overlap (be made simultaneously ), so I'm synchronizing them with a CriticalSection.
The problem is that one of these calls sometime freezes. DirectX doesn't throw any warnings prior to that happening, so I suspect an internal deadlock.
I studied the documentation and it's mentioned that all calls that operate on a single device, especially IDirect3D9::CreateDevice, IDirect3DDevice9::TestCooperativeLevel and IDirect3DDevice9::Reset, need to be called from the same thread - but I have that covered.
So what am I missing? Can anyone please tell me?
Thanks,
Paksas
I only have a vague memory of this but:
The docs state "Any call to create, release, or reset the device must be done using the same thread as the window procedure of the focus window."
As I remember things, even if you try and create a device using a NULL HWND, internally Direct3D goes and digs one up for your app anyway.
Therefore one of your threads is surely violating the first point.

Draw on top of suspended full-screen Direct3D app

Currently, I am able to hook onto Direct3D application and draw custom stuff onto its surface. However, I would like to suspend this application and then draw something else.
Is this even remotely possible to do so? Like creating another my own Direct3D window on top of that application?
I'm targetting only Windows 7, but the application I want to draw on is using only DirectX 9.
The problem is that I have very little experience with DirectX in general.
Sort of.
You're working with two different elements here, one quite large and but not particularly complex: hooking D3D. The other ("suspending" the app) is simple within that, but you don't quite want what you think you want.
To hook D3D, by the simplest method, you need to intercept the call to CreateDirect3D9 and return your own IDirect3D9, which later creates and returns your own IDirect3DDevice9. This will give you full control over the app's render process.
In order to "suspend" it, you need to wait for the desired trigger, then in your IDirect3DDevice9::Present, call your own event loop. This will, for all intents and purposes, suspend execution of the original app's code, but not the process itself (allowing your code and event loop to process). There will be some limitations of this, and you may not be able to consume window/Windows events (simply), but it will give you full control and effectively pause the original app.
Note, however, that you must intercept and reroute execution in every thread you want to "suspend," it's only specific to a single thread and you don't want physics or AI crunching on while render and UI are paused.
You need to perform your overlay drawing, whatever that may be, during your loop or your IDirect3DDevice9::Present hook, then call the real device's Present method as needed. If you want to run multiple frames of your overlay, then call the real Present repeatedly before returning from your Present. Tweak as necessary. Rendering here is done pretty much normally (check out general D3D tutorials for that), but there is one major catch: the device's state is unknown and may be incompatible, but must be "untouched" on return. This is handled simply by caching an IDirect3DStateBlock9 created from the device immediately after creating it. In your Present hook, create another state block with the state on entrance, restore the clean state block, run your code, then restore the entrance state block. You can work with any states, off a fresh slate, without damaging the device's state (I use this in practice, in works great).
If you want some rather extensive examples of how this works, I'd suggest checking out the Voodoo Shader project, which has full D3D8 and 9 hooks, including everything needed for overlays [/shameless own-project promotion]. Feel free to reuse any of the concepts, or comment with further questions; this certainly isn't all the details that may be useful to you.
This is a very complex thing to accomplish, as it is very much a hack to do so. The only people you see doing such things are steam, teamspeak, xfire, fraps, and a few hard-core devs.
There are kits out on the internet that show you have to inject a DLL into the memory space of the target application to achieve such a feat, and methods such as proxy DLLs.
Proxy DLL:
http://www.codeguru.com/cpp/g-m/directx/directx8/article.php/c11453
Injection:
http://www.progamercity.net/d3d/372-c-directx9-0-hooking-via-detours.html
Good luck, this will take you a while.

Resources