(D3D11) Reading texel on separate thread - direct3d

In D3D10, I load a staging texture onto my GPU memory, then map it in order to access its texel data on the CPU. This is done on a separate thread, not the thread I render with. I just call the device methods, and it works.
In D3D11 I load the staging texture onto my GPU, but to access it (i.e. Map it) I need to use the Context, not the device. Can't use the immediate context, since the immediate context can only be used by a single thread at a time. But I also can't use a deferred context to Read from the texture to the CPU:
"If you call Map on a deferred context, you can only pass D3D11_MAP_WRITE_DISCARD, D3D11_MAP_WRITE_NO_OVERWRITE, or both to the MapType parameter. Other D3D11_MAP-typed values are not supported for a deferred context."
Ok, so what am I supposed to do now? It is common to use textures to store certain data (heightmaps for instance) and you obviously have to be able to access that data for it to be useful. Is there no way for me to do this in a separate thread with D3D11?

You should map the staging texture using the immediate context on the render thread, then use the contents as you wish on your second thread. Even in D3D10, the call to map the texture for read ends up putting a synchronization point in the command buffer (refer to this article), effectively serializing your threads. The D3D11 API makes an effort to discourage hidden performance costs like this.


Can I use a Deferred Context for Instancing?

Before starting, I'm sorry for my bad English.
I read here like below.
Things like map/discard and the like can cause a lot of memory to be consumed; for example, iirc if you 'map' on a deferred context the driver basically creates you a whole new copy/version of what you have just mapped and ties it to the that context; when you factor in that the driver will buffer a few frames ahead memory can be eaten very quickly.
A Instancing buffer can be big size(MAX_INSTANCE_NUM * PER_INSTANCE_DATA).
So I think it is impossible to use a deferred context for instancing.
Then If I need to render multi-windows like two different Scene Window, is there no way except locking immediate context? Just using a deferred context for passes like shadowing, lighting, .. ect, is best for me?

DirectX11 resource Release Multi-Threading

I've read the https://learn.microsoft.com/en-us/windows/desktop/direct3d11/overviews-direct3d-11-render-multi-thread-intro
And it states that I can make calls to ID3D11Device from multiple threads (unless D3D11_CREATE_DEVICE_SINGLETHREADED was used), but calls to ID3D11DeviceContext have to be surrounded with a critical section.
I haven't found any information about releasing resources, using their 'Release' method, for resources such as textures, render targets, vertex/index buffers, shaders.
ID3D11Texture2D, ID3D11Texture3D, ID3D11ShaderResourceView, ID3D11RenderTargetView, ID3D11DepthStencilView
ID3D11VertexShader, ID3D11HullShader, ID3D11DomainShader, ID3D11PixelShader.
1) Can I call 'Release' for those resources at any time from any thread without using critical sections while they ARE NOT in use by the render thread's ID3D11DeviceContext?
2) Can I call 'Release' for those resources from other threads even while they ARE in use by ID3D11DeviceContext in the render thread?
Or do I need to surround the Release calls with the same critical section used for accessing ID3D11DeviceContext?
Generally the internal implementation of COM reference counts is done in a thread-safe manner (atomic increments/decrements), so it's safe to call AddRef and Release from multiple threads.
Of course, if the refcount goes to 0 then you have an object destruction so it's important that if you have multiple threads using the same resource, it has the appropriate number of reference counts to keep it live. In Direct3D, object destruction is typically deferred destruction so the actual object cleanup may not happen for a few frames, but you should still keep a non-zero refcount if anyone is referencing it.
Direct3D 11 uses the same rules as Direct3D 10. It uses 'weak references' for the pipeline set methods, so just having a resource set on the device context is not sufficient to increase it's reference count. IOW: if you have two threads both rendering with the same resource, then each thread must hold a reference count on the object to keep it 'live' whether or not it's 'actively set' on a device context at any given moment.
It works this way to avoid the overhead of constantly increment/decrementing reference counts every rendering frame. In Direct3D 9 this was happening thousands of times a frame or more.
Also, if the ID3D11Device reaches a zero ref-count, it and all it's child objects are released regardless of the individual device-child reference counts.
See Microsoft Docs.
The best answer is to use a smart-pointer like Microsoft::WRL::ComPtr and have each thread using a given resource have it's own ComPtr pointing to that resource. That way the only real special-case you'll have is when doing device tear-down (such as responding to a DXGI_ERROR_DEVICE_REMOVED or doing a 'clean exit').

shared context with GLUT on Linux

In my current setup, I have two displays that are being driven by two GPUs. Using GLUT, I create two windows (one per display) and render each one from the main thread by calling glutSetWindow() in the draw call, for each window.
The draw calls renders a Texture2D as a sphere (using gluSphere()) but the Texture2D is swapped for another image every few seconds. I have set up an array of 2 Texture2D so I can load the next image while the current Texture2D is shown. This works well as long as everything runs in the main thread.
The problem is that the call to glTexImage2D(), to load the next image, hangs my draw call, so I need to call glTexImage2D() on a different thread. Calling glTexImage2D() on a different thread crashes, as it seems the OpenGL context is not shared. GLUT does not seem to provide a way to share the context, but I should be able to get the context on Linux via glXGetCurrentContext().
My question is if I get the context via this call, how can I make it a shared context? And would this even work with GLUT? Another option has been to switch to different library to replace GLUT, like GLFW but in that case I will loose some handy function such as gluSphere(). Any recommendation if the context cannot be shared with GLUT please?
With GLX context sharing is established at context creation; unlike WGL you can't establish that sharing as an afterthought. Since GLUT doesn't have a context sharing feature (FreeGLUT may have one, but I'm not sure about that) this is not going to be straightforward.
I have two displays that are being driven by two GPUs.
Unless those GPUs are SLi-ed or CrossFire-ed you can't establish context sharing between them.
The problem is that the call to glTexImage2D(), to load the next image, hangs my draw call, so I need to call glTexImage2D() on a different thread.
If the images are of the same size, use glTexSubImage2D to replace it. Also image data can be loaded asynchronously using pixel buffer objects, using a secondary thread that doesn't even need a OpenGL context!
Outlining the steps:
In the OpenGL context thread:
initiating transfer
signal transfer thread
continue with normal drawing operations
In the transfer thread
on signal to start transfer
copy data to the mapped buffer
signal OpenGL context thread
In the OpenGL context thread:
on signal to complete transfer
sync = glFenceSync
keep on drawing with the old texture
on further iterations of the drawing loop
poll sync with glClientWaitSync using a timeout of 0
if the wait sync returns signalled switch to the new texture and delete the old one
else keep on drawing with the old texture

Why pass parameters through thread function?

When I create a new thread in a program... in it's thread handle function, why do I pass variables that I want that thread to use through the thread function prototype as parameters (as a void pointer)? Since threads share the same memory segments (except for stack) as the main program, shouldn't I be able to just use the variables directly instead of passing parameters from main program to new thread?
Well, yes, you could use the variables directly. Maybe. Assuming that they aren't changed by some other thread before your thread starts running.
Also, a big part of passing parameters to functions (including thread functions) is to limit the amount of information the called function has to know about the outside world. If you pass the thread function everything it needs in order to do its work, then you can change the rest of the program with relative impunity and the thread will still continue to work. If, however, you force the thread to know that there is a global list of strings called MyStringList, then you can't change that global list without also affecting the thread.
Information hiding. Encapsulation. Separation of concerns. Etc.
You cannot pass parameters to a thread function in any kind of normal register/stack manner because thread functions are not called by the creating thread - they are given execution directly by the underlying OS and the API's that do this copy a fixed number of parameters, (usually only one void pointer), to the new and different stack of the new thread.
As Jim says, failure to understand this mechanism often results in disaster. There are numnerous questions on SO where the vars that devs. hope would be used by a new thread are RAII'd away before the new thread even starts.

Is there some kind of incompatibility with Boost::thread() and Nvidia CUDA?

I'm developing a generic streaming CUDA kernel execution Framework that allows parallel data copy & execution on the GPU.
Currently I'm calling the cuda kernels within a C++ static function wrapper, so I can call the kernels from a .cpp file (not .cu), like this:
//kernel definition
__global__ void kernelCall_kernel( dataRow* in, dataRow* out, void* additionalData){
//Do something
//kernel handler, so I can compile this .cu and link it with the main project and call it within a .cpp file
extern "C" void kernelCall( dataRow* in, dataRow* out, void* additionalData){
int blocksize = 256;
dim3 dimBlock(blocksize);
dim3 dimGrid(ceil(tableSize/(float)blocksize));
kernelCall_kernel<<<dimGrid,dimBlock>>>(in, out, additionalData);
If I call the handler as a normal function, the data printed is right.
//allocations and definitions of data omitted
//copy data to GPU
//copy data back
//show result:
printTable(result_h,resultSize);// this just iterate and shows the data
But to allow parallel copy and execution of data on the GPU I need to create a thread, so when I call it making a new boost::thread:
//allocations, definitions of data,copy data to GPU omitted
boost::thread* kernelThreadOwner = new boost::thread(kernelCall, data_d,result_d,null);
//Copy data back and print ommited
I just get garbage when printing the result on the end.
Currently I'm just using one thread, for testing purpose, so there should be no much difference in calling it directly or creating a thread. I have no clue why calling the function directly gives the right result, and when creating a thread not. Is this a problem with CUDA & boost? Am I missing something? Thank you in advise.
The problem is that (pre CUDA 4.0) CUDA contexts are tied to the thread in which they were created. When you are using two threads, you have two contexts. The context that the main thread is allocating and reading from, and the context that the thread which runs the kernel inside are not the same. Memory allocations are not portable between contexts. They are effectively separate memory spaces inside the same GPU.
If you want to use threads in this way, you either need to refactor things so that one thread only "talks" to the GPU, and communicates with the parent via CPU memory, or use the CUDA context migration API, which allows a context to be moved from one thread to another (via cuCtxPushCurrent and cuCtxPopCurrent). Be aware that context migration isn't free, and there is latency involved, so if you plan to migrating contexts around frequently, you might find it more efficient to change to a different design which preserves context-thread affinity.
