DirectX11 resource Release Multi-Threading - multithreading

I've read the https://learn.microsoft.com/en-us/windows/desktop/direct3d11/overviews-direct3d-11-render-multi-thread-intro
And it states that I can make calls to ID3D11Device from multiple threads (unless D3D11_CREATE_DEVICE_SINGLETHREADED was used), but calls to ID3D11DeviceContext have to be surrounded with a critical section.
I haven't found any information about releasing resources, using their 'Release' method, for resources such as textures, render targets, vertex/index buffers, shaders.
ID3D11Texture2D, ID3D11Texture3D, ID3D11ShaderResourceView, ID3D11RenderTargetView, ID3D11DepthStencilView
ID3D11Buffer.
ID3D11VertexShader, ID3D11HullShader, ID3D11DomainShader, ID3D11PixelShader.
1) Can I call 'Release' for those resources at any time from any thread without using critical sections while they ARE NOT in use by the render thread's ID3D11DeviceContext?
2) Can I call 'Release' for those resources from other threads even while they ARE in use by ID3D11DeviceContext in the render thread?
Or do I need to surround the Release calls with the same critical section used for accessing ID3D11DeviceContext?

Generally the internal implementation of COM reference counts is done in a thread-safe manner (atomic increments/decrements), so it's safe to call AddRef and Release from multiple threads.
Of course, if the refcount goes to 0 then you have an object destruction so it's important that if you have multiple threads using the same resource, it has the appropriate number of reference counts to keep it live. In Direct3D, object destruction is typically deferred destruction so the actual object cleanup may not happen for a few frames, but you should still keep a non-zero refcount if anyone is referencing it.
Direct3D 11 uses the same rules as Direct3D 10. It uses 'weak references' for the pipeline set methods, so just having a resource set on the device context is not sufficient to increase it's reference count. IOW: if you have two threads both rendering with the same resource, then each thread must hold a reference count on the object to keep it 'live' whether or not it's 'actively set' on a device context at any given moment.
It works this way to avoid the overhead of constantly increment/decrementing reference counts every rendering frame. In Direct3D 9 this was happening thousands of times a frame or more.
Also, if the ID3D11Device reaches a zero ref-count, it and all it's child objects are released regardless of the individual device-child reference counts.
See Microsoft Docs.
The best answer is to use a smart-pointer like Microsoft::WRL::ComPtr and have each thread using a given resource have it's own ComPtr pointing to that resource. That way the only real special-case you'll have is when doing device tear-down (such as responding to a DXGI_ERROR_DEVICE_REMOVED or doing a 'clean exit').

Related

can pthread_mutexattr_setrobust apply to pthread_rwlock_t?

the robustness of mutex is very important to my program since it can handle the case when a process died without releasing the mutex.
But according to the document, pthread_mutexattr_setrobust only apply to pthread_mutex_t, instead of pthread_rwlock_t, is there any approach to set the robustness of pthread_rwlock_t? Or its implementation is robust by default?
according to the document, pthread_mutexattr_setrobust only apply to pthread_mutex_t
More precisely, pthread_mutexattr_setrobust() sets a property of a pthread_mutexattr_t object, and these are used (only) for configuring objects of type pthread_mutex_t. This happens at initialization of the mutex via pthread_mutex_init().
The corresponding initialization function for read/write locks is pthread_rwlock_init(), and its documentation shows that the corresponding attribute object type, accepted by that function, is pthread_rwlockattr_t. Implementations may provide whatever properties they like as extensions, but the only one specified for this type by the current version of POSIX is pshared. Thus no, there is no (portable) robustness option for pthreads read/write locks.

User defined atomic less than

I've been reading and it seems that std::atomic doesn't support a compare and swap of the less/greater than variant.
I'm using OpenMP and need to safely update a global minimum value.
I was thinking this would be as easy as using a built-in API.
But alas, so instead I'm trying to come up with my own implementation.
I'm primarily concerned with the fact that I don't want to use an omp critical section to do a less than comparison every single time because it may incur significant synchronization overhead for very little gain in most cases.
But in those cases where a new global minima is potentially found (less often), the synchronization overhead is acceptable. I'm thinking I can implement it using the following method. Hoping for someone to advise.
Use an std::atomic_uint as the global minima.
Atomically read the value into thread local stack.
Compare it against the current value and if it's less, attempt to enter a critical section.
Once synchronized, verify that the atomic value is still less than the new one and update accordingly (the body of the critical section should be cheap, just update a few values).
This is for a homework assignment, so I'm trying to keep the implementation my own. Please don't recommend various libraries to accomplish this. But please do comment on the synchronization overhead that this operation can incur or if it's bad, elaborate on why. Thanks.
What you're looking for would be called fetch_min() if it existed: fetch old value and update the value in memory to min(current, new), exactly like fetch_add but with min().
This operation is not directly supported in hardware on x86, but machines with LL/SC could emit slightly more efficient asm for it than from emulating it with a CAS ( old, min(old,new) ) retry loop.
You can emulate any atomic operation with a CAS retry loop. In practice it usually doesn't have to retry, because the CPU that succeeded at doing a load usually also succeeds at CAS a few cycles later after computing whatever with the load result, so it's efficient.
See Atomic double floating point or SSE/AVX vector load/store on x86_64 for an example of creating a fetch_add for atomic<double> with a CAS retry loop, in terms of compare_exchange_weak and plain + for double. Do that with min and you're all set.
Re: clarification in comments: I think you're saying you have a global minimum, but when you find a new one, you want to update some associated data, too. Your question is confusing because "compare and swap on less/greater than" doesn't help you with that.
I'd recommend using atomic<unsigned> globmin to track the global minimum, so you can read it to decide whether or not to enter the critical section and update related state that goes with that minimum.
Only ever modify globmin while holding the lock (i.e. inside the critical section). Then you can update it + the associated data. It has to be atomic<> so readers that look at just globmin outside the critical section don't have data race UB. Readers that look at the associated extra data must take the lock that protects it and makes sure that updates of globmin + the extra data happen "atomically", from the perspective of readers that obey the lock.
static std::atomic<unsigned> globmin;
std::mutex globmin_lock;
static struct Extradata globmin_extra;
void new_min_candidate(unsigned newmin, const struct Extradata &newdata)
{
// light-weight early out check to avoid the critical section
// No ordering requirement as long as globmin is monotonically decreasing with time
if (newmin < globmin.load(std::memory_order_relaxed))
{
// enter a critical section. Use OpenMP stuff if you want, this is plain ISO C++
std::lock_guard<std::mutex> lock(globmin_lock);
// Check globmin again, after we've excluded other threads from modifying it and globmin_extra
if (newmin < globmin.load(std::memory_order_relaxed)) {
globmin.store(newmin, std::memory_order_relaxed);
globmin_extra = newdata;
}
// else leave the critical section with no update:
// another thread raced with use *outside* the critical section
// release the lock / leave critical section (lock goes out of scope here: RAII)
}
// else do nothing
}
std::memory_order_relaxed is sufficient for globmin: there's no ordering required with anything else, just atomicity. We get atomicity / consistency for the associated data from the critical section/lock, not from memory-ordering semantics of loading / storing globmin.
This way the only atomic read-modify-write operation is the locking itself. Everything on globmin is either load or store (much cheaper). The main cost with multiple threads will still be bouncing the cache line around, but once you own a cache line, each atomic RMW is maybe 20x more expensive than a simple store on modern x86 (http://agner.org/optimize/).
With this design, if most candidates aren't lower than globmin, the cache line will stay in the Shared state most of the time, so the globmin.load(std::memory_order_relaxed) outside the critical section can hit in L1D cache. It's just an ordinary load instruction, so it's extremely cheap. (On x86, even seq-cst loads are just ordinary loads (and release loads are just ordinary stores, but seq_cst stores are more expensive). On other architectures where the default ordering is weaker, seq_cst / acquire loads need a barrier.)

(D3D11) Reading texel on separate thread

In D3D10, I load a staging texture onto my GPU memory, then map it in order to access its texel data on the CPU. This is done on a separate thread, not the thread I render with. I just call the device methods, and it works.
In D3D11 I load the staging texture onto my GPU, but to access it (i.e. Map it) I need to use the Context, not the device. Can't use the immediate context, since the immediate context can only be used by a single thread at a time. But I also can't use a deferred context to Read from the texture to the CPU:
"If you call Map on a deferred context, you can only pass D3D11_MAP_WRITE_DISCARD, D3D11_MAP_WRITE_NO_OVERWRITE, or both to the MapType parameter. Other D3D11_MAP-typed values are not supported for a deferred context."
http://msdn.microsoft.com/en-us/library/ff476457.aspx
Ok, so what am I supposed to do now? It is common to use textures to store certain data (heightmaps for instance) and you obviously have to be able to access that data for it to be useful. Is there no way for me to do this in a separate thread with D3D11?
You should map the staging texture using the immediate context on the render thread, then use the contents as you wish on your second thread. Even in D3D10, the call to map the texture for read ends up putting a synchronization point in the command buffer (refer to this article), effectively serializing your threads. The D3D11 API makes an effort to discourage hidden performance costs like this.

Delphi threading - which parts of code need to be protected/synchronized?

so far I thought that any operation done on "shared" object (common for multiple threads) must be protected with "synchronize", no matter what. Apparently, I was wrong - in the code I'm studying recently there are plenty of classes (thread-safe ones, as the Author claims) and only one of them uses Critical Section for almost every method.
How do I find what parts / methods of my code needs to be protected with CriticalSection (or any other method) and which not?
So far I haven't stumbled upon any interesting explanation / article / blog note, all google results are:
a) examples of synchronization between thread and the GUI. From simple progressbar to most complex, but still the lesson is obvious: each time you access / modify the property of GUI component, do that in "Synchronize". But nothing more.
b) articles explaining Critical Sections, Mutexes etc. Just a different approaches of protection/synchronization.
c) Examples of very very simple thread-safe classes (thread safe stack or list) - they all do the same - implement lock / unlock methods which do enter/leave critical section and return the actual stack/list pointer on locking.
Now I'm looking for explanation which parts of code should be protected.
could be in form of code ;) but please don't provide me with one more "using Synchronize to update progressbar" ... ;)
thank you!
You are asking for specific answers to a very general question.
Basically, apart of UI operations, you should protect every shared memory/resource access to avoid two potentially competing threads to:
read inconsistent memory
write memory at the same time
try to use the same resource at the same time from more than one thread... until the resource is thread-safe.
Generally, I consider any other operation thread safe, including operations that access not shared memory or not shared objects.
For example, consider this object:
type
TThrdExample = class
private
FValue: Integer;
public
procedure Inc;
procedure Dec;
function Value: Integer;
procedure ThreadInc;
procedure ThreadDec;
function ThreadValue: Integer;
end;
ThreadVar
ThreadValue: Integer;
Inc, Dec and Value are methods which operate over FValue field. The methods are not thread safe until you protect them with some synchronization mechanism. It can be a MultipleReaderExclusiveWriterSinchronizer for Value function and CriticalSection for Inc and Dec methods.
ThreadInc and ThreadDec methods operate over ThreadValue variable, which is defined as ThreadVar, so I consider it ThreadSafe because the memory they access is not shared between threads... each call from different thread will access different memory address.
If you know that, by design, a class should be used only in one thread or inside other synchronization mechanisms, you're free to consider that thread safe by design.
If you want more specific answers, I suggest you try with a more specific question.
Best regards.
EDIT: Maybe someone say the integer fields is a bad example because you can consider integer operations atomic on Intel/Windows thus is not needed to protect it... but I hope you get the idea.
You misunderstood TThread.Synchronize method.
TThread.Synchronize and TThread.Queue methods executes protected code in the context of main (GUI) thread. That is why you should use Syncronize or Queue to update GUI controls (like progressbar) - normally only main thread should access GUI controls.
Critical Sections are different - the protected code is executed in the context of the thread that acquired critical section, and no other thread is permitted to acquire the critical section until the former thread releases it.
You use critical section in case there's a need for a certain set of objects to be updated atomically. This means, they must at all times be either already updated completely or not yet updated at all. They must never be accessible in a transitional state.
For example, with a simple integer reading/writing this is not the case. The operation of reading integer as well as the operation of writing it are atomic already: you cannot read integer in the middle of processor writing it, half-updated. It's either old value or new value, always.
But if you want to increment the integer atomically, you have not one, but three operations you have to do at once: read the old value into processor's cache, increment it, and write it back to memory. Each operation is atomic, but the three of them together are not.
One thread might read the old value (say, 200), increment it by 5 in cache, and at the same time another thread might read the value too (still 200). Then the first thread writes back 205, while the second thread increments its cached value of 200 to 203 and writes back 203, overwriting 205. The result of two increments (+5 and +3) should be 208, but it's 203 due to non-atomicity of operations.
So, you use critical sections when:
A variable, set of variables, or any resource is used from several threads and needs to be updated atomically.
It's not atomic by itself (for example, calling a function which is guarded by critical section inside of the function body, is an atomic operation already)
Have a read of this documentation
http://www.eonclash.com/Tutorials/Multithreading/MartinHarvey1.1/ToC.html
If you use messaging to communicate between threads then you can basically ignore synchronisation primitives completely because each thread only accesses its internal structures and the messages themselves. In essence this is far easier and more scalable architecture than using synchronisation primitives.

Design Pattern for multithreaded observers

In a digital signal acquisition system, often data is pushed into an observer in the system by one thread.
example from Wikipedia/Observer_pattern:
foreach (IObserver observer in observers)
observer.Update(message);
When e.g. a user action from e.g. a GUI-thread requires the data to stop flowing, you want to break the subject-observer connection, and even dispose of the observer alltogether.
One may argue: you should just stop the data source, and wait for a sentinel value to dispose of the connection. But that would incur more latency in the system.
Of course, if the data pumping thread has just asked for the address of the observer, it might find it's sending a message to a destroyed object.
Has someone created an 'official' Design Pattern countering this situation? Shouldn't they?
If you want to have the data source to always be on the safe side of concurrency, you should have at least one pointer that is always safe for him to use.
So the Observer object should have a lifetime that isn't ended before that of the data source.
This can be done by only adding Observers, but never removing them.
You could have each observer not do the core implementation itself, but have it delegate this task to an ObserverImpl object.
You lock access to this impl object. This is no big deal, it just means the GUI unsubscriber would be blocked for a little while in case the observer is busy using the ObserverImpl object. If GUI responsiveness would be an issue, you can use some kind of concurrent job-queue mechanism with an unsubscription job pushed onto it. ( like PostMessage in Windows )
When unsubscribing, you just substitute the core implementation for a dummy implementation. Again this operation should grab the lock. This would indeed introduce some waiting for the data source, but since it's just a [ lock - pointer swap - unlock ] you could say that this is fast enough for real-time applications.
If you want to avoid stacking Observer objects that just contain a dummy, you have to do some kind of bookkeeping, but this could boil down to something trivial like an object holding a pointer to the Observer object he needs from the list.
Optimization :
If you also keep the implementations ( the real one + the dummy ) alive as long as the Observer itself, you can do this without an actual lock, and use something like InterlockedExchangePointer to swap the pointers.
Worst case scenario : delegating call is going on while pointer is swapped --> no big deal all objects stay alive and delegating can continue. Next delegating call will be to new implementation object. ( Barring any new swaps of course )
You could send a message to all observers informing them the data source is terminating and let the observers remove themselves from the list.
In response to the comment, the implementation of the subject-observer pattern should allow for dynamic addition / removal of observers. In C#, the event system is a subject/observer pattern where observers are added using event += observer and removed using event -= observer.

Resources