Rust WGPU Atomic Texture Operations - rust

TL;DR:
Is it possible to access textures atomically in WGSL?
By atomically, I mean like specified in the "Atomic Operations" section of the documentation of OpenGL's GL_TEXTURE_*.
If not, will changing to GLSL work in WGPU?
Background:
Hi, recently I have been experimenting with WGPU and WGSL, specifically trying to create a cellular automata and storing it's data in a texture_storage_2d.
I was having problems with the fact that accessing the texture asynchronously caused race conditions that made cells disappear (if two cells try to advance to
the same point at the same time, they will overwrite one another)
I did some research and couldn't find any solution to my problem in the WGSL spec, but I found something similar in OpenGL and GLSL with OpenGL's GL_TEXTURE_* called atomic operations on textures (which exist AFAIK only for u32 or i32 in WGSL).
Is there something like GL_TEXTURE_* in WGSL?
Or is there some alternative that I am not aware of?
And is changing to GLSL (while staying with WGPU) the only solution? Will it even work?

To answer the first part, there are no atomic texture operations in WGSL.

The Solution to the Problem
original reddit discussion
After doing some tests I confirmed two things:
I managed to successfully implement an atomic texture (code below).
When the texture is very large (my tests were on a 2000 X 2000 texture) the race conditions described do not occur. This can probably be explained by bank conflicts but I haven't researched it enough to know for sure.
Code
This following snippet is paraphrased from my original code, it is not tested but should work.
#group(0) #binding(0) var texture: texture_storage_2d<rg32uint, read_write>;
struct Locks {
locks: array<array<atomic<u32>, 50>, 50>,
};
#group(0) #binding(1) var<storage, read_write> locks: Locks;
fn lock(location: vec2<u32>) -> bool {
let lock_ptr = &locks.locks[location.y][location.x];
let original_lock_value = atomicLoad(lock_ptr);
if (original_lock_value > 0u) {
return false;
}
return atomicAdd(lock_ptr, 1u) == original_lock_value;
}
fn unlock(location: vec2<u32>) {
atomicStore(&locks.locks[location.y][location.x], 0u);
}
Ideally, I'd use atomicCompareExchangeWeak instead of that somewhat complex logic in lock, but atomicCompareExchangeWeak didn't seem to work on my machine so I created similar logic myself.
Just to clarify, reading from the texture should be possible at any time but writing to the texture at location should be done only if lock(location) returned true.
Don't forget to call unlock after every write and between shader calls to reset the locks :)

Related

How do I test cross-thread queue?

I am not 100% sure that this is SO-adequate question, but I guess it
falls under "a specific programming problem". Tips to make it
more SO-friendly are welcome.
A bit of context
In DLang there is no default data sharing between threads - instead we use message passing. As safe and clean that approach is, it makes it hard to scale our code horizontaly. Best example is multiple writer - multiple reader problem - it gets quite complicated when using std.concurrency.
Quite common way to solve that problem is to use an in-memory queue - writers push to that queue, readers pull from it, each thread runs on its own pace, and Bob's your uncle. So, I've decided to implement Queue for DLang myself.
The code
Queue has following API:
module javaesque.concurrency;
Queue!T queue(T)(){
// constructor, out of struct itself for implementation sake
}
struct Queue(T){
// internals not important for the sake of question
void push(T val){
// ...
}
T pull(){
// ...
}
}
And here's a sample app using that:
// import whatever's needed, stdio, concurrency, etc
void runnable(Queue!string q){
try {
while (true) {
writeln(to!string(thisTid)~" "~q.pull());
}
} catch (OwnerTerminated ot) {
}
}
void main(string[] args){
Queue!string queue = queue!string();
spawn(&runnable, queue);
spawn(&runnable, queue);
for (int i = 0; i< 20; ++i){
queue.push(to!string(i));
}
readln();
}
Question
OK, so how do I test that? While prototyping I just tested it by running that sample app, but now that I've confirmed that the idea itself may work as expected, I want to write some unit tests. But how?
Please keep in mind that I didn't add dlang or related tags to this
question. Even though I've supplied snippets in DLang and the
background is highly D-related, I am looking for general help on
testing this kind of structures, without constraining myself to this
language. Obviously, general answer with DLang-specific addition is
welcome, but the question itself should be treated as
language-agnostic.
Well, the "generic" approach to testing is two-fold:
you focus on the public contract of your constructs, and think up testcases that test each aspect of that contract
you focus on the inner implementation and think of (additional) test cases to get you into specific corner cases
And beyond that: you obviously first test the whole construct in a single threaded manner. You could also look into similar things as a same thread service: you setup your environment to effectively use one thread only.
That might be sufficient for "most" of your code - then you might be fine with some few "integration" tests that actually test one expected end to end scenario (using multiple threads). There you could test for example that your multiple readers receive some expected result in the end.
Finally, from another angle: the key to good unit tests is to write code that can be unit tested easily. You need to be able actually look at your different units in isolation. But if you would provide that code here, that would rather turn into a codereview request (which wouldnt belong here).

Unreal Engine 4: Adapting ReadPixels() to a multithreaded framework

I am trying to access pixel data and save images from an in-game camera to disk. Initially, the simple approach was to use a render target and subsequently RenderTarget->ReadPixels(), but as the native implementation of ReadPixels() contains a call to FlushRenderingCommands(), it would block the game thread until the image is saved. Being a computationally intensive operation, this was lowering my FPS way too much.
To solve this problem, I am trying to create a dedicated thread that can access the camera as a CaptureComponent, and then follow a similar approach. But as the FlushRenderingCommands() block can only be called from a game thread, I had to rewrite ReadPixels() without that call, (in a non-blocking way of sorts, inspired by the tutorial at https://wiki.unrealengine.com/Render_Target_Lookup): but even then I am facing a problem with my in-game FPS being jerky whenever an image is saved (I confirmed this is not because of the actual saving to disk operation, but because of the pixel data access). My rewritten ReadPixels() function looks as below, I was hoping to get some suggestions as to what could be going wrong here. I am not sure if ENQUEUE_UNIQUE_RENDER_COMMAND_ONEPARAMETER can be called from a non-game thread, and if that's part of my problem.
APIPCamera* cam = GameThread->CameraDirector->getCamera(0);
USceneCaptureComponent2D* capture = cam->getCaptureComponent(EPIPCameraType::PIP_CAMERA_TYPE_SCENE, true);
if (capture != nullptr) {
if (capture->TextureTarget != nullptr) {
FTextureRenderTargetResource* RenderResource = capture->TextureTarget->GetRenderTargetResource();
if (RenderResource != nullptr) {
width = capture->TextureTarget->GetSurfaceWidth();
height = capture->TextureTarget->GetSurfaceHeight();
// Read the render target surface data back.
struct FReadSurfaceContext
{
FRenderTarget* SrcRenderTarget;
TArray<FColor>* OutData;
FIntRect Rect;
FReadSurfaceDataFlags Flags;
};
bmp.Reset();
FReadSurfaceContext ReadSurfaceContext =
{
RenderResource,
&bmp,
FIntRect(0, 0, RenderResource->GetSizeXY().X, RenderResource->GetSizeXY().Y),
FReadSurfaceDataFlags(RCM_UNorm, CubeFace_MAX)
};
ENQUEUE_UNIQUE_RENDER_COMMAND_ONEPARAMETER(
ReadSurfaceCommand,
FReadSurfaceContext, Context, ReadSurfaceContext,
{
RHICmdList.ReadSurfaceData(
Context.SrcRenderTarget->GetRenderTargetTexture(),
Context.Rect,
*Context.OutData,
Context.Flags
);
});
}
}
}
EDIT: One more thing I have noticed is that the stuttering goes away if I disable HDR in my render target settings (but this results in low quality images): so it seems plausible that the size of the image, perhaps, is still blocking one of the core threads because of the way I am implementing it.
It should be possible to call ENQUEUE_UNIQUE_RENDER_COMMAND_ONEPARAMETER from any thread since there is underlying call of Task Graph. You can see it, when you analize what code this macro generates:
if(ShouldExecuteOnRenderThread())
{
CheckNotBlockedOnRenderThread();
TGraphTask<EURCMacro_##TypeName>::CreateTask().ConstructAndDispatchWhenReady(ParamValue1);
}
You should be cautious about accessing UObjects (like USceneCaptureComponent2D) from different threads cause these are managed by Garbage Collector and own by game thread.
(...) but even then I am facing a problem with my in-game FPS being jerky whenever an image is saved
Did you check what thread is causing FPS drop with stat unit or stat unitgraph command? You could also use profiling tools to perform more detailed insight and make sure there is no other causes of lag.
Edit:
I've found yet another method of accessing pixel data. Try this without actually copying data in for loop and check, if there is any improvement in FPS. This could be a bit faster cause there is no pixel manipulation/conversion in-between.

Use ArrayBuffer in sequentially executed Threads?

I have two Futures, the second of which starts after the first ended. Both write to the same ArrayBuffer instance, but since they are executed serially (not at the same time), I consider them not acting concurrently.
However, I know there is the #volatile annotation for variables shared among two or more threads (#volatile disables caching).
Since after the first thread finishes, inside the ArrayBuffer instance, there might be some caching going on that makes it impossible for the second thread to see the ArrayBuffer's real state: I am not sure whether it is safe to use ArrayBuffer this way.
Is it true that caching might be a problem in my situation, and if this is the case: Is there a recommended way to make ArrayBuffer use #volatile internally?
It should be fine iff (if-and-only-if) you propagate it [the array] through the future:
val futureA = Future {
val buf = ArrayBuffer(…)
update(buf)
buf
}
val futureB = futureA map {
buf => moreUpdates(buf); buf
}
futureB foreach println // print the result of the transformations
This is OK from a memory safety point of view because the completion of futureA happens-before the onComplete (virtually all transformations on Future is implemented on top of onComplete) callback is invoked. In this case map.
The problem is not caching, per se, but the fact that an ArrayBuffer is a composite, with several subfields that have to be updated in concert to assure correct operation. You will need to use thread synchronization tools to ensure this.
class ArrayBufferWrapper[T](ab: ArrayBuffer[T]) {
def add(item: T) = {
this.synchronized {
ab.add(item)
}
}
}
By wrapping the ArrayBuffer, the components are properly realized into the current thread, and you ensure thread-safe add operations.
No, it is not safe.
This is exactly the reason why they invented functional programming. If you are using scala anyway, might as well take advantage of the paradigm it offers.
Avoid using mutable structures, or, at least, in the rare cases when you have to use them, do not let them escape the local scope. Then you won't ever have to deal with problems like this. They just will not exist anymore.
Tell us more about what you are trying to do, and i am sure someone will suggest a design or two, not involving two threads mutating the same structure.

multithread search design

Dear Community. I like to understand a little task, which have to help me improve performance for my application.
I have array of dictionaries, in singleton area with objects NSDictionary and keys
code
country
specific
I have to receive country and specific values from this array.
My first version of application was using predicate, but later i find a lot of memory leaks and performance issues by this way. Application was too slow and don't empty very quickly a memory stack, coming to around 1G and crash.
My second version was little bit more complicated. I was filled array in singleton area with objects per one code and function, which u can see bellow.
-(void)codeIsSame:(NSArray *)codeForCheck;
{
//#synchronized(self) {
NSString *code = [codeForCheck objectAtIndex:0];
if ([_code isEqualToString:code])
{
code = nil;
NSUInteger queneNumberInt = [[codeForCheck objectAtIndex:1] intValue];
NSLog(#"We match code:%# country:%# specific:%# quene:%lu",_code, _country,_specific, queneNumberInt);
[[ProjectArrays sharedProjectArrays].arrayDictionaryesForCountryCodesResult insertObject:_result atIndex:queneNumberInt];
}
code = nil;
//}
return;
}
The way to receive necessary issues is a :
SEL selector = #selector(codeIsSame:);
[[ProjectArrays sharedProjectArrays].myCountrySpecificCodeListWithClass makeObjectsPerformSelector:selector withObject:codePlusQueueNumber];
This version working much better, no memory leaks, very quickly, but too hard to debug. Sometimes i receive empty result, i tried to synchronize thread jobs, but it still not work stable. The main problem in this way is that in strange reason sometimes i don't have result in my singleton array. I tried to debug it, using index of array for different threads, and have result that class just missed answer.
Core data don't allow me to make copy of main MOC and for multithreading design i can't using it (lock and unlock is not good idea, and that's way product too much error in lock/unlock part of code.
Maybe anybody can suggest, what i can do better in this case? I need a best way to make decision which will work stable, will be easy to coding and understand it?
My current solution is using NSDictionary, where is a keys is a code and under that code i have dictionary with country/specific. Working fine as well, but don't decide a main task - using core data if u need multiply access from too many threads to the same data.

Is this a safe version of double-checked locking?

Slightly modified version of canonical broken double-checked locking from Wikipedia:
class Foo {
private Helper helper = null;
public Helper getHelper() {
if (helper == null) {
synchronized(this) {
if (helper == null) {
// Create new Helper instance and store reference on
// stack so other threads can't see it.
Helper myHelper = new Helper();
// Atomically publish this instance.
atomicSet(helper, myHelper);
}
}
}
return helper;
}
}
Does simply making the publishing of the newly created Helper instance atomic make this double checked locking idiom safe, assuming that the underlying atomic ops library works properly? I realize that in Java, one could just use volatile, but even though the example is in pseudo-Java, this is supposed to be a language-agnostic question.
See also:
Double checked locking Article
It entirely depends on the exact memory model of your platform/language.
My rule of thumb: just don't do it. Lock-free (or reduced lock, in this case) programming is hard and shouldn't be attempted unless you're a threading ninja. You should only even contemplate it when you've got profiling proof that you really need it, and in that case you get the absolute best and most recent book on threading for that particular platform and see if it can help you.
I don't think you can answer the question in a language-agnostic fashion without getting away from code completely. It all depends on how synchronized and atomicSet work in your pseudocode.
The answer is language dependent - it comes down to the guarantees provided by atomicSet().
If the construction of myHelper can be spread out after the atomicSet() then it doesn't matter how the variable is assigned to the shared state.
i.e.
// Create new Helper instance and store reference on
// stack so other threads can't see it.
Helper myHelper = new Helper(); // ALLOCATE MEMORY HERE BUT DON'T INITIALISE
// Atomically publish this instance.
atomicSet(helper, myHelper); // ATOMICALLY POINT UNINITIALISED MEMORY from helper
// other thread gets run at this time and tries to use helper object
// AT THE PROGRAMS LEISURE INITIALISE Helper object.
If this is allowed by the language then the double checking will not work.
Using volatile would not prevent a multiple instantiations - however using the synchronize will prevent multiple instances being created. However with your code it is possible that helper is returned before it has been setup (thread 'A' instantiates it, but before it is setup thread 'B' comes along, helper is non-null and so returns it straight away. To fix that problem, remove the first if (helper == null).
Most likely it is broken, because the problem of a partially constructed object is not addressed.
To all the people worried about a partially constructed object:
As far as I understand, the problem of partially constructed objects is only a problem within constructors. In other words, within a constructor, if an object references itself (including it's subclass) or it's members, then there are possible issues with partial construction. Otherwise, when a constructor returns, the class is fully constructed.
I think you are confusing partial construction with the different problem of how the compiler optimizes the writes. The compiler can choose to A) allocate the memory for the new Helper object, B) write the address to myHelper (the local stack variable), and then C) invoke any constructor initialization. Anytime after point B and before point C, accessing myHelper would be a problem.
It is this compiler optimization of the writes, not partial construction that the cited papers are concerned with. In the original single-check lock solution, optimized writes can allow multiple threads to see the member variable between points B and C. This implementation avoids the write optimization issue by using a local stack variable.
The main scope of the cited papers is to describe the various problems with the double-check lock solution. However, unless the atomicSet method is also synchronizing against the Foo class, this solution is not a double-check lock solution. It is using multiple locks.
I would say this all comes down to the implementation of the atomic assignment function. The function needs to be truly atomic, it needs to guarantee that processor local memory caches are synchronized, and it needs to do all this at a lower cost than simply always synchronizing the getHelper method.
Based on the cited paper, in Java, it is unlikely to meet all these requirements. Also, something that should be very clear from the paper is that Java's memory model changes frequently. It adapts as better understanding of caching, garbage collection, etc. evolve, as well as adapting to changes in the underlying real processor architecture that the VM runs on.
As a rule of thumb, if you optimize your Java code in a way that depends on the underlying implementation, as opposed to the API, you run the risk of having broken code in the next release of the JVM. (Although, sometimes you will have no choice.)
dsimcha:
If your atomicSet method is real, then I would try sending your question to Doug Lea (along with your atomicSet implementation). I have a feeling he's the kind of guy that would answer. I'm guessing that for Java he will tell you that it's cheaper to always synchronize and to look to optimize somewhere else.

Resources