Unreal Engine 4: Adapting ReadPixels() to a multithreaded framework - multithreading

I am trying to access pixel data and save images from an in-game camera to disk. Initially, the simple approach was to use a render target and subsequently RenderTarget->ReadPixels(), but as the native implementation of ReadPixels() contains a call to FlushRenderingCommands(), it would block the game thread until the image is saved. Being a computationally intensive operation, this was lowering my FPS way too much.
To solve this problem, I am trying to create a dedicated thread that can access the camera as a CaptureComponent, and then follow a similar approach. But as the FlushRenderingCommands() block can only be called from a game thread, I had to rewrite ReadPixels() without that call, (in a non-blocking way of sorts, inspired by the tutorial at https://wiki.unrealengine.com/Render_Target_Lookup): but even then I am facing a problem with my in-game FPS being jerky whenever an image is saved (I confirmed this is not because of the actual saving to disk operation, but because of the pixel data access). My rewritten ReadPixels() function looks as below, I was hoping to get some suggestions as to what could be going wrong here. I am not sure if ENQUEUE_UNIQUE_RENDER_COMMAND_ONEPARAMETER can be called from a non-game thread, and if that's part of my problem.
APIPCamera* cam = GameThread->CameraDirector->getCamera(0);
USceneCaptureComponent2D* capture = cam->getCaptureComponent(EPIPCameraType::PIP_CAMERA_TYPE_SCENE, true);
if (capture != nullptr) {
if (capture->TextureTarget != nullptr) {
FTextureRenderTargetResource* RenderResource = capture->TextureTarget->GetRenderTargetResource();
if (RenderResource != nullptr) {
width = capture->TextureTarget->GetSurfaceWidth();
height = capture->TextureTarget->GetSurfaceHeight();
// Read the render target surface data back.
struct FReadSurfaceContext
{
FRenderTarget* SrcRenderTarget;
TArray<FColor>* OutData;
FIntRect Rect;
FReadSurfaceDataFlags Flags;
};
bmp.Reset();
FReadSurfaceContext ReadSurfaceContext =
{
RenderResource,
&bmp,
FIntRect(0, 0, RenderResource->GetSizeXY().X, RenderResource->GetSizeXY().Y),
FReadSurfaceDataFlags(RCM_UNorm, CubeFace_MAX)
};
ENQUEUE_UNIQUE_RENDER_COMMAND_ONEPARAMETER(
ReadSurfaceCommand,
FReadSurfaceContext, Context, ReadSurfaceContext,
{
RHICmdList.ReadSurfaceData(
Context.SrcRenderTarget->GetRenderTargetTexture(),
Context.Rect,
*Context.OutData,
Context.Flags
);
});
}
}
}
EDIT: One more thing I have noticed is that the stuttering goes away if I disable HDR in my render target settings (but this results in low quality images): so it seems plausible that the size of the image, perhaps, is still blocking one of the core threads because of the way I am implementing it.

It should be possible to call ENQUEUE_UNIQUE_RENDER_COMMAND_ONEPARAMETER from any thread since there is underlying call of Task Graph. You can see it, when you analize what code this macro generates:
if(ShouldExecuteOnRenderThread())
{
CheckNotBlockedOnRenderThread();
TGraphTask<EURCMacro_##TypeName>::CreateTask().ConstructAndDispatchWhenReady(ParamValue1);
}
You should be cautious about accessing UObjects (like USceneCaptureComponent2D) from different threads cause these are managed by Garbage Collector and own by game thread.
(...) but even then I am facing a problem with my in-game FPS being jerky whenever an image is saved
Did you check what thread is causing FPS drop with stat unit or stat unitgraph command? You could also use profiling tools to perform more detailed insight and make sure there is no other causes of lag.
Edit:
I've found yet another method of accessing pixel data. Try this without actually copying data in for loop and check, if there is any improvement in FPS. This could be a bit faster cause there is no pixel manipulation/conversion in-between.

Related

Rust WGPU Atomic Texture Operations

TL;DR:
Is it possible to access textures atomically in WGSL?
By atomically, I mean like specified in the "Atomic Operations" section of the documentation of OpenGL's GL_TEXTURE_*.
If not, will changing to GLSL work in WGPU?
Background:
Hi, recently I have been experimenting with WGPU and WGSL, specifically trying to create a cellular automata and storing it's data in a texture_storage_2d.
I was having problems with the fact that accessing the texture asynchronously caused race conditions that made cells disappear (if two cells try to advance to
the same point at the same time, they will overwrite one another)
I did some research and couldn't find any solution to my problem in the WGSL spec, but I found something similar in OpenGL and GLSL with OpenGL's GL_TEXTURE_* called atomic operations on textures (which exist AFAIK only for u32 or i32 in WGSL).
Is there something like GL_TEXTURE_* in WGSL?
Or is there some alternative that I am not aware of?
And is changing to GLSL (while staying with WGPU) the only solution? Will it even work?
To answer the first part, there are no atomic texture operations in WGSL.
The Solution to the Problem
original reddit discussion
After doing some tests I confirmed two things:
I managed to successfully implement an atomic texture (code below).
When the texture is very large (my tests were on a 2000 X 2000 texture) the race conditions described do not occur. This can probably be explained by bank conflicts but I haven't researched it enough to know for sure.
Code
This following snippet is paraphrased from my original code, it is not tested but should work.
#group(0) #binding(0) var texture: texture_storage_2d<rg32uint, read_write>;
struct Locks {
locks: array<array<atomic<u32>, 50>, 50>,
};
#group(0) #binding(1) var<storage, read_write> locks: Locks;
fn lock(location: vec2<u32>) -> bool {
let lock_ptr = &locks.locks[location.y][location.x];
let original_lock_value = atomicLoad(lock_ptr);
if (original_lock_value > 0u) {
return false;
}
return atomicAdd(lock_ptr, 1u) == original_lock_value;
}
fn unlock(location: vec2<u32>) {
atomicStore(&locks.locks[location.y][location.x], 0u);
}
Ideally, I'd use atomicCompareExchangeWeak instead of that somewhat complex logic in lock, but atomicCompareExchangeWeak didn't seem to work on my machine so I created similar logic myself.
Just to clarify, reading from the texture should be possible at any time but writing to the texture at location should be done only if lock(location) returned true.
Don't forget to call unlock after every write and between shader calls to reset the locks :)

Divide a task into subtasks and assign to thread pool

I am trying to read an image manipulate the pixel data(Gaussian or any other) and write the pixels to a new image.Since the images come in big sizes(>1GB and may even be >20 GB),I read them one row at at time with whole width.So it becomes block wise reading.Now my work requires a faster mechanism to do the whole process.Will thread pool work as an effective solution.I cannot use other libraries for image processing,we have an engine built for that.
I have referred to a sample threadpool sample from code project and I am reading the image in the thread's run function but I am really not sure how it works.
HRESULT hRes = m_ObjPool.Init(10, 100); //spawning the thread
void CThreadObject::Run(CThreadPoolThreadCallback &pool)
{
//I read and write my image here using for loop
for(int i=0;i<nImageHeight;++i)
{
for(int j=0;j<nImageWidth;j++)
{
Engine.ReadImage(params);
}
}
}
What I was trying to achieve here is how do I give tasks to thread pool if the image is segmented into 10 or 100 parts(depending on the image size and block size).

Multithreading in DirectX 12

I am having a hard time trying to swallow a concept of multithreaded render in DX12.
According to MSDN one must write draw commands into direct command lists (preferably using bundles) and then submit those lists to a command queue.
It is also said that one can have more than one command queue for direct command lists. But it is unclear for me what is the purpose of doing so.
I take the full profit of multithreading by building command lists in parallel threads, don't i? If so, why would i want to have more than one command queue associated with the device?
I suspect that improper management of command queues can lead to enormous troubles with performance in later stages of rendering library development.
The main benefit to directx 12 is that execution of commands is almost purely asynchronous. Meaning when you call ID3D12CommandQueue::ExecuteCommandLists it will kick off work of the commands passed in. This brings another point however. A common misconception is that rendering is somehow multithreaded now, and this is just simply not true. All work is still executed on the GPU. However command list recording is what is done on several threads, as you will create a ID3D12GraphicsCommandList object for each thread needing it.
An example:
DrawObject DrawObjects[10];
ID3D12CommandQueue* GCommandQueue = ...
void RenderThread1()
{
ID3D12GraphicsCommandList* clForThread1 = ...
for (int i = 0; i < 5; i++)
clForThread1->RecordDraw(DrawObjects[i]);
}
void RenderThread2()
{
ID3D12GraphicsCommandList* clForThread2 = ...
for (int i = 5; i < 10; i++)
clForThread2->RecordDraw(DrawObjects[i]);
}
void ExecuteCommands()
{
ID3D12GraphicsCommandList* cl[2] = { clForThread1, clForThread2 };
GCommandQueue->ExecuteCommandLists(2, cl);
GCommandQueue->Signal(...)
}
This example is a very rough use case, but that is the general idea. That you can record objects of your scene on different threads to remove the CPU overhead of recording the commands.
Another useful thing however is that with this setup, you can kick off rendering tasks and start recording another.
An example
void Render()
{
ID3D12GraphicsCommandList* cl = ...
cl->DrawObjectsInTheScene(...);
CommandQueue->Execute(cl); // Just send it to the gpu to start rendering all the objects in the scene
// And since we have started the gpu work on rendering the scene, we can move to render our post processing while the scene is being rendered on the gpu
ID3D12GraphicsCommandList* cl2 = ...
cl2->SetBloomPipelineState(...);
cl2->SetResources(...);
cl2->DrawOnScreenQuad();
}
The advantage here over directx 11 or opengl is that those apis potentially just sit there and record and record, and possibly don't send their commands until Present() is called, which forces the cpu to wait, and incurring an overhead.

Updating physics engine ina separate thread, is this wise?

I'm 100% new to threading, as a start I've decided I want to muck around with using it to update my physics in a separate thread. I'm using a third party physics engine called Farseer, here's what I'm doing:
// class level declarations
System.Threading.Thread thread;
Stopwatch threadUpdate = new Stopwatch();
//In the constructor:
PhysicsEngine()
{
(...)
thread = new System.Threading.Thread(
new System.Threading.ThreadStart(PhysicsThread));
threadUpdate.Start();
thread.Start();
}
public void PhysicsThread()
{
int milliseconds = TimeSpan.FromTicks(111111).Milliseconds;
while(true)
{
if (threadUpdate.Elapsed.Milliseconds > milliseconds)
{
world.Step(threadUpdate.Elapsed.Milliseconds / 1000.0f);
threadUpdate.Stop();
threadUpdate.Reset();
threadUpdate.Start();
}
}
}
Is this an ok way to update physics or should there be some stuff I should look out for?
In a game you need to synchronise your physics update to the game's frame rate. This is because your rendering and gameplay will depend on the output of your physics engine each frame. And your physics engine will depend on user input and gameplay events each frame.
This means that the only advantage of calculating your physics on a separate thread is that it can run on a separate CPU core to the rest of your game logic and rendering. (Pretty safe for PC these days, and the mobile space is just starting to get dual-core.)
This allows them to both physics and gameplay/rendering run concurrently - but the drawback is that you need to have some mechanism to prevent one thread from modifying data while the other thread is using that data. This is generally quite difficult to implement.
Of course, if your physics isn't dependent on user input - like Angry Birds or The Incredible Machine (ie: the user presses "play" and the simulation runs) - in that case it's possible for you to calculate your physics simulation in advance, recording its output for playback. But instead of blocking the main thread you could move this time-consuming operation to a background thread - which is a well understood problem. You could even go so far as to start playing your recording back in the main thread, even before it is finished recording!
there is nothing wrong with your approach in general. Moving time-consuming operations, such as physics engine calculations to a separate thread is often a good idea. However, I am assuming that your application includes some sort of visual representation of your physics objects in the UI? If this is the case you are going to run into problems.
The UI controls in Silverlight have thread affinity, i.e. you cannot update their state from within the thread you have created in the above example. In order to update their state you are going to have to invoke via the Dispatcher, e.g. TextBox.Dispatcher.Invoke(...).
Another alternative is to use a Silverlight BackgroundWorker. This is a useful little class that allows you to do time-consuming work. It will move your work onto a background thread, avoiding the need to create your own System.Threading.Thread. It will also provide events that marshal results back onto the UI thread for you.
Much simpler!

multithread search design

Dear Community. I like to understand a little task, which have to help me improve performance for my application.
I have array of dictionaries, in singleton area with objects NSDictionary and keys
code
country
specific
I have to receive country and specific values from this array.
My first version of application was using predicate, but later i find a lot of memory leaks and performance issues by this way. Application was too slow and don't empty very quickly a memory stack, coming to around 1G and crash.
My second version was little bit more complicated. I was filled array in singleton area with objects per one code and function, which u can see bellow.
-(void)codeIsSame:(NSArray *)codeForCheck;
{
//#synchronized(self) {
NSString *code = [codeForCheck objectAtIndex:0];
if ([_code isEqualToString:code])
{
code = nil;
NSUInteger queneNumberInt = [[codeForCheck objectAtIndex:1] intValue];
NSLog(#"We match code:%# country:%# specific:%# quene:%lu",_code, _country,_specific, queneNumberInt);
[[ProjectArrays sharedProjectArrays].arrayDictionaryesForCountryCodesResult insertObject:_result atIndex:queneNumberInt];
}
code = nil;
//}
return;
}
The way to receive necessary issues is a :
SEL selector = #selector(codeIsSame:);
[[ProjectArrays sharedProjectArrays].myCountrySpecificCodeListWithClass makeObjectsPerformSelector:selector withObject:codePlusQueueNumber];
This version working much better, no memory leaks, very quickly, but too hard to debug. Sometimes i receive empty result, i tried to synchronize thread jobs, but it still not work stable. The main problem in this way is that in strange reason sometimes i don't have result in my singleton array. I tried to debug it, using index of array for different threads, and have result that class just missed answer.
Core data don't allow me to make copy of main MOC and for multithreading design i can't using it (lock and unlock is not good idea, and that's way product too much error in lock/unlock part of code.
Maybe anybody can suggest, what i can do better in this case? I need a best way to make decision which will work stable, will be easy to coding and understand it?
My current solution is using NSDictionary, where is a keys is a code and under that code i have dictionary with country/specific. Working fine as well, but don't decide a main task - using core data if u need multiply access from too many threads to the same data.

Resources