Godot: Rendering Terrain In Separate Thread Slower Than Main Thread? - multithreading

I'll admit, I'm new to multithreading and I was hoping to dabble in it with my C++ project first but I've hit a snag in my Godot game project where rendering the terrain gave me a small lag spike every time new terrain was generated so I wanted to move it to a separate thread. The only problem is that I can't find good resources on Godot multithreading so I was simply going off the documentation. I practically copied the same design in the documentation but it ended up making my game slower and even lags the main thread, not just the generation thread.
I've done a lot of my own research and I know SO is really keen on that or they kick you out so I want to list it here:
Godot Docs, only teaches you about starting up a thread, mutexes, and semaphores.
From what I understand mutexes lock a resource so only that thread can access it and only that thread can unlock it. From tests on my machine constant locks and unlocks don't seem to cause much overhead.
Semaphores from what I understand are a tool to signal some thread from a different thread, whereas mutexes only can unlock and lock from the same thread, one thread can signal a semaphore while another thread waits for that signal. This too doesn't seem to cause much overhead
Doing some practical experiments, it seems that if I get a handle on a chunk and call its render method, the method doesn't happen on that thread which I assume is the culprit however if that's the case I don't understand why the rendering could be SLOWER than doing it all on the main thread unless there's an overhead to calling a function on an object that was created on the main thread however that confuses me even more as isn't all memory shared between threads so why would it need to do something extra to call a function?
Using "call_deferred" seems to make the separate thread slightly faster but heavily slows the main thread. And tbh I'm not completely knowledgeable on call_deferred it seems to call the function during idle time, I experimented with it because of my next research point which is
Thread Safe APIs, after reading this I understand that interacting with the active tree isn't thread-safe, which means using call_deferred is preferred to interact with it during idle time. It is stated that it is preferred to construct one scene on a separate thread, then use call_deferred to do only one call to add_child. This seems to help get around that Thread Safety issue so that's what I did
That's the best research I could do and I hope it shows I really have tried what I could. It's absolutely not the best that's possible I'm sure, it's just that's the extent of my expertise in research which is why I came here (Y'all seem to have expertise far beyond what I can imagine having haha)
However, taking what I understood from all of this I decided to create a system where once an array of indices to positions to generate is written to, it posts a signal to a semaphore which will start the other thread's generation algorithm. The thread is in a while loop where at the start it has a semaphore.wait() to wait for that signal that the array is written to and ready. It goes through the indices and calls the render function for the chunks around that point (I didn't mention the array holds a Vector2 of the chunk position to render around) For this case the only point right now is the players position so the array is always 1 but that's just for now. The render function of the chunks builds a Node2D with all the tiles before doing only one call to add_child through call_deferred to get around the Thread Safety issues. One issue is that there will be one call_deferred for each chunk however when I tried to fix that it wouldn't work at all which was also weird.
So here I am with the code:
GameMap Code (Simplified)
# Made up of MapChunks
var map = {}
var chunk_loaders = [Vector2(0,0)]
var render_distance = 7 # Must be Odd
var chunk_tick_dist = 19 # Must be Odd
var noise_list = {"map_noise" : null, "foliage_noise" : null}
var chunk_gen_thread
var chunk_gen_thread_exit = true
var mutex
var semaphore
var indices_to_generate = []
onready var player = get_node("../Player")
func _ready():
mutex = Mutex.new()
semaphore = Semaphore.new()
chunk_gen_thread = Thread.new()
chunk_gen_thread.start(self, "chunk_generation_thread")
generate_noise()
regen_chunks(get_chunk_from_world(player.position.x, player.position.y), 0)
func _exit_tree():
mutex.lock()
chunk_gen_thread_exit = true
mutex.unlock()
semaphore.post()
chunk_gen_thread.wait_to_finish()
func chunk_generation_thread(userData):
while true:
semaphore.wait() # Wait for chunk gen signal
# Protect run loop with mutex
mutex.lock()
var should_exit = !chunk_gen_thread_exit
mutex.unlock()
if should_exit:
break
# Regen Chunks
mutex.lock()
for i in indices_to_generate:
var lc_pos = Vector2(chunk_loaders[i].x - floor(render_distance/2), chunk_loaders[i].y - floor(render_distance/2))
var upper_lc = lc_pos - Vector2(1, 1)
for x in render_distance:
for y in render_distance:
var chunk_pos = Vector2(lc_pos.x+x, lc_pos.y+y)
var chunk = retrieve_chunk(chunk_pos.x, chunk_pos.y)
chunk.rerender_chunk()
for x in render_distance+2:
for y in render_distance+2:
if x != 0 and x != render_distance+1 and y != 0 and y != render_distance+1:
continue
var chunk = Vector2(upper_lc.x+x, upper_lc.y+y)
var unrender_chunk = retrieve_chunk(chunk.x, chunk.y)
unrender_chunk.unrender()
mutex.unlock()
func regen_chunks(chunk_position, chunk_loader_index):
mutex.lock()
if chunk_loader_index >= chunk_loaders.size():
chunk_loaders.append(Vector2(0,0))
chunk_loaders[chunk_loader_index] = chunk_position
indices_to_generate = [chunk_loader_index]
mutex.unlock()
semaphore.post()
func retrieve_chunk(x, y):
mutex.lock()
if !map.has(Vector2(x, y)):
create_chunk(x, y)
mutex.unlock()
return map[Vector2(x, y)]
func create_chunk(x, y):
var new_chunk = MapChunk.new()
add_child(new_chunk)
new_chunk.generate_chunk(x, y)
MapChunk Code (Simplified)
var thread_scene
onready var game_map = get_parent()
func _ready():
thread_scene = Node2D.new()
func generate_chunk(x, y):
chunk_position = Vector2(x, y)
rerender_chunk()
func rerender_chunk():
if !un_rendered:
return
un_rendered = false
lc_position.x = chunk_position.x*(CHUNK_WIDTH)
lc_position.y = chunk_position.y*(CHUNK_HEIGHT)
thread_scene.queue_free()
thread_scene = Node2D.new()
chunk_map.resize(CHUNK_WIDTH)
for x in CHUNK_WIDTH:
chunk_map[x] = []
chunk_map[x].resize(CHUNK_HEIGHT)
for y in CHUNK_HEIGHT:
var cell_value = game_map.get_noise_value("map_noise", lc_position.x+x, lc_position.y+y)
assign_ground_cell(cell_value, x, y)
self.call_deferred("add_child", thread_scene)
func unrender():
if un_rendered:
return
un_rendered = true
for x in CHUNK_WIDTH:
for y in CHUNK_HEIGHT:
if chunk_map[x][y].occupying_tile != null:
chunk_map[x][y].occupying_tile.call_deferred("queue_free")
chunk_map[x][y].call_deferred("queue_free")
func assign_ground_cell(cell_value, x, y):
if cell_value < 0.4:
chunk_map[x][y] = game_map.create_tile("GRASS", lc_position.x+x, lc_position.y+y)
generate_grass_foliage(x, y)
elif cell_value < 0.5:
chunk_map[x][y] = game_map.create_tile("SAND", lc_position.x+x, lc_position.y+y)
else:
chunk_map[x][y] = game_map.create_tile("WATER", lc_position.x+x, lc_position.y+y)
thread_scene.add_child(chunk_map[x][y])
func generate_grass_foliage(x, y):
var cell_value = game_map.get_noise_value("foliage_noise", lc_position.x+x, lc_position.y+y)
if cell_value >= 0.4:
chunk_map[x][y].occupying_tile = game_map.create_tile("TREE", lc_position.x+x, lc_position.y+y)
chunk_map[x][y].occupying_tile.parent_tile = chunk_map[x][y]
chunk_map[x][y].occupying_tile.z_index = 3
elif cell_value >= 0.2 and cell_value < 0.4:
chunk_map[x][y].occupying_tile = game_map.create_tile("GRASS_BLADE", lc_position.x+x, lc_position.y+y)
chunk_map[x][y].occupying_tile.parent_tile = chunk_map[x][y]
chunk_map[x][y].occupying_tile.z_index = 1
if chunk_map[x][y].occupying_tile != null:
thread_scene.add_child(chunk_map[x][y].occupying_tile)
KEEP IN MIND
All of this code works fine if it's all on the main thread!!
There is nothing wrong with the chunk generation code itself! It works completely fine if I remove the thread.start thing from the ready function. It all works except there's like a 0.5-second lag spike every time it's called that I'm trying to get rid of. I am almost 89% sure this should purely be a thread problem. (I'm sure I could improve the chunk gen algorithm more but I also really want to understand threads)

Related

How can I include a progress indicator in Octave for parallel computations?

I wrote a function in Octave that uses parcellfun from the parallel package to split calculations up across multiple threads.
Even with multithreading, though, some calculations may take multiple hours to finish, so I would like to include some kind of progress indicator along the way. In the non-parallel version, it was fairly simple to just send the iteration counter to a waitbox object. The parallel version causes some problems.
So far, I have tried to write an extra function that could be called by each parallel child. That function is as follows. It uses persistent variables to try and keep information between the threads.
function parallelWaitbox(i, s)
mlock();
persistent n = 0; % Completed calculations
persistent m = 100; % Total calculations
persistent l = 0; % Last percentage done (0:0.01:1)
persistent h; % Waitbox handle
% Send 0 to initialize
if(0 == i)
n = 0;
m = s;
msg = sprintf("Total Operations: %i\r\n%i%% Complete",m,0);
h = waitbar(0,msg);
endif
% Send 1 to increment
if(1 == i)
n++;
% Special case: max
if(n == m)
msg = sprintf("Total Operations: %i\r\n100%% Complete",m);
waitbar(1, h, msg);
else
p = floor(100*n/m)/100;
if p > l
msg = sprintf("Total Operations: %i\r\n%i%% Complete",m,p*100);
waitbar(p,h,msg);
endif
l = p;
endif
endif
endfunction
It is initialized with a call of parallelWaitbox(0,max) before the parcellfun call, and the parallel function calls parallelWaitbox(1) when it finishes. Unfortunately, because each thread is its own instance of Octave, they don't share this function, even when mlock() is called.
I tried to pass a handle to the parallelWaitbox function to the parallel function, in hopes it would help the different threads access the same version of the function, but it did not work.
I am not sure if passing a handle to the waitbox object would work, but even if it did there is no way to read from the waitbox that I am aware of, so the problem of keeping track of the current state would remain.
I know that I could use a for loop to split my parcellfun call up to 100 chunks, but I'd really rather avoid slowing my processing down. If there's a better way to do this, I'd love to know about it. I am not tied to the waitbox object if there is an alternative.

multithreading multiple short tasks in C++ 11 slows down the process?

I'm not really experienced when it comes down to multithreading. I have a facial landmark detector that detects 68 landmarks around the facial components. For every single landmark HoG features around it need to be extracted and appended to the previous landmark features to create a giant vector before passing it to the regressor.
Currently, all the features are getting extracted in serial one after another and I'm trying to extract them in Parallel to speed up the process.
Extracting the features around all the landmarks IN SERIAL takes about 2.5ms on my system. When I try to parallelize it using 68 threads, it takes about 8.5ms extracting features around all the landmarks. So it actually slows down the process and I'm guessing this is probably because of the threads initializing time.
The following is the original code in serial
for(int i = 0; i < 68; i++){ // for each landmark
fx = shape[i]; // x position
fy = shape[i + 68]; // y position
extract_features(image, fx, fy, &features[i]);
}
Now this is what I have done to parallelize it
vector<std::thread> threads;
for(int i = 0; i < 68; i++){ // for each landmark
fx = shape[i]; // x position
fy = shape[i + 68]; // y position
threads.emplace_back(
[image, fx, fy, &] () { extract_features(image, fx, fy, &features[i]); }
);
}
for(int x = 0; x < 68; x++)
threads[x].join();
I should be doing something wrong which is slowing down the process instead of speeding it up. My best guess is, initializing a thread the way I'm doing it is more time consuming that the task itself. If that's the case, is there a way I can initialize the threads already and just run them in the for loop?
I would very much appreciate your help in guiding me through finding the right approach to this project.
Thanks,

Swift - GCD mutable array multiple threads issue "mutated while being enumerated"

I am currently developing a game where spheres are falling from the sky. While collecting spheres you gain points and after a certain amount of points all spheres accelerate to another speed.
New spheres are continuously added to an Array (4 Spheres inside each SKNode).
When they are to accelerate I iterate through the array to increase the speed of all of them.
When the spheres have fallen out of the screen I remove them from the Array.
class GameScene: SKScene, SKPhysicsContactDelegate {
...
var allActiveNodes = Array<SKNode>()
private let concurrentNodesQueue = dispatch_queue_create(
"com.SphereHunt.allActiveNodesQueue", DISPATCH_QUEUE_CONCURRENT)
...
//1. This is where the new spheres are added to the Array via a new thread
func addSpheres(leftSphere: Sphere, middleLeftSphere: Sphere, middleRightSphere: Sphere, rightSphere: Sphere){
...
dispatch_barrier_async(self.concurrentNodesQueue){
self.allActiveNodes.append(containerNode)
let queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0)
dispatch_async(queue) {
//Set the new spheres in motion
self.runPastAvatar(containerNode)
}
}
//2. This function starts a thread that will increase the speed of all active spheres
func increaseSpeed20percent(){
durationPercentage = durationPercentage * 0.8
dispatch_sync(self.concurrentNodesQueue){
let copyAllActiveNodes = self.allActiveNodes
let count = copyAllActiveNodes.count
for index in 0...count-1{
let node = copyAllActiveNodes[index]
node.removeAllActions()
self.runPastAvatar(node)
}
}
}
//3. This method removes the sphere that is not in screen anymore from the Array
func removeLastNode(node: SKNode){
dispatch_barrier_async(self.concurrentNodesQueue){
self.allActiveNodes.removeAtIndex(0)
node.removeFromParent()
println("Removed")
}
}
I am not sure if I have understood GCD correctly, I have tried multiple solutions and this is the one I was sure was going to work. I always end up with the same error message:
*** Terminating app due to uncaught exception 'NSGenericException',
reason: '*** Collection <__NSArrayM: 0x17004c9f0> was mutated while being enumerated.'
How do I get the threads to not interfere with each other while handling the array?
I'm not sure if this is the issue, but from the documents for:
func dispatch_sync(_ queue: dispatch_queue_t,
_ block: dispatch_block_t)
Unlike with dispatch_async, no retain is performed on the target
queue. Because calls to this function are synchronous, it "borrows"
the reference of the caller. Moreover, no Block_copy is performed on
the block.
> As an optimization, this function invokes the block on the current
thread when possible.
I bolded the important part here. Why don't you call the loop with a dispatch_barrier_sync instead.
My problem was that I was using a thread-sleep solution to fire new spheres in a time interval. This was a bad choice but should not have produced such an error message in my opinion. I solved it using NSTimer to fire new spheres in a time interval. This gave the game a bit of a lag, but it is more robust and won't crash. Next up is finding out how to use the NSTimer without creating such a lag in the game!

Design pattern for asynchronous while loop

I have a function that boils down to:
while(doWork)
{
config = generateConfigurationForTesting();
result = executeWork(config);
doWork = isDone(result);
}
How can I rewrite this for efficient asynchronous execution, assuming all functions are thread safe, independent of previous iterations, and probably require more iterations than the maximum number of allowable threads ?
The problem here is we don't know how many iterations are required in advance so we can't make a dispatch_group or use dispatch_apply.
This is my first attempt, but it looks a bit ugly to me because of arbitrarily chosen values and sleeping;
int thread_count = 0;
bool doWork = true;
int max_threads = 20; // arbitrarily chosen number
dispatch_queue_t queue =
dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
while(doWork)
{
if(thread_count < max_threads)
{
dispatch_async(queue, ^{ Config myconfig = generateConfigurationForTesting();
Result myresult = executeWork();
dispatch_async(queue, checkResult(myresult)); });
thread_count++;
}
else
usleep(100); // don't consume too much CPU
}
void checkResult(Result value)
{
if(value == good) doWork = false;
thread_count--;
}
Based on your description, it looks like generateConfigurationForTesting is some kind of randomization technique or otherwise a generator which can make a near-infinite number of configuration (hence your comment that you don't know ahead of time how many iterations you will need). With that as an assumption, you are basically stuck with the model that you've created, since your executor needs to be limited by some reasonable assumptions about the queue and you don't want to over-generate, as that would just extend the length of the run after you have succeeded in finding value ==good measurements.
I would suggest you consider using a queue (or OSAtomicIncrement* and OSAtomicDecrement*) to protect access to thread_count and doWork. As it stands, the thread_count increment and decrement will happen in two different queues (main_queue for the main thread and the default queue for the background task) and thus could simultaneously increment and decrement the thread count. This could lead to an undercount (which would cause more threads to be created than you expect) or an overcount (which would cause you to never complete your task).
Another option to making this look a little nicer would be to have checkResult add new elements into the queue if value!=good. This way, you load up the initial elements of the queue using dispatch_apply( 20, queue, ^{ ... }) and you don't need the thread_count at all. The first 20 will be added using dispatch_apply (or an amount that dispatch_apply feels is appropriate for your configuration) and then each time checkResult is called you can either set doWork=false or add another operation to queue.
dispatch_apply() works for this, just pass ncpu as the number of iterations (apply never uses more than ncpu worker threads) and keep each instance of your worker block running for as long as there is more work to do (i.e. loop back to generateConfigurationForTesting() unless !doWork).

What are Lua coroutines even for? Why doesn't this code work as I expect it?

I'm having trouble understanding this code... I was expecting something similar to threading where I would get an output with random "nooo" and "yaaaay"s interspersed with each other as they both do the printing asynchronously, but rather I discovered that the main thread seems to block on the first calling of coroutine.resume() and thus prevents the next from being started until the first has yielded.
If this is the intended operation coroutines, what are they useful for, and how would I achieve the goal I was hoping for? Would I have to implement my own scheduler for these coroutines to operate asynchronously?, because that seems messy, and I may as well use functions!
co1 = coroutine.create(function ()
local i = 1
while i < 200 do
print("nooo")
i = i + 1
end
coroutine.yield()
end)
co2 = coroutine.create(function ()
local i = 1
while i < 200 do
print("yaaaay")
i = i + 1
end
coroutine.yield()
end)
coroutine.resume(co1)
coroutine.resume(co2)
Coroutines aren't threads.
Coroutines are like threads that are never actively scheduled. So yes you are kinda correct that you would have to write you own scheduler to have both coroutines run simultaneously.
However you are missing the bigger picture when it comes to coroutines. Check out wikipedia's list of coroutine uses. Here is one concrete example that might guide you in the right direction.
-- level script
-- a volcano erupts every 2 minutes
function level_with_volcano( interface )
while true do
wait(seconds(5))
start_eruption_volcano()
wait(frames(10))
s = play("rumble_sound")
wait( end_of(s) )
start_camera_shake()
-- more stuff
wait(minutes(2))
end
end
The above script could be written to run iteratively with a switch statement and some clever state variables. But it is much more clear when written as a coroutine. The above script could be a thread but do you really need to dedicate a kernel thread to this simple code. A busy game level could have 100's of these coroutines running without impacting performance. However if each of these were a thread you might get away with 20-30 before performance started to suffer.
A coroutine is meant to allow me to write code that stores state on the stack so that I can stop running it for a while (the wait functions) and start it again where I left off.
Since there have been a number of comments asking how to implement the wait function that would make deft_code's example work, I've decided to write a possible implementation. The general idea is that we have a scheduler with a list of coroutines, and the scheduler decides when to return control to the coroutines after they give up control with their wait calls. This is desirable because it makes asynchronous code be readable and easy to reason about.
This is only one possible use of coroutines, they are a more general abstraction tool that can be used for many different purposes (such as writing iterators and generators, writing stateful stream processing objects (for example, multiple stages in a parser), implementing exceptions and continuations, etc.).
First: the scheduler definition:
local function make_scheduler()
local script_container = {}
return {
continue_script = function(frame, script_thread)
if script_container[frame] == nil then
script_container[frame] = {}
end
table.insert(script_container[frame],script_thread)
end,
run = function(frame_number, game_control)
if script_container[frame_number] ~= nil then
local i = 1
--recheck length every time, to allow coroutine to resume on
--the same frame
local scripts = script_container[frame_number]
while i <= #scripts do
local success, msg =
coroutine.resume(scripts[i], game_control)
if not success then error(msg) end
i = i + 1
end
end
end
}
end
Now, initialising the world:
local fps = 60
local frame_number = 1
local scheduler = make_scheduler()
scheduler.continue_script(frame_number, coroutine.create(function(game_control)
while true do
--instead of passing game_control as a parameter, we could
--have equivalently put these values in _ENV.
game_control.wait(game_control.seconds(5))
game_control.start_eruption_volcano()
game_control.wait(game_control.frames(10))
s = game_control.play("rumble_sound")
game_control.wait( game_control.end_of(s) )
game_control.start_camera_shake()
-- more stuff
game_control.wait(game_control.minutes(2))
end
end))
The (dummy) interface to the game:
local game_control = {
seconds = function(num)
return math.floor(num*fps)
end,
minutes = function(num)
return math.floor(num*fps*60)
end,
frames = function(num) return num end,
end_of = function(sound)
return sound.start+sound.duration-frame_number
end,
wait = function(frames_to_wait_for)
scheduler.continue_script(
frame_number+math.floor(frames_to_wait_for),
coroutine.running())
coroutine.yield()
end,
start_eruption_volcano = function()
--obviously in a real game, this could
--affect some datastructure in a non-immediate way
print(frame_number..": The volcano is erupting, BOOM!")
end,
start_camera_shake = function()
print(frame_number..": SHAKY!")
end,
play = function(soundname)
print(frame_number..": Playing: "..soundname)
return {name = soundname, start = frame_number, duration = 30}
end
}
And the game loop:
while true do
scheduler.run(frame_number,game_control)
frame_number = frame_number+1
end
co1 = coroutine.create(
function()
for i = 1, 100 do
print("co1_"..i)
coroutine.yield(co2)
end
end
)
co2 = coroutine.create(
function()
for i = 1, 100 do
print("co2_"..i)
coroutine.yield(co1)
end
end
)
for i = 1, 100 do
coroutine.resume(co1)
coroutine.resume(co2)
end

Resources