I am currently developing a game where spheres are falling from the sky. While collecting spheres you gain points and after a certain amount of points all spheres accelerate to another speed.
New spheres are continuously added to an Array (4 Spheres inside each SKNode).
When they are to accelerate I iterate through the array to increase the speed of all of them.
When the spheres have fallen out of the screen I remove them from the Array.
class GameScene: SKScene, SKPhysicsContactDelegate {
...
var allActiveNodes = Array<SKNode>()
private let concurrentNodesQueue = dispatch_queue_create(
"com.SphereHunt.allActiveNodesQueue", DISPATCH_QUEUE_CONCURRENT)
...
//1. This is where the new spheres are added to the Array via a new thread
func addSpheres(leftSphere: Sphere, middleLeftSphere: Sphere, middleRightSphere: Sphere, rightSphere: Sphere){
...
dispatch_barrier_async(self.concurrentNodesQueue){
self.allActiveNodes.append(containerNode)
let queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0)
dispatch_async(queue) {
//Set the new spheres in motion
self.runPastAvatar(containerNode)
}
}
//2. This function starts a thread that will increase the speed of all active spheres
func increaseSpeed20percent(){
durationPercentage = durationPercentage * 0.8
dispatch_sync(self.concurrentNodesQueue){
let copyAllActiveNodes = self.allActiveNodes
let count = copyAllActiveNodes.count
for index in 0...count-1{
let node = copyAllActiveNodes[index]
node.removeAllActions()
self.runPastAvatar(node)
}
}
}
//3. This method removes the sphere that is not in screen anymore from the Array
func removeLastNode(node: SKNode){
dispatch_barrier_async(self.concurrentNodesQueue){
self.allActiveNodes.removeAtIndex(0)
node.removeFromParent()
println("Removed")
}
}
I am not sure if I have understood GCD correctly, I have tried multiple solutions and this is the one I was sure was going to work. I always end up with the same error message:
*** Terminating app due to uncaught exception 'NSGenericException',
reason: '*** Collection <__NSArrayM: 0x17004c9f0> was mutated while being enumerated.'
How do I get the threads to not interfere with each other while handling the array?
I'm not sure if this is the issue, but from the documents for:
func dispatch_sync(_ queue: dispatch_queue_t,
_ block: dispatch_block_t)
Unlike with dispatch_async, no retain is performed on the target
queue. Because calls to this function are synchronous, it "borrows"
the reference of the caller. Moreover, no Block_copy is performed on
the block.
> As an optimization, this function invokes the block on the current
thread when possible.
I bolded the important part here. Why don't you call the loop with a dispatch_barrier_sync instead.
My problem was that I was using a thread-sleep solution to fire new spheres in a time interval. This was a bad choice but should not have produced such an error message in my opinion. I solved it using NSTimer to fire new spheres in a time interval. This gave the game a bit of a lag, but it is more robust and won't crash. Next up is finding out how to use the NSTimer without creating such a lag in the game!
Related
Good time of the day. I'm doing an assignment which involves writing a program that solves Traveling Salesman Problem in parallel using brute-force method. I managed to write a working version but it was a lot slower than consequential version due to numerous memory allocations at a high rate which I attempted to limit as in a code bellow using buffered channel.
SO probably won't allow to fit all the code into post so, please view definition and methods of data structures on Github: https://github.com/telephrag/tsp_go
This version doesn't work at all.
At the end t will contain empty path and maximum value of uint64 for traveled distance which is set at the beginning.
As could be seen through debugger, salesman with id of 1 is getting node added to its path at sm.Path[1], but once <-t.RouteQueue occurs in the same call to travel() the program stops despite numerous goroutines are supposedly waiting to write to t.RouteQueue at the moment. It's also confirmed through debugger as well that the program never reaches if-block responsible for setting new shortest path in t.
If we create sync.Waitgroup for each for-loop the program will crash on deadlock.
sync.WaitGroup but remove everything related to channel the program would work but do so very slowly. You can get this version at the first commit to main branch of repository above.
Why does the program ends prematurely?
package tsp
func (t *Tsp) Solve() {
sm := NewSalesman(t)
sm.Visit(0) // sets bit corresponding to given node in bitmask
for nextNode := range graph[0] {
t.RouteQueue <- true // book slot in route queue (buffered channel)
go t.travel(sm.Copy(t), nextNode)
}
}
func (t *Tsp) travel(sm *Salesman, node uint64) {
sm.Distance += graph[sm.TailNode()][node] // increase traveled distance
if sm.Distance > t.MinDist { // terminate if traveled distance is bigger than current minimal
<-t.RouteQueue
return
}
sm.Visit(node)
sm.Count++ // increase amount of nodes traveled
sm.Path[sm.Count] = node // add node to path
if sm.Count == t.NodeCount-1 { // stop if t.NodeCount - 1 nodes traveled
sm.Count++
sm.Distance += graph[node][0] // return to zero-node
t.Mu.Lock()
if t.MinDist > sm.Distance { // set new min distance and path if they are shorter
t.MinPath = sm.Path
t.MinDist = sm.Distance
}
t.Mu.Unlock()
}
<-t.RouteQueue // free the slot for routes
for nextNode := range graph[node] {
if !sm.HasVisited(node) {
t.RouteQueue <- true
go t.travel(sm.Copy(t), nextNode)
}
}
}
I'm new to graphics rendering, and I'm trying to write a win32 drawing app using D2D and D3D11. I use two overlapped offscreen D2D bitmaps to preserve the content of the canvas, the top-level bitmap is transparent.
Whenever a mouse message is received, a line from the last point to the current point will be rendered to the top-level bitmap. Then I will draw both bitmaps by Z-Order to the back buffer of the swap chain then call Present(0, 0).
As you may notice, the present call is event-driven in my design. If the mouse message is received every 1 ms, then in 1 second I will get 1000 polylines rendered on the top-level bitmap(which is good), then 1000 times to composite the two bitmaps, and 1000 times to call Present(which is really bad since I only need to present in, let's say, 60 fps). The redundant composition calls and present calls take the most resources of GPU, and finally Present(0, 0) blocks the UI thread, and the frequency of mouse messages reported is dramatically reduced.
int OnMouseMove(int x, int y)
{
// ...
// update top-level bitmap
DrawLineTo(topLevelBitmap, x, y);
// get back buffer
CComPtr<IDXGISurface> dxgiBackBuffer;
HRESULT hr = _dxgiSwapChain->GetBuffer(0, IID_PPV_ARGS(&dxgiBackBuffer));
// Draw two bitmaps to the back buffer
ClearBackBuffer(dxgiBackBuffer);
DrawBimap(dxgiBackBuffer, backgroundBitmap);
DrawBimap(dxgiBackBuffer, topLevelBitmap);
// Present
DXGI_PRESENT_PARAMETERS parameters = { 0 };
_dxgiSwapChain->Present1(0, 0, ¶meters);
// ...
}
I tried to find a callback that could be used to trigger present calls, like CVDisplayLink/CADisplayLink on macOS/iOS, or a higher priority timer that could generate reliable callbacks, but failed. (WM_TIMER has a rather low priority so I didn't even try)
Another thought is to create a new thread and call present in a while loop and sleep for 16ms after each present is called. However, I'm not sure if this is a standard way, and I also worry about thread safety.
// in UI thread
int OnMouseMove(int x, int y)
{
// update top-level bitmap
DrawLineTo(topLevelBitmap, x, y);
}
// in new thread
while(1)
{
// get back buffer
CComPtr<IDXGISurface> dxgiBackBuffer;
HRESULT hr = _dxgiSwapChain->GetBuffer(0, IID_PPV_ARGS(&dxgiBackBuffer));
// Draw two bitmaps to the back buffer
ClearBackBuffer(dxgiBackBuffer);
DrawBimap(dxgiBackBuffer, backgroundBitmap);
DrawBimap(dxgiBackBuffer, topLevelBitmap);
// Present
DXGI_PRESENT_PARAMETERS parameters = { 0 };
_dxgiSwapChain->Present1(0, 0, ¶meters);
sleep(16);
}
So my question is what is the proper way to separate the off-screen rendering(draw line) and the on-screen rendering(draw bitmap & present)?
What you should be doing here is to have a non-blocking event loop using PeekMessage, and then do your rendering on the same thread, after you've gotten all the window events. As for keeping your FPS locked to the monitor's refresh rate, Present's first argument is the sync interval, which should be set to 1, and it will block until the monitor is ready to show the next frame.
For example:
while (isOpen)
{
// Message loop
MSG message;
while (PeekMessage(&message, m_WindowHandle, NULL, NULL, PM_REMOVE))
{
TranslateMessage(&message);
DispatchMessage(&message);
}
// Render
DXGI_PRESENT_PARAMETERS parameters = { 0 };
_dxgiSwapChain->Present1(1, 0, ¶meters);
}
Inside your window proc, you should just store the current position of the mouse so that you can then update it once in your render loop.
I've been banging my head against (my attempt) at a lock-free multiple producer multiple consumer ring buffer. The basis of the idea is to use the innate overflow of unsigned char and unsigned short types, fix the element buffer to either of those types, and then you have a free loop back to beginning of the ring buffer.
The problem is - my solution doesn't work for multiple producers (it does though work for N consumers, and also single producer single consumer).
#include <atomic>
template<typename Element, typename Index = unsigned char> struct RingBuffer
{
std::atomic<Index> readIndex;
std::atomic<Index> writeIndex;
std::atomic<Index> scratchIndex;
Element elements[1 << (sizeof(Index) * 8)];
RingBuffer() :
readIndex(0),
writeIndex(0),
scratchIndex(0)
{
;
}
bool push(const Element & element)
{
while(true)
{
const Index currentReadIndex = readIndex.load();
Index currentWriteIndex = writeIndex.load();
const Index nextWriteIndex = currentWriteIndex + 1;
if(nextWriteIndex == currentReadIndex)
{
return false;
}
if(scratchIndex.compare_exchange_strong(
currentWriteIndex, nextWriteIndex))
{
elements[currentWriteIndex] = element;
writeIndex = nextWriteIndex;
return true;
}
}
}
bool pop(Element & element)
{
Index currentReadIndex = readIndex.load();
while(true)
{
const Index currentWriteIndex = writeIndex.load();
const Index nextReadIndex = currentReadIndex + 1;
if(currentReadIndex == currentWriteIndex)
{
return false;
}
element = elements[currentReadIndex];
if(readIndex.compare_exchange_strong(
currentReadIndex, nextReadIndex))
{
return true;
}
}
}
};
The main idea for writing was to use a temporary index 'scratchIndex' that acts a pseudo-lock to allow only one producer at any one time to copy-construct into the elements buffer, before updating the writeIndex and allowing any other producer to make progress. Before I am called heathen for implying my approach is 'lock-free' I realise that this approach isn't exactly lock-free, but in practice (if it would work!) it is significantly faster than having a normal mutex!
I am aware of a (more complex) MPMC ringbuffer solution here http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue, but I am really experimenting with my idea to then compare against that approach and find out where each excels (or indeed whether my approach just flat out fails!).
Things I have tried;
Using compare_exchange_weak
Using more precise std::memory_order's that match the behaviour I want
Adding cacheline pads between the various indices I have
Making elements std::atomic instead of just Element array
I am sure that this boils down to a fundamental segfault in my head as to how to use atomic accesses to get round using mutex's, and I would be entirely grateful to whoever can point out which neurons are drastically misfiring in my head! :)
This is a form of the A-B-A problem. A successful producer looks something like this:
load currentReadIndex
load currentWriteIndex
cmpxchg store scratchIndex = nextWriteIndex
store element
store writeIndex = nextWriteIndex
If a producer stalls for some reason between steps 2 and 3 for long enough, it is possible for the other producers to produce an entire queue's worth of data and wrap back around to the exact same index so that the compare-exchange in step 3 succeeds (because scratchIndex happens to be equal to currentWriteIndex again).
By itself, that isn't a problem. The stalled producer is perfectly within its rights to increment scratchIndex to lock the queue—even if a magical ABA-detecting cmpxchg rejected the store, the producer would simply try again, reload exactly the same currentWriteIndex, and proceed normally.
The actual problem is the nextWriteIndex == currentReadIndex check between steps 2 and 3. The queue is logically empty if currentReadIndex == currentWriteIndex, so this check exists to make sure that no producer gets so far ahead that it overwrites elements that no consumer has popped yet. It appears to be safe to do this check once at the top, because all the consumers should be "trapped" between the observed currentReadIndex and the observed currentWriteIndex.
Except that another producer can come along and bump up the writeIndex, which frees the consumer from its trap. If a producer stalls between steps 2 and 3, when it wakes up the stored value of readIndex could be absolutely anything.
Here's an example, starting with an empty queue, that shows the problem happening:
Producer A runs steps 1 and 2. Both loaded indices are 0. The queue is empty.
Producer B interrupts and produces an element.
Consumer pops an element. Both indices are 1.
Producer B produces 255 more elements. The write index wraps around to 0, the read index is still 1.
Producer A awakens from its slumber. It had previously loaded both read and write indices as 0 (empty queue!), so it attempts step 3. Because the other producer coincidentally paused on index 0, the compare-exchange succeeds, and the store progresses. At completion the producer lets writeIndex = 1, and now both stored indices are 1, and the queue is logically empty. A full queue's worth of elements will now be completely ignored.
(I should mention that the only reason I can get away with talking about "stalling" and "waking up" is that all the atomics used are sequentially consistent, so I can pretend that we're in a single-threaded environment.)
Note that the way that you are using scratchIndex to guard concurrent writes is essentially a lock; whoever successfully completes the cmpxchg gets total write access to the queue until it releases the lock. The simplest way to fix this failure is to just replace scratchIndex with a spinlock—it won't suffer from A-B-A and it's what's actually happening.
bool push(const Element & element)
{
while(true)
{
const Index currentReadIndex = readIndex.load();
Index currentWriteIndex = writeIndex.load();
const Index nextWriteIndex = currentWriteIndex + 1;
if(nextWriteIndex == currentReadIndex)
{
return false;
}
if(scratchIndex.compare_exchange_strong(
currentWriteIndex, nextWriteIndex))
{
elements[currentWriteIndex] = element;
// Problem here!
writeIndex = nextWriteIndex;
return true;
}
}
}
I've marked the problematic spot. Multiple threads can get to the writeIndex = nextWriteIndex at the same time. The data will be written in any order, although each write will be atomic.
This is a problem because you're trying to update two values using the same atomic condition, which is generally not possible. Assuming the rest of your method is fine, one way around this would be to combine both scratchIndex and writeIndex into a single value of double-size. For example, treating two uint32_t values as a single uint64_t value and operating atomically on that.
I need to find information about how the Unified Shader Array accessess the GPU memory to have an idea how to use it effectively. The image of the architecture of my graphics card doesn't show it clearly.
I need to load a big image into GPU memory using C++Amp and divide it into small pieces (like 4x4 pixels). Every piece should be computed with a different thread. I don't know how the threads share the access to the image.
Is there any way of doing it in such way that the threads aren't blocking each other while accessing the image? Maybe they have their own memory that can be accesses exclusively?
Or maybe the access to the unified memory is so fast that I shouldn't care about it (however I don't belive in it)? It is really important, because I need to compute about 10k subsets for every image.
For C++ AMP you want to load the data that each thread within a tile uses into tile_static memory before starting your convolution calculation. Because each thread accesses pixels which are also read by other threads this allows your to do a single read for each pixel from (slow) global memory and cache it in (fast) tile static memory so that all of the subsequent reads are faster.
You can see an example of tiling for convolution here. The DetectEdgeTiled method loads all the data that it requires and the calls idx.barrier.wait() to ensure all the threads have finished writing data into tile static memory. Then it executes the edge detection code taking advantage of tile_static memory. There are many other examples of this pattern in the samples. Note that the loading code in DetectEdgeTiled is complex only because it must account for the additional pixels around the edge of the pixels that are being written in the current tile and is essentially an unrolled loop, hence it's length.
I'm not sure you are thinking about the problem in quite the right way. There are two levels of partitioning here. To calculate the new value for each pixel the thread doing this work reads the block of surrounding pixels. In addition blocks (tiles) of threads loads larger blocks of pixel data into tile_static memory. Each thread on the tile then calculates the result for one pixel within the block.
void ApplyEdgeDetectionTiledHelper(const array<ArgbPackedPixel, 2>& srcFrame,
array<ArgbPackedPixel, 2>& destFrame)
{
tiled_extent<tileSize, tileSize> computeDomain = GetTiledExtent(srcFrame.extent);
parallel_for_each(computeDomain.tile<tileSize, tileSize>(), [=, &srcFrame, &destFrame, &orgFrame](tiled_index<tileSize, tileSize> idx) restrict(amp)
{
DetectEdgeTiled(idx, srcFrame, destFrame, orgFrame);
});
}
void DetectEdgeTiled(
tiled_index<tileSize, tileSize> idx,
const array<ArgbPackedPixel, 2>& srcFrame,
array<ArgbPackedPixel, 2>& destFrame) restrict(amp)
{
const UINT shift = imageBorderWidth / 2;
const UINT startHeight = 0;
const UINT startWidth = 0;
const UINT endHeight = srcFrame.extent[0];
const UINT endWidth = srcFrame.extent[1];
tile_static RgbPixel localSrc[tileSize + imageBorderWidth ]
[tileSize + imageBorderWidth];
const UINT global_idxY = idx.global[0];
const UINT global_idxX = idx.global[1];
const UINT local_idxY = idx.local[0];
const UINT local_idxX = idx.local[1];
const UINT local_idx_tsY = local_idxY + shift;
const UINT local_idx_tsX = local_idxX + shift;
// Copy image data to tile_static memory. The if clauses are required to deal with threads that own a
// pixel close to the edge of the tile and need to copy additional halo data.
// This pixel
index<2> gNew = index<2>(global_idxY, global_idxX);
localSrc[local_idx_tsY][local_idx_tsX] = UnpackPixel(srcFrame[gNew]);
// Left edge
if (local_idxX < shift)
{
index<2> gNew = index<2>(global_idxY, global_idxX - shift);
localSrc[local_idx_tsY][local_idx_tsX-shift] = UnpackPixel(srcFrame[gNew]);
}
// Right edge
// Top edge
// Bottom edge
// Top Left corner
// Bottom Left corner
// Bottom Right corner
// Top Right corner
// Synchronize all threads so that none of them start calculation before
// all data is copied onto the current tile.
idx.barrier.wait();
// Make sure that the thread is not referring to a border pixel
// for which the filter cannot be applied.
if ((global_idxY >= startHeight + 1 && global_idxY <= endHeight - 1) &&
(global_idxX >= startWidth + 1 && global_idxX <= endWidth - 1))
{
RgbPixel result = Convolution(localSrc, index<2>(local_idx_tsY, local_idx_tsX));
destFrame[index<2>(global_idxY, global_idxX)] = result;
}
}
This code was taken from CodePlex and I stripped out a lot of the real implementation to make it clearer.
WRT #sharpneli's answer you can use texture<> in C++ AMP to achieve the same result as OpenCL images. There is also an example of this on CodePlex.
In this particular case you do not have to worry. Just use OpenCL images. GPU's are extremely good at simply reading images (due to texturing). However this method requires writing the result into a separate image because you cannot read and write from the same image in a single kernel. You should use this if you can perform the computation as a single pass (no need to iterate).
Another way is to access it as a normal memory buffer, load the parts within a wavefront (group of threads running in sync) into local memory (this memory is blazingly fast), perform computation and write complete end result back into unified memory after computation. You should use this approach if you need to read and write values to the same image while computing. If you are not memory bound you can still read the original values from a texture, then iterate in local memory and write the end results in separate image.
Reads from unified memory are slow only if it's not const * restrict and multiple threads read the same location. In general if subsequent thread id's read subsequent locations it's rather fast. However if your threads both write and read to unified memory then it's going to be slow.
I have a function that boils down to:
while(doWork)
{
config = generateConfigurationForTesting();
result = executeWork(config);
doWork = isDone(result);
}
How can I rewrite this for efficient asynchronous execution, assuming all functions are thread safe, independent of previous iterations, and probably require more iterations than the maximum number of allowable threads ?
The problem here is we don't know how many iterations are required in advance so we can't make a dispatch_group or use dispatch_apply.
This is my first attempt, but it looks a bit ugly to me because of arbitrarily chosen values and sleeping;
int thread_count = 0;
bool doWork = true;
int max_threads = 20; // arbitrarily chosen number
dispatch_queue_t queue =
dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
while(doWork)
{
if(thread_count < max_threads)
{
dispatch_async(queue, ^{ Config myconfig = generateConfigurationForTesting();
Result myresult = executeWork();
dispatch_async(queue, checkResult(myresult)); });
thread_count++;
}
else
usleep(100); // don't consume too much CPU
}
void checkResult(Result value)
{
if(value == good) doWork = false;
thread_count--;
}
Based on your description, it looks like generateConfigurationForTesting is some kind of randomization technique or otherwise a generator which can make a near-infinite number of configuration (hence your comment that you don't know ahead of time how many iterations you will need). With that as an assumption, you are basically stuck with the model that you've created, since your executor needs to be limited by some reasonable assumptions about the queue and you don't want to over-generate, as that would just extend the length of the run after you have succeeded in finding value ==good measurements.
I would suggest you consider using a queue (or OSAtomicIncrement* and OSAtomicDecrement*) to protect access to thread_count and doWork. As it stands, the thread_count increment and decrement will happen in two different queues (main_queue for the main thread and the default queue for the background task) and thus could simultaneously increment and decrement the thread count. This could lead to an undercount (which would cause more threads to be created than you expect) or an overcount (which would cause you to never complete your task).
Another option to making this look a little nicer would be to have checkResult add new elements into the queue if value!=good. This way, you load up the initial elements of the queue using dispatch_apply( 20, queue, ^{ ... }) and you don't need the thread_count at all. The first 20 will be added using dispatch_apply (or an amount that dispatch_apply feels is appropriate for your configuration) and then each time checkResult is called you can either set doWork=false or add another operation to queue.
dispatch_apply() works for this, just pass ncpu as the number of iterations (apply never uses more than ncpu worker threads) and keep each instance of your worker block running for as long as there is more work to do (i.e. loop back to generateConfigurationForTesting() unless !doWork).