Why a quadtree sometimes need a max number to hold in a node? - quadtree

I am doing a compuational geometric issue which uses TriangularMeshQuadtree from a C# library, and some of its constructors is written as follows (from metadata, so I cannot see the details of implementations):
constructor 1:
// Summary:
// Constructor to use if you are going to store the objects in x/y space, and there
// is a smallest node size because you don't want the nodes to be smaller than a
// group of pixels.
//
// Parameters:
// xMax:
// eastern border of node coverage.
//
// xMin:
// western border of node coverage.
//
// yMax:
// northern border of node coverage.
//
// yMin:
// southern border of node coverage.
//
// maxItems:
// number of items to hold in a node before splitting itself into four branch and
// redispensing the items into them.
public TriangularMeshQuadtree(double xMax, double xMin, double yMax, double yMin, int maxItems);
constructor 2:
//
// Summary:
// Gets quad tree of a list of triangular surface in the plane with normal of dir
//
// Parameters:
// surfaces:
// A list of triangular surface
//
// dir:
// The normal of plane on which quad tree is projected
//
// maxItemNumber:
// The maximum number of items in each node of quad tree
//
// transformator:
// Coordinate transformator
//
// Returns:
// A quad tree
public static TriangularMeshQuadtree GetQuadTree(List<SubTSurf> surfaces, Vector3 dir, int maxItemNumber, out CoordinateTransformator transformator);
My understanding of a quadtree is that it divides a set of points recursively into 4 sections until every point is unique in one section. I dont understand the definition of maxItem in the above code and how it works with quadtree.

Your understanding "... until every point is unique in one section" is not quite correct. It describes a very special kind of quadtree that is usually used to in example to explain the concept.
In general, a quadtree can hold many more items per node. This is often done to reduce the number of nodes (if we have more entries per node than we need fewer nodes). The benefit of reducing node count is:
Reduced memory usage (every node adds some memory overhead)
Usually faster search (every nodes adds an indirection which is slow), i.e. a very 'deep' tree is slow to traverse.
The maxItem should not be too large, because inside a node the points are usually stored in a linear list. A linear list obviously requires linear search, so that slows things down if the list is too large. In my experience sensible values for maxItem are between 10 and 100.
Another parameter that is often given is maxDepth. This parameter limits the depth of the tree, which is equal to the number of parents of a given node. The idea is that a bad dataset can result in a tree that is very 'deep', which makes it expensive to traverse. Instead, is a node is at depth=maxDepth, it is prevented from splitting, even if it exceeds maxItem entries.
Having said all the above, there are useful real-world quadtree-type structures that allow at most one entry per quadrant. One example is the PH-Tree (disclaimer: self-advertisement). It uses other techniques to limit the depth to 32 or 64. It takes a while to explain and it wasn't part of the question, so I just reference the documentation here.

Related

OpenCL: What if i have more tasks then available work items?

Let's make an example:
i want vector dot product made concurrently (it's not my case, this is only an example) so i have 2 large input vectors and a large output vector with the same size. the work items aviable are less then the sizes of these vectors. How can i make this dot product in opencl if the work items are less then the size of the vectors? Is this possible? Or i have just to make some tricks?
Something like:
for(i = 0; i < n; i++){
output[i] = input1[i]*input2[i];
}
with n > available work items
If by "available work items" you mean you're running into the maximum given by CL_DEVICE_MAX_WORK_ITEM_SIZES, you can always enqueue your kernel multiple times for different ranges of the array.
Depending on your actual workload, it may be more sensible to make each work item perform more work though. In the simplest case, you can use the SIMD types such as float4, float8, float16, etc. and operate on large chunks like that in one go. As always though, there is no replacement for trying different approaches and measuring the performance of each.
Divide and conquer data. If you keep workgroup size as an integer divident of global work size, then you can have N workgroup launches perhaps k of them at once per kernel launch. So you should just launch N/k kernels each with k*workgroup_size workitems and proper addressing of buffers inside kernels.
When you have per-workgroup partial sums of partial dot products(with multiple in-group reduction steps), you can simply sum them on CPU or on whichever device that data is going to.

Version 3 AudioUnits: minimum frameCount in internalRenderBlock

The example code for creating a version 3 AudioUnit demonstrates how the implementation needs to return a function block for rendering processing. The block will both get samples from the previous
AxudioUnit in the chain via pullInputBlock and supply the output buffers with the processed samples. It also must provide some output buffers if the unit further down the chain did not. Here is an excerpt of code from an AudioUnit subclass:
- (AUInternalRenderBlock)internalRenderBlock {
/*
Capture in locals to avoid ObjC member lookups.
*/
// Specify captured objects are mutable.
__block FilterDSPKernel *state = &_kernel;
__block BufferedInputBus *input = &_inputBus;
return Block_copy(^AUAudioUnitStatus(
AudioUnitRenderActionFlags *actionFlags,
const AudioTimeStamp *timestamp,
AVAudioFrameCount frameCount,
NSInteger outputBusNumber,
AudioBufferList *outputData,
const AURenderEvent *realtimeEventListHead,
AURenderPullInputBlock pullInputBlock) {
...
});
This is fine if the processing does not require knowing the frameCount before the call to this block, but many applications do require knowing the frameCount before this block in order to allocate memory, prepare processing parameters, etc.
One way around this would be to accumulate past buffers of output, outputting only frameCount samples each call to the block, but this only works if there is known minimum frameCount. The processing must be initialized with a size greater than this frame count in order to work. Is there a way to specify or obtain a minimum value for frameCount or force it to be a specific value?
The example code is taken from: https://github.com/WildDylan/appleSample/blob/master/AudioUnitV3ExampleABasicAudioUnitExtensionandHostImplementation/FilterDemoFramework/FilterDemo.mm
Under iOS, an audio unit callback must be able to handle variable frameCounts. You can't force it to be a constant.
Thus any processing that requires a fixed size buffer should be done outside the audio unit callback. You can pass data to a processing thread by using a lock-free circular buffer/fifo or similar structure that does not require memory management in the callback.
You can suggest that the frameCount be a certain size by setting a buffer duration using the AVAudioSession APIs. But the OS is free to ignore this, depending on other audio needs in the system (power saving modes, system sounds, etc.) In my experience, the audio driver will only increase your suggested size, not decrease it (by more than a couple samples if resampling to not a power of 2).

Metal - Threads and ThreadGroups

I am learning Metal right now and trying to understand the lines below:
let threadGroupCount = MTLSizeMake(8, 8, 1) ///line 1
let threadGroups = MTLSizeMake(drawable.texture.width / threadGroupCount.width, drawable.texture.height / threadGroupCount.height, 1) ///line 2
command_encoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadGroupCount) ///line 3
for the line 1, What is the 3 integers represent? My guess is to assign the number of threads to be used in the process but which is which?
What is the different between line 1 and 'line 2'? My guess again is the different between threads and thread groups. But I am not sure what is the fundamental difference and when to use what.
When dispatching a grid of work items to a compute kernel, it is your responsibility to divide up the grid into subsets called threadgroups, each of which has a total number of threads (width * height * depth) that is less than the maxTotalThreadsPerThreadgroup of the corresponding compute pipeline state.
The threadsPerThreadgroup size indicates the "shape" of each subset of the grid (i.e. the number of threads in each grid dimension). The threadgroupsPerGrid parameter indicates how many threadgroups make up the entire grid. As in your code, it is often the dimensions of a texture divided by the dimensions of your threadgroup size you've chosen.
One performance note: each compute pipeline state has a threadExecutionWidth value that indicates how many threads of a threadgroup will be scheduled and executed together by the GPU. The optimal threadgroup size will thus always be a multiple of threadExecutionWidth. During development, it's perfectly acceptable to just dispatch a small square grid as you're currently doing.
The first line gives you the number of threads per group (in this case two-dimensional 8x8), while the second line gives you the number of groups per grid. Then the dispatchThreadgroups(_:threadsPerThreadgroup:) function on the third line uses these two numbers. The number of groups can be omitted in which case it defaults to using one group.

Uniform Cost Graph Search opening too many nodes

I'm working on an assignment from an archived AI course from 2014.
The parameter "problem" refers to an object that has different cost functions chosen at run (sometimes it is 1 cost per move; sometimes moves are more expensive depending on which side of the pacman board the moves are done on).
As written below, I get the right behavior but I open more search nodes than expected (about 2x what the assignment expects).
If I turn the cost variable to negative, I get the right behavior on the 1-unit cost case AND get a really low number of nodes. But the behavior is opposite for the cases of higher costs for a given side of the board.
So basically question is: Does it seem like I'm opening any nodes unnecessarily in the context of a uniform cost search?
def uniformCostSearch(problem):
"""Search the node of least total cost first."""
def UCS(problem, start):
q = util.PriorityQueue()
for node in problem.getSuccessors(start): ## Comes a tuple ((x,y), 'North', 1)
q.push([node], node[2]) ##Push nodes onto queue one a time (so they are paths)
while not q.isEmpty():
pathToCheck = q.pop() ##Pops off the lowest priorty path on the queue?
#if pathToCheck in q.heap:
# continue
lastNode = pathToCheck[-1][0] ## Gets the coordinates of that last node in that path
if problem.isGoalState(lastNode): ##Checks if those coordinates are goal
return pathToCheck ## If goal, returns that path that was checked
else: ## Else, need to get successors of that node and put them on queue
for successor in problem.getSuccessors(lastNode): ##Gets those successors the popped path's last node and iterates over them
nodesVisited = [edge[0] for edge in pathToCheck] ##Looks at all the edges (the node plus its next legal move and cost) and grabs just the coords visited (nodes)
if successor[0] not in nodesVisited: ## ##Checks to see if those visited were in THIS particular path (to avoid cyclces)
newPath = pathToCheck + [successor] ## If ONE successor was not in path, adds it to the growing path (will do the next one in next part of loop)
cost = problem.getCostOfActions([x[1] for x in newPath])
q.update(newPath, cost) #Pushes that validated new path onto the back of the queue for retrieval later
return None
start = problem.getStartState()#Starting point of stack
successorList = UCS(problem, start)
directions = []
for i in range(len(successorList)):
directions += successorList[i]
return directions[1::3]
I figured it out.
Basically, while I'm checking that I don't revisit nodes within a given path, I'm not checking if I'm visiting nodes in others paths on the queue. I can check that by adding a nodesVisited list that just appends all nodes ever visited and checking THAT for duplicate visits.

Unexpected memory usage of List<T>

I always thought that the default constructor for List would initialize a list with a capacity of 4 and that the capacity would be doubled when adding the 5th element, etc...
In my application I make a lot of lists (tree like structure where each node can have many children), some of these nodes won't have any children and since my application was fast but was also using a bit much memory I decided to use the constructor where I can specify the capacity and have set this at 1.
The strange thing now is that the memory usage when I start with a capacity of 1 is about 15% higher then when I use the default constructor. It can't be because of a better fit with 4 since the doubling would be 1,2,4. So why this extra increase in memory usage? As an extra test I've tried to start with a capacity of 4. This time again the memory usage was 15% higher then when using no specified capacity.
Now this really isn't a problem, but it bothers me that a pretty simple data structure that I've used for years has some extra logic that I didn't know about yet. Does anyone have an idea of the inner workings of List in this aspect?
That's because if you use the default constructor the internal storage array is set to an empty array, but if you use the constructor with a set size an array of the correct size gets set immediately, instead of being generated on the first call to Add.
You can see this using a decompiler like JustDecompile:
public List(int capacity)
{
if (capacity < 0)
{
ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.capacity, ExceptionResource.ArgumentOutOfRange_NeedNonNegNum);
}
this._items = new T[capacity];
}
public List()
{
this._items = List<T>._emptyArray;
}
If you look at the Add function it calls EnsureCapacity, which will enlarge the internal storage array if required. Obviously, if the array is set to an empty array initially the first add will create the default size array.

Resources