Dynamic hashmap for octree - hashmap

Is there memory AND speed efficient way to store unique key:value pairs dynamically in hash map? Keys are guaranteed to be unique but the amount of them is changing frequently. Insertion and deletion has to be fast.
What I've done is octree (not linear/full) containing signed distance field. Octree is updated often. What I'd like to do is try to make it pointerless to save some space.

I'm not sure about using a hashmap, as dynamic structures tend to either lose the O(1) lookup that a hashmap benefits from or allocate extra memory and require reallocation when that memory is used up. You can, however, create an octree where each node only has one pointer.
for example, in c++ one might do something like:
struct node{
node* next;//pointer to children
unsigned byte shape;//8bit flag indicating which children actually exist
node():next(0),shape(0){}
};
void initChildren(node& parent,unsigned byte state=0){
if(!parent.next){//only allocate the array if it has not yet been allocated
parent.next=new node[8];//the children are allocated in consecutive memory
}
shape=state;
}
Deleting a child node only requires setting a bit in the shape field to 0. I hope this helps you, even if you did ask a while ago.

Related

What is uninitialized memory and why isn't it initialized when allocating?

Taking this signature for a method of the GlobalAllocator:
unsafe fn alloc(&self, layout: Layout) -> *mut u8
and this sentence from the method's documentation:
The allocated block of memory may or may not be initialized.
Suppose that we are going to allocate some chunk of memory for an [i32, 10]. Assuming the size of i32 it's 4 bytes, our example array would need 40 bytes for the requested storage.
Now, the allocator found a memory spot that fits our requirements. Some 40 bytes of a memory region... but... what's there? I always read the term garbage data, and assume that it's just old data already stored there by another process, program... etc.
What's unitialized memory? Just data that is not initialized with zeros of with some default value for the type that we want to store there?
Why not always memory it's initialized before returning the pointer? It's too costly? But the memory must be initialized in order to use it properly and not cause UB. Why then doesn't comes already initialized?
When some resource it's deallocated, things musn't be pointing to that freed memory. That's that place got zeroed? What really happens when you deallocate some piece of memory?
What's unitialized memory? Just data that is not initialized with zeros of with some default value for the type that we want to store there?
It's worse than either of those. Reading from uninitialized memory is undefined behavior, as in you can no longer reason about a program which does so. Practically, compilers often optimize assuming that code paths that would trigger undefined behavior are never executed and their code can be removed. Or not, depending on how aggressive the compiler is.
If you could reliably read from the pointer, it would contain arbitrary data. It may be zeroes, it may be old data structures, it may be parts of old data structures. It may even be things like passwords and encryption keys, which is another reason why reading uninitialized memory is problematic.
Why not always memory it's initialized before returning the pointer? It's too costly? But the memory must be initialized in order to use it properly and not cause UB. Why then doesn't comes already initialized?
Yes, cost is the issue. The first thing that is typically done after allocating a piece of memory is to write to it. Having the allocator "pre-initialize" the memory is wasteful when the caller is going to overwrite it anyway with the values it wants. This is especially significant with large buffers used for IO or other large storage.
When some resource it's deallocated, things musn't be pointing to that freed memory. That's that place got zeroed? What really happens when you deallocate some piece of memory?
It's up to how the memory allocator is implemented. Most don't waste processing power to clear the data that's been deallocated, since it will be overwritten anyway when it's reallocated. Some allocators may write some bookkeeping data to the freed space. GlobalAllocator is an interface to whatever allocator the system comes with, so it can vary depending on the environment.
I always read the term garbage data, and assume that it's just old data already stored there by another process, program... etc.
Worth noting: all modern desktop OSs have memory isolation between processes - your program cannot access the memory of other processes or the kernel (unless you explicitly share it via specialized functionality). The kernel will clear memory before it assigns it to your process, to prevent leaking sensitive data. But you can see old data from your own process, for the reasons described above.
What you are asking are implementation details that can even vary from run to run. From the perspective of the abstract machine and thus the optimizer they don't matter.
Turning contents of uninitialized memory into almost any type (other than MaybeUninit) is immediate undefined behavior.
let mem: *u8 = unsafe { alloc(...) };
let x: u8 = unsafe { ptr::read(mem) };
if x != x {
print!("wtf");
}
May or may not print, crash or delete the contents of your harddrive, possibly even before reaching that alloc call because the optimizer worked backwards and eliminated the entire code block because it could prove that all execution paths are UB.
This may happen due to assumptions the optimizer relies on, i.e. even when the underlying allocator is well-behaved. But real systems may also behave non-deterministically. E.g. theoretically on a freshly booted embedded system memory might be in an uninitialized state that doesn't reliably return 0 or 1. Or on linux madvise(MADV_FREE) can cause allocations to return inconsistent results over time until initialized.

How many ABA tag bits are needed in lock-free data structures?

One popular solution to the ABA problem in lock-free data structures is to tag pointers with an additional monotonically incrementing tag.
struct aba {
void *ptr;
uint32_t tag;
};
However, this approach has a problem. It is really slow and has huge cache problems. I can obtain a speed-up of twice as much if I ditch the tag field. But this is unsafe?
So my next attempt stuff for 64 bit platforms stuffs bits in the ptr field.
struct aba {
uintptr __ptr;
};
uint32_t get_tag(struct aba aba) { return aba.__ptr >> 48U; }
But someone said to me that only 16 bits for the tag is unsafe. My new plan is to use pointer alignment to cache-lines to stuff more tag bits in but I want to know if that'll work.
If that fails to work my next plan is to use Linux's MAP_32BIT mmap flag to allocated data so I only need 32 bits of pointer space.
How many bits do I need for the ABA tag in lock-free data-structures?
The amount of tag bits that is practically safe can be estimated based on the preemption time and the frequency of pointer modifications.
To remind, the ABA problem happens when a thread reads the value it wants to change with compare-and-swap, gets preempted, and when it resumes the actual value of the pointer happens to be equal to what the thread read before. Therefore the compare-and-swap operation may succeed despite data structure modifications possibly done by other threads during the preemption time.
The idea of adding the monotonically incremented tag is to make each modification of the pointer unique. For it to succeed, increments must produce unique tag values during the time when a modifying thread might be preempted; i.e. for guaranteed correctness the tag may not wraparound during the whole preemption time.
Let's assume that preemption lasts a single OS scheduling time slice, which is typically tens to hundreds of milliseconds. The latency of CAS on modern systems is tens to hundreds of nanoseconds. So rough worst-case estimate is that there might be millions of pointer modifications while a thread is preempted, and so there should be 20+ bits in the tag in order for it to not wraparound.
In practice it can be possible to make a better estimate for a particular real use case, based on known frequency of CAS operations. One also need to estimate the worst-case preemption time more accurately; for example, a low-priority thread preempted by a higher-priority job might end up with much longer preemption time.
According to the paper
http://web.cecs.pdx.edu/~walpole/class/cs510/papers/11.pdf
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects (IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 6, JUNE 2004 p. 491) by PhD Maged M. Michael
tag bits should be sized to make wraparound impossible in real lockfree scenarios (I can read this as if you may have N threads running and each may access the structure, you should have N+1 different states for tags at least):
6.1.1 IBM ABA-Prevention Tags
The earliest and simplest lock-free method for node reuse is
the tag (update counter) method introduced with the
documentation of CAS on the IBM System 370 [11]. It
requires associating a tag with each location that is the
target of ABA-prone comparison operations. By incrementing
the tag when the value of the associated location is
written, comparison operations (e.g., CAS) can determine if
the location was written since it was last accessed by the
same thread, thus preventing the ABA problem.
The method requires that the tag contains enough bits to make
full wraparound impossible during the execution of any
single lock-free attempt. This method is very efficient and
allows the immediate reuse of retired nodes.
Depending on your data structure you could be able to steal some extra bits from the pointers. For example if the objects are 64 bytes and always aligned on 64 byte boundaries, the lower 6 bits of each pointer could be used for the tags (but that's probably what you already suggested for your new plan).
Another option would be to use an index into your objects instead of pointers.
In case of contiguous objects that would of course simply be an index into an array or vector. In case of lists or trees with objects allocated on the heap, you could use a custom allocator and use an index into your allocated block(s).
For say 17M objects you would only need 24 bits, leaving 40 bits for the tags.
This would need some (small and fast) extra calculation to get the address, but if the alignment is a power of 2 only a shift and an addition are needed.

Can different threads write to different sections of the same Vec? [duplicate]

This question already has an answer here:
How do I pass disjoint slices from a vector to different threads?
(1 answer)
Closed 6 years ago.
I have 10 threads and a Vec of length 100.
Can I have thread 0 work on elements 0-9 (sort them, for example), while thread 1 is working on elements 10-19, etc.?
Or do I have to use a Vec<Vec<>> for this? (Which I would rather avoid, because the elements would no longer be contiguous in memory)
Yes, you can. You asked about the mutable case, but I'll preface by saying that if the Vec is read only (e.g. for a reduction) you can safely send an immutable reference to the specific slice you want in each thread. You can do this by simply using something like &my_vec[idx1..idx2] in a loop.
For the mutable case it's a bit trickier since the borrow tracker is not sophisticated enough to allow non-overlapping borrows of a Vec. However, there are a number of methods, notably split_at_mut you can call to get these subslices. By far the easiest is the chunks_mut iterator documented here. (Note that there is a matching chunks iterator for the immutable case so you only need to make minor changes when writing either case).
Be aware that the chunks and chunks_mut functions take the size of each chunk, not the number of chunks. However, deriving one from the other is fairly straightforward.
I would like to give a few words of caution with the mutable case, however. If you split the data evenly you may get abysmal performance. The reason is that the CPU doesn't work on individual addresses, instead it works on blocks of memory known as cache lines which are 64-bytes long. If multiple threads work on a single cache line, they have to write and read slower memory in order to ensure consistency between threads.
Unfortunately, in safe Rust there's no easy way to determine where on a cache line a Vec's buffer starts (because the buffer's start may have been allocated in the middle of a CPU cache line), most of the methods I know of to detect this involve twiddling with the lower bytes of the actual pointer address. The easiest way to handle this is to simply add a 64-byte pad of nonsense-data between each chunk you want to use. So, for instance, if you have a Vec containing 1000 32-bit floats and 10 threads, you simply add 16 floats with a dummy value (since 32-bits = 4-bytes, 16*4=64=1 cache line) between each 100 of your "real" floats and ignore the dummies during computation.
This is known as false sharing, and I encourage you to look up other references to learn other methods of dealing with this.
Note that the 64-byte line size is guaranteed on x86 architectures. If you're compiling for ARM, PowerPC, MIPS, or something else this value can and will vary.

Stop and copy garbage collector in two phases

When implementing a stop and copy garbage collector as a pair, I need two memory banks (the old one and a free new one). One memory bank consists of the-cars and the-cdrs. So basicly when I make a new addres, it is a pointer to the-cars and the-cdrs.
When allocating new memory and I see that I don't have enough space, I start a GC. What this one does is:
switch the memory banks
move: read car and cdr from the old bank, copy to the new bank and put a pointer in the old bank to the new for later.
scan: loops over the memory and copies everything from old to new.
Now the question is: Why do I need to scan first and move after. Why can't I do both together?
It sounds like you are going through the really awesome garbage collection assignment where you implement your own collectors (mark and sweep, stop and copy, generational).
General answer to your question: two-pass algorithms typically use less memory than one-pass algorithms, by trading time for space.
More specific answer: in a stop-and-copy collector, you do it in two passes by (1) first copying the live data over to the new semispace, and (2) adjusting internal references in the live data to refer to elements in the new semispace, mapping old memory to new memory.
You must realize that the information necessary to do the mapping isn't magically available: you need memory to keep track how to redirect the moved memory. And remember: your collector itself is a program, and it must use a bounded, small amount of memory! Keeping a hash table in your collector to do the bookkeeping, for example, would be verboten: it's not playing by the rules. So one thing you need to keep track of is making sure the collector is playing with a limited amount of memory. So that explains why a stop-and-copy collector will reuse the old semispace and write those redirect records there.
With that constraint in mind: it's important to realize that we need to be careful of how we're traversing the live set. Which approach we choose may or may not require additional memory, in some very subtle and surprising ways. In particular, recursion in the general case is not free! Technically, in the first pass we should be using the new semispace not only as the target of our copying, but as a funky representation of the control stack that we use to implement the recursive process that walks through the live data.
Concretely, if we're doing a one-pass approach like this to copy the live set:
;; copy-live-set: number -> void
;; copies the live set starting from memory-location.
Pseudocode:
to copy-live-set starting at memory-location:
copy the block at memory-location over to the new semispace, and
record a redirection record in the old semispace
for each internal-reference in the block:
recursively call copy-live-set at the internal-reference if
it hasn't been copied already
remap the internal-reference to that new memory location
then you may be surprised to know that we've messed up with memory. The above will break the promise that the collector is making to the runtime environment because the recursion here is not iterative! It will consume control stack space. During the live set traversal, it will consume control stack space proportional to the depth of the structures we're walking across. Ooops.
If you try an alternative approach for walking through the live set, you should eventually see that there's a good way to traverse the whole live set while still guaranteeing bounded, small control stack usage. Hint: consider how graph traversal algorithms can be written as a simple while loop, with an explicit container that holds what to visit next till we exhaust the container. If you squint just right, the intermediate values in the new semispace look awfully like that container.
Once you discover how to traverse the live set in constant control stack space, you'll see that you'll need those two passes to do the complete copy-and-rewrite-internal-references thing. Worrying about these details is messy, but it's important in seeing how garbage collectors actually work. A real collector needs to do something like this, to be concerned about control stack usage, to ensure it uses bounded memory during the collection.
Summary: a two-pass algorithm is a solution that helps us with memory at the cost of some time. But we don't pay much in terms of performance: though we pass through the live set twice, the process is still linear in the size of the live set.
History: see Cheney's Algorithm, and note the title of the seminal paper's emphasis: "A Nonrecursive List Compacting Algorithm". That single highlighted word "Nonrecursive" is the key to what motivates the two-pass approach: it's trying to avoid consuming the control stack. Cheney's paper is an extension of the paper by Fenichel and Yochelson "A LISP Garbage-Collector for Virtual-Memory Computer Systems", in which the authors there proposed basically the recursive, stack-using approach first. To improve the situation, Fenichel and Yochelson then proposed using the non-recursive (but complicated!) Schorr-Waite DFS algorithm to do the traversal. Cheney's approach is an improvement because the traversal is simpler.

Does the GHC garbage collector have any special optimisations for large objects?

Does the GHC garbage collector handle "large" objects specially? Or does it treat them exactly the same as any other object?
Some GC engines put large objects in a separate area, which gets scanned less regularly and possibly has a different collection algorithm (e.g., compacting instead of copying, or maybe even using freelists rather than attempting to defragment). Does GHC do anything like this?
Yes. The GHC heap is not kept in one contiguous stretch of memory; rather, it is organized into blocks.
When an allocated object’s size is above a specific threshold (block_size*8/10, where block_size is 4k, so roughly 3.2k), the block holding the object is marked as large (BF_LARGE). Now, when garbage collection occurs, rather than copy large objects from this block to a new one, the block itself is added to the new generation's set of blocks; this involves fiddling with a linked list (a large object list, to be precise).
Since this means that it may take a while for us to reclaim dead space inside a large block, it does mean that large objects can suffer from fragmentation, as seen in bug 7831. However, this doesn't usually occur until individual allocations hit half of the megablock size, 1M.

Resources