Is it possible to efficiently search a bit-trie for values less than the key? - search

I am currently storing a large number of unsigned 32-bit integers in a bit trie (effectively forming a binary tree with a node for each bit in the 32-bit value.) This is very efficient for fast lookup of exact values.
I now want to be able to search for keys that may or may not be in the trie and find the value for the first key less than or equal to the search key. Is this efficiently possible with a bit trie, or should I use a different data structure?
I am using a trie due to its speed and cache locality, and ideally want to sacrifice neither.
For example, suppose the trie has two keys added:
0x00AABBCC
0x00AABB00
and I an now searching for a key that is not present, 0x00AABB11. I would like to find the first key present in the tree with a value <= the search key, which in this case would be the node for 0x00AABB00.
While I've thought of a possible algorithm for this, I am seeking concrete information on if it is efficiently possible and/or if there are known algorithms for this, which will no doubt be better than my own.

We can think bit trie as a binary search tree. In fact, it is a binary search tree. Take the 32-bit trie for example, suppose left child as 0, right child as 1. For the root, the left subtree is for the numbers less than 0x80000000 and the right subtree is for the numbers no less than 0x80000000, so on and so forth. So you can just use the similar the method to find the largest item not larger than the search key in the binary search tree. Just don't worry about the backtracks, it won't backtrack too much and won't change the search complexity.
When you match fails in the bit trie, just backtrack to find the right-most child of the nearest ancestor of the failed node.

If the data is static--you're not adding or removing items--then I'd take a good look at using a simple array with binary search. You sacrifice cache locality, but that might not be catastrophic. I don't see cache locality as an end in itself, but rather a means of making the data structure fast.
You might get better cache locality by creating a balanced binary tree in an array. Position 0 is the root node, position 1 is left node, position 2 is right node, etc. It's the same structure you'd use for a binary heap. If you're willing to allocate another 4 bytes per node, you could make it a left-threaded binary tree so that if you search for X and end up at the next larger value, following that left thread would give you the next smaller value. All told, though, I don't see where this can outperform the plain array in the general case.
A lot depends on how sparse your data is and what the range is. If you're looking at a few thousand possible values in the range 0 to 4 billion, then the binary search looks pretty attractive. If you're talking about 500 million distinct values, then I'd look at allocating a bit array (500 megabytes) and doing a direct lookup with linear backward scan. That would give you very good cache locality.

A bit trie walks 32 nodes in the best case when the item is found.
A million entries in a red-black tree like std::map or java.util.TreeMap would only require log2(1,000,000) or roughly 20 nodes per query, worst case. And you do not always need to go to the bottom of the tree making average case appealing.
When backtracking to find <= the difference is even more pronounced.
The fewer entries you have, the better the case for a red-black tree
At a minimum, I would compare any solution to a red-black tree.

Related

Why isn't bitwise trie a popular implementation of associative array

I have to store billions of entries with Int64 keys in an ordered map.
If I use a usual BST then each search operation costs log(N) pointer dereferencings (20-30 for millions to billions entries),
however bitwise trie with bitmap reduces this just to Ceil(64/6) = 11 pointer dereferencings.
This comes at a cost of an array for all 64 children in each trie node but I think that applying the usual array list growth strategy to this array and reusing previously allocated but discarded arrays with mitigate some problems with space wastage.
I'm aware a variant of this data structure is called HAMT and used as an effective persistent data structure, but this question is about a usual ordered map like std::map in C++, besides I need no deletions of entries.
However there are a few implementations of this data structure on github.
Why aren't bitwise tries as popular as binary search trees?
Disclaimer: I'm the author of https://en.wikipedia.org/wiki/Bitwise_trie_with_bitmap
Why aren't bitwise tries as popular as binary search trees?
Good question and I don't know the answer. To make bitwise tries more popular is exactly the reason to publish the wikipedia article.
This comes at a cost of an array for all 64 children in each trie node but I think that applying the usual array list growth strategy to this array and reusing previously allocated but discarded arrays with mitigate some problems with space wastage.
Nope: That's exactly where the bitmap comes in: To avoid having an array sized for all 64 possible children.

AVL tree vs Binary Search in an array(only for searching & no insertion or deletion) when number of elementsto be stored is known in advance

Any reason why i should be using AVL tree as opposed to using binary search in an array, when i know the number of elements to be stored in advance and is of fixed size.Also only the search operation is to be performed??
Is there Any other data structure or algo which is better than them both for the search purpose??
Generally, your binary search solution will perform better if you are storing the array in contiguous memory chunk and the AVL tree would use a sparse collection of node objects which would yield poor spatial cache locality.
Depending on the type of data, you can use interpolation search would which provide O(loglogn) performance - though this would increase your worst case time complexity to O(n). This is used if the data is following a distribution that is uniform, so you would base the guess index not on the middle but rather the expected position. Another option would be to have hash table which is normally O(1) get, which with Cuckoo hashing would guarantee O(1) if created properly regardless of collisions.

Why does a HashMap contain a LinkedList instead of an AVL tree?

The instructor in this video explains that hash map implementations usually contain a linked list to chain values in case of collisions. My question is: Why not use something like an AVL tree (that takes O(log n) for insertions, deletions and lookups), instead of a linked list (that has a worst case lookup of O(n))?
I understand that hash functions should be designed such that collisions would be rare. But why not implement AVL trees anyway to optimize those rare cases?
It depends of the language implementing HashMap. I dont think this is a strict rule.
For example in Java:
What your video says is true up to Java 7.
In Java 8, the implementation of HashMap was changed to make use of red-black trees once the bucket grows beyond a certain point.
If your number of elements in the bucket is less than 8, it uses a singly linked list. Once it grows larger than 8 it becomes a tree. And reverts back to a singly linked list once it shrinks back to 6.
Why not just use a tree all the time? I guess this is a tradeoff between memory footprint vs lookup complexity within the bucket. Keep in mind that most hash functions will yield very few collisions, so maintaining a tree for buckets that have a size of 3 or 4 would be much more expensive for no good reason.
For reference, this is the Java 8 impl of an HashMap (and it actually has a quite good explanation about how the whole thing works, and why they chose 8 and 6, as "TREEIFY" and "UNTREEIFY" threshold) :
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/HashMap.java?av=f
And in Java 7:
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/java/util/HashMap.java?av=f

Iterative octree traversal

I am not able to figure out the procedure for iterative octree traversal though I have tried approaching it in the way of binary tree traversal. For my problem, I have octree nodes having child and parent pointers and I would like to iterate and only store the leaf nodes in the stack.
Also, is going for iterative traversal faster than recursive traversal?
It is indeed like binary tree traversal, but you need to store a bit of intermediate information. A recursive algorithm will not be slower per se, but use a bit more stack space for O(log8) recursive calls (about 10 levels for 1 billion elements in the octree).
Iterative algorithms will also need the same amount of space to be efficient, but you can place it into the heap it you are afraid that your stack might overflow.
Recursively you would do (pseudocode):
function traverse_rec (octree):
collect value // if there are values in your intermediate nodes
for child in children:
traverse_rec (child)
The easiest way to arrive at an iterative algorithm is to use a stack or queue for depth first or breath first traversal:
function traverse_iter_dfs(octree):
stack = empty
push_stack(root_node)
while not empty (stack):
node = pop(stack)
collect value(node)
for child in children(node):
push_stack(child)
Replace the stack with a queue and you got breath first search. However, we are storing something in the region of O(7*(log8 N)) nodes which we are yet to traverse. If you think about it, that's the lesser evil though, unless you need to traverse really big trees. The only other way is to use the parent pointers, when you are done in a child, and then you need to select the next sibling, somehow.
If you don't store in advance the index of the current node (in respect to it's siblings) though, you can only search all the nodes of the parent in order to find the next sibling, which essentially doubles the amount of work to be done (for each node you don't just loop through the children but also through the siblings). Also, it looks like you at least need to remember which nodes you visited already, for it is in general undecidable whether to descend farther down or return back up the tree otherwise (prove me wrong somebody).
All in all I would recommend against searching for such a solution.
Depends on what your goal is. Are you trying to find whether a node is visible, if a ray will intersect its bounding box, or if a point is contained in the node?
Let's assume that you are doing the last one, checking if a point is/should be contained in the node. I would add a method to the Octnode that takes a point and checks whether or not it lies within the bounding box of the Octnode. If it does return true, else false, pretty simple. From here, call a drill down method that starts at your head node and check each child, simple "for" loop, to see which Octnode it lies in, it can at most be one.
Here is where your iterative vs recursive algorithm comes into play. If you want iterative, just store the pointer to the current node, and swap this pointer from the head node to the one containing your point. Then just keep drilling down till you reach maximal depth or don't find an Octnode containing it. If you want a recursive solution, then you will call this drill down method on the Octnode that you found the point in.
I wouldn't say that iterative versus recursive has much performance difference in terms of speed, but it could have a difference in terms of memory performance. Each time you recurse you add another call depth onto the stack. If you have a large Octree this could result in a large number of calls, possibly blowing your stack.

Best way to create a disk-based B-tree from a given file?

Do you know a fast algorithm to create a B-tree from an existing (non-sorted) file containing space separated integers. Typically, the size of the file will be orders of magnitude bigger than the available RAM.
You can assume that the B-tree will not be modified afterwards, i.e. it will be only used to index the info in the file (say the file contains comma separated strings).
Moreover, is a B-tree the best idea to use for an index, can you suggest other structures?
It depends on how you want to access your data. If you are using a hashtable, you can only access elements by their primary key in O(1) which is faster than with a tree(log(n))
You cannot select ranges (everything string in between say 10 and 20) which is supported by tree algorithms in Log(n) where as a hash index can result in a full scan O(n). also the constant overhead of hash indexes is usually bigger (which is no factor in theta notation, but it still exists) whereas tree algorithms are usually easier to maintain, grow with data, scale, etc.
Use a hash table if you do not need ordering and a binary tree(balanced) otherwise.

Resources