What data structure supports fast insertion, deletion and search - search

I feel like this has to exist, but I just can't think of it. Is there a data structure that can one hold a sorted list of values and be searched quickly (maybe log(N) time like an array), and also supports insertion and removal of elements in log(N) or constant time?

This is pretty much the description of a balanced binary search tree, which stores elements in sorted order, allows for O(log n) insertions, deletions, and lookups, and allows for O(n) traversal of all elements.
There are many ways to build a balanced BST - there are red/black trees, AVL trees, scapegoat trees, splay trees, AA trees, treaps, (a, b)-trees, etc. Any of these would solve your problem. Of them, splay trees are probably the easiest to code up, followed by AA-trees and AVL trees.
Hope this helps!

Related

How Python3 sorted list operations compare to a balanced BST?

I am using a sorted list to binary search values using built-in bisect module, which gives lookup time of O(log n). The documentation of bisect points out that inserting with insort() gives a total time of O(n) by dominated insert time in a list. It also has a deletion time of O(n).
Is there a way to use a list and have O(log n) insert, delete and lookup? Can I do that with a balanced binary search tree (BST) like a Red-Black Tree? Which Python3 module has a data structure with those properties?
NOTE: I've seen there is a package bintrees on PyPI that has RBTree and AVLTree but it is abandoned and their documentation points to using sortedcontainers lib. sortedcontainers as far as I've seen doesn't have these trees for usage (they are writen in C and are base for SortedList, SortedDict and SortedSet).
Can I do that with a balanced binary search tree (BST) like a Red-Black Tree?
Yes.
Which Python3 module has a data structure with those properties?
The Python built-in data structure set has O(log n) insert, delete and lookup. More precisely, a tighter bound for all these operations is (amortised) O(1).
However, these sets are not sorted. These are provided by sortedcontainers.
sortedcontainers as far as I've seen doesn't have these trees for usage (they are writen in C and are base for SortedList, SortedDict and SortedSet).
I'm not exactly sure about the implementation details of the library, but SortedSet (or more generally SortedDict) are what you want (O(log n) insert, O(log n) delete, O(1) lookup) if you require sorted sets. Otherwise, use the built-in set.

Merkle Patricia Tree (Ethereum) and Hashtable, which one has faster search speed?

Goal:
I would like to implement a function which has a sequence of inputs "X1,...,Xn" and outputs an ordered list "Xp,..,Xq" where all the elements are distinct but ordered.
Requirement:
For every Xi in the sequence "X1,...,Xn", it is a 256 bits long string.
The sequence of inputs "X1,...,Xn" may have the same elements, which means that there may exist two elements Xi and Xj to satisfy Xi=Xj.
For the same elements in sequence of "X1,...,Xn", we only output one element in the ordered list.
The speed of function should be as fast as possible. And it does not matter that how much storage volume is used in function.
The size of sequence "X1,...,Xn" is n, and n is a number that no more than 10,000.
My idea:
I use an Array to store the sequence which is initially empty.
When inputting Xi, first I search the Hashtable to judge if Xi is already in the above array. If yes, just dropping it. And if not, add Xi to Hashtable and Array.
If having inputting all the element of the sequence "X1,...,Xn", I sort the array and output it.
Question:
And with Merkle Patricia Tree (Ethereum) and Hashtable, which one
should I choose?
For Merkle Patricia Tree (Ethereum) and Hashtable, which one has faster search speed?
Or is there a better data structure to satisfy this function?
If you want the fastest look up nothing can beat the hashtable but hashtable is not good for ordering. merkle patricia trie allows us to verify data integrity in large datasets. "Patricia" stands for "Practical Algorithm To Retrieve Information Coded In Alphanumeric".
Since blockchains include sensitive financial transactions, "merkle trees" are heavily used in blockchains. In your question you are worried about data integrity because inputs are in order and each input in sequence might be same or might contain similar elements. That sounds like perfect use case for merkle-patricia tree

Is it possible to efficiently search a bit-trie for values less than the key?

I am currently storing a large number of unsigned 32-bit integers in a bit trie (effectively forming a binary tree with a node for each bit in the 32-bit value.) This is very efficient for fast lookup of exact values.
I now want to be able to search for keys that may or may not be in the trie and find the value for the first key less than or equal to the search key. Is this efficiently possible with a bit trie, or should I use a different data structure?
I am using a trie due to its speed and cache locality, and ideally want to sacrifice neither.
For example, suppose the trie has two keys added:
0x00AABBCC
0x00AABB00
and I an now searching for a key that is not present, 0x00AABB11. I would like to find the first key present in the tree with a value <= the search key, which in this case would be the node for 0x00AABB00.
While I've thought of a possible algorithm for this, I am seeking concrete information on if it is efficiently possible and/or if there are known algorithms for this, which will no doubt be better than my own.
We can think bit trie as a binary search tree. In fact, it is a binary search tree. Take the 32-bit trie for example, suppose left child as 0, right child as 1. For the root, the left subtree is for the numbers less than 0x80000000 and the right subtree is for the numbers no less than 0x80000000, so on and so forth. So you can just use the similar the method to find the largest item not larger than the search key in the binary search tree. Just don't worry about the backtracks, it won't backtrack too much and won't change the search complexity.
When you match fails in the bit trie, just backtrack to find the right-most child of the nearest ancestor of the failed node.
If the data is static--you're not adding or removing items--then I'd take a good look at using a simple array with binary search. You sacrifice cache locality, but that might not be catastrophic. I don't see cache locality as an end in itself, but rather a means of making the data structure fast.
You might get better cache locality by creating a balanced binary tree in an array. Position 0 is the root node, position 1 is left node, position 2 is right node, etc. It's the same structure you'd use for a binary heap. If you're willing to allocate another 4 bytes per node, you could make it a left-threaded binary tree so that if you search for X and end up at the next larger value, following that left thread would give you the next smaller value. All told, though, I don't see where this can outperform the plain array in the general case.
A lot depends on how sparse your data is and what the range is. If you're looking at a few thousand possible values in the range 0 to 4 billion, then the binary search looks pretty attractive. If you're talking about 500 million distinct values, then I'd look at allocating a bit array (500 megabytes) and doing a direct lookup with linear backward scan. That would give you very good cache locality.
A bit trie walks 32 nodes in the best case when the item is found.
A million entries in a red-black tree like std::map or java.util.TreeMap would only require log2(1,000,000) or roughly 20 nodes per query, worst case. And you do not always need to go to the bottom of the tree making average case appealing.
When backtracking to find <= the difference is even more pronounced.
The fewer entries you have, the better the case for a red-black tree
At a minimum, I would compare any solution to a red-black tree.

Most efficient way to print an AVL tree of strings?

I'm thinking that an in order traversal will run in O(n) time. The only thing better than that would be to have something running in logn time. But I don't see how this could be, considering we have to run at least n times.
Is O(n) the lastest we could do here?
Converting and expanding #C.B.'s comment to an answer:
If you have an AVL tree with n strings in it and you want to print all of them, then you have to do at least Θ(n) total work simply because you have to print out each of the n strings. You can often lower-bound the amount of work required to produce a list or otherwise output a sequence of values simply by counting up how many items are going to be in the list.
We can be even more precise here. Suppose the combined length of all the strings in the tree is L. The time required to print out all the strings in the tree has to be at least Θ(L), since it costs some computational effort to output each individual character. Therefore, we can say that we have to do at least Θ(n + L) work to print out all the strings in the tree.
The bound given here just says that any correct algorithm has to do at least this much work, not that there actually is an algorithm that does this much work. But if you look closely at any of the major tree traversals - inorder, preorder, postorder, level-order - you'll find that they all match this time bound.
Now, one area where you can look for savings is in space complexity. A level-order traversal of the tree might require Ω(n) total space if the tree is perfectly balanced (since it holds a whole layer of the tree in memory and the bottommost layer can have Θ(n) nodes in it), while an inorder, preorder, or postorder traversal would only require O(log n) memory because you only need to store the current access path, which has logarithmic height in an AVL tree.

efficient functional data structure for finite bijections

I'm looking for a functional data structure that represents finite bijections between two types, that is space-efficient and time-efficient.
For instance, I'd be happy if, considering a bijection f of size n:
extending f with a new pair of elements has complexity O(ln n)
querying f(x) or f^-1(x) has complexity O(ln n)
the internal representation for f is more space efficient than having 2 finite maps (representing f and its inverse)
I am aware of efficient representation of permutations, like this paper, but it does not seem to solve my problem.
Please have a look at my answer for a relatively similar question. The provided code can handle general NxM relations, but also be specialized to just bijections (just as you would for a binary search tree).
Pasting the answer here for completeness:
The simplest way is to use a pair of unidirectional maps. It has some cost, but you won't get much better (you could get a bit better using dedicated binary trees, but you have a huge complexity cost to pay if you have to implement it yourself). In essence, lookups will be just as fast, but addition and deletion will be twice as slow. Which isn't so bad for a logarithmic operation. Another advantage of this technique is that you can use specialized maps types for the key or value type if you have one available. You won't get as much flexibility with a specific generalist data structure.
A different solution is to use a quadtree (instead of considering a NxN relation as a pair of 1xN and Nx1 relations, you see it as a set of elements in the cartesian product (Key*Value) of your types, that is, a spatial plane), but it's not clear to me that the time and memory costs are better than with two maps. I suppose it needs to be tested.
Although it doesn't satisfy your third requirement, bimaps seem like the way to go. (They just make two finite maps, one in each direction, convenient to use.)

Resources