Goal:
I would like to implement a function which has a sequence of inputs "X1,...,Xn" and outputs an ordered list "Xp,..,Xq" where all the elements are distinct but ordered.
Requirement:
For every Xi in the sequence "X1,...,Xn", it is a 256 bits long string.
The sequence of inputs "X1,...,Xn" may have the same elements, which means that there may exist two elements Xi and Xj to satisfy Xi=Xj.
For the same elements in sequence of "X1,...,Xn", we only output one element in the ordered list.
The speed of function should be as fast as possible. And it does not matter that how much storage volume is used in function.
The size of sequence "X1,...,Xn" is n, and n is a number that no more than 10,000.
My idea:
I use an Array to store the sequence which is initially empty.
When inputting Xi, first I search the Hashtable to judge if Xi is already in the above array. If yes, just dropping it. And if not, add Xi to Hashtable and Array.
If having inputting all the element of the sequence "X1,...,Xn", I sort the array and output it.
Question:
And with Merkle Patricia Tree (Ethereum) and Hashtable, which one
should I choose?
For Merkle Patricia Tree (Ethereum) and Hashtable, which one has faster search speed?
Or is there a better data structure to satisfy this function?
If you want the fastest look up nothing can beat the hashtable but hashtable is not good for ordering. merkle patricia trie allows us to verify data integrity in large datasets. "Patricia" stands for "Practical Algorithm To Retrieve Information Coded In Alphanumeric".
Since blockchains include sensitive financial transactions, "merkle trees" are heavily used in blockchains. In your question you are worried about data integrity because inputs are in order and each input in sequence might be same or might contain similar elements. That sounds like perfect use case for merkle-patricia tree
Related
For example, if I have n keys and m slots in the hash map, the average size of a linked list starting from a slot would be n/m. Am I correct in thinking this? Again, I'm talking about an average. Thanks in advance!
I'm trying to learn data structures.
As you say, the average size of a single list is generally going to be the table's load factor; but this is assuming that the "Simple Uniform Hashing Assumption" holds with your hash table (more specifically, with its hash function(s) and expected input keys): simply put, we assume that the hash function distributes elements to buckets uniformly, as well as independently of one another.
To expand a little, and in different words:
We assume that if we choose a new item randomly (imagine sampling an item from the probability distribution that characterizes our inputs), then there is an equal chance that the item we end up with will be mapped to any of the m buckets. (A chance of 1/m.)
Furthermore, that this probability is unaffected given the presence (or absence) of any other elements in any of the buckets.
This is helpful because from this we can conclude that the probability for an item to be sorted into a given bucket is always 1/m, regardless of any other circumstances; from this it directly follows that the expected (average) length of a single bucket's list will be n/m (we insert n elements into the table, and for each one, sort it into this given list at a probability of 1/m).
To see that this is important, we might imagine a case in which it doesn't hold: for instance, if we're facing some kind of "attack" and our inputs are engineered to all hash into the same bucket, or even just with a high probability. In this case SUHA no longer holds, and clearly neither does the link you've asked about between the length of a list and the load factor.
This is part of the reason that it is important to choose a good hash function for your use case: without it, the assumption may not hold which could have a harmful effect on your lookup times.
I have to store billions of entries with Int64 keys in an ordered map.
If I use a usual BST then each search operation costs log(N) pointer dereferencings (20-30 for millions to billions entries),
however bitwise trie with bitmap reduces this just to Ceil(64/6) = 11 pointer dereferencings.
This comes at a cost of an array for all 64 children in each trie node but I think that applying the usual array list growth strategy to this array and reusing previously allocated but discarded arrays with mitigate some problems with space wastage.
I'm aware a variant of this data structure is called HAMT and used as an effective persistent data structure, but this question is about a usual ordered map like std::map in C++, besides I need no deletions of entries.
However there are a few implementations of this data structure on github.
Why aren't bitwise tries as popular as binary search trees?
Disclaimer: I'm the author of https://en.wikipedia.org/wiki/Bitwise_trie_with_bitmap
Why aren't bitwise tries as popular as binary search trees?
Good question and I don't know the answer. To make bitwise tries more popular is exactly the reason to publish the wikipedia article.
This comes at a cost of an array for all 64 children in each trie node but I think that applying the usual array list growth strategy to this array and reusing previously allocated but discarded arrays with mitigate some problems with space wastage.
Nope: That's exactly where the bitmap comes in: To avoid having an array sized for all 64 possible children.
Any reason why i should be using AVL tree as opposed to using binary search in an array, when i know the number of elements to be stored in advance and is of fixed size.Also only the search operation is to be performed??
Is there Any other data structure or algo which is better than them both for the search purpose??
Generally, your binary search solution will perform better if you are storing the array in contiguous memory chunk and the AVL tree would use a sparse collection of node objects which would yield poor spatial cache locality.
Depending on the type of data, you can use interpolation search would which provide O(loglogn) performance - though this would increase your worst case time complexity to O(n). This is used if the data is following a distribution that is uniform, so you would base the guess index not on the middle but rather the expected position. Another option would be to have hash table which is normally O(1) get, which with Cuckoo hashing would guarantee O(1) if created properly regardless of collisions.
I feel like this has to exist, but I just can't think of it. Is there a data structure that can one hold a sorted list of values and be searched quickly (maybe log(N) time like an array), and also supports insertion and removal of elements in log(N) or constant time?
This is pretty much the description of a balanced binary search tree, which stores elements in sorted order, allows for O(log n) insertions, deletions, and lookups, and allows for O(n) traversal of all elements.
There are many ways to build a balanced BST - there are red/black trees, AVL trees, scapegoat trees, splay trees, AA trees, treaps, (a, b)-trees, etc. Any of these would solve your problem. Of them, splay trees are probably the easiest to code up, followed by AA-trees and AVL trees.
Hope this helps!
I would like to know the best way to sort a long list of strings wrt the time and space efficiency. I prefer time efficiency over space efficiency.
The strings can be numeric, alpha, alphanumeric etc. I am not interested in the sort behavior like alphanumeric sort v/s alphabetic sort just the sort itself.
Some ways below that I can think of.
Using code ex: .Net framework's Arrays.Sort() function. I think the way this works is that the hashcodes for the strings are calculated and the string is inserted at the proper position using a binary search.
Using the database (ex: MS-sql). I have not done this. I do not know how efficient this would be though.
Using a prefix tree data structure like a trie. Sorting requires traversing all the trieNodes of the trie tree using DFS (depth first search) - O(|V| + |E|) time. (Searching takes O(l) time where l is the length of the string to compare).
Any other ways or data structures?
You say that you have a database, and presumably the strings are stored in the database. Then you should get the database to do the work for you. It may be able to take advantage of an index and therefore not need to actually sort the list, but just read it from the index in sorted order.
If there is no index the database might still be able to help you. If you only fetch the first k rows for some small constant number k, for example 100. When you use ORDER BY with a LIMIT clause it allows SQL Server to use a special optimization called TOP N SORT which runs in linear time instead of O(n log(n)) time.
If your strings are not in the database already then you should use the features provided by .NET instead. I think it is unlikely you will be able to write custom code that will be much faster than the default sort.
I found this paper that uses trie data structure to efficiently sort large sets of strings. I have not looked into it in detail though.
Radix sort could also be good option if strings are not very long e.g. list of names
Let us suppose you have a large list of strings and that the length of the List is N.
Using a comparison based sorting algorithm like MergeSort, HeapSort or Quicksort will give you an
where n is the size of the list and d is the maximum length for all strings in the list.
We can try to use Radix sort in this case. Let b be the base and let d be the length of the maximum string then we can show that the running time using radix sort is .
Furthermore, if the strings are say the lower case English Alphabets the running time is
Source: MIT Opencourse Algorithms lecture by prof. Eric Demaine.