Optimising dictionary key generation/lookup in python 3 - python-3.x

My program will contain a lot of references to a few large python dictionaries. The dictionaries use as keys fairly long strings (often +100 characters). I need to often check the presence of a key in these dictionaries. And very often it would be for the same string as it passes through the flow of the script.
Checking the presence of a key in a dictionary is O(1). However, producing a hash of a string (which is what a dictionary would do) is O(N) where N is the length of the string. Since I need to do these checks very often for the same string I was wondering if there a way to optimise this hash re-generation? I was thinking along the lines of (pseudo-code follows):
(1) receive a long string as an input
(2) create a short version of the string, e.g. by using MD5 or CRC32
(3) use the short version as a key
Does this make sense?
If yes, what sort of compression/hashing would you suggest?

To be honest, I was over-engineering the problem.
I was trying to mimic the structure of a RDBMS table so was looking for unique primary keys. These were the long strings that I was thinking of hashing. And I was looking for an efficient ways to do the hashing.
But the solution was much simpler - I found the equivalent of an auto-increment PK. In python you can do this with itertools.count(). So, instead of using the original long strings I started using itertools.count(). It is not an ideal solution but it solved the problem.

Related

Why is it not possible to get dictionary values in O(1) time?

Can we write a data structure which will search directly by taking the values in O(1) time?
For example, in this code in python3, we can get morse code by taking the keys and output the values.
morse={'A':'.-','B':'-...','C':'-.-.','D':'-..','E':'.',\
'F':'..-.','G':'--.','H':'....','I':'..','J':'.---',\
'K':'-.-','L':'.-..','M':'--','N':'_.','O':'---',\
'P':'.--.','Q':'--.-','R':'.-.','S':'...','T':'-',\
'U':'..-','V':'...-','W':'.--','X':'-..-','Y':'-.--',\
'Z':'--..','1':'.---','2':'..---','3':'...--','4':'....-',\
'5':'.....','6':'-....','7':'--...','8':'---..','9':'----.',\
'0':'----'}
n=input()
n=''.join(i.upper() for i in n if i!=' ')
for i in n:
print(morse[i],end=' ')
This gives the output:
>>>
S O S
... --- ...
If we want to search by taking the morse code as input and giving the string as output:
>>>
... --- ...
S O S
how do we do that without making another dictionary of morse code?
Please provide the proper reasoning and what are the limitations if any.
Python dictionaries are hashmaps behind the scenes. The keys are hashed to achieve O(1) lookups. The same is not done for values for a few reasons, one of which is the reason #CrakC mentioned: the dict doesn't have to have unique values. Maintaining an automatic reverse lookup would be nonconsistent at best. Another reason could be that fundamental data structures are best kept to a minimum set of operations for predictability reasons.
Hence the correct & common pattern is to maintain a separate dict with key-value pairs reversed if you want to have reverse lookups in O(1). If you cannot do that, you'll have to settle for greater time complexities.
Yes, getting the name of the key from its value in a dictionary is not possible in python. The reason for this is quite obvious. The keys in a dictionary are unique in nature i.e., there cannot be two entries in the dictionary for the same key. But the inverse is not always true. Unique keys might have non-unique values. It should be noted here that the immutable nature of the keys actually defines the structure of the dictionary. Since they are unique in nature, they can be indexed and so fetching the value of a given key executes in O(1) time. The inverse, as explained above, cannot be realized in O(1) time and will always take an average time of O(n). The most important point that you should know here is that python dictionaries are not meant to be used this way.
Further reading: http://stupidpythonideas.blogspot.in/2014/07/reverse-dictionary-lookup-and-more-on.html
Can we write a data structure which will search directly by taking the values in O(1) time?
The answer to that question would be yes, and it's a HasMap or HashTable.
Following your example, what actually happens there is that Python Dictionaries are implemented as HashMap's. From that follows that search complexity is O(1) but, as I understand, your real problem is how to search the key by the value in O(1) too. Well, being dictionaries implemented as hashmaps, if Python provided (I am not 100% sure it doesn't) that reverse searching functionality it wouldn't be O(1) because HashMaps are not designed to provide it.
It can be shown looking at how HashMaps work: you would need a hashing function which would map the key and the value to the same index in the array which, if not impossible, is pretty hard to do.
I guess that your best option is to define de inverse dictionary. It's not that uncommon to sacrifice memory to achieve better times.
As CrakC has correctly stated it is not possible to get the key from the dictionary in O(1) time, you will need to traverse the dictionary once in O(n) time in order to search for the key in the dictionary. As you do not want to create another dictionary this would be your only option.

Comparing long strings by their hashes

Trying to improve the performance of a function that compares strings I decided to compare them by comparing their hashes.
So is there a guarantee if the hash of 2 very long strings are equal to each other then the strings are also equal to each other?
While it's guaranteed that 2 identical strings will give you equal hashes, the other way round is not true : for a given hash, there are always several possible strings which produce the same hash.
This is true due to the PigeonHole principle.
That being said, the chances of 2 different strings producing the same hash can be made infinitesimal, to the point of being considered equivalent to null.
A fairly classical example of such hash is MD5, which has a near perfect 128 bits distribution. Which means that you have one chance in 2^128 that 2 different strings produce the same hash. Well, basically, almost the same as impossible.
In the simple common case where two long strings are to be compared to determine if they are identical or not, a simple compare would be much preferred over a hash, for two reasons. First, as pointed out by #wildplasser, the hash requires that all bytes of both strings must be traversed in order to calculate the two hash values, whereas the simple compare is fast, and only needs to traverse bytes until the first difference is found, which may be much less than the full string length. And second, a simple compare is guaranteed to detect any difference, whereas the hash gives only a high probability that they are identical, as pointed out by #AdamLiss and #Cyan.
There are, however, several interesting cases where the hash comparison can be employed to great advantage. As mentioned by #Cyan if the compare is to be done more than once, or must be stored for later use, then hash may be faster. A case not mentioned by others is if the strings are on different machines connected via a local network or the Internet. Passing a small amount of data between the two machines will generally be much faster. The simplest first check is compare the size of the two, if different, you're done. Else, compute the hash, each on its own machine (assuming you are able to create the process on the remote machine) and again, if different you are done. If the hash values are the same, and if you must have absolute certainty, there is no easy shortcut to that certainty. Using lossless compression on both ends will allow less data to be transferred for comparison. And finally, if the two strings are separated by time, as alluded to by #Cyan, if you want to know if a file has changed since yesterday, and you have stored the hash from yesterday's version, then you can compare today's hash to it.
I hope this will help stimulate some "out of the box" ideas for someone.
I am not sure, if your performance will be improved. Both: building hash + comparing integers and simply comparing strings using equals have same complexity, that lays in O(n), where n is the number of characters.

Comparing string distance based on precomputed hashes

I have a large list (over 200,000) of strings that I'd like to compare to a given string.
The given string is inserted by a user, so it may be slightly incorrect.
What I was hoping to do was create some kind of precomputed hash on each string on adding it to the list. This hash would contain information such as string length, addition of all the characters etc.
My question is, does something like this already exist? Surely there would be something that lets me avoid running Levenshtein distance on every string in the list?
Or maybe there's a third option I haven't thought of yet?
Sounds like you want to use a fuzzy hash of some sort. There are lots of hash functions available that can do things like this. The classic old "SOUNDEX" algorithm might even work.
Another thought - if you estimate that the probability of an incorrect entry is low, then you might actually be fine having a direct hit 99.9% of the time, falling back to SOUNDEX which might catch 90% of the remaining cases and then searching the whole list for the remaining 0.01% of the time.
Also worth checking this discussion:
How to find best fuzzy match for a string in a large string database

Best way to sort a long list of strings

I would like to know the best way to sort a long list of strings wrt the time and space efficiency. I prefer time efficiency over space efficiency.
The strings can be numeric, alpha, alphanumeric etc. I am not interested in the sort behavior like alphanumeric sort v/s alphabetic sort just the sort itself.
Some ways below that I can think of.
Using code ex: .Net framework's Arrays.Sort() function. I think the way this works is that the hashcodes for the strings are calculated and the string is inserted at the proper position using a binary search.
Using the database (ex: MS-sql). I have not done this. I do not know how efficient this would be though.
Using a prefix tree data structure like a trie. Sorting requires traversing all the trieNodes of the trie tree using DFS (depth first search) - O(|V| + |E|) time. (Searching takes O(l) time where l is the length of the string to compare).
Any other ways or data structures?
You say that you have a database, and presumably the strings are stored in the database. Then you should get the database to do the work for you. It may be able to take advantage of an index and therefore not need to actually sort the list, but just read it from the index in sorted order.
If there is no index the database might still be able to help you. If you only fetch the first k rows for some small constant number k, for example 100. When you use ORDER BY with a LIMIT clause it allows SQL Server to use a special optimization called TOP N SORT which runs in linear time instead of O(n log(n)) time.
If your strings are not in the database already then you should use the features provided by .NET instead. I think it is unlikely you will be able to write custom code that will be much faster than the default sort.
I found this paper that uses trie data structure to efficiently sort large sets of strings. I have not looked into it in detail though.
Radix sort could also be good option if strings are not very long e.g. list of names
Let us suppose you have a large list of strings and that the length of the List is N.
Using a comparison based sorting algorithm like MergeSort, HeapSort or Quicksort will give you an
where n is the size of the list and d is the maximum length for all strings in the list.
We can try to use Radix sort in this case. Let b be the base and let d be the length of the maximum string then we can show that the running time using radix sort is .
Furthermore, if the strings are say the lower case English Alphabets the running time is
Source: MIT Opencourse Algorithms lecture by prof. Eric Demaine.

Constant-time hash for strings?

Another question on SO brought up the facilities in some languages to hash strings to give them a fast lookup in a table. Two examples of this are dictionary<> in .NET and the {} storage structure in Python. Other languages certainly support such a mechanism. C++ has its map, LISP has an equivalent, as do most other modern languages.
It was contended in the answers to the question that hash algorithms on strings can be conducted in constant timem with one SO member who has 25 years experience in programming claiming that anything can be hashed in constant time. My personal contention is that this is not true, unless your particular application places a boundary on the string length. This means that some constant K would dictate the maximal length of a string.
I am familiar with the Rabin-Karp algorithm which uses a hashing function for its operation, but this algorithm does not dictate a specific hash function to use, and the one the authors suggested is O(m), where m is the length of the hashed string.
I see some other pages such as this one (http://www.cse.yorku.ca/~oz/hash.html) that display some hash algorithms, but it seems that each of them iterates over the entire length of the string to arrive at its value.
From my comparatively limited reading on the subject, it appears that most associative arrays for string types are actually created using a hashing function that operates with a tree of some sort under the hood. This may be an AVL tree or red/black tree that points to the location of the value element in the key/value pair.
Even with this tree structure, if we are to remain on the order of theta(log(n)), with n being the number of elements in the tree, we need to have a constant-time hash algorithm. Otherwise, we have the additive penalty of iterating over the string. Even though theta(m) would be eclipsed by theta(log(n)) for indexes containing many strings, we cannot ignore it if we are in such a domain that the texts we search against will be very large.
I am aware that suffix trees/arrays and Aho-Corasick can bring the search down to theta(m) for a greater expense in memory, but what I am asking specifically if a constant-time hash method exists for strings of arbitrary lengths as was claimed by the other SO member.
Thanks.
A hash function doesn't have to (and can't) return a unique value for every string.
You could use the first 10 characters to initialize a random number generator and then use that to pull out 100 random characters from the string, and hash that. This would be constant time.
You could also just return the constant value 1. Strictly speaking, this is still a hash function, although not a very useful one.
In general, I believe that any complete string hash must use every character of the string and therefore would need to grow as O(n) for n characters. However I think for practical string hashes you can use approximate hashes that can easily be O(1).
Consider a string hash that always uses Min(n, 20) characters to compute a standard hash. Obviously this grows as O(1) with string size. Will it work reliably? It depends on your domain...
You cannot easily achieve a general constant time hashing algorithm for strings without risking severe cases of hash collisions.
For it to be constant time, you will not be able to access every character in the string. As a simple example, suppose we take the first 6 characters. Then comes someone and tries to hash an array of URLs. The has function will see "http:/" for every single string.
Similar scenarios may occur for other characters selections schemes. You could pick characters pseudo-randomly based on the value of the previous character, but you still run the risk of failing spectacularly if the strings for some reason have the "wrong" pattern and many end up with the same hash value.
You can hope for asymptotically less than linear hashing time if you use ropes instead of strings and have sharing that allows you to skip some computations. But obviously a hash function can not separate inputs that it has not read, so I wouldn't take the "everything can be hashed in constant time" too seriously.
Anything is possible in the compromise between the hash function's quality and the amount of computation it takes, and a hash function over long strings must have collisions anyway.
You have to determine if the strings that are likely to occur in your algorithm will collide too often if the hash function only looks at a prefix.
Although I cannot imagine a fixed-time hash function for unlimited length strings, there is really no need for it.
The idea behind using a hash function is to generate a distribution of the hash values that makes it unlikely that many strings would collide - for the domain under consideration. This key would allow direct access into a data store. These two combined result in a constant time lookup - on average.
If ever such collision occurs, the lookup algorithm falls back on a more flexible lookup sub-strategy.
Certainly this is doable, so long as you ensure all your strings are 'interned', before you pass them to something requiring hashing. Interning is the process of inserting the string into a string table, such that all interned strings with the same value are in fact the same object. Then, you can simply hash the (fixed length) pointer to the interned string, instead of hashing the string itself.
You may be interested in the following mathematical result I came up with last year.
Consider the problem of hashing an infinite number of keys—such as the set of all strings of any length—to the set of numbers in {1,2,…,b}. Random hashing proceeds by first picking at random a hash function h in a family of H functions.
I will show that there is always an infinite number of keys that are certain to collide over all H functions, that is, they always have the same hash value for all hash functions.
Pick any hash function h: there is at least one hash value y such that the set A={s:h(s)=y} is infinite, that is, you have infinitely many strings colliding. Pick any other hash function h‘ and hash the keys in the set A. There is at least one hash value y‘ such that the set A‘={s is in A: h‘(s)=y‘} is infinite, that is, there are infinitely many strings colliding on two hash functions. You can repeat this argument any number of times. Repeat it H times. Then you have an infinite set of strings where all strings collide over all of your H hash functions. CQFD.
Further reading:
Sensible hashing of variable-length strings is impossible
http://lemire.me/blog/archives/2009/10/02/sensible-hashing-of-variable-length-strings-is-impossible/

Resources