Comparing string distance based on precomputed hashes - string

I have a large list (over 200,000) of strings that I'd like to compare to a given string.
The given string is inserted by a user, so it may be slightly incorrect.
What I was hoping to do was create some kind of precomputed hash on each string on adding it to the list. This hash would contain information such as string length, addition of all the characters etc.
My question is, does something like this already exist? Surely there would be something that lets me avoid running Levenshtein distance on every string in the list?
Or maybe there's a third option I haven't thought of yet?

Sounds like you want to use a fuzzy hash of some sort. There are lots of hash functions available that can do things like this. The classic old "SOUNDEX" algorithm might even work.
Another thought - if you estimate that the probability of an incorrect entry is low, then you might actually be fine having a direct hit 99.9% of the time, falling back to SOUNDEX which might catch 90% of the remaining cases and then searching the whole list for the remaining 0.01% of the time.
Also worth checking this discussion:
How to find best fuzzy match for a string in a large string database

Related

Is there an alternative to SOUNDEX or metaphone that handles names with fewer collisions?

I'm trying to find near duplicates in a large list of names by computing the metaphone key for each string, and then, within each set of possible duplicates, use something like Levenshtein distance to get a more refined estimate of duplicate likelihood.1
However, I'm finding that metaphone is heavily determined by the first characters in the strings, and so if I feed it a long list of people's names, I get huge buckets where everyone's name is "Jennifer X" or "Richard Y", but otherwise haven't got much in common.
If I reverse the string before generating the key, the results are more sensible, in that they group by last name, but still I find that the first names aren't particularly similar.
So is there a similar algorithm that samples more of the input string to produce a sound key, perhaps by using a longer key string?
[1] Ideally, I'd compute the string distances directly, but if my list has 10,000 names, that would mean 100,000,000 computations, which is why I'm trying to divide and conquer by sound keying each name first and only checking for similarities within the buckets. But if there's a better way, I'd love to hear about it!
Try eudex.
It's described as "A blazingly fast phonetic reduction/hashing algorithm."
There are many easy ways to use it, as it encodes a word into a 64-bit integer with most discriminating features towards the MSB. The hamming difference between hashes is also useful as a difference metric between words and spellings.

Comparing long strings by their hashes

Trying to improve the performance of a function that compares strings I decided to compare them by comparing their hashes.
So is there a guarantee if the hash of 2 very long strings are equal to each other then the strings are also equal to each other?
While it's guaranteed that 2 identical strings will give you equal hashes, the other way round is not true : for a given hash, there are always several possible strings which produce the same hash.
This is true due to the PigeonHole principle.
That being said, the chances of 2 different strings producing the same hash can be made infinitesimal, to the point of being considered equivalent to null.
A fairly classical example of such hash is MD5, which has a near perfect 128 bits distribution. Which means that you have one chance in 2^128 that 2 different strings produce the same hash. Well, basically, almost the same as impossible.
In the simple common case where two long strings are to be compared to determine if they are identical or not, a simple compare would be much preferred over a hash, for two reasons. First, as pointed out by #wildplasser, the hash requires that all bytes of both strings must be traversed in order to calculate the two hash values, whereas the simple compare is fast, and only needs to traverse bytes until the first difference is found, which may be much less than the full string length. And second, a simple compare is guaranteed to detect any difference, whereas the hash gives only a high probability that they are identical, as pointed out by #AdamLiss and #Cyan.
There are, however, several interesting cases where the hash comparison can be employed to great advantage. As mentioned by #Cyan if the compare is to be done more than once, or must be stored for later use, then hash may be faster. A case not mentioned by others is if the strings are on different machines connected via a local network or the Internet. Passing a small amount of data between the two machines will generally be much faster. The simplest first check is compare the size of the two, if different, you're done. Else, compute the hash, each on its own machine (assuming you are able to create the process on the remote machine) and again, if different you are done. If the hash values are the same, and if you must have absolute certainty, there is no easy shortcut to that certainty. Using lossless compression on both ends will allow less data to be transferred for comparison. And finally, if the two strings are separated by time, as alluded to by #Cyan, if you want to know if a file has changed since yesterday, and you have stored the hash from yesterday's version, then you can compare today's hash to it.
I hope this will help stimulate some "out of the box" ideas for someone.
I am not sure, if your performance will be improved. Both: building hash + comparing integers and simply comparing strings using equals have same complexity, that lays in O(n), where n is the number of characters.

mapping strings

I want to map some strings(word) with number. the similar the string, the nearer their value(mapped number) . also, while checking the positional combination of the letters should impact the mapping.the mapping function should be function of letters, positions (combination given position of letter thepriority such as pit and tip should be different), number of letters.
Well, I would give some examples : starter, stater , stapler, startler, tstarter are some words. These words are of format "(*optinal)sta(*opt)*er" where * denotes some sort of variable in our case it is either 't' or 'l' (i.e. in case of starter and staler). these all should be mapped INDIVIDUALLY, without context to other such that their value are not of much difference. and later on which creating groups I can put appropriate range of numbers for differentiating groups.
So while mapping the string their values should be similar. there are many words, so comparing each other would be complex. so mapping with some numeric value for each word independently and putting the similar string (as they have similar value) in a group and then later find these pattern by other means.
So, for now I need to look up for some existing methods of mapping such that similar strings (I guess I have clarify the term 'similar' for my context) have similar value and these value should be different to the dissimilar ones. please, again I emphasize that the number of string would be huge and comparing each with other is practically impossible(or computationally expensive and much slow).SO WHAT I THINK IS TO DEVISE AN ALGORITHM(taking help from existing ones) FOR MAPPING WORD(STRING) ON ITS OWN
Have I made you clear? Please give me some idea to start with. some terms to search and research.
I think I need some type of "bad" hash function to hash strings and then put them in bucket according to that hash value. at least some idea or algorithm names.
Seems like it would best to use a known algorithm like Levenshtein Distance
This search on StackOverflow
reveals this question about finding-groups-of-similar-strings-in-a-large-set-of-strings, which links to this article describing a SimHash which sounds exactly like what you want.

Why is it called rainbow table?

Anyone know why it is called rainbow table? Just remembered we have learned there is an attack called "dictionary attack". Why it is not call dictionary?
Because it contains the entire "spectrum" of possibilities.
A dictionary attack is a bruteforce technique of just trying possibilities. Like this (python pseudo code)
mypassworddict = dict()
for password in mypassworddict:
trypassword(password)
However, a rainbow table works differently, because it's for inverting hashes. A high level overview of a hash is that it has a number of bins:
bin1, bin2, bin3, bin4, bin5, ...
Which correspond to binary parts of the output string - that's how the string ends up the length it is. As the hash proceeds, it affects differing parts of the bins in different ways. So the first byte (or whatever input field is accepted) input affects (say, simplistically) bins 3 and 4. The next input affects 2 and 6. And so on.
A rainbow table is a computation of all the possibilities of a given bin, i.e. all the possible inverses of that bin, for every bin... that's why it ends up so large. If the first bin value is 0x1 then you need to have a lookup list of all the values of bin2 and all the values of bin3 working backwards through the hash, which eventually gives you a value.
Why isn't it called a dictionary attack? Because it isn't.
As I've seen your previous question, let me expand on the detail you're looking for there. A cryptographically secure hash needs to be safe ideally from smallish input sizes up to whole files. To precompute the values of a hash for an entire file would take forever. So a rainbow table is designed on a small well understood subset of outputs, for example the permutations of all the characters a-z over a field of say 10 characters.
This is why password advice for defeating dictionary attacks works here. The more subsets of the whole possible set of inputs you put into your input for the hash, the more a rainbow table needs to contain to search it. The data sizes required end up stupidly big and so does the time to search. So, think about it:
If you have an input that is [a-z] for 5-8 characters, that's not too bad a rainbow table.
If you increase the length to 42 characters, that's a massive rainbow table. Each input affects the hash and so the bins of said hash.
If you throw numbers in to your search requirement [a-z][0-9] you've got even more searching to do.
Likewise [A-Za-z0-9]. Finally, stick in [\w] i.e. any printable character you can think of, and again, you're looking at a massive table.
So, making passwords long and complicated makes rainbow tables start taking blue-ray sized discs of data. Then, as per your previous question, you start adding in salting and hash derived functions and you make a general solution to hash cracking hard(er).
The goal here is to stay ahead of the computational power available.
Rainbow is a variant of dictionary attack (Pre-computed dictionary attack to be exact), but it takes less space than full dictionary (at the price of time needed to find a key in table). The other end of this space-memory tradeoff is full search (brute force attack = zero precomputation, a lot of time).
In the rainbow table the precomputed dictionary of pairs key-ciphertext is compressed in chains. Every step in chain is done using different commpression function. And the table has a lot of chains, so it looks like a rainbow.
In this picture different compression functions K1, K2, K3 have a colors like in rainbow:
The table, stored in the file contains only first and last columns, as the middle columns can be recomputed.
I don't know where the name comes from, but the differences are:
A dictionary contains a few selected items (e.g. english words), while a rainbow table contains every possible combination.
A dictionary only contains the input, while the rainbow table contains both the input and the output.
A dictionary is used to test different input to see if the output is valid, while a rainbow table is used for e reverse lookup, i.e. to find which input gives a specific output.
Unfortunately some of the statements are not correct. Contrary to what is bring posted rainbow tables DO NOT contain all the possibilites for a given keyspace well not the ones generated for use that I've seen. They can be generated to cover 99.9 but due to the randomness of a hash function there in no gurantee that EVERY plaintext is covered.
Each chain is made up of links or steps and each step is made of a hashing and reduction function. If your chain was 100 links long you would go that number of hash/reduction functions then discarding everything in between except the start and end.
To find the plain for a given hash you simply perform the reduction / hash x amount of the length of your chain. So you run the step once and check against the endpoint if it's a miss you would repeat... Until you have stepped through the entire length of your chain. If there is a match you can then regenerate the chain from the start point and you may be able to find the plain. If after the regeneration it is not correct then this is a false alarm. This happens due to collisions caused by the reduction hashing function. Since the table contains many chains you can do a large lookup against all the chain endpoints each step, this is essentially where the magic happens allowing speed. This will also lead to false alarms, since you only need to regenerate chains which have matches you save lots of time by skipping unnecessary chains.
They do not contain dictionaries.... Well not the traditional tables there are variants of rainbow tables which incorporate the use of dictionaries though.
That's about it. There are many ways which this process has been optimized including removing merging / duplicate chains and creating perfect tables and also storing them in differing packing to save space and loading time.

Getting fuzzy string matches from database very fast

I have a database of ~150'000 words and a pattern (any single word) and I want to get all words from the database which has Damerau-Levenshtein distance between it and the pattern less than given number. I need to do it extremely fast. What algorithm could you suggest? If there's no good algorithm for Damerau-Levenshtein distance, just Levenshtin distance will be welcome as well.
Thank you for your help.
P.S. I'm not going to use SOUNDEX.
I would start with a SQL function to calculate the Levenshtein distance (in T-SQl or .Net) (yes, I'm a MS person...) with a maximum distance parameter that would cause an early exit.
This function could then be used to compare your input with each string to check the distanve and move on to the next if it breaks the threshold.
I was also thinking you could, for example, set the maximum distance to be 2, then filter all words where the length is more than 1 different whilst the first letter is different. With an index this may be slightly quicker.
You could also shortcut to bring back all strings that are perfect matches (indexing will speed this up) as these will actually take longer to calculate the Levenshtein distance of 0.
Just some thoughts....
I do not think you can calculate this kind of function without actually enumerating all rows.
So the solutions are:
Make it a very fast enumeration (but this doesn't really scale)
Filter initial variants somehow (index by a letter, at least x common letters)
Use alternative (indexable) algorithm, such as N-Grams (however I do not have details on result quality of ngrams versus D-L distance).
A solution off the top of my head might be to store the database in a sorted set (e.g., std::set in C++), as it seems to me that strings sorted lexicographically would compare well. To approximate the position of the given string in the set, use std::upper_bound on the string, then iterate over the set outward from the found position in both directions, computing the distance as you go, and stop when it falls below a certain threshold. I have a feeling that this solution would probably only match strings with the same start character, but if you're using the algorithm for spell-checking, then that restriction is common, or at least unsurprising.
Edit: If you're looking for an optimisation of the algorithm itself, however, this answer is irrelevant.
I have used KNIME for string fuzzy matching and has got very fast results. It is also very easy to make visual workflows in it. Just install KNIME free edition from https://www.knime.org/ then use "String Distance" and "Similarity Search" nodes to get your results. I have attached a small fuzzy matching smaple workflow in here (the input data come from top and the patterns to search for come from the bottom in this case):
I would recommend looking into Ankiro.
I'm not certain that it meets your requirements for precision, but it is fast.

Resources