I'm searching for a "bad" hash function:
I'd like to hash strings and put similar strings in one bucket.
Can you give me a hint where to start my research?
Some methods or algorithm names...
Your problem is not an easy one. Two ideas:
This solution might be overly complicated but you could try a Fourier transform. Treat your input text as a series of samples of a function and then run a Fourier transform to convert your input to the frequency domain. The low frequency part is the general jist of the text and the high frequency part is the tiny changes.
This is somewhat similar to what jpeg compression does: Throw away the details and just leave the important stuff. If you have two almost-identical images and you jpeg compress them greatly, you usually get the same output.
pHash uses a method similar to this.
Again, this is going to be a pretty complicated way to do it.
Second idea: minHash
The idea for minHash is that you pick some markers that are likely to be the same when the inputs are the same. Then you compute a vector for the outputs of all the markers. If two inputs have similar vectors then the inputs are similar.
For example, count how many times the word "the" appears in the text. If it's even, 0, if it's odd, 1. Now count how many times the word "math" shows up in the text. Again, 0 for even, 1 for odd. Do that for a lot of words.
Now you process all the texts and each one gives you an output like "011100010101" or whatever. If two texts are similar then they will have similar outputs strings, differing by just 1 or two bits. You can use a multi-variate partition trie (MVP) to search the outputs efficiently.
This, too, might be overkill for your problem.
It depends on what you mean by "similar string" ?
But if you look for such a bad one, you have to build it yourself.
Example :
you can create 10 buckets (0 to 9)
and group the strings by theirs length
mod 10
Use a strcmp() like function and group them by the differences with a defined String
Related
Let's say I have a dictionary (word list) of millions upon millions of words. Given a query word, I want to find the word from that huge list that is most similar.
So let's say my query is elepant, then the result would most likely be elephant.
If my word is fentist, the result will probably be dentist.
Of course assuming both elephant and dentist are present in my initial word list.
What kind of index, data structure or algorithm can I use for this so that the query is fast? Hopefully complexity of O(log N).
What I have: The most naive thing to do is to create a "distance function" (which computes the "distance" between two words, in terms of how different they are) and then in O(n) compare the query with every word in the list, and return the one with the closest distance. But I wouldn't use this because it's slow.
The problem you're describing is a Nearest Neighbor Search (NNS). There are two main methods of solving NNS problems: exact and approximate.
If you need an exact solution, I would recommend a metric tree, such as the M-tree, the MVP-tree, and the BK-tree. These trees take advantage of the triangle inequality to speed up search.
If you're willing to accept an approximate solution, there are much faster algorithms. The current state of the art for approximate methods is Hierarchical Navigable Small World (hnsw). The Non-Metric Space Library (nmslib) provides an efficient implementation of hnsw as well as several other approximate NNS methods.
(You can compute the Levenshtein distance with Hirschberg's algorithm)
I made similar algorythm some time ago
Idea is to have an array char[255] with characters
and values is a list of words hashes (word ids) that contains this character
When you are searching 'dele....'
search(d) will return empty list
search(e) will find everything with character e, including elephant (two times, as it have two 'e')
search(l) will brings you new list, and you need to combine this list with results from previous step
...
at the end of input you will have a list
then you can try to do group by wordHash and order by desc by count
Also intresting thing, if your input is missing one or more characters, you will just receive empty list in the middle of the search and it will not affect this idea
My initial algorythm was without ordering, and i was storing for every character wordId and lineNumber and char position.
My main problem was that i want to search
with ee to find 'elephant'
with eleant to find 'elephant'
with antph to find 'elephant'
Every words was actually a line from file, so it's often was very long
And number of files and lines was big
I wanted quick search for directories with more than 1gb text files
So it was a problem even store them in memory, for this idea you need 3 parts
function to fill your cache
function to find by char from input
function to filter and maybe order results (i didn't use ordering, as i was trying to fill my cache in same order as i read the file, and i wanted to put lines that contains input in the same order upper )
I hope it make sense
General problem
I have a set of short strings, each of different length with minimum X > 0 and maximum Y. What is an algorithm which will optimally fit together these short strings to make long strings of length M, where M >> Y? Optimal would be defined as the greatest number of long strings with lengths closest to M as possible.
Details
I am writing a tweet creator to practice javascript. I have a list of greetings and a list of account names. I want my program to create tweets such that each tweet has one greeting and the rest of the characters are used for account names. Each tweet has a limit of 140 characters.
Hello! #person1 #acc2 #mygoodfriend3 ...
Of course, each account has a different number of characters. I want each tweet to use up as many of the 140 characters as possible by optimally selecting combinations of account names.
I am pretty certain there is a class of problems / algorithm that is known to solve this problem but I can't remember it.
This kind of problem is called a knapsack problem, and an exact (or exactly optimal) solution is famous for being intractable.
However, there are reasonable approximate solvers, as well as a "pseudo-polynomial time" dynamic-programming algorithm.
I think it is related to the knapsack problem; it is a particular case of "multiple linear bin packing problem". Both are NP-hard problems. Here you can find a greedy algorithm for the linear bin packing problem, but the multiple case is much harder. A constraint programming language/library would help to solve these kind of problems.
I'm trying to find near duplicates in a large list of names by computing the metaphone key for each string, and then, within each set of possible duplicates, use something like Levenshtein distance to get a more refined estimate of duplicate likelihood.1
However, I'm finding that metaphone is heavily determined by the first characters in the strings, and so if I feed it a long list of people's names, I get huge buckets where everyone's name is "Jennifer X" or "Richard Y", but otherwise haven't got much in common.
If I reverse the string before generating the key, the results are more sensible, in that they group by last name, but still I find that the first names aren't particularly similar.
So is there a similar algorithm that samples more of the input string to produce a sound key, perhaps by using a longer key string?
[1] Ideally, I'd compute the string distances directly, but if my list has 10,000 names, that would mean 100,000,000 computations, which is why I'm trying to divide and conquer by sound keying each name first and only checking for similarities within the buckets. But if there's a better way, I'd love to hear about it!
Try eudex.
It's described as "A blazingly fast phonetic reduction/hashing algorithm."
There are many easy ways to use it, as it encodes a word into a 64-bit integer with most discriminating features towards the MSB. The hamming difference between hashes is also useful as a difference metric between words and spellings.
I don't want a direct solution to the problem that's the source of this question but it's this one link:
So I take in the strings and add them to a suffix array which is implemented as a sorted set internally, what I obtain then is a lexicographically sorted list of the two given strings.
S1 = "banana"
S2 = "panama"
SuffixArray.add S1, S2
To make searching for the k-th smallest substring efficient I preprocess this sorted set to add in information about the longest common prefix between a suffix and it's predecessor as well as keeping tabs on a cumulative substrings count. So I know that for a given k greater than the cumulative substrings count of the last item, it's an invalid query.
This works really well for small inputs as well as random large inputs of the constraints given in the problem definition, which is at most 50 strings of length 2000. I am able to pass the 4 out of 7 cases and was pretty surprised I didn't get them all.
So I went searching for the bottleneck and it hit me. Given large number of inputs like these
anananananananana.....ananana
bkbkbkbkbkbkbkbkb.....bkbkbkb
The queries for k-th smallest substrings are still fast as expected but not the way I preprocess the sorted set... The way I calculate the longest common prefix between the elements of the set is not efficient and linear O(m), like this, I did the most naïve thing expecting it to be good enough:
m = anananan
n = anananana
Start at 0 and find the point where `m[i] != n[i]`
It is like this because a suffix and his predecessor might no be related (i.e. coming from different input strings) and so I thought I couldn't help but using brute force.
Here is the question then and where I ended up reducing the problem as. Given a list of lexicographically sorted suffix like in the manner I described above (made up of multiple strings):
What is an efficient way of computing the longest common prefix array?.
The subquestion would then be, am I completely off the mark in my approach? Please propose further avenues of investigation if that's the case.
Foot note, I do not want to be shown implemented algorithm and I don't mind to be told to go read so and so book or resource on the subject as that is what I do anyway while attempting these challenges.
Accepted answer will be something that guides me on the right path or in the case that that fails; something that teaches me how to solve these types of problem in a broader sense, a book or something
READING
I would recommend this tutorial pdf from Stanford.
This tutorial explains a simple O(nlog^2n) algorithm with O(nlogn) space to compute suffix array and a matrix of intermediate results. The matrix of intermediate results can be used to compute the longest common prefix between two suffixes in O(logn).
HINTS
If you wish to try to develop the algorithm yourself, the key is to sort the strings based on their 2^k long prefixes.
From the tutorial:
Let's denote by A(i,k) be the subsequence of A of length 2^k starting at position i.
The position of A(i,k) in the sorted array of A(j,k) subsequences (j=1,n) is kept in P(k,i).
and
Using matrix P, one can iterate descending from the biggest k down to 0 and check whether A(i,k) = A(j,k). If the two prefixes are equal, a common prefix of length 2^k had been found. We only have left to update i and j, increasing them both by 2^k and check again if there are any more common prefixes.
I have a database of ~150'000 words and a pattern (any single word) and I want to get all words from the database which has Damerau-Levenshtein distance between it and the pattern less than given number. I need to do it extremely fast. What algorithm could you suggest? If there's no good algorithm for Damerau-Levenshtein distance, just Levenshtin distance will be welcome as well.
Thank you for your help.
P.S. I'm not going to use SOUNDEX.
I would start with a SQL function to calculate the Levenshtein distance (in T-SQl or .Net) (yes, I'm a MS person...) with a maximum distance parameter that would cause an early exit.
This function could then be used to compare your input with each string to check the distanve and move on to the next if it breaks the threshold.
I was also thinking you could, for example, set the maximum distance to be 2, then filter all words where the length is more than 1 different whilst the first letter is different. With an index this may be slightly quicker.
You could also shortcut to bring back all strings that are perfect matches (indexing will speed this up) as these will actually take longer to calculate the Levenshtein distance of 0.
Just some thoughts....
I do not think you can calculate this kind of function without actually enumerating all rows.
So the solutions are:
Make it a very fast enumeration (but this doesn't really scale)
Filter initial variants somehow (index by a letter, at least x common letters)
Use alternative (indexable) algorithm, such as N-Grams (however I do not have details on result quality of ngrams versus D-L distance).
A solution off the top of my head might be to store the database in a sorted set (e.g., std::set in C++), as it seems to me that strings sorted lexicographically would compare well. To approximate the position of the given string in the set, use std::upper_bound on the string, then iterate over the set outward from the found position in both directions, computing the distance as you go, and stop when it falls below a certain threshold. I have a feeling that this solution would probably only match strings with the same start character, but if you're using the algorithm for spell-checking, then that restriction is common, or at least unsurprising.
Edit: If you're looking for an optimisation of the algorithm itself, however, this answer is irrelevant.
I have used KNIME for string fuzzy matching and has got very fast results. It is also very easy to make visual workflows in it. Just install KNIME free edition from https://www.knime.org/ then use "String Distance" and "Similarity Search" nodes to get your results. I have attached a small fuzzy matching smaple workflow in here (the input data come from top and the patterns to search for come from the bottom in this case):
I would recommend looking into Ankiro.
I'm not certain that it meets your requirements for precision, but it is fast.