Finding similar strings in large datasets - string

I'm using levenshtein distance to retrieve similar strings from a list. At the moment the list has just a few thousand items, but we'll need to support at least 100k items.
I'm trying to make this more efficient and one technique I came up with was to calculate the levenshtein distance only on strings that are of similar length. I though about also filtering on the initial character i.e. if the string to search starts with b then I'll run the calculation only on the strings that start with b. But I'm not sure if I could assume this to work all the time.
I was wondering if you all have a better way of getting this done?
Thanks

One way to go would be to hope that a match with small edit distance would have within it a short exact match. If you assume this, then, given the string ABCDEF, retrieve all strings containing ABC, BCD, CDE, or DEF, and compute their edit distances. You may even find that the best match among these is so close that any closer match must have a short match inside it, so you would have found it already. You would have to accept that if you are unlucky you may miss some good matches, or be forced to go through all the possibilities one by one.
As an alternative to building a database of substrings, you could build a http://en.wikipedia.org/wiki/Suffix_array and LCP array from a string obtained by concatenating all the stored strings, separating them with a marker character not otherwise used. This takes time and space linear in the input size. You would then search for exact matches by looking for strings in the suffix array starting ABCDEF, BCDEF, CDEF, and DEF.

Related

Best way to find if there is a one-typo word from list of given words

How would you efficiently solve this problem
Suppose we were given a list of words [“apple”, “banana”, “mango”]
If we are given a word in the list that is one typo away,
“Dpple”
“Adple”
“Appld”
We output true
If there is more than one typo, we output false.
For optimizations, I’ve tried storing the list in a hashtable containing the number of letters of each word and looking for the same number of letters upon the given input to reduce the size in which we look for our input. Is there a faster optimization we can make to this problem?
One possible optimisation would be to generate all one-typo words for the given list and put them in a map (or some better string lookup structure). Then lookup the given words - if found output true, else false. The total number of one-typo words is: 25*L, where L is the total number of letters in the input list (assuming case does not matter).

Fast way to match strings with typo

I have a huge list of strings (city-names) and I want to find the name of a city even if the user makes a typo.
Example
User types "chcago" and the system finds "Chicago"
Of course I could calculate the Levenshtein distance of the query for all strings in the list but that would be horribly slow.
Is there any efficient way to perform this kind of string-matching?
I think the basic idea is to use Levenshtein distance, but on a subset of the names. One approach that works if the names are long enough is to use n-grams. You can store n-grams and then use more efficient techniques to say that at least x n-grams need to match. Alas, your example misspelling has 2-matching 3-grams with Chicago out of 5 (unless you count partials at the beginning and end).
For shorter names, another approach is to store the letters in each name. So, "Chicago" would turn into 6 "tuples": "c", "h", "i", "a", "g", "o". You would do the same for the name entered and then require that 4 or 5 match. This is a fairly simple match operation, so it can go quite fast.
Then, on this reduced set, apply Levenshtein distance to determine what the closest match is.
You're asking to determine Levenshtein without using Levenshtein.
You would have to determine how far the words could be deviated before it could be identified, and see if it would be acceptable to apply this less accurate algorithm. For instance, you could lookup commonly switched typed letters and limit it to that. Or apply the first/last letter rule from this paper. You could also assume the first few letters are correct and look up the cities in a sorted list and if you don't find it, apply Levenshtein to the n-1 and n+1 words where n is the location of the last lookup (or some variant of it).
There are several ideas, but I don't think there is a single best solution for what you are asking, without more assumptions.
Efficient way to search for fuzzy matches on a text string based on a Levenshtein distance (or any other metric that obeys the triangle inequality) is Levenshtein automaton. It's implemented in a Lucene project (Java) and particulary in a Lucene.net project (C#). This method works fast, but is very complex to implement

How should I remove all the repeated words and letters of a String?

I am trying to remove every character repeated over 2 times from an extremely long string. So, for example, the word Terrrrrrific becomes Terrific.
Now my question is, how do I filter out repeats that include more than a single character the same way, i.e. if I have Words words words words words I want to filter it down to words words, however, it might be something less sensible, such as abcdabcdabcdabcdabcd which should become abcdabcd.
I do suspect that I should use a suffix tree, but I'm not sure how to go at the algorithm exactly.
I don't know, Is this efficient algorithm for you but you can do this:
Choose length for finding repeats
Then for every start point from 0 to length-1 go through string
Maintain stack (you use disjoint substrings and push on stack if top two from stack is different from them)

Search String in Cell Efficient Way

It's my first post here, so please bear with me :-).
Problem Background:
I've multiple text files of the form:
<ticker>,<date>,<open>,<high>,<low>,<close>,<vol>
A,20120904 0926,37.14,37.14,37.14,37.14,693
.
.
.
ZZ,20120904 1602,1.6,1.6,1.6,1.6,11771
As you might have guessed it's stock ticks. When I load it to matlab, it creates a structure with an array (of the numerical values) and a cell (for the strings) which is fine at this point as I can work with it.
Problem:
I'd like to find the most efficient way to search the array for a specific symbol (~70K lines). While it's easy to do a naive or halving searches, I don't think these approaches are very useful for multiple files and/or multiple searches to extract the beginning and end indices of a given symbol/string.
I've looked into past posts here and read about Rabin-Karp, Bitap and hash tables, but I'm not sure any of them fully answers my needs.
So far, I've leaning towards running through the cell once and creating a hash table for each letter (i.e. 'A', 'B', etc) and then running a naive search or anything else you might suggest :-). The reason for hashing is that I might use the same file to look up different stock symbols, so I think running through it once and labeling letters will reduce the complexity in the long run.
What are your thoughts on the matter? Am I in the right direction?
I'm using matlab btw.
Thank you
You can store all your tickers in a struct array. Each column being a property. Assuming you have non-empty values, you can do the following,
tickers = [S.tickers];
dates = [S.date];
You can easily do queries to get the index you want from your struct array S. You can go further and index tickers by ticker name, by creating an index with ticker name as keys.

Comparing strings in MIPS assembly

I have a bunch of strings in an array that I have defined in the data segment. If I were to take 2 of the strings from the array, is it possible to compare them to see which has a greater value in mips? How would I do this? Basically, I'm looking to rearrange the strings based on alphabetical order.
EDIT: This is less of me trying to get help with a specific problem, and more of just a general question that will help me with my approach to the code. Thanks!
If it were me, I'd create a list of pointers to the strings. That is, a list of the addresses of each string. Then you'd write a subroutine the compares two strings given their pointers. Then, when you need to swap the strings, you simply swap the actual pointers.
You want to avoid swapping the strings themselves, since they may well be tightly packed, thus you'd have to do a lot of shifting to move the holes of memory around. Pointers are simple to swap. You could swap strings more easily if they were all of a fixed length (or less), then you wouldn't have to worry about moving the memory holes around.
But sorting the pointer list is really the hot tip.
To compare strings, the simplest way is to iterate over each character of each string, and subtract them from each other. If the result is 0, they're equal. If not, then if the result is > 0, then the first string is before the other string, otherwise the second string is lower and you would swap them. If you run out of either string before the other, and they're equal all the way to that point, the shorter string is less than the longer one.

Resources