Comparing strings for their similarities? - string

I want to count the number of times there is an ocurrence of certain college course on a list of thousands of entries. The problem is the course is not always spelled the same. For example, Computer Engineering can be spelled Computers Engineering. What is a proper, elegant way to test if 2 strings are very similar?

I would try to canonize the strings using stemming. The idea is - give each string its canonized form, and two different strings, that represent the same word are very likely to have the same canon form (for example, Computer and Computers will have the same cannon form, and you will get a match).
Porter stemming algorithm is often used for canonization.
An alternative - is grading the strings with a distance between each other, the suggested Levenshtein Distance can help you with it, but personally - I'd prefer canonization.

Related

A good word splitter

I have a set of short strings (average length < 12).
The strings are mostly sequence of English words (names, dict words etc).
However there is no delimiter between the words. I want to split each string into individual words. I tried google but didn't find anything.
Is there any standard way to do that? Also where can I get dictionary which also includes name of person, along with other English words.
Please note: The strings might not adhere to grammatical rules of English.
Examples of Strings are given below:
dontdisturb
ilovejane
iamagoodperson
It is a known problem for Twitter content/hashtags, though there is no standard/universally accepted way to solve it. (I would also suggest changing the topic to "hashtag splitter" if it is your problem, then more people would be able to find it.)
The algorithm I would suggest is the one typically used for segmentation of Chinese (which has a very similar issue as you can imagine). Here is the idea:
1.Try finding all substrings that can be found in a dictionary, give them the highest score.
2.Then add sequences accepted by some English heuristic with a lower score.
3.And finally throw in individual letters or syllables found in the remainder, with the lowest score.
4.Use Viterbi algorithm (or here) to find the best non-overlapping coverage of the string with the highest score.

Fuzzy substring search from a list of strings

Okay, I've seen lots of posts about fuzzy string matching, Levenstein distance, longest common substring, and so on. None of them seem to fit exactly for what I want to do. I'm pulling product results from a variety of web services, and from those I can build a big list of names for the product. These names may include a bunch of variable junk. Here are some examples, from SearchUPC:
Apple 60W magsafe adapter L-shape with extension cord
Original Apple 60W Power Adapter (L-shaped Connector) for MacBook MC461LL/A with AC Extension Wall Cord (Bulk Packaging)
Current Apple MagSafe 60W Power Adapter for MacBook MC461LL/A with AC Extension Wall Cord (Bulk Packaging)
Apple 60W MagSafe Power Adapter - Apple Mac Accessories
Apple - MagSafe 60W Power Adapter for MacBook and 13\" MacBook Pro
MagSafe - power adapter - 60 Watt
etc. What I'd like to do is pull the common product name (which to my heuristic human eye is obviously Apple 60W MagSafe Power Adapter), but none of the aforementioned methods seem likely to work. My main problem is that I don't know what to search the list of strings for... At first, I was thinking of trying longest common substring, but it seems like that will fail as a bunch of the strings have things out of order, which might yield a product name of power adapter, which is not terribly useful to the user.
Note: the vast majority of the records returned from the SearchUPC API (mostly omitted here) do include the literal string "Apply 60W MagSafe Power Adapter".
I'm implementing this in Objective-C, for iOS, but I'm really interested in the algorithm more than the implementation, so any language is acceptable.
If you want to compare strings but need something more robust than longest common substring with respect to changed order of substrings, you might look into a technique called string tiling. Simplified, the principle is as follows:
Find the largest common substring (larger than a minimum length) in both strings
Remove that substring from both strings
Repeat until there are no more substrings larger than your minimal length
In practice, the ratio between the remaining (unmatched) stringparts to the initial lengths is a excellent indicator on how well the strings match. And the technique is extremely robust against reordering of substrings. You can find the scientific paper by M. Wise describing this technique here. I have implemented the algorithm myself in the past (it's not hard), but apparantly free implementations are readily available (for example here). While I have used the algorithm in a variety of fuzzy matching scenarios, I can't vouch for the implementation which I have never used myself.
String tiling does not solve the problem of finding the largest common productname in itself, but it seems to me that by slightly modifying the algorithm, you can do so. You're probably less interested in calculating the matchpercentage than in the keeping the similar parts?
As the previous poster remarked, such fuzzy matching almost always needs some kind of manual validation, as both false positives and false negatives are unavoidable.
Sounds like your approach needs to be two fold. One should be matching records with one another. The other part is pulling the "canonical name" out of those matching records. In your case it is a product, but same concept. I'm not sure how you will go about associating groups of matching records with the standardized product name, but I would suggest trying to pull the important information out of the records and trying to match that to some resources on the Internet. For instance, for your example, maybe you compare your data to an apple product list. Or you could try and have a robot that crawls google and pulls the top result to try and associate. Bottom line, at the end of the day, if your text is really dirty, you will need to include some human intervention. What I mean is that you may set a threshold for match, no match, and needs review. Good luck.

Need a routine to detect strings that are similar but not identical

I have a list of strings, some of which have been modified since my previous release. Some of the changes are trivial (spacing, off by one word, etc). I would like to detect strings that have only "minor" differences, so that I can try to use the older translations if at all possible.
What do I mean by "minor differences"? I will not know until I start working with the database.
DO you know of any tunable routines that will indicate when two strings are similar but not identical? Any routines that will return a number indicating how different two strings are?
There are many such algorithms. Keywords are fuzzy string matching.
A well known one is a Levenshtein distance. By it you can calculate the number of "changes" required to transform one string into another, so that gives you an estimate of how similar the strings are.
See also this question: How to search for similar words for solutions in Delphi.

Most efficient algorithm for comparing a string with a group of strings

I have a scenario where a user can post a number of responses or phrases via a form field. I would like to be able to take the response and determine what they are asking for. For instance if the user types in car, train, bike, jet .... I can assume they are talking about a vehicle, and respond accordingly. I understand that I could use a switch statement or perhaps a regexp as well, however the larger the number of possible responses, the less efficient that computation will be. I'm wondering if there is an efficient algorithm for comparing a string with a group of strings. Any info would be great.
You may want to look into the Aho-Corasick algorithm. If you have a collection of strings that you want to search for, you can spend linear time doing preprocessing on those strings and from that point forward can, in O(n) time, check for all possible matches of those strings in a text corpus of length n. In other words, with a small preprocessing time to set up the algorithm once, you can extremely efficiently scan over numerous inputs again and again searching for those keywords.
Interestingly enough, the algorithm was specifically invented to build a fast index (that is, to look for a lot of different keywords in a huge body of text), and allegedly outperformed other methods by a factor of ten. I think it would work great in your application.
Hope this helps!
If you have a large number of "magic" words, I would suggest splitting the query into words, and using a hash-based lookup to check whether the words are recognized.
You can check Trie structure. I think one of best solution for your problem.

Algorithms for splitting personal names in parts

I'm looking for references on separating a name: "John A. Doe" in parts, first=John, middle=A., last=Doe. In Mexico we have paternal, maternal, first and second given names, and can be written in different permutations, so the problem is quite complex.
As it depends on data, we are working with matching software that calculates a score for every word so we can take decisions (it is based on a big database). The input data is not clean, it is imported from some government web pages and is human filtered so it could have junk that has to be recognized as well. Any suggestions?
[Edit]
Examples:
name:
Javier Abdul Córdoba Gándara
common permutations (or as it may appear in gvt data referring to same person):
Córdoba Gándara Javier Abdul
Javier A. Córdoba Gándara
Javier Abdul Córdoba G.
paternal=Córdoba
maternal=Gándara
first given:Javier
second given:Abdul
name: María de la Luz Sánchez Martínez
paternal:Sánchez
maternal: Martínez
first given: María de la Luz
name: Paloma Viridiana Alin Arias Medina
paternal: Arias
maternal: Medina
first given: Paloma
second given: Viridiana Alin
As I said what the meaning of each word depends on the score. One has no way of knowing that Viridiana and Alin are given names if not from the score.
We have a very strong database (80 million records or so) so we can get some use of the scoring system. I am designing some algorithm that uses that but looking for other references.
Unfortunately - and having done quite a bit of this work myself - your ideal algorithm will be very data specific, and you will need to work this out for your particular situation.
Of the total time and effort to develop this algorithm, I'd say the time will be split roughly as follows:
10% for general string manipulation
30% for the specific nature of the
data (Mexican name formats, data input quirks)
60% to cater for data quality / lack of
quality
And I believe that's quite generous towards the general string manipulation. Of course it depends whether you need quality results for all records, or only the 'clean' records etc, and if you are able to ignore the 'difficult' records it makes it a lot simpler.
Some general tips
If they are not required, remove non alphanumeric / whitespace characters
Split on spaces
Use hyphens / punctuation to identify surnames or family names
Initials (which are generally single
letters) are not surnames; i.e. they
must be first / middle
determine the level of confidence that you have programmatically identified the each name (and test this thoroughly). You may find there are subsets of data that contain similar patterns that need to be catered for individually (they may come from different sources etc)
You may need to add some natural language or machine learning to check. The problem of identifying author names (e.g. in scientific papers) is difficult as they can be reported with differing orders, degrees of abbreviation, elisions etc. If your database is dirty you will end with ambiguity whatever you do.

Resources