Approximate String Matching Algorithms for names

Approximate String Matching Algorithms for names - string

I'm looking for fuzzy string algorithms for the following example: given a database of existing names, match inputs to either the best-matched name if the match accuracy is higher than the input threshold (say 90%), or NA otherwise
database = [James Bond, Michael Smith]
input
James L Bond->James Bond
JBondL->James Bond
Bond,James->James Bond
BandJamesk->James Bond
Jenny,Bond->N/A
Currently, most algorithms like Levenstein and phonetic based ones like Soundex can't match inverted names like BondJames. So far cosine and Jacquard yield the best results, but I'm looking for more, so that I can choose the best or possibly combine algorithms.

Given your examples, I would consider:
Separating n1 - the name in the input and n2 - a name in the database into words (by delimiters and capital letters): n1 -> {u1,u2,...}, n2 -> {v1,v2,...}
Finding the permutation of the order of words in n2 that minimizes s = sum(L(u, v)) where L is the Levenshtein distance.
Selecting the database entry that minimizes s.
When the number of words in L1 and the number of words in L2 don't match - you should 'penalize' s.

Related

Efficient way to check if string A is contained in string B with at most k errors

Given a string A and a string B (A shorter or the same length as B), I would like to check whether B contains a substring A' such that the Hamming distance between A and A' is at most k.
Does anyone know of an efficient algorithm to do this? Obviously I can just run a sliding window, but this is not feasible for the amount of data I'm working with. The Knuth-Morris-Pratt algorithm (https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm) would work when k=0, but I don't know whether it's modifiable to account for k>0.
Thanks!
Edit: I apparently forgot to clarify, I am looking for a consecutive substring, so for example the substring from position 3 to position 7, without skipping characters. So levenshtein distance is not applicable.

This is what you are looking for : https://en.wikipedia.org/wiki/Levenshtein_distance

If you use the Levenshtein distance and k=1, then you can use the fact that if the length of A is 2n+1 or 2n+2, then either the first or the last n characters of A must be in B.
So you can use strstr to find all places in B where the first or last n characters match exactly and then check the Levenshtein distance.
Special case A = 1 characters: matches everywhere with one error. Special case where A = 2 characters ab: call strchr(a), if it fails call strchr(b).

Is this already a string similarity algorithm?

I'm unfamiliar with string similarity algorithms except for Levenshtein Distance because that's what I'm using and it has turned out to be less than ideal.
So I've kind of got an idea of a recursive algorithm I'd like to implement but I want to know if it exists already so I can leverage other's expertise.
Here's the algorithm by example:
string 1: "Paul Johnson"
string 2: "John Paulson"
Step 1: find all longest matches
Match 1: "Paul"
Match 2: "John"
Match 3: "son"
Match 4: " "
Step 2: Calculate scores for each match with this formula: ((match.len/string.len)*match.len) This allows longer strings to be weighted more at a balanced rate based on the length of the string.
Match 1: (4/12)*4 = 1.333...
Match 2: 1.333...
Match 3: .75
Match 4: .083
Step 3: do steps 1 and 2 on larger scales, (matches of matches.) This I don't have figured out exactly. but my thinking is if "son" comes after "Paul John" and it comes after "John Paul" then that should count for something.
Step 4: sum all the scores that have been calculated.
Scores: 1.333 + 1.333 + .75 + .083333 = 3.4999... (plus whatever scores step 3 produces)
Does this look familiar to anyone? I hope someone else has gone to the trouble of actually making an algorithm along these lines so I don't have to figure it out myself.

What you describe somewhat resembles what the following paper calls the Longest Common Substring (LCS). For a brief description and comparison to other algorithms:
A Comparison of Personal Name Matching
This algorithm [11] repeatedly finds and removes the longest common
sub-string in the two strings compared, up to a minimum lengths
(normally set to 2 or 3).
...
A similarity measure can be calculated by
dividing the total length of the common sub-strings by the minimum,
maximum or average lengths of the two original strings (similar to
Smith-Waterman).
...
this algorithm is
suitable for compound names that have words (like given- and surname)
swapped.

what is the formula of sentiment calculation

what is the actual formula to compute sentiments using sentiment rated lexicon. the lexicon that I am using contains rating between the range -5 to 5. I want to compute sentiment for individual sentences. Either i have to compute average of all sentiment ranked words in sentence or only sum up them.

There are several methods for computing an index from scored sentiment components of sentences. Each is based on comparing positive and negative words, and each has advantages and disadvantages.
For your scale, a measure of the central tendency of the words would be a fair measure, where the denominator is the number of scored words. This is a form of the "relative proportional difference" measure employed below. You would probably not want to divide the total sentiment words' scores by all words, since this makes each sentence's measure strongly affected by non-sentiment terms.
If you do not believe that the 11 point rating you describe is accurate, you could just classify it as positive or negative depending on its sign. Then you could apply the following methods where you have transformed
where each P and N refer to the counts of the Positive and Negative coded sentiment words, and O is the count of all other words (so that the total number of words = P + N + O).
Absolute Proportional Difference. Bounds: [0,1]
Sentiment = (P − N) / (P + N + O)
Disadvantage: A sentence's score is affected by non-sentiment-related content.
Relative Proportional Difference. Bounds: [-1, 1]
Sentiment = (P − N) / (P + N)
Disadvantage: A sentence's score may tend to cluster very strongly near the scale endpoints (because they may contain content primarily or exclusively of either positive or negative).
Logit scale. Bounds: [-infinity, +infinity]
Sentiment = log(P + 0.5) - log(N + 0.5)
This tends to have the smoothest properties and is symmetric around zero. The 0.5 is a smoother to prevent log(0).
For details, please see William Lowe, Kenneth Benoit, Slava Mikhaylov, and Michael Laver. (2011) "Scaling Policy Preferences From Coded Political Texts." Legislative Studies Quarterly 26(1, Feb): 123-155. where we compare their properties for measuring right-left ideology, but everything we discuss also applies to positive-negative sentiment.

you can use R tool for sentiment computation. here is the link you can refer to:
https://sites.google.com/site/miningtwitter/questions/sentiment/analysis

String pre-processing step, to answer further queries in O(1) time

A string is given to you and it contains characters consisting of only 3 characters. Say, x y z.
There will be million queries given to you.
Query format: x z i j
Now in this we need to find all possible different substrings which begins with x and ends in z. i and j denotes the lower and upper bound of the range where the substring must lie. It should not cross this.
My Logic:-
Read the string. Have 3 arrays which will store the count of x y z respectively, for i=0 till strlen
Store the indexes of each characters separately in 3 more arrays. xlocation[], ylocation[], zlocation[]
Now, accordingly to the query, (a b i j) find all the indices of b within the range i and j.
Calculate the answer, for each index of b and sum it to get the result.
Is it possible to pre-process this string before the query? So, like that it takes O(1) time to answer the query.

As the others suggested, you can do this with a divide and conquer algorithm.
Optimal substructure:
If we are given a left half of the string and a right half and we know how many substrings there are in the left half and how many there are in the right half then we can add the two numbers together. We will be undercounting by all the strings that begin in the left and end in the right. This is simply the number of x's in the left substring multiplied by the number of z's in the right substring.
Therefore we can use a recursive algorithm.
This would be a problem however if we tried to solve for everything single i and j combination as the bottom level subproblems would be solved many many times.
You should look into implementing this with a dynamic programming algorithm keeping track of substrings in range i,j, x's in range i,j, and z's in range i,j.

What is the fastest way to sort n strings of length n each?

I have n strings, each of length n. I wish to sort them in ascending order.
The best algorithm I can think of is n^2 log n, which is quick sort. (Comparing two strings takes O(n) time). The challenge is to do it in O(n^2) time. How can I do it?
Also, radix sort methods are not permitted as you do not know the number of letters in the alphabet before hand.

Assume any letter is a to z.
Since no requirement for in-place sorting, create an array of linked list with length 26:
List[] sorted= new List[26]; // here each element is a list, where you can append
For a letter in that string, its sorted position is the difference of ascii: x-'a'.
For example, position for 'c' is 2, which will be put to position as
sorted[2].add('c')
That way, sort one string only take n.
So sort all strings takes n^2.
For example, if you have "zdcbacdca".
z goes to sorted['z'-'a'].add('z'),
d goes to sorted['d'-'a'].add('d'),
....
After sort, one possible result looks like
0 1 2 3 ... 25 <br/>
a b c d ... z <br/>
a b c <br/>
c
Note: the assumption of letter collection decides the length of sorted array.

For small numbers of strings a regular comparison sort will probably be faster than a radix sort here, since radix sort takes time proportional to the number of bits required to store each character. For a 2-byte Unicode encoding, and making some (admittedly dubious) assumptions about equal constant factors, radix sort will only be faster if log2(n) > 16, i.e. when sorting more than about 65,000 strings.
One thing I haven't seen mentioned yet is the fact that a comparison sort of strings can be enhanced by exploiting known common prefixes.
Suppose our strings are S[0], S[1], ..., S[n-1]. Let's consider augmenting mergesort with a Longest Common Prefix (LCP) table. First, instead of moving entire strings around in memory, we will just manipulate lists of indices into a fixed table of strings.
Whenever we merge two sorted lists of string indices X[0], ..., X[k-1] and Y[0], ..., Y[k-1] to produce Z[0], ..., Z[2k-1], we will also be given 2 LCP tables (LCPX[0], ..., LCPX[k-1] for X and LCPY[0], ..., LCPY[k-1] for Y), and we need to produce LCPZ[0], ..., LCPZ[2k-1] too. LCPX[i] gives the length of the longest prefix of X[i] that is also a prefix of X[i-1], and similarly for LCPY and LCPZ.
The first comparison, between S[X[0]] and S[Y[0]], cannot use LCP information and we need a full O(n) character comparisons to determine the outcome. But after that, things speed up.
During this first comparison, between S[X[0]] and S[Y[0]], we can also compute the length of their LCP -- call that L. Set Z[0] to whichever of S[X[0]] and S[Y[0]] compared smaller, and set LCPZ[0] = 0. We will maintain in L the length of the LCP of the most recent comparison. We will also record in M the length of the LCP that the last "comparison loser" shares with the next string from its block: that is, if the most recent comparison, between two strings S[X[i]] and S[Y[j]], determined that S[X[i]] was smaller, then M = LCPX[i+1], otherwise M = LCPY[j+1].
The basic idea is: After the first string comparison in any merge step, every remaining string comparison between S[X[i]] and S[Y[j]] can start at the minimum of L and M, instead of at 0. That's because we know that S[X[i]] and S[Y[j]] must agree on at least this many characters at the start, so we don't need to bother comparing them. As larger and larger blocks of sorted strings are formed, adjacent strings in a block will tend to begin with longer common prefixes, and so these LCP values will become larger, eliminating more and more pointless character comparisons.
After each comparison between S[X[i]] and S[Y[j]], the string index of the "loser" is appended to Z as usual. Calculating the corresponding LCPZ value is easy: if the last 2 losers both came from X, take LCPX[i]; if they both came from Y, take LCPY[j]; and if they came from different blocks, take the previous value of L.
In fact, we can do even better. Suppose the last comparison found that S[X[i]] < S[Y[j]], so that X[i] was the string index most recently appended to Z. If M ( = LCPX[i+1]) > L, then we already know that S[X[i+1]] < S[Y[j]] without even doing any comparisons! That's because to get to our current state, we know that S[X[i]] and S[Y[j]] must have first differed at character position L, and it must have been that the character x in this position in S[X[i]] was less than the character y in this position in S[Y[j]], since we concluded that S[X[i]] < S[Y[j]] -- so if S[X[i+1]] shares at least the first L+1 characters with S[X[i]], it must also contain x at position L, and so it must also compare less than S[Y[j]]. (And of course the situation is symmetrical: if the last comparison found that S[Y[j]] < S[X[i]], just swap the names around.)
I don't know whether this will improve the complexity from O(n^2 log n) to something better, but it ought to help.

You can build a Trie, which will cost O(s*n),
Details:
https://stackoverflow.com/a/13109908

Solving it for all cases should not be possible in better that O(N^2 Log N).
However if there are constraints that can relax the string comparison, it can be optimised.
-If the strings have high repetition rate and are from a finite ordered set. You can use ideas from count sort and use a map to store their count. later, sorting just the map keys should suffice. O(NMLogM) where M is the number of unique strings. You can even directly use TreeMap for this purpose.
-If the strings are not random but the suffixes of some super string this can well be done
O(N Log^2N). http://discuss.codechef.com/questions/21385/a-tutorial-on-suffix-arrays

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Approximate String Matching Algorithms for names - string

Related

Efficient way to check if string A is contained in string B with at most k errors

Is this already a string similarity algorithm?

what is the formula of sentiment calculation

String pre-processing step, to answer further queries in O(1) time

What is the fastest way to sort n strings of length n each?

Categories

Resources