Word-level edit distance of a sentence - string

Is there an algorithm that lets you find the word-level edit distance between 2 sentences?
For eg., "A Big Fat Dog" and "The Big House with the Fat Dog" have 1 substitute, 3 insertions

In general, this is called the sequence alignment problem. Actually it does not matter what entities you align - bits, characters, words, or DNA bases - as long as the algorithm works for one type of items it will work for everything else. What matters is whether you want global or local alignment.
Global alignment, which attempt to align every residue in every sequence, is most useful when the sequences are similar and of roughly equal size. A general global alignment technique is the Needleman-Wunsch algorithm algorithm, which is based on dynamic programming. When people talk about Levinstain distance they usually mean global alignment. The algorithm is so straightforward, that several people discovered it independently, and sometimes you may come across Wagner-Fischer algorithm which is essentially the same thing, but is mentioned more often in the context of edit distance between two strings of characters.
Local alignment is more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The Smith-Waterman algorithm is a general local alignment method also based on dynamic programming. It is quite rarely used in natural language processing, and more often - in bioinformatics.

You can use the same algorithms that are used for finding edit distance in strings to find edit distances in sentences. You can think of a sentence as a string drawn from an alphabet where each character is a word in the English language (assuming that spaces are used to mark where one "character" starts and the next ends). Any standard algorithm for computing edit distance, such as the standard dynamic programming approach for computing Levenshtein distance, can be adapted to solve this problem.

check out the AlignedSent function in python from the nltk package. It aligns sentences at the word level.
https://www.nltk.org/api/nltk.align.html

Here is a sample implementation of the #templatetypedef's idea in ActionScript (it worked great for me), which calculates the normalized Levenshtein distance (or in other words gives a value in the range [0..1])
private function nlevenshtein(s1:String, s2:String):Number {
var tokens1:Array = s1.split(" ");
var tokens2:Array = s2.split(" ");
const len1:uint = tokens1.length, len2:uint = tokens2.length;
var d:Vector.<Vector.<uint> >=new Vector.<Vector.<uint> >(len1+1);
for(i=0; i<=len1; ++i)
d[i] = new Vector.<uint>(len2+1);
d[0][0]=0;
var i:int;
var j:int;
for(i=1; i<=len1; ++i) d[i][0]=i;
for(i=1; i<=len2; ++i) d[0][i]=i;
for(i = 1; i <= len1; ++i)
for(j = 1; j <= len2; ++j)
d[i][j] = Math.min( Math.min(d[i - 1][j] + 1,d[i][j - 1] + 1),
d[i - 1][j - 1] + (tokens1[i - 1] == tokens2[j - 1] ? 0 : 1) );
var nlevenshteinDist:Number = (d[len1][len2]) / (Math.max(len1, len2));
return nlevenshteinDist;
}
I hope this will help!

The implementation in D is generalized over any range, and thus array. So by splitting your sentences into arrays of strings they can be run through the algorithm and an edit number will be provided.
https://dlang.org/library/std/algorithm/comparison/levenshtein_distance.html

Here is the Java implementation of edit distance algorithm for sentences using dynamic programming approach.
public class EditDistance {
public int editDistanceDP(String sentence1, String sentence2) {
String[] s1 = sentence1.split(" ");
String[] s2 = sentence2.split(" ");
int[][] solution = new int[s1.length + 1][s2.length + 1];
for (int i = 0; i <= s2.length; i++) {
solution[0][i] = i;
}
for (int i = 0; i <= s1.length; i++) {
solution[i][0] = i;
}
int m = s1.length;
int n = s2.length;
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (s1[i - 1].equals(s2[j - 1]))
solution[i][j] = solution[i - 1][j - 1];
else
solution[i][j] = 1
+ Math.min(solution[i][j - 1], Math.min(solution[i - 1][j], solution[i - 1][j - 1]));
}
}
return solution[s1.length][s2.length];
}
public static void main(String[] args) {
String sentence1 = "first second third";
String sentence2 = "second";
EditDistance ed = new EditDistance();
System.out.println("Edit Distance: " + ed.editDistanceDP(sentence1, sentence2));
}
}

Related

interview riddle (string manipulation) - explanation needed

i am studying for an interview and encountered a question + solution.
i am having a problem with one line in the solution and was hoping maybe someone here can explain it.
the question:
Write a method to replace all spaces in a string with ‘%20’.
the solution:
public static void ReplaceFun(char[] str, int length) {
int spaceCount = 0, newLength, i = 0;
for (i = 0; i < length; i++) {
if (str[i] == ‘ ‘) {
spaceCount++;
}
}
newLength = length + spaceCount * 2;
str[newLength] = ‘\0’;
for (i = length - 1; i >= 0; i--) {
if (str[i] == ‘ ‘) {
str[newLength - 1] = ‘0’;
str[newLength - 2] = ‘2’;
str[newLength - 3] = ‘%’;
newLength = newLength - 3;
} else {
str[newLength - 1] = str[i];
newLength = newLength - 1;
}
}
}
my problem is with line number 9. how can he just set str[newLength] to '\0'? or in other words, how can he take over the needed amount of memory without allocating it first or something like that?
isn't he running over a memory?!
Assuming this is actually meant to be in C (private static is not valid C or C++), they can't, as it's written. They're never allocating a new str which will be long enough to hold the old string plus the %20 expansion.
I suspect there's an additional part to the question, which is that str is already long enough to hold the expanded %20 data, and that length is the length of the string in str, not counting the zero terminator.
This is valid code, but it's not good code. You are completely correct in your assessment that we are overwriting the bounds of the initial str[]. This could cause some rather unwanted side-effects depending on what was being overwritten.

Asymmetric Levenshtein distance

Given two bit strings, x and y, with x longer than y, I'd like to compute a kind of asymmetric variant of the Levensthein distance between them. Starting with x, I'd like to know the minimum number of deletions and substitutions it takes to turn x into y.
Can I just use the usual Levensthein distance for this, or do I need I need to modify the algorithm somehow? In other words, with the usual set of edits of deletion, substitution, and addition, is it ever beneficial to delete more than the difference in lengths between the two strings and then add some bits back? I suspect the answer is no, but I'm not sure. If I'm wrong, and I do need to modify the definition of Levenshtein distance to disallow deletions, how do I do so?
Finally, I would expect intuitively that I'd get the same distance if I started with y (the shorter string) and only allowed additions and substitutions. Is this right? I've got a sense for what these answers are, I just can't prove them.
If i understand you correctly, I think the answer is yes, the Levenshtein edit distance could be different than an algorithm that only allows deletions and substitutions to the larger string. Because of this, you would need to modify, or create a different algorithm to get your limited version.
Consider the two strings "ABCD" and "ACDEF". The Levenshtein distance is 3 (ABCD->ACD->ACDE->ACDEF). If we start with the longer string, and limit ourselves to deletions and substitutions we must use 4 edits (1 deletion and 3 substitutions. The reason is that strings where deletions are applied to the smaller string to efficiently get to the larger string can't be achieved when starting with the longer string, because it does not have the complimentary insertion operation (since you're disallowing that).
Your last paragraph is true. If the path from shorter to longer uses only insertions and substitutions, then any allowed path can simply be reversed from the longer to the shorter. Substitutions are the same regardless of direction, but the inserts when going from small to large become deletions when reversed.
I haven't tested this thoroughly, but this modification shows the direction I would take, and appears to work with the values I've tested with it. It's written in c#, and follows the psuedo code in the wikipedia entry for Levenshtein distance. There are obvious optimizations that can be made, but I refrained from doing that so it was more obvious what changes I've made from the standard algorithm. An important observation is that (using your constraints) if the strings are the same length, then substitution is the only operation allowed.
static int LevenshteinDistance(string s, string t) {
int i, j;
int m = s.Length;
int n = t.Length;
// for all i and j, d[i,j] will hold the Levenshtein distance between
// the first i characters of s and the first j characters of t;
// note that d has (m+1)*(n+1) values
var d = new int[m + 1, n + 1];
// set each element to zero
// c# creates array already initialized to zero
// source prefixes can be transformed into empty string by
// dropping all characters
for (i = 0; i <= m; i++) d[i, 0] = i;
// target prefixes can be reached from empty source prefix
// by inserting every character
for (j = 0; j <= n; j++) d[0, j] = j;
for (j = 1; j <= n; j++) {
for (i = 1; i <= m; i++) {
if (s[i - 1] == t[j - 1])
d[i, j] = d[i - 1, j - 1]; // no operation required
else {
int del = d[i - 1, j] + 1; // a deletion
int ins = d[i, j - 1] + 1; // an insertion
int sub = d[i - 1, j - 1] + 1; // a substitution
// the next two lines are the modification I've made
//int insDel = (i < j) ? ins : del;
//d[i, j] = (i == j) ? sub : Math.Min(insDel, sub);
// the following 8 lines are a clearer version of the above 2 lines
if (i == j) {
d[i, j] = sub;
} else {
int insDel;
if (i < j) insDel = ins; else insDel = del;
// assign the smaller of insDel or sub
d[i, j] = Math.Min(insDel, sub);
}
}
}
}
return d[m, n];
}

searching for dynamic programming solution

Problem :
There is a stack consisting of N bricks. You and your friend decide to play a game using this stack. In this game, one can alternatively remove 1/2/3 bricks from the top and the numbers on the bricks removed by the player is added to his score. You have to play in such a way that you obtain maximum possible score while it is given that your friend will also play optimally and you make the first move.
Input Format
First line will contain an integer T i.e. number of test cases. There will be two lines corresponding to each test case, first line will contain a number N i.e. number of element in stack and next line will contain N numbers i.e. numbers written on bricks from top to bottom.
Output Format
For each test case, print a single line containing your maximum score.
I have tried with recursion but didn't work
int recurse(int length, int sequence[5], int i) {
if(length - i < 3) {
int sum = 0;
for(i; i < length; i++) sum += sequence[i];
return sum;
} else {
int sum1 = 0;
int sum2 = 0;
int sum3 = 0;
sum1 += recurse(length, sequence, i+1);
sum2 += recurse(length, sequence, i+2);
sum3 += recurse(length, sequence, i+3);
return max(max(sum1,sum2),sum3);
}
}
int main() {
int sequence[] = {0, 0, 9, 1, 999};
int length = 5;
cout << recurse(length, sequence, 0);
return 0;
}
My approach to solving this problem was as follows:
Both players play optimally.
So, the solution is to be built in a manner that need not take the player into account. This is because both players are going to pick the best choice available to them for any given state of the stack of bricks.
The base cases:
Either player, when left with the last one/two/three bricks, will choose to remove all bricks.
For the sake of convenience, let's assume that the array is actually in reverse order (i.e. a[0] is the value of the bottom-most brick in the stack) (This can easily be incorporated by performing a reverse operation on the array.)
So, the base cases are:
# Base Cases
dp[0] = a[0]
dp[1] = a[0]+a[1]
dp[2] = a[0]+a[1]+a[2]
Building the final solution:
Now, in each iteration, a player has 3 choices.
pick brick (i), or,
pick brick (i and i-1) , or,
pick brick (i,i-1 and i-2)
If the player opted for choice 1, the following would result:
player secures a[i] points from the brick (i) (+a[i])
will not be able to procure the points on the bricks removed by the opponent. This value is stored in dp[i-1] (which the opponent will end up scoring by virtue of this choice made by the player).
will surely procure the points on the bricks not removed by the opponent. (+ Sum of all the bricks up until brick (i-1) not removed by opponent )
A prefix array to store the partial sums of points of bricks can be computed as follows:
# build prefix sum array
pre = [a[0]]
for i in range(1,n):
pre.append(pre[-1]+a[i])
And, now, if player opted for choice 1, the score would be:
ans1 = a[i] + (pre[i-1] - dp[i-1])
Similarly, for choices 2 and 3. So, we get:
ans1 = a[i]+ (pre[i-1] - dp[i-1]) # if we pick only ith brick
ans2 = a[i]+a[i-1]+(pre[i-2] - dp[i-2]) # pick 2 bricks
ans3 = a[i]+a[i-1]+a[i-2]+(pre[i-3] - dp[i-3]) # pick 3 bricks
Now, each player wants to maximize this value. So, in each iteration, we pick the maximum among ans1, ans2 and ans3.
dp[i] = max(ans1, ans2, ans3)
Now, all we have to do is to iterate from 3 through to n-1 to get the required solution.
Here is the final snippet in python:
a = map(int, raw_input().split())
a.reverse() # so that a[0] is bottom brick of stack
dp = [0 for x1 in xrange(n)]
dp[0] = a[0]
dp[1] = a[0]+a[1]
dp[2] = a[0]+a[1]+a[2]
# build prefix sum array
pre = [a[0]]
for i in range(1,n):
pre.append(pre[-1]+a[i])
for i in xrange(3,n):
# We can pick brick i, (i,i-1) or (i,i-1,i-2)
ans1 = a[i]+ (pre[i-1] - dp[i-1]) # if we pick only ith brick
ans2 = a[i]+a[i-1]+(pre[i-2] - dp[i-2]) # pick 2
ans3 = a[i]+a[i-1]+a[i-2]+(pre[i-3] - dp[i-3]) #pick 3
# both players maximise this value. Doesn't matter who is playing
dp[i] = max(ans1, ans2, ans3)
print dp[n-1]
At a first sight your code seems totally wrong for a couple of reasons:
The player is not taken into account. You taking a brick or your friend taking a brick is not the same (you've to maximize your score, the total is of course always the total of the score on the bricks).
Looks just some form of recursion with no memoization and that approach will obviously explode to exponential computing time (you're using the "brute force" approach, enumerating all possible games).
A dynamic programming approach is clearly possible because the best possible continuation of a game doesn't depend on how you reached a certain state. For the state of the game you'd need
Who's next to play (you or your friend)
How many bricks are left on the stack
With these two input you can compute how much you can collect from that point to the end of the game. To do this there are two cases
1. It's your turn
You need to try to collect 1, 2 or 3 and call recursively on the next game state where the opponent will have to choose. Of the three cases you keep what is the highest result
2. It's opponent turn
You need to simulate collection of 1, 2 or 3 bricks and call recursively on next game state where you'll have to choose. Of the three cases you keep what is the lowest result (because the opponent is trying to maximize his/her result, not yours).
At the very begin of the function you just need to check if the same game state has been processed before, and when returning from a computation you need to store the result. Thanks to this lookup/memorization the search time will not be exponential, but linear in the number of distinct game states (just 2*N where N is the number of bricks).
In Python:
memory = {}
bricks = [0, 0, 9, 1, 999]
def maxResult(my_turn, index):
key = (my_turn, index)
if key in memory:
return memory[key]
if index == len(bricks):
result = 0
elif my_turn:
result = None
s = 0
for i in range(index, min(index+3, len(bricks))):
s += bricks[i]
x = s + maxResult(False, i+1)
if result is None or x > result:
result = x
else:
result = None
for i in range(index, min(index+3, len(bricks))):
x = maxResult(True, i+1)
if result is None or x < result:
result = x
memory[key] = result
return result
print maxResult(True, 0)
import java.io.*;
import java.util.*;
import java.text.*;
import java.math.*;
import java.util.regex.*;
public class Solution {
public static void main(String[] args){
Scanner sc=new Scanner(System.in);
int noTest=sc.nextInt();
for(int i=0; i<noTest; i++){
int noBrick=sc.nextInt();
ArrayList<Integer> arr=new ArrayList<Integer>();
for (int j=0; j<noBrick; j++){
arr.add(sc.nextInt());
}
long sum[]= new long[noBrick];
sum[noBrick-1]= arr.get(noBrick-1);
for (int j=noBrick-2; j>=0; j--){
sum[j]= sum[j+1]+ arr.get(j);
}
long[] max=new long[noBrick];
if(noBrick>=1)
max[noBrick-1]=arr.get(noBrick-1);
if(noBrick>=2)
max[noBrick-2]=(int)Math.max(arr.get(noBrick-2),max[noBrick-1]+arr.get(noBrick-2));
if(noBrick>=3)
max[noBrick-3]=(int)Math.max(arr.get(noBrick-3),max[noBrick-2]+arr.get(noBrick-3));
if(noBrick>=4){
for (int j=noBrick-4; j>=0; j--){
long opt1= arr.get(j)+sum[j+1]-max[j+1];
long opt2= arr.get(j)+arr.get(j+1)+sum[j+2]-max[j+2];
long opt3= arr.get(j)+arr.get(j+1)+arr.get(j+2)+sum[j+3]-max[j+3];
max[j]= (long)Math.max(opt1,Math.max(opt2,opt3));
}
}
long cost= max[0];
System.out.println(cost);
}
}
}
I tried this using Java, seems to work alright.
here a better solution that i found on the internet without recursion.
#include <iostream>
#include <fstream>
#include <algorithm>
#define MAXINDEX 10001
using namespace std;
long long maxResult(int a[MAXINDEX], int LENGTH){
long long prefixSum [MAXINDEX] = {0};
prefixSum[0] = a[0];
for(int i = 1; i < LENGTH; i++){
prefixSum[i] += prefixSum[i-1] + a[i];
}
long long dp[MAXINDEX] = {0};
dp[0] = a[0];
dp[1] = dp[0] + a[1];
dp[2] = dp[1] + a[2];
for(int k = 3; k < LENGTH; k++){
long long x = prefixSum[k-1] + a[k] - dp[k-1];
long long y = prefixSum[k-2] + a[k] + a[k-1] - dp[k-2];
long long z = prefixSum[k-3] + a[k] + a[k-1] + a[k-2] - dp[k-3];
dp[k] = max(x,max(y,z));
}
return dp[LENGTH-1];
}
using namespace std;
int main(){
int cases;
int bricks[MAXINDEX];
ifstream fin("test.in");
fin >> cases;
for (int i = 0; i < cases; i++){
long n;
fin >> n;
for(int j = 0; j < n; j++) fin >> bricks[j];
reverse(bricks, bricks+n);
cout << maxResult(bricks, n)<< endl;
}
return 0;
}

O(n) algorithm for constructing suffix table in boyer-moore string matching algorithm

I want to implement boyer-moore algorithm but I'm stuck on constructing a good suffix table which I think should have O(n) complexity, I only found the O(n^2) algorithm.
So do you guys have a clue for me?
Please don't give me code snippets, I can google it if I want, but I prefer to solve it in my way, I just need a clue.
There is a fast algorithm that uses the prefix function.
A prefix function of a string s is an array p, where p[i] is the longest length of the prefix of a substring s[0..i] (0-indexed) and its suffix.
It can be calculated with O(n) complexity using KMP that uses 2 facts:
p[i+1]<=p[i]+1.
For each i, if s[p[i]]==s[i+1], then p[i+1] = p[i] + 1. Otherwise, we should try another string, for which s[0...j-1]==s[i-j+1...i]. Obviously, (we choose the longest string) we should just jump to the position i = p[i-1].
The algorithm (c++):
vector<int> prefix (string s)
{
int n=s.length();
vector<int> pi(n);
pi[0]=0;
for (int i=1; i<n; ++i)
{
int j = pi[i-1];
while (j>0 && s[i]!=s[j])
j=pi[j-1];
if (s[i]==s[j]) ++j;
pi[i]=j;
}
return pi;
}
Now we can construct the suffix table:
m = text.length();
vector<int> suffshift(m);
vector<int> pi = prefix(pattern);
vector<int> pi1 = prefix(inverse(pattern));
for (int j=0; j<m; ++j)
{
suffshift[j] = m - pi[m];
}
for (int i=1; i<m; ++i)
{
j = m - pi1[i];
suffshift[j]=min(suffshift[j], i-pi1[i]);
}
Suffshift[m] stands for the empty suffix, suppshift[0] - for the whole text. THe complexity is O(n).

Finding similar/related texts algorithms

I searched a lot in stackoverflow and Google but I didn't find the best answer for this.
Actually, I'm going to develop a news reader system that crawl and collect news from web (with a crawler) and then, I want to find similar or related news in websites (In order to prevent showing duplicated news in website)
I think the best live example for that is Google News, it collect news from web and then categorize and find related news and articles. This is what I want to do.
What's the best algorithm for doing this?
A relatively simple solution is to compute a tf-idf vector (en.wikipedia.org/wiki/Tf*idf) for each document, then use the cosine distance (en.wikipedia.org/wiki/Cosine_similarity) between these vectors as an estimate for semantic distance between articles.
This will probably capture semantic relationships better than Levenstein distance and is much faster to compute.
This is one: http://en.wikipedia.org/wiki/Levenshtein_distance
public static SqlInt32 ComputeLevenstheinDistance(SqlString firstString, SqlString secondString)
{
int n = firstString.Value.Length;
int m = secondString.Value.Length;
int[,] d = new int[n + 1,m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (secondString.Value[j - 1] == firstString.Value[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
This is handy for the task at hand: http://code.google.com/p/boilerpipe/
Also, if you need to reduce the number of words to analyze, try this: http://ots.codeplex.com/
I have found the OTS VERY useful in sentiment analysis, whereby I can reduce the number of sentences into a small list of common phrases and/or words and calculate the overall sentiment based on this. The same should work for similarity.

Resources