Finding similar/related texts algorithms

Finding similar/related texts algorithms - nlp

I searched a lot in stackoverflow and Google but I didn't find the best answer for this.
Actually, I'm going to develop a news reader system that crawl and collect news from web (with a crawler) and then, I want to find similar or related news in websites (In order to prevent showing duplicated news in website)
I think the best live example for that is Google News, it collect news from web and then categorize and find related news and articles. This is what I want to do.
What's the best algorithm for doing this?

A relatively simple solution is to compute a tf-idf vector (en.wikipedia.org/wiki/Tf*idf) for each document, then use the cosine distance (en.wikipedia.org/wiki/Cosine_similarity) between these vectors as an estimate for semantic distance between articles.
This will probably capture semantic relationships better than Levenstein distance and is much faster to compute.

This is one: http://en.wikipedia.org/wiki/Levenshtein_distance
public static SqlInt32 ComputeLevenstheinDistance(SqlString firstString, SqlString secondString)
{
int n = firstString.Value.Length;
int m = secondString.Value.Length;
int[,] d = new int[n + 1,m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (secondString.Value[j - 1] == firstString.Value[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
This is handy for the task at hand: http://code.google.com/p/boilerpipe/
Also, if you need to reduce the number of words to analyze, try this: http://ots.codeplex.com/
I have found the OTS VERY useful in sentiment analysis, whereby I can reduce the number of sentences into a small list of common phrases and/or words and calculate the overall sentiment based on this. The same should work for similarity.

Related

(DP) Memoization - How to know if it starts from the top or bottom?

It hasn't been long since I started studying algorithm coding tests, and I found it difficult to find regularity in Memoization.
Here are two problems.
Min Cost Climbing Stairs
You are given an integer array cost where cost[i] is the cost of ith step on a staircase. Once you pay the cost, you can either climb one or two steps.
You can either start from the step with index 0, or the step with index 1.
Return the minimum cost to reach the top of the floor.
Min Cost Climbing Stairs
Recurrence Relation Formula:
minimumCost(i) = min(cost[i - 1] + minimumCost(i - 1), cost[i - 2] + minimumCost(i - 2))
House Robber
You are a professional robber planning to rob houses along a street. Each house has a certain amount of money stashed, the only constraint stopping you from robbing each of them is that adjacent houses have security systems connected and it will automatically contact the police if two adjacent houses were broken into on the same night.
Given an integer array nums representing the amount of money of each house, return the maximum amount of money you can rob tonight without alerting the police.
House Robber
Recurrence Relation Formula:
robFrom(i) = max(robFrom(i + 1), robFrom(i + 2) + nums(i))
So as you can see, first problem consist of the previous, and second problem consist of the next.
Because of this, when I try to make recursion function, start numbers are different.
Start from n
int rec(int n, vector<int>& cost)
{
if(memo[n] == -1)
{
if(n <= 1)
{
memo[n] = 0;
} else
{
memo[n] = min(rec(n-1, cost) + cost[n-1], rec(n-2, cost) + cost[n-2]);
}
}
return memo[n];
}
int minCostClimbingStairs(vector<int>& cost) {
const int n = cost.size();
memo.assign(n+1,-1);
return rec(n, cost); // Start from n
}
Start from 0
int getrob(int n, vector<int>& nums)
{
if(how_much[n] == -1)
{
if(n >= nums.size())
{
return 0;
} else {
how_much[n] = max(getrob(n + 1, nums), getrob(n + 2, nums) + nums[n]);
}
}
return how_much[n];
}
int rob(vector<int>& nums) {
how_much.assign(nums.size() + 2, -1);
return getrob(0, nums); // Start from 0
}
How can I easily know which one need to be started from 0 or n? Is there some regularity?
Or should I just solve a lot of problems and increase my sense?

Your question is right, but somehow examples are not correct. Both the problems you shared can be done in both ways : 1. starting from top & 2. starting from bottom.
For example: Min Cost Climbing Stairs : solution that starts from 0.
int[] dp;
public int minCostClimbingStairs(int[] cost) {
int n = cost.length;
dp = new int[n];
for(int i=0; i<n; i++) {
dp[i] = -1;
}
rec(0, cost);
return Math.min(dp[0], dp[1]);
}
int rec(int in, int[] cost) {
if(in >= cost.length) {
return 0;
} else {
if(dp[in] == -1) {
dp[in] = cost[in] + Math.min(rec(in+1, cost), rec(in+2, cost));
}
return dp[in];
}
}
However, there are certain set of problems where this is not easy. Their structure is such that if you start in reverse, the computation could get complicated or mess up the future results:
Example: Reaching a target sum from numbers in an array using an index at max only 1 time. Reaching 10 in {3, 4, 6, 5, 2} : {4,6} is one answer but not {6, 2, 2} as you are using index (4) 2 times.
This can be done easily in top down way:
int m[M+10];
for(i=0; i<M+10; i++) m[i]=0;
m[0]=1;
for(i=0; i<n; i++)
for(j=M; j>=a[i]; j--)
m[j] |= m[j-a[i]];
If you try to implement in bottom up way, you will end up using a[i] multiple times. You can definitely do it bottom up way if you figure a out a way to tackle this messing up of states. Like using a queue to only store reached state in previous iterations and not use numbers reached in current iterations. Or even check if you keep a count in m[j] instead of just 1 and only use numbers where count is less than that of current iteration count. I think same thing should be valid for all DP.

0/1 Knapsack with Minimum Cost

The famous 0/1 knapsack problem focuses on getting the maximum cost/value in the given Weight (W).
The code for the above is this ::
n = cost_array / weight_array size
INIT :: fill 0th col and 0th row with value 0
for (int i=1; i<=n; i++) {
for (int j=1; j<=W; j++) {
if (weight[i-1] <= j) {
dp[i][j] = Math.max(dp[i-1][j], dp[i-1][j - weight[i-1]] + cost[i-1]);
}else {
dp[i][j] = dp[i-1][j];
}
}
}
Ans :: dp[n][W]
NEW Problem :: So, here we are calculating the maximum cost/value. But what if I want to find the minimum cost/value (Its still bounded knapsack only).
I think the problem boils down how I do the INIT step above. As in the loop I think it will remain same with the only difference of Math.max becoming Math.min
I tried the INIT step with Infinity, 0 etc but am not able to build the iterative solution.
How can we possibly do that?

Writing answer as mentioned by #radovix
Convert every weight to negative number and write the same algorithm.

here's an efficient algorithm for your problem written in JS (minimal-cost maximal knapsacking) with time complexity O(nC) and space complexity O(n+C) instead of the obvious solution with space complexity O(nC) with a DP matrix.
Given a knapsack instance (I, p,w, C), the goal of the MCMKP (Minimum-Cost Maximal Knapsack Packing) is to find a maximal knapsack packing S⊂I that minimizes the profit of selected items
const knapSack = (weights, prices, target, ic) => {
if (weights.length !== prices.length) {
return null;
}
let weightSum = [0],
priceSum = [0];
for (let i = 1; i < weights.length; i++) {
weightSum[i] = weights[i - 1] + weightSum[i - 1];
priceSum[i] = prices[i - 1] + priceSum[i - 1];
}
let dp = [0],
opt = Infinity;
for (let i = 1; i <= target; i++) dp[i] = Infinity;
for (let i = weights.length; i >= 1; i--) {
if (i <= ic) {
const cMax = Math.max(0, target - weightSum[i - 1]),
cMin = Math.max(0, target - weightSum[i - 1] - weights[i - 1] + 1);
let tmp = Infinity;
for (let index = cMin; index <= cMax; index++) {
tmp = Math.min(tmp, dp[index] + priceSum[i - 1]);
}
if (tmp < opt) opt = tmp;
}
for (let j = target; j >= weights[i - 1]; j--) {
dp[j] = Math.min(dp[j], dp[j - weights[i - 1]] + prices[i - 1]);
}
}
return opt;
};
knapSack([1, 1, 2, 3, 4],[3, 4, 6, 2, 1],5,4);
In the following code above, we define a critical item, as an item whose weight gives a tight upper bound on the smallest possible item left out of any feasible solution.
the index of the critical item, i.e., the index of the first item that exceeds the capacity, assuming all i ≤ ic
will be taken as well. opt is the optimal solution.
Observe that the items {1, . . . , i − 1} determine the remaining capacity of the knapsack
(which is given as cMax) that has to be filled using items from {i + 1, . . . , n}, whereas cMin is determined by to the fact that the packing has to be maximal and that item i is the smallest one taken out of the solution

Longest Common Prefix property

I was going through suffix array and its use to compute longest common prefix of two suffixes.
The source says:
"The lcp between two suffixes is the minimum of the lcp's of all pairs of adjacent suffixes between them on the array"
i.e. lcp(x,y)=min{ lcp(x,x+1),lcp(x+1,x+2),.....,lcp(y-1,y) }
where x and y are two index of the string from where the two suffix of the string starts.
I am not convinced with the statement as in example of string "abca".
lcp(1,4)=1 (considering 1 based indexing)
but if I apply the above equation then
lcp(1,4)=min{lcp(1,2),lcp(2,3),lcp(3,4)}
and I think lcp(1,2)=0.
so the answer must be 0 according to the equation.
Am i getting it wrong somewhere?

I think the index referred by the source is not the index of the string itself, but index of the sorted suffixes.
a
abca
bca
ca
Hence
lcp(1,2) = lcp(a, abca) = 1
lcp(1,4) = min(lcp(1,2), lcp(2,3), lcp(3,4)) = 0

You can't find LCP of any two suffixes by simply calculating the minimum of the lcp's of all pairs of adjacent suffixes between them on the array.
We can calculate the LCPs of any suffixes (i,j)
with the Help of Following :
LCP(suffix i,suffix j)=LCP[RMQ(i + 1; j)]
Also Note (i<j) as LCP (suff i,suff j) may not necessarly equal LCP (Suff j,suff i).
RMQ is Range Minimum Query .
Page 3 of this paper.
Details:
Step 1:
First Calculate LCP of Adjacents /consecutive Suffix Pairs .
n= Length of string.
suffixArray[] is Suffix array.
void calculateadjacentsuffixes(int n)
{
for (int i=0; i<n; ++i) Rank[suffixArray[i]] = i;
Height[0] = 0;
for (int i=0, h=0; i<n; ++i)
{
if (Rank[i] > 0)
{
int j = suffixArray[Rank[i]-1];
while (i + h < n && j + h < n && str[i+h] == str[j+h])
{
h++;
}
Height[Rank[i]] = h;
if (h > 0) h--;
}
}
}
Note: Height[i]=LCPs of (Suffix i-1 ,suffix i) ie. Height array contains LCP of adjacent suffix.
Step 2:
Calculate LCP of Any two suffixes i,j using RMQ concept.
RMQ pre-compute function:
void preprocesses(int N)
{
int i, j;
//initialize M for the intervals with length 1
for (i = 0; i < N; i++)
M[i][0] = i;
//compute values from smaller to bigger intervals
for (j = 1; 1 << j <= N; j++)
{
for (i = 0; i + (1 << j) - 1 < N; i++)
{
if (Height[M[i][j - 1]] < Height[M[i + (1 << (j - 1))][j - 1]])
{
M[i][j] = M[i][j - 1];
}
else
{
M[i][j] = M[i + (1 << (j - 1))][j - 1];
}
}
}
}
Step 3: Calculate LCP between any two Suffixes i,j
int LCP(int i,int j)
{
/*Make sure we send i<j always */
/* By doing this ,it resolve following
suppose ,we send LCP(5,4) then it converts it to LCP(4,5)
*/
if(i>j)
swap(i,j);
/*conformation over*/
if(i==j)
{
return (Length_of_str-suffixArray[i]);
}
else
{
return Height[RMQ(i+1,j)];
//LCP(suffix i,suffix j)=LCPadj[RMQ(i + 1; j)]
//LCPadj=LCP of adjacent suffix =Height.
}
}
Where RMQ function is:
int RMQ(int i,int j)
{
int k=log((double)(j-i+1))/log((double)2);
int vv= j-(1<<k)+1 ;
if(Height[M[i][k]]<=Height[ M[vv][ k] ])
return M[i][k];
else
return M[ vv ][ k];
}
Refer Topcoder tutorials for RMQ.
You can check the complete implementation in C++ at my blog.

Recurrence equation for dynamic programming

I have a situation that is really similar to the knapsack problem but I just want to confirm that my recurrence equation is the same as the knapsack problem.
We have a maximum of M dollars to invest. We have N different investments which each one have a cost m(i) and a profit g(i). We want to find the recurrence equation for maximize the profit.
here is my answer :
g(i,j) = max{g(i-1,j), g_i + (i-1,j-m_i)} if j-m_i >= 0
g(i-1,j) if j-m_i < 0
I hope my explanation are clear.
Thank you and have a nice day!
Bobby

Your recurrence equation is correct. The problem is same as the traditional knapsack problem. Actually you can make some optimization on space complexity. Here is the C++ code.
int dp[M + 10];
int DP{
memset(dp, 0, sizeof(dp));
for(int i = 0; i < N; ++i)
for(int j = M; j >= m[i]; --j) // pay attention
dp[j] = max(dp[j], dp[j - m[i]] + g[i]);
int ret = 0;
for(int i = 0; i <= M; ++i) ret = max(ret, dp[i]);
return ret;
}

Word-level edit distance of a sentence

Is there an algorithm that lets you find the word-level edit distance between 2 sentences?
For eg., "A Big Fat Dog" and "The Big House with the Fat Dog" have 1 substitute, 3 insertions

In general, this is called the sequence alignment problem. Actually it does not matter what entities you align - bits, characters, words, or DNA bases - as long as the algorithm works for one type of items it will work for everything else. What matters is whether you want global or local alignment.
Global alignment, which attempt to align every residue in every sequence, is most useful when the sequences are similar and of roughly equal size. A general global alignment technique is the Needleman-Wunsch algorithm algorithm, which is based on dynamic programming. When people talk about Levinstain distance they usually mean global alignment. The algorithm is so straightforward, that several people discovered it independently, and sometimes you may come across Wagner-Fischer algorithm which is essentially the same thing, but is mentioned more often in the context of edit distance between two strings of characters.
Local alignment is more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The Smith-Waterman algorithm is a general local alignment method also based on dynamic programming. It is quite rarely used in natural language processing, and more often - in bioinformatics.

You can use the same algorithms that are used for finding edit distance in strings to find edit distances in sentences. You can think of a sentence as a string drawn from an alphabet where each character is a word in the English language (assuming that spaces are used to mark where one "character" starts and the next ends). Any standard algorithm for computing edit distance, such as the standard dynamic programming approach for computing Levenshtein distance, can be adapted to solve this problem.

check out the AlignedSent function in python from the nltk package. It aligns sentences at the word level.
https://www.nltk.org/api/nltk.align.html

Here is a sample implementation of the #templatetypedef's idea in ActionScript (it worked great for me), which calculates the normalized Levenshtein distance (or in other words gives a value in the range [0..1])
private function nlevenshtein(s1:String, s2:String):Number {
var tokens1:Array = s1.split(" ");
var tokens2:Array = s2.split(" ");
const len1:uint = tokens1.length, len2:uint = tokens2.length;
var d:Vector.<Vector.<uint> >=new Vector.<Vector.<uint> >(len1+1);
for(i=0; i<=len1; ++i)
d[i] = new Vector.<uint>(len2+1);
d[0][0]=0;
var i:int;
var j:int;
for(i=1; i<=len1; ++i) d[i][0]=i;
for(i=1; i<=len2; ++i) d[0][i]=i;
for(i = 1; i <= len1; ++i)
for(j = 1; j <= len2; ++j)
d[i][j] = Math.min( Math.min(d[i - 1][j] + 1,d[i][j - 1] + 1),
d[i - 1][j - 1] + (tokens1[i - 1] == tokens2[j - 1] ? 0 : 1) );
var nlevenshteinDist:Number = (d[len1][len2]) / (Math.max(len1, len2));
return nlevenshteinDist;
}
I hope this will help!

The implementation in D is generalized over any range, and thus array. So by splitting your sentences into arrays of strings they can be run through the algorithm and an edit number will be provided.
https://dlang.org/library/std/algorithm/comparison/levenshtein_distance.html

Here is the Java implementation of edit distance algorithm for sentences using dynamic programming approach.
public class EditDistance {
public int editDistanceDP(String sentence1, String sentence2) {
String[] s1 = sentence1.split(" ");
String[] s2 = sentence2.split(" ");
int[][] solution = new int[s1.length + 1][s2.length + 1];
for (int i = 0; i <= s2.length; i++) {
solution[0][i] = i;
}
for (int i = 0; i <= s1.length; i++) {
solution[i][0] = i;
}
int m = s1.length;
int n = s2.length;
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (s1[i - 1].equals(s2[j - 1]))
solution[i][j] = solution[i - 1][j - 1];
else
solution[i][j] = 1
+ Math.min(solution[i][j - 1], Math.min(solution[i - 1][j], solution[i - 1][j - 1]));
}
}
return solution[s1.length][s2.length];
}
public static void main(String[] args) {
String sentence1 = "first second third";
String sentence2 = "second";
EditDistance ed = new EditDistance();
System.out.println("Edit Distance: " + ed.editDistanceDP(sentence1, sentence2));
}
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Finding similar/related texts algorithms - nlp

Related

(DP) Memoization - How to know if it starts from the top or bottom?

0/1 Knapsack with Minimum Cost

Longest Common Prefix property

Recurrence equation for dynamic programming

Word-level edit distance of a sentence

Categories

Resources