In Cormen's Algorithms book, it gave an algorithms for finding the longest subsequene. When filling the table, it started from the lower right corner instead of the upper left corner: In short, it looked for the longest subsequence of two sequences starting from the last elements of both sequence instead of the first elements. Is there a brilliant reason behind this (more efficient,etc.)?
Related
Using Word embeddings ,I am calculating the similarity distance between 2 paragraphs where distance between 2 paragraphs is the sum of euclidean distances between vectors of 2 words ,1 from each paragraph.
The more the value of this sum, the less similar 2 documents are-
How can I assign prefernce/weights to certain words while calculating this similarity distance.
It sounds like you've improvised your own paragraph-to-paragraph distance measure based on doing (lots of?) word-to-word distances.
Are you picking the words for each word-to-word comparison randomly, and doing it a lot to find the overall difference?
One naive measure that works better-than-nothing is to average all words in a paragraph to get a single vector for the paragraph. You could conceivably overweight words there quite easily by assigning each word a weight, default 1.0 (for normal average), but larger to overweight words.
Another more sophisticated comparison based on word-vectors is "Word Mover's Distance" - it essentially considers each word to be a "pile of meaning", and then finds the minimal pairwise "moves" to tranform one paragraph (as a bag-of-words) to another. (It's available in Python gensim as wmdistance(), and other libraries.) It's quite a bit more expensive to calculate, though, especially as a function of text word count.
A variant of the blokus game, where you are a single player and you want to cover all corners of the given board using A* Search, I'm trying to figure out what kind of heuristics would be good for this, so for example on an 8x8 board it would finish fast, without expanding too many nodes.
I want an admissible heuristic, so far I've ruled out:
Manhattan distances and euclid distances because in blokus you need to put pieces adjacent to other pieces diagonal which doesn't comply with manhattan distance.
Information about the game:
It's a board game, in which there is a n x n table, and you are given pieces of sizes and shapes like tetris in which you can put on the table.
The rules are: Each piece is only usable once, and you start from coordinate (0,0). You can only place pieces adjacent to another piece diagonally. Two pieces cannot be adjacent to each other, only diagonally.
The task is to finish game with the lowest score possible (score is determined by how many tiles your pieces are composed with), you want to leave the board as vacant as possible.
Given a range of numbers, say from [80,240], it is easy to determine how much of that range lies within [100,105]: (105-100)/(240-80) = 5/160 = .03125. Easy.
So now, how much of a Meriam Webster dictionary lies between umbrella and velvet? Even if we assume uniform distribution of text across the corpus, is there a standard metric for text?
I don't think there is a standard for that. If you had all entries from Meriam Webster in an array, you could use first and last positions as the bounds, so you would have a set going from 1 to n. Then you could pick the positions of "umbrella" and "velvet", call them x and y, and calculate your range as (y - x + 1) / (n).
That works if you are seeing words as elements of an ordered set, so as to have them behave as real numbers. You are basically dividing the distance between two numbers in a set by the distance between the boundaries of the set. Some forms of algebra deal with them differently - when calculating the Levenshtein distance between any two given words, for example, each words is seen as a vector with as many dimensions as they have characters.
You could define the boundaries of your n-dimensional space by using the biggest word in Meriam Webster (hint: it's "pneumonoultramicroscopicsilicovolcanoconiosis", so your space would have 45 dimensions). However, when considering any A-B pair of words, a third word C of intermediary length may or may not be between those, depending on the operations involved in the transformation from A to B.
You'd have to check every word with a length between that of A and B to check whether they are part of the range between A and B... So it's not a matter of simple calculus, and I don't know if this could be even feasible with a regular computer nowadays. And that's just considering Meriam's close to half a million entries.
This paper contains confusion matrices for spelling errors in a noisy channel. It describes how to correct the errors based on conditional properties.
The conditional probability computation is on page 2, left column. In footnote 4, page 2, left column, the authors say: "The chars matrices can be easily replicated, and are therefore omitted from the appendix." I cannot figure out how can they be replicated!
How to replicate them? Do I need the original corpus? or, did the authors mean they could be recomputed from the material in the paper itself?
Looking at the paper, you just need to calculate them using a corpus, either the same one or one relevant to your application.
In replicating the matrices, note that they implicitly define two different chars matrices: a vector and an n-by-n matrix. For each character x, the vector chars contains a count of the number of times the character x occurred in the corpus. For each character sequence xy, the matrix chars contains a count of the number of times that sequence occurred in the corpus.
chars[x] represents a look-up of x in the vector; chars[x,y] represents a look-up of the sequence xy in the matrix. Note that chars[x] = the sum over chars[x,y] for each value of y.
Note that their counts are all based on the 1988 AP Newswire corpus (available from the LDC). If you can't use their exact corpus, I don't think it would be unreasonable to use another text from the same genre (i.e. another newswire corpus) and scale your counts such that they fit the original data. That is, the frequency of a given character shouldn't vary too much from one text to another if they're similar enough, so if you've got a corpus of 22 million words of newswire, you could count characters in that text and then double them to approximate their original counts.
I heard about clustering to group similar data. I want to know how it works in the specific case for String.
I have a table with more than different 100,000 words.
I want to identify the same word with some differences (eg.: house, house!!, hooouse, HoUse, #house, "house", etc...).
What is needed to identify the similarity and group each word in a cluster? What algorithm is more recommended for this?
To understand what clustering is imagine a geographical map. You can see many distinct objects (such as houses). Some of them are close to each other, and others are far. Based on this, you can split all objects into groups (such as cities). Clustering algorithms make exactly this thing - they allow you to split your data into groups without previous specifying groups borders.
All clustering algorithms are based on the distance (or likelihood) between 2 objects. On geographical map it is normal distance between 2 houses, in multidimensional space it may be Euclidean distance (in fact, distance between 2 houses on the map also is Euclidean distance). For string comparison you have to use something different. 2 good choices here are Hamming and Levenshtein distance. In your particular case Levenshtein distance if more preferable (Hamming distance works only with the strings of same size).
Now you can use one of existing clustering algorithms. There's plenty of them, but not all can fit your needs. For example, pure k-means, already mentioned here will hardly help you since it requires initial number of groups to find, and with large dictionary of strings it may be 100, 200, 500, 10000 - you just don't know the number. So other algorithms may be more appropriate.
One of them is expectation maximization algorithm. Its advantage is that it can find number of clusters automatically. However, in practice often it gives less precise results than other algorithms, so it is normal to use k-means on top of EM, that is, first find number of clusters and their centers with EM and then use k-means to adjust the result.
Another possible branch of algorithms, that may be suitable for your task, is hierarchical clustering. The result of cluster analysis in this case in not a set of independent groups, but rather tree (hierarchy), where several smaller clusters are grouped into one bigger, and all clusters are finally part of one big cluster. In your case it means that all words are similar to each other up to some degree.
There is a package called stringdist that allows for string comparison using several different methods. Copypasting from that page:
Hamming distance: Number of positions with same symbol in both strings. Only defined for strings of equal length.
Levenshtein distance: Minimal number of insertions, deletions and replacements needed for transforming string a into string b.
(Full) Damerau-Levenshtein distance: Like Levenshtein distance, but transposition of adjacent symbols is allowed.
Optimal String Alignment / restricted Damerau-Levenshtein distance: Like (full) Damerau-Levenshtein distance but each substring may only be edited once.
Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.
q-gram distance: Sum of absolute differences between N-gram vectors of both strings.
Cosine distance: 1 minus the cosine similarity of both N-gram vectors.
Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.
Jaro distance: The Jaro distance is a formula of 4 values and effectively a special case of the Jaro-Winkler distance with p = 0.
Jaro-Winkler distance: This distance is a formula of 5 parameters determined by the two compared strings (A,B,m,t,l) and p chosen from [0, 0.25].
That will give you the distance. You might not need to perform a cluster analysis, perhaps sorting by the string distance itself is sufficient. I have created a script to provide the basic functionality here... feel free to improve it as needed.
You can use an algorithm like the Levenshtein distance for the distance calculation and k-means for clustering.
the Levenshtein distance is a string metric for measuring the amount of difference between two sequences
Do some testing and find a similarity threshold per word that will decide your groups.
You can use a clustering algorithm called "Affinity Propagation". This algorithm takes in an input called similarity matrix which you can generate by taking negative of the either Levenstein distance or an harmonic mean of partial_ratio and token_set_ratio from fuzzywuzzy library if you are using Python.