I was curious if anyone had a good method of choosing the best matching case between strings. For example, say I have a table with keys “Hi there”, “Hello”, “Hiya”, “hi”, “Hi”, and “Hey there”. The I want to find the closest match for “Hi”. It would then match to the “Hi” first. If that wasn’t found, then the “hi” then “Hiya”, and so on. Prioritizing perfect matches, then lower/uppercase matches, then which ever had the least number of differences or length difference.
My current method seems unwieldy, first checking for a perfect match, then looping around with a string.match, saving any with the closest string.len.
If you're not looking for a perfect match only, you need to use some metric as a measure of similarity and then look for the closest match.
As McBarby suggested in his comment you can use the Levenshtein distance which is the minimum number of single character edits necessary to get from string 1 to string 2. Just research which metrics are available and which one suits your needs best. Of course you can also define your own metric.
https://en.wikipedia.org/wiki/String_metric lists a number of other string metrics:
Sørensen–Dice coefficient
Block distance or L1 distance or City block distance
Jaro–Winkler distance
Simple matching coefficient (SMC)
Jaccard similarity or Jaccard coefficient or Tanimoto coefficient
Tversky index
Overlap coefficient
Variational distance
Hellinger distance or Bhattacharyya distance
Information radius (Jensen–Shannon divergence)
Skew divergence
Confusion probability
Tau metric, an approximation of the Kullback–Leibler divergence
Fellegi and Sunters metric (SFS)
Maximal matches
Grammar-based distance
TFIDF distance metric
Related
Levenshtein distance is a gauge of the distance between two strings. Is there a similar metric for assessing how similar a group of strings are to each other, a kind of mean or group distance?
EDIT: I realize now that my question above could be better worded. Let's say I have a list of strings:
['aaa','aab','aaaa','aacd']
Is there a way to compute the degree to which every string in the above list is similar to each other? Some sort of measurement?
My teacher has given me these set of questions as homework and I don't know if I'm understanding it right.
The following customers have rated a number of DVD's as shown in the table. Calculate the similarity figures for these customers using the Euclidean distance method. Then, using the similarity figure as a weighting factor, calculate the weighted average scores for each movie.
Which other customer is most similar to Dave?
What is the similarity score for that customer?
Which movie does this scheme recommend for Dave?
What is the weighted rating for that movie?
Now, to my understanding I need to use coordinates to find the Euclidean distance since the formula is:
And according to my teacher the Similarity formula is:
Do I use the coordinates by row or column to calculate the Euclidean?
Do I need to use another formula to calculate the weighted rating?
OR do I need to use the formula both ways (row and column)?
I'm really trying to wrap my head around this.
Am I in the right direction or off completely?
Since you are comparing customers, you will be comparing the columns, not the rows.
Excel has a function SUMXMY2(array_x, array_y) which computes the square sum of two arrays (e.g. SUMXMY2(DVD_Table[Alice],DVD_Table[Bob])). You can simply take the square root of this to get the Euclidean distance between two customers.
It is not clear to me how the weighted ratings are calculated.
Unless you have further Excel or programming questions related to this, you'd do better posting at https://math.stackexchange.com.
I classify thousands of documents where the vector components are calculated according to the tf-idf. I use the cosine similarity. I did a frequency analysis of words in clusters to check the difference in top words. But I'm not sure how to calculate the similarity numerically in this sort of documents.
I count internal similarity of a cluster as the average of the similarity of each document to the centroid of the cluster. If I counted the average couple is based on small number.
External similarity calculated as the average similarity of all pairs cluster centroid
I count right? It is based on my inner similarity values average from 0.2 (5 clusters and 2000 documents)to 0.35 (20 clusters and 2000 documents). Which is probably caused by a widely-oriented documents in computer science. Intra from 0.3-0.7. The result may be like that? On the Internet I found various ways of measuring, do not know which one to use than the one that was my idea. I am quite desperate.
Thank you so much for your advice!
Using k-means with anything but squared euclidean is risky. It may stop converging, as the convergence proof relies on both the mean and the distance assignment optimizing the same criterion. K-means minimizes squared deviations, not distances!
For a k-means variant that can handle arbitrary distance functions (and have guaranteed convergence), you will need to look at k-medoids.
What are some of the deciding factors to take into consideration when choosing a similarity index.
In what cases is a Euclidean Distance preferred over Pearson and vice versa?
Correlation is unit independent; if you scale one of the objects ten times, you will get different euclidean distances and same correlation distances. Therefore, correlation metrics is excellent when you want to measure distance between such objects as genes defined by their expression profile.
Often, absolute or squared correlation is used as a distance metrics, because we are more interested in the strength of the relationship than in its sign.
However, correlation is only suitable for highly dimensional data; there is hardly a point of calculating it for two- or three dimensional data points.
Also note that "Pearson distance" is a weighted type of Euclidean distance, and not the "correlation distance" using Pearson correlation coefficient.
It really depends on the application scenario you have in hand. Very briefly, if you are dealing with data where the actual difference in values of attributes is important, go with Euclidean Distance. If you are looking for trend or shape similarity, then go with correlation. Also note, that if you perform z-score normalization in each object, Euclidean Distance behaves similarly to Pearson correlation coefficient. Pearson is not sensitive to linear transformations of the data. There are other types of correlation coefficients that take into account the ranks of the values only, being insensitive to both linear and non linear transformations. Note that the usual use of correlation as dissimilarity is 1 - correlation, which does not respect all the rules for a metric distance.
There are some studies on which proximity measure select on a particular application, for instance:
Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Ivan G. Costa Filho, "Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 99, no. PrePrints, p. 1, , 2013
I heard about clustering to group similar data. I want to know how it works in the specific case for String.
I have a table with more than different 100,000 words.
I want to identify the same word with some differences (eg.: house, house!!, hooouse, HoUse, #house, "house", etc...).
What is needed to identify the similarity and group each word in a cluster? What algorithm is more recommended for this?
To understand what clustering is imagine a geographical map. You can see many distinct objects (such as houses). Some of them are close to each other, and others are far. Based on this, you can split all objects into groups (such as cities). Clustering algorithms make exactly this thing - they allow you to split your data into groups without previous specifying groups borders.
All clustering algorithms are based on the distance (or likelihood) between 2 objects. On geographical map it is normal distance between 2 houses, in multidimensional space it may be Euclidean distance (in fact, distance between 2 houses on the map also is Euclidean distance). For string comparison you have to use something different. 2 good choices here are Hamming and Levenshtein distance. In your particular case Levenshtein distance if more preferable (Hamming distance works only with the strings of same size).
Now you can use one of existing clustering algorithms. There's plenty of them, but not all can fit your needs. For example, pure k-means, already mentioned here will hardly help you since it requires initial number of groups to find, and with large dictionary of strings it may be 100, 200, 500, 10000 - you just don't know the number. So other algorithms may be more appropriate.
One of them is expectation maximization algorithm. Its advantage is that it can find number of clusters automatically. However, in practice often it gives less precise results than other algorithms, so it is normal to use k-means on top of EM, that is, first find number of clusters and their centers with EM and then use k-means to adjust the result.
Another possible branch of algorithms, that may be suitable for your task, is hierarchical clustering. The result of cluster analysis in this case in not a set of independent groups, but rather tree (hierarchy), where several smaller clusters are grouped into one bigger, and all clusters are finally part of one big cluster. In your case it means that all words are similar to each other up to some degree.
There is a package called stringdist that allows for string comparison using several different methods. Copypasting from that page:
Hamming distance: Number of positions with same symbol in both strings. Only defined for strings of equal length.
Levenshtein distance: Minimal number of insertions, deletions and replacements needed for transforming string a into string b.
(Full) Damerau-Levenshtein distance: Like Levenshtein distance, but transposition of adjacent symbols is allowed.
Optimal String Alignment / restricted Damerau-Levenshtein distance: Like (full) Damerau-Levenshtein distance but each substring may only be edited once.
Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.
q-gram distance: Sum of absolute differences between N-gram vectors of both strings.
Cosine distance: 1 minus the cosine similarity of both N-gram vectors.
Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.
Jaro distance: The Jaro distance is a formula of 4 values and effectively a special case of the Jaro-Winkler distance with p = 0.
Jaro-Winkler distance: This distance is a formula of 5 parameters determined by the two compared strings (A,B,m,t,l) and p chosen from [0, 0.25].
That will give you the distance. You might not need to perform a cluster analysis, perhaps sorting by the string distance itself is sufficient. I have created a script to provide the basic functionality here... feel free to improve it as needed.
You can use an algorithm like the Levenshtein distance for the distance calculation and k-means for clustering.
the Levenshtein distance is a string metric for measuring the amount of difference between two sequences
Do some testing and find a similarity threshold per word that will decide your groups.
You can use a clustering algorithm called "Affinity Propagation". This algorithm takes in an input called similarity matrix which you can generate by taking negative of the either Levenstein distance or an harmonic mean of partial_ratio and token_set_ratio from fuzzywuzzy library if you are using Python.