Levenshtein Distance in Power Query Using M

Levenshtein Distance in Power Query Using M - excel

I've been using this VBA solution by smirkingman from another similar question for calculating Levenshtein distance between strings. I have a need to translate this to an M code function in Excel Power Query, but don't have the know-how to do so.
Hoping someone can help me out. The 3 basic transformations between strings used in Levenshtein distance are below. Each counts as 1 step. More steps = greater distance between strings.
Insertion
Deletion
Substitution
I thought I could "cheat" and not use a For loop-type structure as shown in the VBA example, but the test results below show that I need a more robust solution.
let
result = (s1 as text, s2 as text) as number =>
List.Max({Text.Length(s1),Text.Length(s2)}) - List.Count(List.Intersect({Text.ToList(s1), Text.ToList(s2)}))
in
result
Test Results
s1
s2
result
explanation
pale
pole
1
substitution
dole
sale
2
substitution (x2)
pool
spool
1
insertion
two
one
2 (incorrect)
substitution and/or insert/delete (3 steps min) EXPECTED: 3

If you're keen to use PowerQuery to achieve this, then you should check the Fuzzy Matching functionality. See details here and here.
However, to me, this is a 'black box' algorithm, and I am not convinced on how accurate/efficient it works... Plus, Microsoft have not published the code behind this function, so the Open Source community cannot investigate it.
The PowerBI Community seem to think that it is implementing the Jaccard similarity algorithm; which is a little bit different to the Levenshtein algorithm which you are familiar with. See more info here.
If you're keen to take a deep-dive in to the internal workings of the string-distance matrix, I implemented this in R some years ago. You can read about this in my Blog and my Repo on this topic.
I would strongly recommend you to not use PowerQuery or VBA for this. There are much much much better libraries in both R and Python for implementing this methodology.

Related

Alternatives to TF-IDF and Cosine Similarity (comparing documents with different formats)

I've been working on a small, personal project which takes a user's job skills and suggests the most ideal career for them based on those skills. I use a database of job listings to achieve this. At the moment, the code works as follows:
1) Process the text of each job listing to extract skills that are mentioned in the listing
2) For each career (e.g. "Data Analyst"), combine the processed text of the job listings for that career into one document
3) Calculate the TF-IDF of each skill within the career documents
After this, I'm not sure which method I should use to rank careers based on a list of a user's skills. The most popular method that I've seen would be to treat the user's skills as a document as well, then to calculate the TF-IDF for the skill document, and use something like cosine similarity to calculate the similarity between the skill document and each career document.
This doesn't seem like the ideal solution to me, since cosine similarity is best used when comparing two documents of the same format. For that matter, TF-IDF doesn't seem like the appropriate metric to apply to the user's skill list at all. For instance, if a user adds additional skills to their list, the TF for each skill will drop. In reality, I don't care what the frequency of the skills are in the user's skills list -- I just care that they have those skills (and maybe how well they know those skills).
It seems like a better metric would be to do the following:
1) For each skill that the user has, calculate the TF-IDF of that skill in the career documents
2) For each career, sum the TF-IDF results for all of the user's skill
3) Rank career based on the above sum
Am I thinking along the right lines here? If so, are there any algorithms that work along these lines, but are more sophisticated than a simple sum? Thanks for the help!

The second approach you explained will work. But there are better ways to solve this kind of problem.
At first you should know a little bit about language models and leave the vector space model.
In the second step based on your kind of problem that is similar to expert finding/profiling you should learn a baseline language model framework to implement a solution.
You can implement A language modeling framework for expert finding with a little changes so that the formulas can be adapted to your problem.
Also reading On the assessment of expertise profiles will give you a better understanding of expert profiling with the framework above.
you can find some good ideas, resources and projects on expert finding/profiling at Balog's blog.

I would take SSRM [1] approach to expand query (job documents) using WordNet (extracted database [2]) as semantic lexicon - so you are not constrained only to direct word-vs-word matches. SSRM has its own similarity measure (I believe the paper is open-access, if not, check this: http://blog.veles.rs/document-similarity-computation-models-literature-review/, there are many similarity computation models listed). Alternativly, and if your corpus is big enough, you might try LSA/LSI[3,4] (also covered on the page) - without using external lexicon. But, if it is on English, WordNet's semantic graph is really rich in all directions (hyponims, synonims, hypernims... concepts/SinSet).
The bottom line: I would avoid simple SVM/TF-IDF for such concrete domain. I measured really serious margin of SSRM, over TF-IDF/VSM (measured as macro-average F1, 5-class single label classification, narrow domain).
[1] A. Hliaoutakis, G. Varelas, E. Voutsakis, E.G.M. Petrakis, E. Milios, Information Retrieval by Semantic Similarity, Int. J. Semant. Web Inf. Syst. 2 (2006) 55–73. doi:10.4018/jswis.2006070104.
[2] J.E. Petralba, An extracted database content from WordNet for Natural Language Processing and Word Games, in: 2014 Int. Conf. Asian Lang. Process., 2014: pp. 199–202. doi:10.1109/IALP.2014.6973502.
[3] P.W. Foltz, Latent semantic analysis for text-based research, Behav. Res. Methods, Instruments, Comput. 28 (1996) 197–202. doi:10.3758/BF03204765.
[4] A. Kashyap, L. Han, R. Yus, J. Sleeman, T. Satyapanich, S. Gandhi, T. Finin, Robust semantic text similarity using LSA, machine learning, and linguistic resources, Springer Netherlands, 2016. doi:10.1007/s10579-015-9319-2.

How to detect near duplicate rows in Azure Machine Learning?

I am new to azure machine learning. We are trying to implement questions similarity algorithm using azure machine learning. We have large set of questions and answers. Our objective is to identify whether newly added questions are duplicates or not? Just like Stackoverflow suggests existing questions when we ask new questions?Can we use azure machine learning services to solve this? Can someone guide us in the right direction?

Yes you can use Azure Machine Learning studio and could use the method Jennifer proposed.
However, I would assume it is much better to run a R script against a database containing all current questions in your experiment and return a similarity metric for each comparison.
Have a look at the following paper for some examples (from simple/basic to more advanced) how you could do this:
https://www.researchgate.net/publication/4314910_Question_Similarity_Calculation_for_FAQ_Answering
A simple way to start would just be to implement a simple "bags of words" comparison. This will yield a distance matrix that you could use for clustering or use to give back similar questions. The following R code would so such a thing, in essence you build a large string with as first sentence the new question and then follow it with all known questions. This method will, obviously, not really take into consideration the meaning of the questions and would just trigger on equal word usage.
library(tm)
library(Matrix)
x <- TermDocumentMatrix( Corpus( VectorSource( strings.with.all.questions ) ) )
y <- sparseMatrix( i=x$i, j=x$j, x=x$v, dimnames = dimnames(x) )
plot( hclust(dist(t(y))) )

Yes, you can definitely do this with Azure Machine Learning. It sounds like you have a clustering problem (you are trying to group together similar questions).
There is a "Clustering: Find similar companies" sample that does a similar thing at https://gallery.cortanaanalytics.com/Experiment/60cf8e46935c4fafbf86f669121a24f0. You can read the description on that page and click the "Open in Studio" button in the right-hand sidebar to actually open the workspace in Azure Machine Learning Studio. In that sample, they are finding similar companies based on the text from the company's Wikipedia article (for example: Microsoft and Apple are similar companies because the word "computer" appears a lot in both articles). Your problem is very similar except you would use the text in your questions to find similar questions and cluster them into groups accordingly.
In k-means clustering, "k" is the number of clusters that you want to form, so this number will probably be pretty big for your specific problem. If you have 500 questions, perhaps start with 250 centroids? But mess around with this number and see what works. For performance reasons, you might want to start with a small dataset for testing and then run all of your data through the model after it seems to be grouping well.
Also, the documentation for K-means clustering is here.

Generating multiple optimal solutions using Excel solver

Is there a way to get all optimal solutions when you are solving some problem with Excel Solver (Simplex LP method)?
If not, what is the best way/add-in to Excel to solve it and convert existing VBA code to use this new way?

Actually, I have found a way to do this with Excel solver, although it is not optimal in sense of time consumption but that is not issue for me.
If you can assign unique id for each possible solution on some way, which is true in my case, then for each solution you find you can check if there is some solution with same value with different id on following way :
Find first optimal solution and save solution id and result. I will call this origID, origRes
Check if there is some solution with id < origID and res = origRes
If yes, then consider newId as initial id and continue with step 2 until you can't find solutions which satisfied criteria
After that, do the same thing with condition id > origID and res = origRes
After you make sure you found all solutions with optimal solution origRes, then we can go and find solution which is not optimal as origRes. I did it on a way to add condition that new solution needs to be <= (origRes - 0.01) because I know that all solutions will be with 2 decimal places.
Go to step 2 again
I know this is not the best way but I usually do not need more than 100 solutions and currently I can get it in 2 mins which is acceptable for me.

Although this looks easy, it actually is not such an easy question. Even the definition of "all possible optimal solutions" is not clear. There may by infinitely many of them. Asking for "all basic feasible solutions" (i.e. corner points) sounds better. To my knowledge there are no solvers providing this. I also do not know of a really simple technique to enumerate all optimal bases.
One interesting approach is to use a MIP formulation to enumerate all optimal bases:
Sangbum Lee, Chan Phalakornkule, Michael M. Domach, Ignacio E. Grossmann, "Recursive MILP model for finding all the alternate optima in LP
models for metabolic networks," Computers and Chemical Engineering 24 (2000) 711-716. (link)

automatic keyword generation evaluation

I have a simple text analyzer with generates keywords for a given input text. Until now I have been doing a manual evaluation of it, i.e., manually selecting keywords of a text and comparing them against the ones generated by the analyzer.
Is there any way in which I can automate this? I tried googling a lot for some free keyword generators which can help in this evaluation but have not found any till now. I will appreciate any suggestions on how to go about this.

Testing keyword generation is a difficult problem. In the past, I have used the following method to evaluate it.
Identify the popular association-rule generation methods like Confidence, Jaccard, Lift, Chi-Squared, Mutual Information etc. There are many papers that compare such measures.
Implementing these measures is fairly simple. They all involve some simple algebraic expression using one or more of term frequencies, document frequencies and co-occurrence frequencies.
Generate related keywords using all of these measures and compute their union. Call this set TOTAL.
Compute the intersection of the keywords generated by your algorithm with the above TOTAL-set. When viewed as a fraction (intersection/TOTAL), it is a rough indicator of how powerful your measure is.

I found an automatic keyword generation evaluation tool Text Mechanic's Keyword Suggestion Generator, which might help.
It says:
The Text Mechanic's "Keyword Suggestion Generator" will retrieve Google.com auto suggest results* for your entered seed text in an easy to investegate format. Seed text can be a letter, number, word, phrase, related to what you (and others) are querying to find in Google search results.
I believe it can be automated.

How do you implement a "Did you mean"? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How does the Google “Did you mean?” Algorithm work?
Suppose you have a search system already in your website. How can you implement the "Did you mean:<spell_checked_word>" like Google does in some search queries?

Actually what Google does is very much non-trivial and also at first counter-intuitive. They don't do anything like check against a dictionary, but rather they make use of statistics to identify "similar" queries that returned more results than your query, the exact algorithm is of course not known.
There are different sub-problems to solve here, as a fundamental basis for all Natural Language Processing statistics related there is one must have book: Foundation of Statistical Natural Language Processing.
Concretely to solve the problem of word/query similarity I have had good results with using Edit Distance, a mathematical measure of string similarity that works surprisingly well. I used to use Levenshtein but the others may be worth looking into.
Soundex - in my experience - is crap.
Actually efficiently storing and searching a large dictionary of misspelled words and having sub second retrieval is again non-trivial, your best bet is to make use of existing full text indexing and retrieval engines (i.e. not your database's one), of which Lucene is currently one of the best and coincidentally ported to many many platforms.

Google's Dr Norvig has outlined how it works; he even gives a 20ish line Python implementation:
http://googlesystem.blogspot.com/2007/04/simplified-version-of-googles-spell.html
http://www.norvig.com/spell-correct.html
Dr Norvig also discusses the "did you mean" in this excellent talk. Dr Norvig is head of research at Google - when asked how "did you mean" is implemented, his answer is authoritive.
So its spell-checking, presumably with a dynamic dictionary build from other searches or even actual internet phrases and such. But that's still spell checking.
SOUNDEX and other guesses don't get a look in, people!

Check this article on wikipedia about the Levenshtein distance. Make sure you take a good look at Possible improvements.

I was pleasantly surprised that someone has asked how to create a state-of-the-art spelling suggestion system for search engines. I have been working on this subject for more than a year for a search engine company and I can point to information on the public domain on the subject.
As was mentioned in a previous post, Google (and Microsoft and Yahoo!) do not use any predefined dictionary nor do they employ hordes of linguists that ponder over the possible misspellings of queries. That would be impossible due to the scale of the problem but also because it is not clear that people could actually correctly identify when and if a query is misspelled.
Instead there is a simple and rather effective principle that is also valid for all European languages. Get all the unique queries on your search logs, calculate the edit distance between all pairs of queries, assuming that the reference query is the one that has the highest count.
This simple algorithm will work great for many types of queries. If you want to take it to the next level then I suggest you read the paper by Microsoft Research on that subject. You can find it here
The paper has a great introduction but after that you will need to be knowledgeable with concepts such as the Hidden Markov Model.

I would suggest looking at SOUNDEX to find similar words in your database.
You can also access google own dictionary by using the Google API spelling suggestion request.

You may want to look at Peter Norvig's "How to Write a Spelling Corrector" article.

I believe Google logs all queries and identifies when someone makes a spelling correction. This correction may then be suggested when others supply the same first query. This will work for any language, in fact any string of any characters.

http://en.wikipedia.org/wiki/N-gram#Google_use_of_N-gram

I think this depends on how big your website it. On our local Intranet which is used by about 500 member of staff, I simply look at the search phrases that returned zero results and enter that search phrase with the new suggested search phrase into a SQL table.
I them call on that table if no search results has been returned, however, this only works if the site is relatively small and I only do it for search phrases which are the most common.
You might also want to look at my answer to a similar question:
"Similar Posts" like functionality using MS SQL Server?

If you have industry specific translations, you will likely need a thesaurus. For example, I worked in the jewelry industry and there were abbreviate in our descriptions such as kt - karat, rd - round, cwt - carat weight... Endeca (the search engine at that job) has a thesaurus that will translate from common misspellings, but it does require manual intervention.

I do it with Lucene's Spell Checker.

Soundex is good for phonetic matches, but works best with peoples' names (it was originally developed for census data)
Also check out Full-Text-Indexing, the syntax is different from Google logic, but it's very quick and can deal with similar language elements.

Soundex and "Porter stemming" (soundex is trivial, not sure about porter stemming).

There's something called aspell that might help:
http://blog.evanweaver.com/files/doc/fauna/raspell/classes/Aspell.html
There's a ruby gem for it, but I don't know how to talk to it from python
http://blog.evanweaver.com/files/doc/fauna/raspell/files/README.html
Here's a quote from the ruby implementation
Usage
Aspell lets you check words and suggest corrections. For example:
string = "my haert wil go on"
string.gsub(/[\w\']+/) do |word|
if !speller.check(word)
# word is wrong
puts "Possible correction for #{word}:"
puts speller.suggest(word).first
end
end
This outputs:
Possible correction for haert:
heart
Possible correction for wil:
Will

Implementing spelling correction for search engines in an effective way is not trivial (you can't just compute the edit/levenshtein distance to every possible word). A solution based on k-gram indexes is described in Introduction to Information Retrieval (full text available online).

U could use ngram for the comparisment: http://en.wikipedia.org/wiki/N-gram
Using python ngram module: http://packages.python.org/ngram/index.html
import ngram
G2 = ngram.NGram([ "iis7 configure ftp 7.5",
"ubunto configre 8.5",
"mac configure ftp"])
print "String", "\t", "Similarity"
for i in G2.search("iis7 configurftp 7.5", threshold=0.1):
print i[1], "\t", i[0]
U get:
>>>
String Similarity
0.76 "iis7 configure ftp 7.5"
0.24 "mac configure ftp"
0.19 "ubunto configre 8.5"

Why not use google's did you mean in your code.For how see here
http://narenonit.blogspot.com/2012/08/trick-for-using-googles-did-you-mean.html

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string