Find which cells have the smallest levenshtein distance - excel

So, I have this Function which will quickly return the Levenshtein Distance between two Strings:
Function Levenshtein(ByVal string1 As String, ByVal string2 As String) As Long
Dim i As Long, j As Long
Dim string1_length As Long
Dim string2_length As Long
Dim distance() As Long
string1_length = Len(string1)
string2_length = Len(string2)
ReDim distance(string1_length, string2_length)
For i = 0 To string1_length
distance(i, 0) = i
Next
For j = 0 To string2_length
distance(0, j) = j
Next
For i = 1 To string1_length
For j = 1 To string2_length
If Asc(Mid$(string1, i, 1)) = Asc(Mid$(string2, j, 1)) Then
distance(i, j) = distance(i - 1, j - 1)
Else
distance(i, j) = Application.WorksheetFunction.Min _
(distance(i - 1, j) + 1, _
distance(i, j - 1) + 1, _
distance(i - 1, j - 1) + 1)
End If
Next
Next
Levenshtein = distance(string1_length, string2_length)
End Function
I want to perform a fast comparison between all cells in the "A" column and return which ones have a "small" Levenshtein distance. How would I make all these comparisons?

Do you want to find which combinations of strings have small levenshtein distances or just overall how similar/disimilar each string is with all the other strings?
If it is the former this should work fine:
You just copy and paste transposed values to create all those headers(as Dale commented). You can use the conditional formatting to highlight the lowest results.
Or if you want the actual strings to return you should be able to use this:
=IF(AND(Levenshtein($A28,B$27)>0,Levenshtein($A28,B$27)<=3),$A28&"/"&B$27,"")
Just copy and paste unique values if you want the returned combinations in a single column.
Good Luck.

Related

How to use Excel VLOOKUP for same letters but different word

I need to find the values of words with the same letters but different order of letters. I tried using vlookup but it didn't work, any suggestions how to do that? For example, the letters are the same in the photo, but their order is different and I cannot find the value.
Another alternative LAMBDA() related option using REDUCE():
Formula in E1:
=XLOOKUP("",REDUCE(A$1:A$3,MID(D1,SEQUENCE(LEN(D1)),1),LAMBDA(a,b,SUBSTITUTE(a,b,"",1))),B$1:B$3)
Or; spill the entire range at once wrapping the above in BYROW():
Formula in E1:
=BYROW(D1:D3,LAMBDA(a,XLOOKUP("",REDUCE(A$1:A$3,MID(a,SEQUENCE(LEN(a)),1),LAMBDA(b,c,SUBSTITUTE(b,c,"",1))),B$1:B$3)))
The above would assume lookup values that are of the exact same length. To avoid false positives we should probably concatenate our starting value first:
=XLOOKUP(D1,REDUCE(A$1:A$3&D1,MID(D1,SEQUENCE(LEN(D1)),1),LAMBDA(a,b,SUBSTITUTE(a,b,"",1))),B$1:B$3)
And respectively:
=BYROW(D1:D3,LAMBDA(a,XLOOKUP(a,REDUCE(A$1:A$3&a,MID(a,SEQUENCE(LEN(a)),1),LAMBDA(b,c,SUBSTITUTE(b,c,"",1))),B$1:B$3)))
With Office 365 we can use XLOOKUP with some other dynamic array formula:
=XLOOKUP(CONCAT(SORT(MID(D1,SEQUENCE(,5),1),1,1,TRUE)),BYROW(MID($A$1:$A$3,SEQUENCE(,5),1),LAMBDA(A,CONCAT(SORT(A,1,1,TRUE)))),$B$1:$B$3,"")
This will order the letters in alphabetical order virtually so that they will find the matches.
If one does not have Office 365 this will be easier with vba. Based on an older answer: Excel formula to take string value from cell and sort its characters in alphabetical order
We can use the following UDF's to return the needed sorted values:
Function sortletterarr(rng As Range)
If rng.Columns.Count > 1 Then Exit Function
If rng.Rows.Count <= 1 Then Exit Function
Dim out()
out() = rng.Value
Dim srtArr() As String
Dim i As Long, j As Long, k As Long
Dim a As Long
For a = LBound(out, 1) To UBound(out, 1)
ReDim srtArr(1 To Len(out(a, 1)))
srtArr(1) = Mid(out(a, 1), 1, 1)
For i = 2 To UBound(srtArr)
For j = 1 To UBound(srtArr)
If srtArr(j) = "" Then
srtArr(j) = Mid(out(a, 1), i, 1)
Exit For
ElseIf IIf(Asc(Mid(out(a, 1), i, 1)) > 96, Asc(Mid(out(a, 1), i, 1)) - 32, Asc(Mid(out(a, 1), i, 1))) <= IIf(Asc(srtArr(j)) > 96, Asc(srtArr(j)) - 32, Asc(srtArr(j))) Then
For k = UBound(srtArr) To j + 1 Step -1
srtArr(k) = srtArr(k - 1)
Next k
srtArr(j) = Mid(out(a, 1), i, 1)
Exit For
End If
Next j
Next i
out(a, 1) = Join(srtArr, "")
Next a
sortletterarr = out
End Function
And
Function sortletter(rng As Range)
If rng.Count > 1 Then Exit Function
Dim srtArr() As String
Dim i&, j&, k&
ReDim srtArr(1 To Len(rng))
srtArr(1) = Mid(rng, 1, 1)
For i = 2 To UBound(srtArr)
For j = 1 To UBound(srtArr)
If srtArr(j) = "" Then
srtArr(j) = Mid(rng, i, 1)
Exit For
ElseIf IIf(Asc(Mid(rng, i, 1)) > 96, Asc(Mid(rng, i, 1)) - 32, Asc(Mid(rng, i, 1))) <= IIf(Asc(srtArr(j)) > 96, Asc(srtArr(j)) - 32, Asc(srtArr(j))) Then
For k = UBound(srtArr) To j + 1 Step -1
srtArr(k) = srtArr(k - 1)
Next k
srtArr(j) = Mid(rng, i, 1)
Exit For
End If
Next j
Next i
sortletter = Join(srtArr, "")
End Function
Put both in a normal module attached to the worksheet.
then we can use INDEX/MATCH:
=INDEX($B$1:$B$3,MATCH(sortletter(D1),sortletterarr($A$1:$A$3),0))
And it will return what we want.
This is definitely a primitive approach, but it does work if, as your post indicates, all of your references are of a uniform length, and will work (as an array formula) in versions of Excel from 2007 upwards:
=INDEX($B$1:$B$3,MATCH(SUM(CODE(MID(D1,ROW($1:$5),1))),MMULT(CODE(MID($A$1:$A$3,TRANSPOSE(ROW($1:$5)),1)),ROW($1:$5)/ROW($1:$5)),0))
If your references are not all of the same length (5 in your example) then this formula will not work (even as updated in later versions of Excel, it could compute the same numeric value for 2 entirely different character sequences of unequal length).
(if you opt for the VBA solution in the first answer then this article will show you where to put it)
Edit 2-May-2022
JvdV's feedback has indicated a potential issue with the foregoing so, below is an alternate approach, albeit using a helper column (C):
the formula in column C1 is
=SUMPRODUCT(LARGE(CODE(MID(A1,ROW($1:$5),1)),ROW($1:$5)),ROW($1:$5)*128)
(which should copied down for each cell of your data)
and the formula in F1 is
=INDEX($B$1:$B$5,MATCH(SUMPRODUCT(LARGE(CODE(MID(E1,ROW($1:$5),1)),ROW($1:$5)),ROW($1:$5)*128),$C$1:$C$5,0))

VBA function InSTR - How to use asterisk (as any other charakter) in searched phrase?

In Excel when we try to find some phrase we can put asterisk * inside as any other character. But how to do it inside VBA macro? For example below;
I want to find the secName by searching the value of firName with asterisk but id doesn't work. I suppose the problem is that VBA thinks that i want to find exactly * as normal character instead of anything.
Dim firName, secName As String
firName = "Da*"
secName = "Daniel"
search = InStr(1, secName, firName, vbTextCompare)
MsgBox (search)
Is it possible to use asterisk * in the way I described?
You can either do a FuzzySearch like: Matching similar but not exact text strings in Excel VBA projects, …
… or you can use the The Levenshtein Distance to find out how similar 2 strings are which is probably more accurate but needs O(n*m) time for calculation. So don't use it on very long strings.
Function Levenshtein(str1 As String, str2 As String) As Long
Dim arrLev As Variant, intLen1 As Long, intLen2 As Long, i As Long
Dim j As Long, arrStr1 As Variant, arrStr2 As Variant, intMini As Long
intLen1 = Len(str1)
ReDim arrStr1(intLen1 + 1)
intLen2 = Len(str2)
ReDim arrStr2(intLen2 + 1)
ReDim arrLev(intLen1 + 1, intLen2 + 1)
arrLev(0, 0) = 0
For i = 1 To intLen1
arrLev(i, 0) = i
arrStr1(i) = Mid(str1, i, 1)
Next i
For j = 1 To intLen2
arrLev(0, j) = j
arrStr2(j) = Mid(str2, j, 1)
Next j
For j = 1 To intLen2
For i = 1 To intLen1
If arrStr1(i) = arrStr2(j) Then
arrLev(i, j) = arrLev(i - 1, j - 1)
Else
intMini = arrLev(i - 1, j) 'deletion
If intMini > arrLev(i, j - 1) Then intMini = arrLev(i, j - 1) 'insertion
If intMini > arrLev(i - 1, j - 1) Then intMini = arrLev(i - 1, j - 1) 'deletion
arrLev(i, j) = intMini + 1
End If
Next i
Next j
Levenshtein = arrLev(intLen1, intLen2)
End Function
The smaller the returned number is the more similar are the strings.
For example:
Debug.Print Levenshtein("OFFICE CLUB, S.A.", "OFFICE CLUB SA") 'returns 3
Debug.Print Levenshtein("OFFICE CLUB, S.A.", "OFFICE CLUB S.A.") 'returns 1
The second strings are more similar than the first ones.

Find minimum Levenshtein Distance between one word and an array of thousands

So my users wrote their addresses in a registration form, but a lot of them have typos. I have another list retrieved from the city records with the correct spelling of those addresses. So let's say I have "Brooklny" typed by them and I have the list of correct names: Brooklyn, Manhattan, Bronx, Staten Island, Queens (this is an example, the actual addresses are in Spanish and refer to neighborhoodS in Mexico City).
I want to find the edit distance between Brooklyn and each of the borough names and then find the word to whick Brooklyn has the minimum edit distance.
So edit distance between: Brooklny-Brooklyn is 2, Brooklny-Bronx is 4 and so on. The minimum of course is 2 with Brooklyn.
Imagine that I have Brooklny in cell A1 and Brooklyn, Manhattan, Bronx, Staten Island, Queens each in a cell from B1:B6
Im doing this in VBA for a user defined function in Excel and so far I have this code but it doesnt work.
Function Minl(ByVal string1 As String, ByVal correctos As Range) As Variant
Dim distancias(3) As Integer
Dim i, minimo As Integer
i = 0
For Each c In correctos.Cells
distancias(i) = Levenshtein(string1, c.Value)
i = i + 1
Next c
Minl = Minrange(distancias)
End Function
Function Levenshtein(ByVal string1 As String, ByVal string2 As String) As Long
Dim i As Long, j As Long
Dim string1_length As Long
Dim string2_length As Long
Dim distance() As Long
string1_length = Len(string1)
string2_length = Len(string2)
ReDim distance(string1_length, string2_length)
For i = 0 To string1_length
distance(i, 0) = i
Next
For j = 0 To string2_length
distance(0, j) = j
Next
For i = 1 To string1_length
For j = 1 To string2_length
If Asc(Mid$(string1, i, 1)) = Asc(Mid$(string2, j, 1)) Then
distance(i, j) = distance(i - 1, j - 1)
Else
distance(i, j) = Application.WorksheetFunction.Min _
(distance(i - 1, j) + 1, _
distance(i, j - 1) + 1, _
distance(i - 1, j - 1) + 1)
End If
Next
Next
Levenshtein = distance(string1_length, string2_length)
End Function
Function Minrange(ParamArray values() As Variant) As Variant
Dim minValue, Value As Variant
minValue = values(0)
For Each Value In values
If Value < minValue Then minValue = Value
Next
Minrange = minValue
End Function
I think the algorithm is right but I think I might be having trouble with the syntax. The levenshtein function works but im not sure about the other two.
To get the closest output you could use this:
Function get_match(ByVal str As String, rng As Range) As String
Dim itm As Variant, outp(0 To 2) As Variant
outp(1) = 0: outp(2) = ""
For Each itm In rng.Text
outp(0) = Levenshtein(itm, str)
If outp(0) = 0 Then
get_match = itm
Exit Function
ElseIf outp(1) = 0 Or outp(0) < outp(1) Then
outp(1) = outp(0)
outp(2) = itm
End If
Next
get_match = outp(1)
End Function
to get the distance later, you simply could run an Levenshtein(string,get_match(string,range))
Still... I'm not exactly sure what you are looking for :/

Moving characters from one cell to another cause infinite loop

I'm a novice with Excel VBA and what I need to do is take a few cells in a column that contain a name and a phone number and cut the phone number out of that cell and paste it in another column, so I'll have one cell with the name and another with the number.
For some reason my program goes into an infinite loop.
I'm pretty sure it's caused by sLen = sLen - 1 & j = j - 1 but I don't know how to solve it.
Sub loop_macro()
Dim myStr As String
Dim i As Integer
For i = 5 To 15
myStr = movePhone(Cells(i, 2), i)
Cells(i, 5) = myStr
Next i
End Sub
Function movePhone(s As String, rowNum As Integer) As String
Dim retval As String // This is the return string to be copied to column 5
Dim j As Integer // Counter for character position.
Dim sLen As Integer
retval = ""
sLen = Len(s)
For j = 1 To sLen
If Mid(s, j, 1) >= "0" And Mid(s, j, 1) <= "9" Then
retval = retval + Mid(s, j, 1)
Cells(rowNum, 2) = Mid(Cells(rowNum, 2), j, 1) // remove digit from cell
sLen = sLen - 1
j = j - 1
ElseIf Mid(s, j, 1) = "-" Then
retval = retval + Mid(s, j, 1)
Cells(rowNum, 2) = Mid(Cells(rowNum, 2), j, 1) // remove "-" char from cell
sLen = sLen - 1
j = j - 1
End If
Next
movePhone = retval
End Function
Instead of messing around so much with manipulating J so you can go left to right go right to left like so:
for j = sLen to 1 step -1
Now you can ommit the line j=j-1 and it will no longer give you this problem.
Always start at the end and work your way to the start when removing things from strings or even data sets, it's much easier than manipulating the variable you are looping on.
In this instance you should consider using Regexp (Regular Expressions) I have not used them before but they will find number strings within strings (and so much more if you want). Search SO for regexp and you will find heaps of code to help you out.

Simple Histogram in VBA?

I have data stored in some column (Say, Column A). The length of Column A is not fixed (depends on previous steps in the code).
I need a histogram for the values in Column A, and have it in the same sheet. I need to take the values in column A, and automatically compute M Bins, then give the plot.
I looked online for a "simple" code, but all codes are really fancy, with tons of details that I don't need, to the extent that I am not even able to use it. (I am a VBA beginner.)
I found the following code that seems to do the job, but I am having trouble even calling the function. Besides, it only does computations but does not make the plot.
Sub Hist(M As Long, arr() As Single)
Dim i As Long, j As Long
Dim Length As Single
ReDim breaks(M) As Single
ReDim freq(M) As Single
For i = 1 To M
freq(i) = 0
Next i
Length = (arr(UBound(arr)) - arr(1)) / M
For i = 1 To M
breaks(i) = arr(1) + Length * i
Next i
For i = 1 To UBound(arr)
If (arr(i) <= breaks(1)) Then freq(1) = freq(1) + 1
If (arr(i) >= breaks(M - 1)) Then freq(M) = freq(M) + 1
For j = 2 To M - 1
If (arr(i) > breaks(j - 1) And arr(i) <= breaks(j)) Then freq(j) = freq(j) + 1
Next j
Next i
For i = 1 To M
Cells(i, 1) = breaks(i)
Cells(i, 2) = freq(i)
Next i
End Sub
And then I try to call it simply by:
Sub TestTrial()
Dim arr() As Variant
Dim M As Double
Dim N As Range
arr = Range("A1:A10").Value
M = 10
Hist(M, arr) ' This does not work. Gives me Error (= Expected)
End Sub
A little late but still I want to share my solution. I created a Histogram function which might be used as array formula in the excel spread sheet. Note: you must press
CTRL+SHIFT+ENTER to enter the formula into your workbook. Input is the range of values and the number M of bins for the histogram. The output range must have M rows and two columns. One column for the bin value and one column for the bin frequency.
Option Explicit
Option Base 1
Public Function Histogram(arr As Range, M As Long) As Variant
On Error GoTo ErrHandler
Dim val() As Variant
val = arr.Value
Dim i As Long, j As Integer
Dim Length As Single
ReDim breaks(M) As Single
ReDim freq(M) As Integer
Dim min As Single
min = WorksheetFunction.min(val)
Dim max As Single
max = WorksheetFunction.max(val)
Length = (max - min) / M
For i = 1 To M
breaks(i) = min + Length * i
freq(i) = 0
Next i
For i = 1 To UBound(val)
If IsNumeric(val(i, 1)) And Not IsEmpty(val(i, 1)) Then
If val(i, 1) > breaks(M) Then
freq(M) = freq(M) + 1
Else
j = Int((val(i, 1) - min) / Length) + 1
freq(j) = freq(j) + 1
End If
End If
Next i
Dim res() As Variant
ReDim res(M, 2)
For i = 1 To M
res(i, 1) = breaks(i)
res(i, 2) = freq(i)
Next i
Histogram = res
ErrHandler:
'Debug.Print Err.Description
End Function
Not 100% sure as to the efficacy of that approach but;
Remove the parens as your calling a sub; Hist M, arr
M is declared as double but received by the function as a long; this won't work so declare it in the calling routine as long
You will need to recieve arr() As Variant
Range -> Array produces a 2 dimensional array so the elements are arr(1, 1) .. arr(n, 1)

Resources