I have addresses in U and V. I want to see if they are somewhat similar and if they are say "Update" If not say "Omit".
For example 246 N High street in U and 246 North High St in V would return a value of Update.
246 N High Street in U and 458 Auburn Drive in V would return a value of Omit.
Any ideas?
There are a lot of algorithms for doing fuzzy matching. One of the easier ones to implement in excel is N-Gram.
To perform an n-gram match, we have to break each address up into a list of sets of smaller character lengths. A 2-gram list of your address 246 N High street would look like 24,46,6 , N,N , H,Hi,ig,gh,h , s,st,tr,re,ee,et. We could do the same with a 3-gram: 246,46 ,6 N, N ,N H, Hi,Hig,igh,gh ,h s, st,str,tre,ree,eet
We do this with both addresses, then we can check each item in the first address's list to see if it appears in the second address's list; count the matches and divide that by the number of items in the first list. That will give you a percentage of how close they are.
You could get fancy with cell formulas mid() and countif() to do this with sheet formulas, but I think it's easier to just write it out in VBA and make it a UDF.
Function NGramCompare(string1 As String, string2 As String, intGram As Integer) As Double
'Take in two strings and the N-gram
Dim intChar As Integer, intGramMatch As Integer
Dim ngramList1 As String, ngramList2 As String, nGram As Variant
Dim nGramArr1 As Variant
'split the first string into a list of ngrams
For intChar = 1 To Len(string1) - (intGram-1)
If ngramList1 <> "" Then ngramList1 = ngramList1 & ","
ngramList1 = ngramList1 & Mid(string1, intChar, intGram)
Next intChar
'split the secong string into a list of ngrams
For intChar = 1 To Len(string2) - (intGram-1)
If ngramList2 <> "" Then ngramList2 = ngramList2 & ","
ngramList2 = ngramList2 & Mid(string2, intChar, intGram)
Next intChar
'Split the ngramlist1 into an array through which we can iterate
nGramArr1 = Split(ngramList1, ",")
'Iterate through array and compare values to ngramlist2
For Each nGram In nGramArr1
If InStr(1, ngramList2, nGram) Then
'we found a match, add to the counter
intGramMatch = intGramMatch + 1
End If
Next nGram
'output the percentage of grams matching.
NGramCompare = intGramMatch / (UBound(nGramArr1) + 1)
End Function
If you've never used a UDF:
Go to visual basic editor (VBE) with Alt+F11
In the VBA Project window, find your workbook and right click on the name
Choose: Insert>>Module
Double click the new module in the list to bring up it's code window
Paste this function in and save your workbook
Then, assuming address1 is in A1 and address2 is in B1 you can put, in C1:
=NGramCompare(A1, B1, 2)
Which, for your first address, will spit out 56%. Which seems like a reasonably good match. If you find you are getting too many positive hits, you can change your 2-gram to be a 3-gram by changing that last parameter.
To take it a step further so it will say "Update" or "Omit" you could do:
=If(NGramCompare(A1, B1, 2)>.30, "Update", "Omit")
I just set that so that it will consider a match anything above 30%, but you can adjust as necessary. No matter where you set it, you will probably end up with a percentage of compares that are false positives or false negatives, but that's the way fuzzy matching goes.
Some of the naive approaches can be to compare the first few characters
=LEFT(A1,5)=LEFT(B1,5)
or to replace parts until they match
=(SUBSTITUTE(SUBSTITUTE(LOWER(A2)," street"," ST")," north "," N ")
=SUBSTITUTE(SUBSTITUTE(LOWER(B2)," street"," ST")," north "," N "))
both will probably turn into a big ugly formula after adjusting for most cases
I want to join a list of words in Excel (not in VBA... with an Excel formula in the worksheet) to the following specifications:
Formula should ignore empty cells.
Formula should concatenate the words with "and" before final item if there is more than one item in the array of cells.
Formula should add "," between items if there are more than two items.
Examples:
A1=dog
A2=cat
A3=bird
A4=fish
Result would be: dog, cat, bird, and fish
A1=dog
A2=cat
A3=(empty cell)
A4=fish
Result would be: dog, cat, and fish
A1=dog
A2=(empty cell)
A3=bird
A4=(empty cell)
Result would be: dog and bird
A1=dog
A2=(empty cell)
A3=(empty cell)
A4=(empty cell)
Result would be: dog
Pretty please? I promise I've searched and searched for the answer.
Edit: Thank you, ExcelArchitect, I got it! This was the first time I'd ever used a custom function. You use it just like any other function in the worksheet! This is so great.
Not to push my luck, but how to do I get two cells to concatenate with my result if there is only one word in the result and two other cells if there is more than one word? Example: If the function you made for me returns just "dog", I'd want it to concatenate a cell with the text (B1) "My favorite thing to wear is a " and then "dog" and then another cell (B2) that says " costume." to make the sentence "My favorite thing to wear is a dog costume." But if it returns more than one animal, it would concatenate two other cells like this: Cell C1 "My favorite things to wear are " and "dog, cat, and bird" and Cell C2 " costumes." so that it would say "My favorite things to wear are dog, cat, and bird costumes."
If you're curious, my data really has nothing to do with animals or costumes. I am writing a program that will score a psychological test and then create an interpretive report from the test scores (I'm a psychologist).
-Mary Anne
Mary Anne:
This would be a great time to use VBA! But if you don't want to, there is a way to accomplish your goal without it.
You have to account for all of the possible outcomes here. With 4 different animals that means you have 15 outcomes:
Your equation just has to take into account all 15. It is VERY long and drawn out as a result. As such, if you have more than 4 animals that you'd like to turn into phrases, you should go the VBA route.
Here is my set up:
The formula in A7 is the following:
=IF(AND(A2<>"", A3="", A4="", A5=""), A2, IF(AND(A2="", A3<>"", A4="", A5=""), A3, IF(AND(A2="", A3="", A4<>"", A5=""), A4, IF(AND(A2="", A3="", A4="", A5<>""), A5, IF(AND(A2<>"", A3<>"", A4="", A5=""), A2&" and "&A3, IF(AND(A2<>"", A3="", A4<>"", A5=""), A2&" and "&A4, IF(AND(A2<>"", A3="", A4="", A5<>""), A2&" and "&A5, IF(AND(A2="", A3<>"", A4<>"", A5=""),A3&" and "&A4, IF(AND(A2="", A3<>"", A4="", A5<>""), A3&" and "&A5, IF(AND(A2="", A3="", A4<>"", A5<>""),A4&" and "&A5, IF(AND(A2<>"", A3<>"", A4<>"", A5=""), A2&", "&A3&", and "&A4, IF(AND(A2<>"", A3<>"", A4="", A5<>""), A2&", "&A3&", and "&A5, IF(AND(A2<>"", A3="", A4<>"", A5<>""), A2&", "&A4&", and "&A5, IF(AND(A2="", A3<>"", A4<>"", A5<>""), A3&", "&A4&", and "&A5, A2&", "&A3&", "&A4&", and "&A5))))))))))))))
Here it is via Excel:
Mary Anne - I'm such a nerd that I had to do this. Here is the VBA solution, and you can have as many names as you want! Paste this code into a new module in the workbook (go to Developer -> Visual Basic, then Insert -> New Module, and paste), then you can use it in your worksheet like a regular function. Just give it the range where the names are and you should be good to go! -Matt
Function CreatePhrase(NamesRng As Range) As String
'Creates a comma-separated phrase given a list of words or names
Dim Cell As Range
Dim l As Long
Dim cp As String
'Add commas between the values in the cells
For Each Cell In NamesRng
If Not IsEmpty(Cell) And Not Cell.Value = "" And Not Cell.Value = " " Then
cp = cp & Cell.Value & ", "
End If
Next Cell
'Remove trailing comma and space
If Right(cp, 2) = ", " Then cp = Left(cp, Len(cp) - 2)
'If there is only one value (no commas) then quit here
If InStr(1, cp, ",", vbTextCompare) = 0 Then
CreatePhrase = cp
Exit Function
End If
'Add "and" to the end of the phrase
For l = 1 To Len(cp)
If Mid(cp, Len(cp) - l + 1, 1) = "," Then
cp = Left(cp, Len(cp) - l + 2) & "and" & Right(cp, l - 1)
Exit For
End If
Next l
'If there are only two words or names (only one comma) then remove the comma
If InStr(InStr(1, cp, ",", vbTextCompare) + 1, cp, ",", vbTextCompare) = 0 Then
cp = Left(cp, InStr(1, cp, ",", vbTextCompare) - 1) & Right(cp, Len(cp) - InStr(1, cp, ",", vbTextCompare))
End If
CreatePhrase = cp
End Function
Hope that helps!
Matt, via ExcelArchitect.com
VBA is simpler. A formula is quite complicated, since Excel has no native functions allowing concatenation of a range. However, given that you have written that you would have up to eight animals, it is doable with the following formula which concatenates the contents of A1:A8 according to your rules. You can change those locations in the formula in the obvious locations.
I made one change: I may be wrong, but I believe English rules indicate that the comma preceding the last and should be omitted, so I did so. It could be added in if necessary. EDIT: Further investigation reveals a difference between US and UK rules: US rules are as you requested, UK rules omit the comma before the conjunction. I will modify the formulas and UDF to comply with US conventions.
In the formulas, the modification is to place a comma immediately prior to the and. The change in the UDF is likewise minor.
The formula was constructed from the following sequences:
So putting those formulas together, so as only to refer to A1:A8, we wind up with this monster:
=SUBSTITUTE(IFERROR(SUBSTITUTE(MID(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(CONCATENATE(",",A1,",",A2,",",A3,",",A4,",",A5,",",A6,",",A7,",",A8,","),",,",","),",,",","),",,",","),2,LEN(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(CONCATENATE(",",A1,",",A2,",",A3,",",A4,",",A5,",",A6,",",A7,",",A8,","),",,",","),",,",","),",,",","))-2),",",",and ",LEN(MID(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(CONCATENATE(",",A1,",",A2,",",A3,",",A4,",",A5,",",A6,",",A7,",",A8,","),",,",","),",,",","),",,",","),2,LEN(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(CONCATENATE(",",A1,",",A2,",",A3,",",A4,",",A5,",",A6,",",A7,",",A8,","),",,",","),",,",","),",,",","))-2))-LEN(SUBSTITUTE(MID(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(CONCATENATE(",",A1,",",A2,",",A3,",",A4,",",A5,",",A6,",",A7,",",A8,","),",,",","),",,",","),",,",","),2,LEN(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(CONCATENATE(",",A1,",",A2,",",A3,",",A4,",",A5,",",A6,",",A7,",",A8,","),",,",","),",,",","),",,",","))-2),",",""))),MID(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(CONCATENATE(",",A1,",",A2,",",A3,",",A4,",",A5,",",A6,",",A7,",",A8,","),",,",","),",,",","),",,",","),2,LEN(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(CONCATENATE(",",A1,",",A2,",",A3,",",A4,",",A5,",",A6,",",A7,",",A8,","),",,",","),",,",","),",,",","))-2)),",",", ")
Here is a VBA solution which will allow for any number of items; it concatenate according to the same rules as above.
Option Explicit
Function ConcatRangeWithAnd(RG As Range, Optional Delim As String = ", ")
Dim COL As Collection
Dim C As Range
Dim S As String
Dim I As Long
Set COL = New Collection
For Each C In RG
If Len(C.Text) > 0 Then COL.Add C.Text
Next C
Select Case COL.Count
Case 0
Exit Function
Case 1
ConcatRangeWithAnd = COL(1)
Case 2
ConcatRangeWithAnd = COL(1) & " and " & COL(2)
Case Else
For I = 1 To COL.Count - 1
S = S & COL(I) & ", "
Next I
ConcatRangeWithAnd = S & "and " & COL(COL.Count)
End Select
End Function
With the new TEXTJOIN function, this can be done very easily.
Step 1: Use TEXTJOIN function with the ", " delimiter, and set the ignore_empty to TRUE. This will give you comma separated, concatenated string, ignoring the blank values.
Step 2: Count the number of not blank entries in the list using COUNTA function. And subtract 1 from it. You might want to floor the value at 1 using the MAX function at this point.
Step 3: Use the SUBSTITUTE function to replace the last instance of the comma, which was calculated in Step 2, with a " and ".
Putting it all together:
=SUBSTITUTE(TEXTJOIN(", ",TRUE,A1:A14),", "," and ",MAX(1,COUNTA(A1:A14)-1))
Plug in any Range you want instead of A1:A14 in the above formula, and you will get a comma separated concatenate with an and before the last word.
Regarding duplicates:
Firstly, I really love Matt's solution and I've added this to my collection of custom functions.
What I do miss though is the possibility to remove duplicates from the phrase without removing them from the original range.
As you can't create a virtual range (a range that you can just play with in VBA independently from your source data), the solution would probably involve converting the range to an array, running some deduplication code and then creating the phrase from that.
My solution (albeit inelegant) is just to use the UNIQUE and FILTER functions to get a deduplicated list elsewhere on the spreadsheet (can be hidden if it bothers you) and to use Matt's function on that.
=UNIQUE(FILTER(yourRange,yourRange<>""))