Excel Exact Word Matching - excel

Let's say I have "Vegas is great" in cell A1. I want to write a formula that looks for the exact word, "gas" in cells. Vegas ≠ gas, but the only search formula I'm finding:
=ISNUMBER(SEARCH("gas",lower(A1))
returns true. Is there anyway to do do exact matching? I'd ideally like it to be non-case sensitive which I believe is satisfied by wrapping A1 in lower().

I believe to correctly cover cases you have to pad spaces before and after the term "gas" and the search term. This will ensure that gas will be found at the beginning or end of a cell, and also prevent it from being found in the middle of any words. Your post does not indicate whether punctuation can exist in the file, but to accomodate punctuation padding spaces around the search will not work correctly, you would have to include the case of " gas. " " gas! " etc to allow for any punctuation specifically. If you are worried about catching values like "gas.cost" or similar you can use the same padding around the punctuation search.
=Or(ISNUMBER(SEARCH(" gas ", " "&A1&" ")),ISNUMBER(SEARCH(" gas. ", " "&A1&" ")))
Is a basic search that should return the word gas by itself, or "gas." By padding a space after "gas." in the search it will find it as the final word in a sentence, or at the end of a cell.
Edit: Dropped a parentheses.

The Find function is case sensitive. The SEARCH function is not. There is no need for the LOWER function if you are using SEARCH.
SEARCH(<find_text>, <within_text>, [optional]<start_num>)
Wrap both the find_text and within_text in spaces and perform your SEARCH.
        
The formula in B1 is,
=ISNUMBER(SEARCH(" gas ", " "&A1&" "))
Fill down as necessary.

One can also use regular expressions in VBA to accomplish this. In Regular Expressions, "\b" represents a word boundary. A word boundary is defined as the position between a word and a non-word character or the beginning or end of the line. Word characters are [A-Za-z0-9_] (letters, digits, and the underscore). Hence, one can use this UDF. You do need to be aware that words which include non-word characters (e.g. a hyphen) may be treated differently than you expect. And if you are dealing with non-English letters, the Pattern would need to be modified.
But the code is fairly compact.
Option Explicit
Function reFindWord(FindWord As String, SearchText As String, Optional MatchCase As Boolean = False) As Boolean
Dim RE As Object
Dim sPattern As String
Set RE = CreateObject("vbscript.regexp")
sPattern = "\b" & FindWord & "\b"
With RE
.Pattern = sPattern
.ignorecase = Not MatchCase
reFindWord = .test(SearchText)
End With
End Function

I think the only way to cover all possible punctuation surrounding the search word is to create a custom macro function. Use the enhanced split function to tokenize the sentence into an array of words then search the array for a match.
Enhanced split function
https://msdn.microsoft.com/en-us/library/aa155763
How to create custom macro
http://www.wikihow.com/Create-a-User-Defined-Function-in-Microsoft-Excel
Code to create FindEngWord function
Public Function FindEngWord(ByVal TextToSearch As String, ByVal WordToFind As String) As Boolean
Dim WrdArray() As String
Dim text_string As String
Dim isFound As Boolean
isFound = False
text_string = TextToSearch
WrdArray() = Split(text_string)
isFound = False
For i = 0 To UBound(WrdArray)
If LCase(WrdArray(i)) = LCase(WordToFind) Then
isFound = True
End If
Next i
FindEngWord = isFound
End Function
Public Function Split(ByVal InputText As String, _
Optional ByVal Delimiter As String) As Variant
' This function splits the sentence in InputText into
' words and returns a string array of the words. Each
' element of the array contains one word.
' This constant contains punctuation and characters
' that should be filtered from the input string.
Const CHARS = ".!?,;:""'()[]{}"
Dim strReplacedText As String
Dim intIndex As Integer
' Replace tab characters with space characters.
strReplacedText = Trim(Replace(InputText, _
vbTab, " "))
' Filter all specified characters from the string.
For intIndex = 1 To Len(CHARS)
strReplacedText = Trim(Replace(strReplacedText, _
Mid(CHARS, intIndex, 1), " "))
Next intIndex
' Loop until all consecutive space characters are
' replaced by a single space character.
Do While InStr(strReplacedText, " ")
strReplacedText = Replace(strReplacedText, _
" ", " ")
Loop
' Split the sentence into an array of words and return
' the array. If a delimiter is specified, use it.
'MsgBox "String:" & strReplacedText
If Len(Delimiter) = 0 Then
Split = VBA.Split(strReplacedText)
Else
Split = VBA.Split(strReplacedText, Delimiter)
End If
End Function
Can be called from your excel sheet with this.
=FindEngWord(A1,"gas")

I think this will handle all the cases that you are planning to handle:
=OR(ISNUMBER(SEARCH(" gas",LOWER(A1), 1 )), LEFT(A1,3)= "gas")
I added a space before the "gas" in the search. And if the gas was the only word in the cell or the first word in the cell, the right part of this function handles that case.

Related

How to separate a string and counting the number of characters in a string minus the spaces

I need help with my code that displays an input box and the user inputs a name then the code splits the names and counts the names displaying the following:
Sub ParseName()
Dim Name As String
Dim Count As Integer
Dim Cell As Object
Dim n As Integer
Count = 0
Name = InputBox("Enter First Name, Middle Name, and Last Name")
If Name = "" Then
For Each Cell In Selection
n = InStr(1, Cell.Value, Name)
While n <> 0
Count = Count + 1
n = InStr(n + 1, Cell.Value, Name)
Next Cell
MsgBox Count & " Occurrences of " & Name
End If
End Sub
Absolutely. You could split the names, but it probably wouldn't help. VBA's split doesn't allow you to split on single characters, AFAIK, as other languages might if you split them a specific delimiter. So you could just loop through the characters using MID to see if each letter is a space or not.
There is a way without any splitting or looping. You can just replace the space and get the length of the what's left.
Len(Replace(Name, " ", ""))
where REPLACE just replaces one string with another, in this case replacing all the spaces with nothing, and LEN just counts the characters in a string.
Here's your code rewritten to use this method, with the unnecessary code and variables removed. I would also change the Name variable, since I believe that is a reserved word in VBA. It will let you do it, but you're potentially impacting some existing behavior. For this particular purpose, using a standalone function to get the character count is someh
Sub ParseName()
Dim fullName As String, charCount As Integer
fullName = InputBox("Enter First Name, Middle Name, and Last Name")
If fullName <> "" Then
charCount = Len(Replace(fullName, " ", ""))
MsgBox fullName & " has " & charCount & " characters"
End If
End Sub
Bear in mind, however, that there are plenty of other character codes you might not want to count. Tabs, new lines, any of a number of whitespace characters. Non-character symbols. Things of that nature.
Also, this code does not check that the string even contains letters, or that the user input has three names, or that it is in the format First Middle Last.

How to do a proper keyword search

If I search for the term 'tfo' in the cell value 'TFO_xyz' then the result should be TRUE.
If I search for the term 'tfo' in the cell value 'TFO systems' then the result should be TRUE.
If I search for the term 'tfo' in the cell value 'spring TFO' then the result should be TRUE.
BUT if I check 'tfo' in the cell value 'Platform' then I want the result as FALSE
I have used the formula =IF(COUNTIF(A2,"*tfo*"),"TRUE","FALSE"), but this wont give result as FALSE when I check 'tfo' in the word 'Platform'
NOTE:
Platform should be false because tfo is coming in between a word. I'm looking result as True for cell values with just the word tfo like in tfo<‌space>America or TFO_America or <‌space>TFO systems. But I want FALSE result for the words Platform and portfolio because in these two words the term tfo comes in between alphabets.
Try this:
Dim x As Long: x = 1
With Sheet1
Do While x <= .Cells(.Rows.Count, 1).End(xlUp).Row
If VBA.Left(.Cells(x, 1).Value, 3) = "tfo" Or VBA.Right(.Cells(x, 1).Value, 3) = "tfo" Then
.Cells(x, 2).Value = True
End If
x = x + 1
Loop
End With
Try this formula. This assumes that word tfo will be at the beginning or end
Just make sure to place appropriate cell names where i have 'A2' in the formula
=IF(OR(PROPER(LEFT(A2,3))="tfo",PROPER(RIGHT(A2,3))="tfo"),TRUE,FALSE)
Test Cases Below:
My suggestion is to spend sometime to know your data and create a white-list.
Since there is no easy way to properly do fuzzy search in strings.
Function TFO_Search(strText As String) As Boolean
Dim ArryString As Variant
Dim ArryWhitelist As Variant
' Create a White-List Array
ArryWhitelist = Array("TFO_", "TFO ", "_TFO", " TFO", "tfoAmerica")
For Each ArryString In ArryWhitelist
If InStr(UCase(strText), UCase(ArryString)) > 0 Then 'force to UPPER CASE
TFO_Search = True
Exit Function
Else
TFO_Search = False
End If
Next
End Function
I see two dimensions of complexity in your question:
Where does the key word occur in the text (beginning, middle, end)
What are the characters that separate words.
The first one is fixed size, you need to handle three cases. The second one depends on the number of characters you want to accept as delimiters. Below I assumed that you accept space and underscore, however, you may expand this set by inserting more SUBSTITUTE function calls.
In my table, $A2 is the cell in which you search for the keyword, while B$1 contains the keyword.
To standardize the separator character, you need the formula:
B2=SUBSTITUTE($A2,"_"," ")
To check if the string starts with the keyword:
C2=--(LEFT($B2,LEN(B$1)+1)=B$1&" ")
To check if the string ends with the keyword:
D2=--(RIGHT($B2,LEN(B$1)+1)=" "&B$1)
To check if the keyword is in the middle of the string:
E2=--(LEN(SUBSTITUTE(UPPER($B2)," "&UPPER(B$1)&" ",""))<LEN($B2))
To evaluate the above three cases:
F2=--(0<$C2+$D2+$E2)
If you want to use a single cell, combine the formulas into:
G2=--(0<--(LEFT(SUBSTITUTE($A2,"_"," "),LEN(B$1)+1)=B$1&" ")+--(RIGHT(SUBSTITUTE($A2,"_"," "),LEN(B$1)+1)=" "&B$1)+--(LEN(SUBSTITUTE(UPPER(SUBSTITUTE($A2,"_"," "))," "&UPPER(B$1)&" ",""))<LEN(SUBSTITUTE($A2,"_"," "))))
It is not very readable in the end but I don't think there was an easier solution using Formulas only.
Note: If you want to modify the set of characters accepted as delimiters, add more SUBSTITUTE function calls to B2, then copy the Formula of F2 into notepad and replace $C2 with the formula of C2, etc., then replace $B2 with the updated Formula of B2.
Update
Building on the idea in Ron Rosenfelds comment to tigeravatar's answer, the formula can be simplified (the beginning, middle, ending cases can be joined):
=--(LEN(SUBSTITUTE(" "&UPPER($B2)&" "," "&UPPER(B$1)&" ",""))<LEN($B2))
After substituting $B2 with its formula:
=--(LEN(SUBSTITUTE(" "&UPPER(SUBSTITUTE($A2,"_"," "))&" "," "&UPPER(B$1)&" ",""))<LEN(SUBSTITUTE($A2,"_"," ")))
This formula will return true if TFO is at the beginning or end of any given word, or by itself, in the text string. It also checks every word in the text string, so TFO can be at beginning, middle, or end. The formula assumes that if a word starts or ends with TFO, then the result should be TRUE (as is the case for tfoAmerica so same rule would apply to tform), else FALSE.
=OR(ISNUMBER(SEARCH({" tfo","tfo "}," "&SUBSTITUTE(A2,"_"," ")&" ")))
Here are its results:
EDIT:
In the event that the result should only be TRUE if TFO is found by itself, then this version of the formula will suffice:
=ISNUMBER(SEARCH(" tfo "," "&SUBSTITUTE(A2,"_"," ")&" "))
Image showing results of second version:
If you can rely on VBA, then regex is a more flexible solution.
There is a good summary, of how to use them in VBA: How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops
For your keyword search problem I wrote the following:
Option Explicit
' Include: Tools > References > Microsoft VBScript Regular Expressions 5.5 (C:\Windows\SysWOW64\vbscript.dll\3)
Public Function SearchKeyWord(strHay As String, strNail As String, Optional strDelimiters As String = " _,.;/", Optional lngNthOccurrence As Long = 1) As Long ' Returns 1-based index of nth occurrence or 0 if not found
Dim strPattern As String: strPattern = CreatePattern(strNail, strDelimiters)
Dim rgxKeyWord As RegExp: Set rgxKeyWord = CreateRegex(strPattern, True)
Dim mtcResult As MatchCollection: Set mtcResult = rgxKeyWord.Execute(strHay)
If (0 <= lngNthOccurrence - 1) And (lngNthOccurrence - 1 < mtcResult.Count) Then
Dim mthResult As Match: Set mthResult = mtcResult(lngNthOccurrence - 1)
SearchKeyWord = mthResult.FirstIndex + Len(mthResult.SubMatches(0)) + 1
Else
SearchKeyWord = 0
End If
End Function
Private Function CreateRegex(strPattern As String, Optional blnIgnoreCase As Boolean = False, Optional blnMultiLine As Boolean = True, Optional blnGlobal As Boolean = True) As RegExp
Dim rgxResult As RegExp: Set rgxResult = New RegExp
With rgxResult
.Pattern = strPattern
.IgnoreCase = blnIgnoreCase
.MultiLine = blnMultiLine
.Global = blnGlobal
End With
Set CreateRegex = rgxResult
End Function
Private Function CreatePattern(strNail As String, strDelimiters As String) As String
Dim strDelimitersEscaped As String: strDelimitersEscaped = RegexEscape(strDelimiters)
Dim strPattern As String: strPattern = "(^|[" & strDelimitersEscaped & "]+)(" & RegexEscape(strNail) & ")($|[" & strDelimitersEscaped & "]+)"
CreatePattern = strPattern
End Function
Private Function RegexEscape(strOriginal As String) As String
Dim strEscaped As String: strEscaped = vbNullString
Dim i As Long: For i = 1 To Len(strOriginal)
Dim strChar As String: strChar = Mid(strOriginal, i, 1)
Select Case strChar
Case ".", "$", "^", "{", "[", "(", "|", ")", "*", "+", "?", "\"
strEscaped = strEscaped & "\" & strChar
Case Else
strEscaped = strEscaped & strChar
End Select
Next i
RegexEscape = strEscaped
End Function
Once you have the above in a Module, you can insert formulas like the following:
=SearchKeyWord($A1,"tfo")
where A1 contains e.g. "tfo America".
As a third parameter, you may specify, which characters you want to treat as delimiters, by default they are space, underscore, comma, dot, semicolon and slash.
The return value is the position of the nth occurrence of the keyword, where n is the value of the fourth parameter (default: 1), or 0 if not found.
To check if the keyword is present in A1, compare the result to 0, which means not found:
=--(SearchKeyWord($A1,"tfo")<>0)

How to remove all numeric characters separated by white space from an Excel cell?

I need to remove the numeric characters that are separated by white space ONLY in a text string in an Excel cell. For example I have:
johndoe99#mail.com 1 concentr8 on work VARIABLE1 99
I need to get:
johndoe99#mail.com concentr8 on work VARIABLE1
Either formula or VBA script solution is good. Thank you.
I think nomad is right that regex is probably a simpler option. However, I also think that by using the Split() and isNumeric() functions I've come up with a good solution here.
Sub test()
Dim cell As Range
For Each cell In Range("A1:A10") 'adjust as necessary
cell.Value2 = RemoveNumbers(cell.Value2)
Next cell
End Sub
Function RemoveNumbers(ByVal inputString As String) As String
Dim tempSplit As Variant
tempSplit = Split(inputString, " ")
Dim result As String
Dim i As Long
For i = LBound(tempSplit) To UBound(tempSplit)
If Not IsNumeric(tempSplit(i)) Then result = result & " " & tempSplit(i)
Next i
RemoveNumbers = Trim$(result)
End Function
UDF
Function RemNum(cell)
With CreateObject("VBScript.RegExp")
.Global = True: .Pattern = "\s\d+"
RemNum = .Replace(cell, vbNullString)
End With
End Function
Note that in addition to testing for spaces before and after, this also tests for the beginning or end of the string as a delimiter.
You did not indicate the case where the number is the only contents of the string. This routine will remove it but, if you want something else, specify.
Try this:
Function remSepNums(S As String) As String
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = "(?:\s+|^)(?:\d+)(?=\s+|$)"
.MultiLine = True
remSepNums = .Replace(S, "")
End With
End Function
Just for fun, if you have a recent version of Excel (Office 365/2016) you can use the following array formula:
=TEXTJOIN(" ",TRUE,IF(NOT(ISNUMBER(FILTERXML("<t><s>"&SUBSTITUTE(TRIM(A1)," ","</s><s>")&"</s></t>","//s"))),FILTERXML("<t><s>"&SUBSTITUTE(TRIM(A1)," ","</s><s>")&"</s></t>","//s"),""))
FILTERXML can be used to split the string into an array of words, separated by spaces
If any word is not a number, return that word, else return a null string
Then join the segments using the TEXTJOIN function.

Remove words that contain each other and leave the longer one

I'm looking for a macro (preferably a function) that would take cell contents, split it into separate words, compare them to one another and remove the shorter words.
Here's an image of what I want the output to look like (I need the words that are crossed out removed):
I tried to write a macro myself, but it doesn't work 100% properly because it's not taking the last words and sometimes removes what shouldn't be removed. Also, I have to do this on around 50k cells, so a macro takes a lot of time to run, that's why I'd prefer it to be a function. I guess I shouldn't use the replace function, but I couldn't make anything else work.
Sub clean_words_containing_eachother()
Dim sht1 As Worksheet
Dim LastRow As Long
Dim Cell As Range
Dim cell_value As String
Dim word, word2 As Variant
Set sht1 = ActiveSheet
col = InputBox("Which column do you want to clear?")
LastRow = sht1.Cells(sht1.Rows.Count, col).End(xlUp).Row
Let to_clean = col & "2:" & col & LastRow
For i = 2 To LastRow
For Each Cell In sht1.Range(to_clean)
cell_value = Cell.Value
cell_split = Split(cell_value, " ")
For Each word In cell_split
For Each word2 In cell_split
If word <> word2 Then
If InStr(word2, word) > 0 Then
If Len(word) < Len(word2) Then
word = word & " "
Cell = Replace(Cell, word, " ")
ElseIf Len(word) > Len(word2) Then
word2 = word2 & " "
Cell = Replace(Cell, word2, " ")
End If
End If
End If
Next word2
Next word
Next Cell
Next i
End Sub
Assuming that the retention of the third word in your first example is an error, since books is contained later on in notebooks:
5003886 book books bound case casebound not notebook notebooks office oxford sign signature
and also assuming that you would want to remove duplicate identical words, even if they are not contained subsequently in another word, then we can use a Regular Expression.
The regex will:
Capture each word
look-ahead to see if that word exists later on in the string
if it does, remove it
Since VBA regexes cannot also look-behind, we work-around this limitation by running the regex a second time on the reversed string.
Then remove the extra spaces and we are done.
Option Explicit
Function cleanWords(S As String) As String
Dim RE As Object, MC As Object, M As Object
Dim sTemp As String
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.Pattern = "\b(\w+)\b(?=.*\1)"
.ignorecase = True
'replace looking forward
sTemp = .Replace(S, "")
' check in reverse
sTemp = .Replace(StrReverse(sTemp), "")
'return to normal
sTemp = StrReverse(sTemp)
'Remove extraneous spaces
cleanWords = WorksheetFunction.Trim(sTemp)
End With
End Function
Limitations
punctuation will not be removed
a "word" is defined as containing only the characters in the class [_A-Za-z0-9] (letters, digits and the underscore).
if any words might be hyphenated, or contain other non-word characters
in the above, they will be treated as two separate words
if you want it treated as a single word, then we might need to change the regex
General steps:
Write cell to array (already working)
for each element (x), go through each element (y) (already working)
if x is in y AND y is longer that x THEN set x to ""
concat array back into string
write string to cell
String/array manipulations are much faster than operations on cells, so this will give you some increase in performance (depending on the amount of words you need to replace for each cell).
The "last word problem" might be that you dont have a space after the last word within your cells, since you only replace word + " " with " ".

Excel Find alphanumeric string from text

I have a column containing multiple string values, like a sentence.
in that sentence i want to find one or all alphanumeric values of 10 or more characters containing atleast one - , and put the resulting values in another column.
For example:
the column containing sentence is like:
upgrade 15.07.2010, old No: WI82-01062. User moved to No: WI12-01012 02.04.2012 to a 2 user network.
or
Upgrade from lite 7/6/07, old No: PTX7-89C367EC5052-01211
Ideally I want a column with values like WI82-01062, WI12-01012 for the first example, and PTX7-89C367EC5052-01211 for the second example.
May be searching for the - in the string and finding the first occurrence of blank space at both ends would help, but I do not have any clue how to write that in excel term.
Thanks
You could probably use a regex like this (there may be better patterns!):
Function ExtractData(r As Variant) As String
Static oRE As Object
Dim sTemp As String
Dim n As Long
Dim matches
If oRE Is Nothing Then
Set oRE = CreateObject("vbscript.regexp")
With oRE
.Pattern = "[A-Za-z0-9\-]{10,}"
.Global = True
End With
End If
Set matches = oRE.Execute(r)
If matches.Count > 0 Then
For n = 1 To matches.Count
sTemp = sTemp & ", " & matches(n - 1)
Next n
ExtractData = Mid$(sTemp, 3)
End If
End Function

Resources