Regexp Return position VBA - excel

I am looking to have my regexp return the value of of the pattern I am looking for or the position. Similar to how the Instr function works returning its position in a string I would like to be able to do this with patterns. what i have so far just replaces the patters and i cannot figure out how to have it return a position.
Sub test
Dim regex As Object
Dim r As Range, rC As Range
Dim firstextract As Long
' cells in column A
Set r = Range("A2:A3")
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = "[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]"
' loop through the cells in column A and execute regex replace
Dim MyArray(10, 10) As Integer
For CntMtg = 1 To 100
For Each rC In r
If rC.Value <> "" Then rC.Value = regex.Replace(rC.Value, "Extract from here")
Next rC
Next
End sub

If you don't want to replace but just get the position of a hit, use the Execute-method. It returns a Collection of Matches. A match has basically three properties:
FirstIndex is the position of the match within your string
Length is the length of the match that was found
Value it the match itself that was found
If you can have more than one match within a string, you need to set the property Global of your regex, else the collection will at most find 1 hit.
The following code uses early binding (as it helps to figure out properties and methods), add a reference to Microsoft VBScript Regular Expressions 5.5.
Dim regex As RegExp
regex.Pattern = "[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]"
regex.Global = True
Dim matches As MatchCollection, match As match
Set matches = regex.Execute(s)
For Each match In matches
Debug.Print "Pos: " & match.FirstIndex & " Len: " & match.Length & " - Found: " & match.Value
Next
For details, see the official documentation: https://learn.microsoft.com/en-us/dotnet/standard/base-types/the-regular-expression-object-model

Related

Remove duplicate string from cell but keep last instance of duplicate

using VBA on excel to remove duplicates strings (whole words) from a cell, but keep the last instance of the duplicate.
Example
hello hi world hello => hi world hello
this is hello my hello world => this is my hello world
Iam originally a python developer so excuse my lack of syntax in VBA, I have edited a piece of code found online with the following logic:
'''
Function RemoveDupeWordsEnd(text As String, Optional delimiter As String = " ") As String
Dim dictionary As Object
Dim x, part, endword
Set dictionary = CreateObject("Scripting.Dictionary")
dictionary.CompareMode = vbTextCompare
For Each x In Split(text, delimiter)
part = Trim(x)
If part <> "" And Not dictionary.exists(part) Then
dictionary.Add part, Nothing
End If
'' COMMENT
'' if the word exists in dictionary remove previous instance and add the latest instance
If part <> "" And dictionary.exists(part) Then
dictionary.Del part, Nothing
endword = part
dictionary.Add endword, Nothing
End If
Next
If dictionary.Count > 0 Then
RemoveDupeWordsEnd = Join(dictionary.keys, delimiter)
Else
RemoveDupeWordsEnd = ""
End If
Set dictionary = Nothing
End Function
'''
Thanks all help and guidance would be very much appreciated
Keep the Last Occurrence of Matching Substrings
Option Explicit
Function RemoveDupeWordsEnd( _
ByVal DupeString As String, _
Optional ByVal Delimiter As String = " ") _
As String
Dim dict As Object: Set dict = CreateObject("Scripting.Dictionary")
dict.CompareMode = vbTextCompare
Dim Item As Variant
Dim Word As String
For Each Item In Split(DupeString, Delimiter)
Word = Trim(Item)
If Len(Word) > 0 Then
If dict.Exists(Word) Then
dict.Remove Word
End If
dict(Word) = Empty ' the same as 'dict.Add Word, Empty'
End If
Next Item
If dict.Count > 0 Then RemoveDupeWordsEnd = Join(dict.Keys, Delimiter)
End Function
Use VBA's replace in a while loop that terminates when the occurrences of the string drop below 2. Replace takes an optional argument for the number of matches to replace.
Function keepLast(raw As String, r As String) As String
While (Len(raw) - Len(Replace(raw, r, ""))) / Len(r) > 1
raw = Replace(raw, r, "", , 1)
Wend
keepLast = Trim(Replace(raw, " ", " "))
End Function
I use Trim and Replace any double spaces with a single space to avoid extraneous white space that is left by the removal of the target string. You could avoid the loop by just counting the number of occurrences and passing that minus 1 straight to replace:
Function keepLast(raw As String, r As String) As String
keepLast = raw
Dim cnt As Integer
cnt = (Len(raw) - Len(Replace(raw, r, ""))) / Len(r)
If cnt < 2 Then Exit Function
raw = Replace(raw, r, "", , cnt - 1)
keepLast = Trim(Replace(raw, " ", " "))
End Function
Bear in mind that this method is very susceptible to partial matches. If your raw string was "hello that Othello is a good play hello there", then you'll end up with "that O is a good play hello there", which I don't think is exactly what you want. You might use regex to address this, if it's necessary:
Function keepLast(raw As String, r As String) As String
Dim parser As Object
Set parser = CreateObject("vbscript.regexp")
parser.Global = True
parser.Pattern = "\b" & r & "\b"
While parser.Execute(raw).Count > 1
raw = parser.Replace(raw, "")
Wend
keepLast = Trim(Replace(raw, " ", " "))
End Function
The regexp object has a property to ignore case, if you need to handle "hello" and "Hello". You would set that like this:
parser.ignoreCase = true
Late to the party, but try:
Function RemoveDups(inp As String) As String
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = "(?:^| )(\S+)(?= |$)(?=.* \1(?: |$))"
RemoveDups = Application.Trim(.Replace(inp, ""))
End With
End Function
Unfortunately VBA does not support word-boundaries which would make for a much easier pattern. The idea however is to match 1+ non-whitespace characters from and upto a space/start-line/end-line and use this match with a backreference to check the word is repeated again.
Formula in B1:
=RemoveDups(A1)
Note: This is currently case-sensitive. So use the appropriate regex object properties and add: RegExp.IgnoreCase = False in case you want to use case-insensitive matching.

How to do a proper keyword search

If I search for the term 'tfo' in the cell value 'TFO_xyz' then the result should be TRUE.
If I search for the term 'tfo' in the cell value 'TFO systems' then the result should be TRUE.
If I search for the term 'tfo' in the cell value 'spring TFO' then the result should be TRUE.
BUT if I check 'tfo' in the cell value 'Platform' then I want the result as FALSE
I have used the formula =IF(COUNTIF(A2,"*tfo*"),"TRUE","FALSE"), but this wont give result as FALSE when I check 'tfo' in the word 'Platform'
NOTE:
Platform should be false because tfo is coming in between a word. I'm looking result as True for cell values with just the word tfo like in tfo<‌space>America or TFO_America or <‌space>TFO systems. But I want FALSE result for the words Platform and portfolio because in these two words the term tfo comes in between alphabets.
Try this:
Dim x As Long: x = 1
With Sheet1
Do While x <= .Cells(.Rows.Count, 1).End(xlUp).Row
If VBA.Left(.Cells(x, 1).Value, 3) = "tfo" Or VBA.Right(.Cells(x, 1).Value, 3) = "tfo" Then
.Cells(x, 2).Value = True
End If
x = x + 1
Loop
End With
Try this formula. This assumes that word tfo will be at the beginning or end
Just make sure to place appropriate cell names where i have 'A2' in the formula
=IF(OR(PROPER(LEFT(A2,3))="tfo",PROPER(RIGHT(A2,3))="tfo"),TRUE,FALSE)
Test Cases Below:
My suggestion is to spend sometime to know your data and create a white-list.
Since there is no easy way to properly do fuzzy search in strings.
Function TFO_Search(strText As String) As Boolean
Dim ArryString As Variant
Dim ArryWhitelist As Variant
' Create a White-List Array
ArryWhitelist = Array("TFO_", "TFO ", "_TFO", " TFO", "tfoAmerica")
For Each ArryString In ArryWhitelist
If InStr(UCase(strText), UCase(ArryString)) > 0 Then 'force to UPPER CASE
TFO_Search = True
Exit Function
Else
TFO_Search = False
End If
Next
End Function
I see two dimensions of complexity in your question:
Where does the key word occur in the text (beginning, middle, end)
What are the characters that separate words.
The first one is fixed size, you need to handle three cases. The second one depends on the number of characters you want to accept as delimiters. Below I assumed that you accept space and underscore, however, you may expand this set by inserting more SUBSTITUTE function calls.
In my table, $A2 is the cell in which you search for the keyword, while B$1 contains the keyword.
To standardize the separator character, you need the formula:
B2=SUBSTITUTE($A2,"_"," ")
To check if the string starts with the keyword:
C2=--(LEFT($B2,LEN(B$1)+1)=B$1&" ")
To check if the string ends with the keyword:
D2=--(RIGHT($B2,LEN(B$1)+1)=" "&B$1)
To check if the keyword is in the middle of the string:
E2=--(LEN(SUBSTITUTE(UPPER($B2)," "&UPPER(B$1)&" ",""))<LEN($B2))
To evaluate the above three cases:
F2=--(0<$C2+$D2+$E2)
If you want to use a single cell, combine the formulas into:
G2=--(0<--(LEFT(SUBSTITUTE($A2,"_"," "),LEN(B$1)+1)=B$1&" ")+--(RIGHT(SUBSTITUTE($A2,"_"," "),LEN(B$1)+1)=" "&B$1)+--(LEN(SUBSTITUTE(UPPER(SUBSTITUTE($A2,"_"," "))," "&UPPER(B$1)&" ",""))<LEN(SUBSTITUTE($A2,"_"," "))))
It is not very readable in the end but I don't think there was an easier solution using Formulas only.
Note: If you want to modify the set of characters accepted as delimiters, add more SUBSTITUTE function calls to B2, then copy the Formula of F2 into notepad and replace $C2 with the formula of C2, etc., then replace $B2 with the updated Formula of B2.
Update
Building on the idea in Ron Rosenfelds comment to tigeravatar's answer, the formula can be simplified (the beginning, middle, ending cases can be joined):
=--(LEN(SUBSTITUTE(" "&UPPER($B2)&" "," "&UPPER(B$1)&" ",""))<LEN($B2))
After substituting $B2 with its formula:
=--(LEN(SUBSTITUTE(" "&UPPER(SUBSTITUTE($A2,"_"," "))&" "," "&UPPER(B$1)&" ",""))<LEN(SUBSTITUTE($A2,"_"," ")))
This formula will return true if TFO is at the beginning or end of any given word, or by itself, in the text string. It also checks every word in the text string, so TFO can be at beginning, middle, or end. The formula assumes that if a word starts or ends with TFO, then the result should be TRUE (as is the case for tfoAmerica so same rule would apply to tform), else FALSE.
=OR(ISNUMBER(SEARCH({" tfo","tfo "}," "&SUBSTITUTE(A2,"_"," ")&" ")))
Here are its results:
EDIT:
In the event that the result should only be TRUE if TFO is found by itself, then this version of the formula will suffice:
=ISNUMBER(SEARCH(" tfo "," "&SUBSTITUTE(A2,"_"," ")&" "))
Image showing results of second version:
If you can rely on VBA, then regex is a more flexible solution.
There is a good summary, of how to use them in VBA: How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops
For your keyword search problem I wrote the following:
Option Explicit
' Include: Tools > References > Microsoft VBScript Regular Expressions 5.5 (C:\Windows\SysWOW64\vbscript.dll\3)
Public Function SearchKeyWord(strHay As String, strNail As String, Optional strDelimiters As String = " _,.;/", Optional lngNthOccurrence As Long = 1) As Long ' Returns 1-based index of nth occurrence or 0 if not found
Dim strPattern As String: strPattern = CreatePattern(strNail, strDelimiters)
Dim rgxKeyWord As RegExp: Set rgxKeyWord = CreateRegex(strPattern, True)
Dim mtcResult As MatchCollection: Set mtcResult = rgxKeyWord.Execute(strHay)
If (0 <= lngNthOccurrence - 1) And (lngNthOccurrence - 1 < mtcResult.Count) Then
Dim mthResult As Match: Set mthResult = mtcResult(lngNthOccurrence - 1)
SearchKeyWord = mthResult.FirstIndex + Len(mthResult.SubMatches(0)) + 1
Else
SearchKeyWord = 0
End If
End Function
Private Function CreateRegex(strPattern As String, Optional blnIgnoreCase As Boolean = False, Optional blnMultiLine As Boolean = True, Optional blnGlobal As Boolean = True) As RegExp
Dim rgxResult As RegExp: Set rgxResult = New RegExp
With rgxResult
.Pattern = strPattern
.IgnoreCase = blnIgnoreCase
.MultiLine = blnMultiLine
.Global = blnGlobal
End With
Set CreateRegex = rgxResult
End Function
Private Function CreatePattern(strNail As String, strDelimiters As String) As String
Dim strDelimitersEscaped As String: strDelimitersEscaped = RegexEscape(strDelimiters)
Dim strPattern As String: strPattern = "(^|[" & strDelimitersEscaped & "]+)(" & RegexEscape(strNail) & ")($|[" & strDelimitersEscaped & "]+)"
CreatePattern = strPattern
End Function
Private Function RegexEscape(strOriginal As String) As String
Dim strEscaped As String: strEscaped = vbNullString
Dim i As Long: For i = 1 To Len(strOriginal)
Dim strChar As String: strChar = Mid(strOriginal, i, 1)
Select Case strChar
Case ".", "$", "^", "{", "[", "(", "|", ")", "*", "+", "?", "\"
strEscaped = strEscaped & "\" & strChar
Case Else
strEscaped = strEscaped & strChar
End Select
Next i
RegexEscape = strEscaped
End Function
Once you have the above in a Module, you can insert formulas like the following:
=SearchKeyWord($A1,"tfo")
where A1 contains e.g. "tfo America".
As a third parameter, you may specify, which characters you want to treat as delimiters, by default they are space, underscore, comma, dot, semicolon and slash.
The return value is the position of the nth occurrence of the keyword, where n is the value of the fourth parameter (default: 1), or 0 if not found.
To check if the keyword is present in A1, compare the result to 0, which means not found:
=--(SearchKeyWord($A1,"tfo")<>0)

Excel VBA - Multiple regex pattern deletion

In Excel VBA I need to perform multiple regular expression matches which then deletes the match from the string while preserving the remainder of the string. I have it working by daisy-chaining two variables, and by not testing the pattern match first since the second match is the remainder of the first.
Consider the follow data:
(2.5.3) A. 100% of product will be delivered in 3 days
(2.5.3) B. Capability to deliver product by air.
(2.5.3) C. Support for xyz feature
(2.5.3) D. Vendor is to provide an overview of the network as proposed.
(2.5.3) E. The network should allow CustomerABC to discover their devices.
(2.5.3) F. The use of CustomerABC existing infrastructure should be optimized. CustomerABC's capability will vary.
(2.5.3) G. Describe the number of network devices requiring to run CustomerABC's center.
With this data, I am deleting the outline numbers in the beginning of the string, as well as any references to CustomerABC and any hyphenation that could possibly appear multiple times in the string at any location, with potentially upper and lower case. I have the regex's working. Here is the code I'm trying:
Function test(Txt As String) As String
Dim regEx As Object
Dim v1 As String
Dim v2 As String
Dim n As String
n = "CustomerABC"
If regEx Is Nothing Then
Set regEx = CreateObject("VBScript.RegExp")
regEx.Global = True
regEx.IgnoreCase = True
End If
If Len(Txt) > 0 Then
With regEx
' The 1st pattern
.Pattern = "^\(?[0-9.]+\)?"
'If Not .Test(Txt) Then Exit Function
v1 = .Replace(Txt, "")
' The 2nd pattern
.Pattern = n + "(\S*)?(\s+)?"
'If Not .Test(Txt) Then Exit Function
v2 = .Replace(v1, "")
' The result
test = Application.Trim(v2)
End With
End If
End Function
Is there a way to make this better, speed things up, and have a variable number of match/deletions?
Thanks in advance.
Like this:
Function test(Txt As String) As String
Static regEx As Object '<< need Static here
Dim rv As String, p, n
n = "CustomerABC"
If regEx Is Nothing Then
Set regEx = CreateObject("VBScript.RegExp")
regEx.Global = True
regEx.IgnoreCase = True
End If
If Len(Txt) > 0 Then
rv = Txt
'looping over an array of patterns
For Each p In Array("^\(?[0-9.]+\)?", n & "(\S*)?(\s+)?")
With regEx
.Pattern = p
rv = .Replace(rv, "")
End With
Next p
End If
test = Application.Trim(rv)
End Function

Regex pattern to remove certain prefixes in a word from Excel

I am trying to cleanup a set of strings in Excel to extract certain words after removing some prefixes and extra characters. Initially I was trying this with FIND, LEFT, MID, etc. Then, I came across this helpful post and trying my hand at regex.
https://superuser.com/questions/794536/excel-formulas-for-stripping-out-prefix-suffix-around-number
I have used the UDF given there called Remove which takes a regex argument. Now, I am still not able to remove all the items I wanted to remove.
In the attached Excel you can see what I have tried and what the answer I am looking.
Here are the Prefixes I wanted to remove:
The numbers in the beginning surrounded by brackets - Ideally I want this in a separate column.
Anyword before a hyphen here there are a number of them 'l-', 'al-'
and then these prefixes below.
 bi
 bil
 fa
 wa
 wal
How do I write a single regex which would remove all the above prefixes?
Here is the UDF I am using:
Function Remove(objCell As Range, strPattern As String)
Dim RegEx As Object
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.Global = True
RegEx.Pattern = strPattern
Remove = RegEx.Replace(objCell.Value, "")
End Function
Here is the link to the XLSM file which contains the data I have:
https://www.dropbox.com/s/et9ee727ompj5fl/Regex%20Trials.xlsm?dl=0
and here is a screenshot to show you what I am looking for:
Not 100% perfect for words but should get you started
Breakdown of RegEx (\d+\:)+\d+
(\d+\:) finds any patterns that match the format x:
the plus after the bracket then tells it that this is a repeating pattern.
lastly the \d+ matches the last digit in the string so that the regex will find a pattern that matches x:x:x
The next RegEx (?!l-|al-|a-|wa-|fa-|bi-)[a-z].* is a lot more complex.
First of all lets look at the [a-z]. This tells it to match any character between a and z. We then want to capture the rest of the word so by using .* it captures everything from the first match to the end of the string (this includes non a-z characters). However, we don't want it to capture the first part of the string before the hyphen (in most cases) so by using ?! We use what's called negative look ahead. This looks for anything inside the brackets and ignores those bits. | simply means or. so anything inside that bracket will be ignored from the match.
Go to http://regexr.com/ if you want to have a play around is a handy site to learn/test RegEx
Public Sub test()
Dim rng As Range
Dim matches
Dim c
With Sheet1
Set rng = .Range(.Cells(2, 1), .Cells(.Cells(.Rows.Count, 1).End(xlUp).Row, 1))
End With
For Each c In rng
With c
.Offset(0, 6) = ExecuteRegEx(.Value2, "(\d+\:)+\d+")
.Offset(0, 7) = ExecuteRegEx(.Value2, "(?!l-|al-|a-|wa-|fa-|bi-)[a-z].*")
End With
Next c
End Sub
Public Function ExecuteRegEx(str As String, pattern As String) As String
Dim RegEx As Object
Dim matches
Set RegEx = CreateObject("VBScript.RegExp")
With RegEx
.Global = True
.ignorecase = False
.pattern = pattern
If .test(str) Then
Set matches = .Execute(str)
ExecuteRegEx = matches(0)
Else
ExecuteRegEx = vbNullString
End If
End With
End Function
I wouldn't use a regex for this: you can do some splitting of the cell value and testing of the prefixs against a defined array of prefixs:
Note: the array values are in an order where substrings of other prefixs are later in the list
Public Function RemovePrefix(RngSrc As Range) As String
If RngSrc.Count > 1 Then Exit Function
On Error GoTo ExitFunction
Dim Prefixs() As String: Prefixs = Split("wal,wa',wa,bil,bi,fa", ",")
Dim Arr() As String, i As Long, Temp As String
Arr = Split(RngSrc, "-")
If UBound(Arr) > 0 Then
RemovePrefix = Arr(UBound(Arr))
Exit Function
End If
Arr = Split(RngSrc, " ")
For i = 0 To UBound(Prefixs)
Temp = Arr(UBound(Arr))
If InStr(Temp, Prefixs(i)) = 1 Then
RemovePrefix = Right(Temp, Len(Temp) - Len(Prefixs(i)))
Exit Function
End If
Next i
RemovePrefix = Temp
ExitFunction:
If Err Then RemovePrefix = "Error"
End Function

Returning a numeric value on either side of a dash in a string?

Does anyone know how to return only the numeric value immediately on either side of a dash in a string?
For example, let's say we have the following string "Text, 2-78, 88-100, 101". I'm looking for a way to identify a dash and then return one of the numbers (left or right).
Ultimately I would like to check to see if a given number, let's say 75, falls within any of the ranges noted in the string. Ideally it would see that 75 falls within "2-78".
Any help would be greatly appreciated!
Go to Tools->References and check "Microsoft VBScript Regular Expressions 5.5." Then you can do something like this. (I know this isn't good code, but it's the idea...) Also, this finds all the #-# patterns and prints either the left or right number for all of them (based on whether the boolean "left" is true or false).
Dim str, res As String
str = "Text, 2-78, 88-100, 101"
Dim left As Boolean
left = False
Dim re1 As New RegExp
re1.Pattern = "\d+-\d+"
re1.Global = True
Dim m, n As Match
For Each m In re1.Execute(str)
Dim re2 As New RegExp
re2.Global = False
If left Then
re2.Pattern = "\d+"
Else
re2.Pattern = "-\d+"
End If
For Each n In re2.Execute(m.Value)
res = n.Value
If Not left Then
res = Mid(res, 2, Len(str))
End If
Next
MsgBox res
Next
You can do this many different ways with VBA. Using the Split() function to convert into an array, first using the commas as a delimiter and then using the dash would probably be a way to go.
That said, if you want a quick and dirty way to do this with excel ( from which you could record a macro ) here is what you can do.
Paste your target string into a cell.
Run Text to Columns on it, using the comma as your deliminator.
Copy the row your now have and Paste-Transpose onto a new sheet.
Run Text to Columns again on your transposed column, this time with the dash as your deliminator.
You now have side by side columns of your numbers, which you can compare to your target values as needed.
You may need to use the Trim() functions in there somewhere to remove whitespace, but hopefully the text to columns would leave you with numbers instead of text numbers.
Ultimately I think there are lots of ways you could approach this sort of problem. It looks like a good way to try and use RegExp. RegExp is not my speciality but I do like to try and use it to answer some Q's here on SO. This code has been tested for your example data and is working properly.
Something like this, assuming your text is in cell A1, and you're testing a value like 75, this also captures single digits in your string in the match collection:
Sub TestRegExp
Dim m As Match
Dim testValue As Long
Dim rangeArray As Variant
testValue = 75 'or whatever value you're trying to find
pattern = "[\d]+[-][\d]+\b|[\d]+"
Set re = New RegExp
re.pattern = pattern
re.Global = True
re.IgnoreCase = True 'doesn't really matter since you're looking for numbers
Set allMatches = re.Execute([A1])
For Each m In allMatches
rangeArray = Split(m, "-")
Select Case UBound(rangeArray)
Case 0
If testValue = rangeArray(0) Then
msg = testValue & " = " & m
Else:
msg = testValue & " NOT " & m
End If
Case 1
If testValue >= CLng(rangeArray(0)) And testValue <= CLng(rangeArray(1)) Then
msg = testValue & " is within range: " & m
Else:
msg = testValue & " is not within range: " & m
End If
Case Else
End Select
MsgBox msg, vbInformation
Next
End Sub

Resources