Regex pattern to remove certain prefixes in a word from Excel - excel

I am trying to cleanup a set of strings in Excel to extract certain words after removing some prefixes and extra characters. Initially I was trying this with FIND, LEFT, MID, etc. Then, I came across this helpful post and trying my hand at regex.
https://superuser.com/questions/794536/excel-formulas-for-stripping-out-prefix-suffix-around-number
I have used the UDF given there called Remove which takes a regex argument. Now, I am still not able to remove all the items I wanted to remove.
In the attached Excel you can see what I have tried and what the answer I am looking.
Here are the Prefixes I wanted to remove:
The numbers in the beginning surrounded by brackets - Ideally I want this in a separate column.
Anyword before a hyphen here there are a number of them 'l-', 'al-'
and then these prefixes below.
 bi
 bil
 fa
 wa
 wal
How do I write a single regex which would remove all the above prefixes?
Here is the UDF I am using:
Function Remove(objCell As Range, strPattern As String)
Dim RegEx As Object
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.Global = True
RegEx.Pattern = strPattern
Remove = RegEx.Replace(objCell.Value, "")
End Function
Here is the link to the XLSM file which contains the data I have:
https://www.dropbox.com/s/et9ee727ompj5fl/Regex%20Trials.xlsm?dl=0
and here is a screenshot to show you what I am looking for:

Not 100% perfect for words but should get you started
Breakdown of RegEx (\d+\:)+\d+
(\d+\:) finds any patterns that match the format x:
the plus after the bracket then tells it that this is a repeating pattern.
lastly the \d+ matches the last digit in the string so that the regex will find a pattern that matches x:x:x
The next RegEx (?!l-|al-|a-|wa-|fa-|bi-)[a-z].* is a lot more complex.
First of all lets look at the [a-z]. This tells it to match any character between a and z. We then want to capture the rest of the word so by using .* it captures everything from the first match to the end of the string (this includes non a-z characters). However, we don't want it to capture the first part of the string before the hyphen (in most cases) so by using ?! We use what's called negative look ahead. This looks for anything inside the brackets and ignores those bits. | simply means or. so anything inside that bracket will be ignored from the match.
Go to http://regexr.com/ if you want to have a play around is a handy site to learn/test RegEx
Public Sub test()
Dim rng As Range
Dim matches
Dim c
With Sheet1
Set rng = .Range(.Cells(2, 1), .Cells(.Cells(.Rows.Count, 1).End(xlUp).Row, 1))
End With
For Each c In rng
With c
.Offset(0, 6) = ExecuteRegEx(.Value2, "(\d+\:)+\d+")
.Offset(0, 7) = ExecuteRegEx(.Value2, "(?!l-|al-|a-|wa-|fa-|bi-)[a-z].*")
End With
Next c
End Sub
Public Function ExecuteRegEx(str As String, pattern As String) As String
Dim RegEx As Object
Dim matches
Set RegEx = CreateObject("VBScript.RegExp")
With RegEx
.Global = True
.ignorecase = False
.pattern = pattern
If .test(str) Then
Set matches = .Execute(str)
ExecuteRegEx = matches(0)
Else
ExecuteRegEx = vbNullString
End If
End With
End Function

I wouldn't use a regex for this: you can do some splitting of the cell value and testing of the prefixs against a defined array of prefixs:
Note: the array values are in an order where substrings of other prefixs are later in the list
Public Function RemovePrefix(RngSrc As Range) As String
If RngSrc.Count > 1 Then Exit Function
On Error GoTo ExitFunction
Dim Prefixs() As String: Prefixs = Split("wal,wa',wa,bil,bi,fa", ",")
Dim Arr() As String, i As Long, Temp As String
Arr = Split(RngSrc, "-")
If UBound(Arr) > 0 Then
RemovePrefix = Arr(UBound(Arr))
Exit Function
End If
Arr = Split(RngSrc, " ")
For i = 0 To UBound(Prefixs)
Temp = Arr(UBound(Arr))
If InStr(Temp, Prefixs(i)) = 1 Then
RemovePrefix = Right(Temp, Len(Temp) - Len(Prefixs(i)))
Exit Function
End If
Next i
RemovePrefix = Temp
ExitFunction:
If Err Then RemovePrefix = "Error"
End Function

Related

Extract one specific key and value from a comma separated list in every cell of a column

I'm trying to extract 'manufacturer=acme' from, for example:
attribute1=red,attribute2=medium,manufacturer=acme,attribute4=spherical
from column 'attributes', for which there are 8000+ rows.
I can't use left(), right(), split() functions because the manufacturer attribute doesn't have a fixed number of attributes/characters to the left or right of it and split() only works for one character, not a string.
Is there a way I can achieve this, target the string manufacturer= and remove all text from the left and right starting from its encapsulating commas?
Quick mock-up for looping through a split string (untested):
dim stringToArray as variant: stringToArray = split(target.value, ",")
dim arrayItem as long
for arrayLocation = lbound(stringToArray) to ubound(stringToArray)
if instr(ucase(stringToArray(arrayLocation)), ucase("manufacturer=")) then
dim manufacturerName as string: manufacturerName = right(stringToArray(arrayLocation), len(stringToArray(arrayLocation))-len("manufacturer="))
exit for
end if
next arrayLocation
debug.print manufacturerName
I have, maybe, an overkill solution using RegExp.
Following is a UDF you can use in a formula
Public Function ExtractManufacturerRE(ByRef r As Range) As String
On Error GoTo RETURN_EMPTY_STR
Dim matches As Object
With CreateObject("VBScript.RegExp")
.Pattern = "manufacturer=[^,]+"
.Global = False
Set matches = .Execute(r.Value)
If matches.Count > 0 Then
ExtractManufacturerRE = matches.Item(0).Value
End If
End With
RETURN_EMPTY_STR:
End Function
To be fair, this is sub-optimal, plus it doesn't work on a range but only on a single cell.

VBA code to extract Substring from Main String

I am trying to extract substring from main string. String have not same pattern. Main string is in Column "I". Desired output should be as per column "J". I have to extract substring between "FL" and "WNG".
I have tried to write code put it is not giving proper output. Can you please assist with alternate solution to get desired output using VBA.
Sub Get_Substring()
Range("K2") = Mid(Range("I2"), InStrRev(Range("I2"), "FL") + 1, _
InStrRev(Range("I2"), "WNG") - _
InStrRev(Range("I2"), "FL") - 1)
End Sub
Try the following...
Range("K2") = Mid(Range("I2"), InStrRev(Range("I2"), "FL") + 2, _
InStrRev(Range("I2"), "WNG") - _
InStrRev(Range("I2"), "FL") - 2)
Although, I would make it clear that you want the value for each of the ranges, as follows...
Range("K2").Value = Mid(Range("I2").Value, InStrRev(Range("I2").Value, "FL") + 2, _
InStrRev(Range("I2").Value, "WNG") - _
InStrRev(Range("I2").Value, "FL") - 2)
The next piece of code extracts the necessary string using arrays, too. But it can do it, even if more "WNG" strings exist in the string to be analyzed:
Private Function ExtractString(strTxt As String) As String
Dim arrFL, arrWNG, i As Long
arrFL = Split(strTxt, "FL")
For i = 1 To UBound(arrFL) 'start from the second array element
arrWNG = Split(arrFL(i), "WNG") 'split each first array element by "WNG"
'if the array contains at least a "WNG" string:
If UBound(arrWNG) > 0 Then ExtractString = arrWNG(0): Exit Function 'extract the first array element
Next
End Function
Note: If more pairs "FL" folowed by "WNG" exists, the function can be adapted to return an array, containing all such potential occurrences...
It can be tested using the next testing Sub:
Sub testExtractString()
Dim x As String
x = "John12REGNO02FL02WNGARM01"
'x = "John12WNGREGNO02FL02WNGARM01"
'x = "John12WNGREGNO02FL02WNGARWNGM01"
Debug.Print ExtractString(x)
End Sub
Just uncomment each x definition row...
I'll chuck in a solution based on regex to assure you got the exact substring:
Sub Test()
Dim stringIn As String: stringIn = "John12REGNO02FL02WNGARM01"
Debug.Print (Extract(stringIn))
End Sub
Function Extract(stringIn As String) As String
With CreateObject("vbscript.regexp")
.Pattern = "^.*FL(.*?)WNG"
If .Test(stringIn) = True Then
Extract = .Execute(stringIn)(0).Submatches(0)
Else
Extract = "None Found"
End If
End With
End Function
^ - Start line anchor.
.*FL - 0+ Chars greedy, and therefor untill, the last occurence of "FL".
(.*?) - A capture group with 0+ but lazy characters and therefor upto the nearest occurence of:
WNG - Literally match "WNG".
NOTE, you could make a more strict pattern only catching digits of that's the only type of characters possible, e.g: ^.*FL(\d*)WNG.
Here is an online demo
You can try the following udf:
Public Function FLWNG(s As String) As String
'Purpose: get the substring enclosed by the most right pair of FL..WNG
Dim tmp
tmp = Split(Replace(s, "WNG", "FL"), "FL")
FLWNG = tmp(UBound(tmp) - 1)
End Function
Explanation
Replacing all occurencies of WNG in the original string (s) with FL allows to split the resulting string by the FL delimiter only.
Assuming that the original string has at least one enclosing structure, you get the enclosed content as next to last element, i.e. via tmp(Ubound(tmp)-1).

How to Bold certain text part of a string

I have a string within a cell and I am trying to bold certain parts of that string. I have my code setup so each case is a line within that cell.
The first cell is what I am starting out with, and the one below it is what I am trying to do. Below is my code on what I have so far.
Sub test()
For Each cel In Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).Row)
Dim arr, line As Long, pos As Long, txt, length, dashPos
arr = Split(cel.Value, Chr(10)) ' Spliting cell contents by newline character
pos = 1
For line = 1 To UBound(arr) + 1
txt = arr(line - 1)
length = Len(txt)
'check which line we're on...
Select Case line
Case 4: 'Underline on line 4
cel.Characters(pos, length).Font.Underline = True
Case 5: 'Bold the team players
Case 6: 'Underline on line 6
cel.Characters(pos, length).Font.Underline = True
End Select
pos = pos + Len(txt) + 1 'start position for next line
Next line
Next cel
End Sub
Since you are looking up a certain pattern I thought this could be done through regular expressions since each match in the MatchCollection2 object will have a starting index including the length of the captured pattern. Let's imagine the following sample data:
Now we can apply the following code:
Sub Test()
Dim str As String: str = [A1]
Dim colMatch, objMatch
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = "\S+:(?=.*$)"
If .Test(str) = True Then
Set colMatch = .Execute(str)
For Each objMatch In colMatch
Range("A1").Characters(objMatch.FirstIndex, objMatch.Length).Font.Bold = True
Next
End If
End With
End Sub
The result:
About the regular expression's pattern:
\S+:(?=.*$)
You can see an online demo here and a small breakdown below:
\S+: - 1+ Non-whitespace character up to and including a colon.
(?= - A positive lookahead:
.*$ - 0+ characters other than newline up to the end string anchor.
) - Close positive lookahead.
Note: We need to either forget about the "Newline" property of the regex object or set it's value to FALSE. In the example I gave I simply didn't include it because it will then default to FALSE. If this was set to true the end string anchor won't simply match the end of the whole string but the end of each line (which is what we want to avoid if we don't want to match "Server:").
The answer is in another StackOverflow question here:
excel vba: make part of string bold
Which is similar to this:
Change color of certain characters in a cell
It's roughly:
{{CELL OR CELLS NEEDING BOLD CHARACTERS}}.Characters({{LOCATION, INFO}}).Font
.FontStyle = "Bold"

Excel VBA - Multiple regex pattern deletion

In Excel VBA I need to perform multiple regular expression matches which then deletes the match from the string while preserving the remainder of the string. I have it working by daisy-chaining two variables, and by not testing the pattern match first since the second match is the remainder of the first.
Consider the follow data:
(2.5.3) A. 100% of product will be delivered in 3 days
(2.5.3) B. Capability to deliver product by air.
(2.5.3) C. Support for xyz feature
(2.5.3) D. Vendor is to provide an overview of the network as proposed.
(2.5.3) E. The network should allow CustomerABC to discover their devices.
(2.5.3) F. The use of CustomerABC existing infrastructure should be optimized. CustomerABC's capability will vary.
(2.5.3) G. Describe the number of network devices requiring to run CustomerABC's center.
With this data, I am deleting the outline numbers in the beginning of the string, as well as any references to CustomerABC and any hyphenation that could possibly appear multiple times in the string at any location, with potentially upper and lower case. I have the regex's working. Here is the code I'm trying:
Function test(Txt As String) As String
Dim regEx As Object
Dim v1 As String
Dim v2 As String
Dim n As String
n = "CustomerABC"
If regEx Is Nothing Then
Set regEx = CreateObject("VBScript.RegExp")
regEx.Global = True
regEx.IgnoreCase = True
End If
If Len(Txt) > 0 Then
With regEx
' The 1st pattern
.Pattern = "^\(?[0-9.]+\)?"
'If Not .Test(Txt) Then Exit Function
v1 = .Replace(Txt, "")
' The 2nd pattern
.Pattern = n + "(\S*)?(\s+)?"
'If Not .Test(Txt) Then Exit Function
v2 = .Replace(v1, "")
' The result
test = Application.Trim(v2)
End With
End If
End Function
Is there a way to make this better, speed things up, and have a variable number of match/deletions?
Thanks in advance.
Like this:
Function test(Txt As String) As String
Static regEx As Object '<< need Static here
Dim rv As String, p, n
n = "CustomerABC"
If regEx Is Nothing Then
Set regEx = CreateObject("VBScript.RegExp")
regEx.Global = True
regEx.IgnoreCase = True
End If
If Len(Txt) > 0 Then
rv = Txt
'looping over an array of patterns
For Each p In Array("^\(?[0-9.]+\)?", n & "(\S*)?(\s+)?")
With regEx
.Pattern = p
rv = .Replace(rv, "")
End With
Next p
End If
test = Application.Trim(rv)
End Function

Excel Find alphanumeric string from text

I have a column containing multiple string values, like a sentence.
in that sentence i want to find one or all alphanumeric values of 10 or more characters containing atleast one - , and put the resulting values in another column.
For example:
the column containing sentence is like:
upgrade 15.07.2010, old No: WI82-01062. User moved to No: WI12-01012 02.04.2012 to a 2 user network.
or
Upgrade from lite 7/6/07, old No: PTX7-89C367EC5052-01211
Ideally I want a column with values like WI82-01062, WI12-01012 for the first example, and PTX7-89C367EC5052-01211 for the second example.
May be searching for the - in the string and finding the first occurrence of blank space at both ends would help, but I do not have any clue how to write that in excel term.
Thanks
You could probably use a regex like this (there may be better patterns!):
Function ExtractData(r As Variant) As String
Static oRE As Object
Dim sTemp As String
Dim n As Long
Dim matches
If oRE Is Nothing Then
Set oRE = CreateObject("vbscript.regexp")
With oRE
.Pattern = "[A-Za-z0-9\-]{10,}"
.Global = True
End With
End If
Set matches = oRE.Execute(r)
If matches.Count > 0 Then
For n = 1 To matches.Count
sTemp = sTemp & ", " & matches(n - 1)
Next n
ExtractData = Mid$(sTemp, 3)
End If
End Function

Resources