Get every word ending with dot using Regex/VBA - excel

I am using excel 2019 and I am trying to extract from a bunch of messed up text cells any (up to 5) word ending with dot that comes after a ].
This is a sample of the text I am trying to parse/clean
`
some text [asred.] ost. |Monday - Ribben (ult.) lot. ac, sino. other maybe long text; collan.
`
I expect to get this:
ost. ult. lot. sino. collan.
I am using this Function found somewhere on the internet which appears to do the job:
`
Public Function RegExtract(Txt As String, Pattern As String) As String
With CreateObject("vbscript.regexp")
'.Global = True
.Pattern = Pattern
If .test(Txt) Then
RegExtract = .Execute(Txt)(0)
Else
RegExtract = "No match found"
End If
End With
End Function
`
and I call it from an empty cell:
=RegExtract(D2; "([\]])(\s\w+[.]){0,5}")
It's the first time I am using regexp, so I might have done terrible things in the eyes of an expert.
So this is my expression: ([]])(\s\w+[.]){0,5}
Right now it returns only
] ost.
Which is much more than I was expecting to be able to do on my first approach to regex, but:
I am not able to get rid of the first ] which is needed to find the place where my useful bits start inside the text block, since \K does not work in excel. I might "find and replace" it later as a smart barbarian, but I'd like to know the way to do it clean, if any clean way exists :)
2)I don't understand how iterators work to get all my "up to 5 occurrencies": I was expecting that {0,5} after the second group meant exactly: "repeat the previous group again until the end of the text block (or until you manage to do it 5 times)".
Thank you for your time :)
--Added after JdvD accepted answer for the records--
I am using this pattern to get all the words ending with dot, after the FIRST occurrence of the closing bracket.
^.*?\]|(\w+\.\s?)|.
This one (without the question mark) instead gets all the words ending with dot, after the LAST occurrence of the closing bracket.
^.*\]|(\w+\.\s?)|.
I was even missing something in my regExtract function: I needed to store the matches into an array through a for loop and then output this array as a string.
I was wrongly assuming that the regex engine was already storing matches as a unique string.
The correct RegExtract function to extract EVERY match is the following:
Public Function RegExtract(Txt As String, Pattern As String) As String
Dim rMatch As Object, arrayMatches(), i As Long
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = Pattern
If .Test(Txt) Then
For Each rMatch In .Execute(Txt)
If Not IsEmpty(rMatch.SubMatches(0)) Then
ReDim Preserve arrayMatches(i)
arrayMatches(i) = rMatch.SubMatches(0)
i = i + 1
End If
Next
RegExtract = Join(arrayMatches, " ")
Else
RegExtract = "No match found"
End If
End With
End Function

RegexMatch:
In addition to the answer given by #RonRosenfeld one could apply what some refer to as 'The Best Regex Trick Ever' which would imply to first match what you don't want and then match what you do want in a capture group. For example:
^.*\]|(\w+\.)
See an online demo where in short this means:
^.*\] - Match 0+ (Greedy) characters from the start of the string upto the last occurence of closing square brackets;
| - Or;
(\w+\.) - Capture group holding 1+ (Greedy) word-characters ending with a dot.
Here is how it could work in an UDF:
Sub Test()
Dim s As String: s = "some text [asred.] ost. |Monday - Ribben (ult.) lot. ac, sino. other maybe long text; collan. "
Debug.Print RegExtract(s, "^.*\]|(\w+\.)")
End Sub
'------
'The above Sub would invoke the below function as an example.
'But you could also invoke this through: `=RegExtract(A1,"^.*\]|(\w+\.)")`
'on your sheet.
'------
Public Function RegExtract(Txt As String, Pattern As String) As String
Dim rMatch As Object, arrayMatches(), i As Long
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = Pattern
If .Test(Txt) Then
For Each rMatch In .Execute(Txt)
If Not IsEmpty(rMatch.SubMatches(0)) Then
ReDim Preserve arrayMatches(i)
arrayMatches(i) = rMatch.SubMatches(0)
i = i + 1
End If
Next
RegExtract = Join(arrayMatches, " ")
Else
RegExtract = "No match found"
End If
End With
End Function
RegexReplace:
Depending on your desired output one could also use a replace function. You'd have to match any remaining character with another alternative for that. For example:
^.*\]|(\w+\.\s?)|.
See an online demo where in short this means that we added another alternative which is simply any single character. A 2nd small addition is that we added the option of an optional space character \s? in the 2nd alternative.
Sub Test()
Dim s As String: s = "some text [asred.] ost. |Monday - Ribben (ult.) lot. ac, sino. other maybe long text; collan. "
Debug.Print RegReplace(s, "^.*\]|(\w+\.\s?)|.", "$1")
End Sub
'------
'There are now 3 parameters to parse to the UDF; String, Pattern and Replacement.
'------
Public Function RegReplace(Txt As String, Pattern As String, Replacement) As String
Dim rMatch As Object, arrayMatches(), i As Long
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = Pattern
RegReplace = Trim(.Replace(Txt, Replacement))
End With
End Function
Note that I used Trim() to remove possible trailing spaces.
Both RegexMatch and RegexReplace would currently return a single string to clean the input but the former does give you the option to deal with the array in the arrayMatches() variable.

There is a method to return all the matches in a string starting after a certain pattern. But I can't recall it at this time.
In the meantime, it seems the simplest would be to remove everything prior to the first ], and then apply Regex to the remainder.
For example:
Option Explicit
Sub findit()
Const str As String = "some text [asred.] ost. |Monday - Ribben (ult.) lot. ac, sino. other maybe long text; collan."
Dim RE As RegExp, MC As MatchCollection, M As Match
Dim S As String
Dim sOutput As String
S = Mid(str, InStr(str, "]"))
Set RE = New RegExp
With RE
.Pattern = "\w+(?=\.)"
.Global = True
If .Test(S) = True Then
Set MC = .Execute(S)
For Each M In MC
sOutput = sOutput & vbLf & M
Next M
End If
End With
MsgBox Mid(sOutput, 2)
End Sub
You could certainly limit the number of matches to 5 by using a counter instead of the For each loop

You can use the following regex
([a-zA-Z]+)\.
Let me explain a little bit.
[a-zA-Z] - this looks for anything that contain any letter from a to z and A to Z, but it only matches the first letter.
\+ - with this you are telling that matches all the letters until it finds something that is not a letter from a to z and A to Z
\. - with this you are just looking for the . at the end of the match
Here the example.

Related

Remove Certain Characters from a String using UDF

I have a column which contain cells that have some list of alphanumeric number system as follows:
4A(4,5,6,7,8,9); 4B(4,5,7,8); 3A(1,2,3); 3B(1,2,3), 3C(1,2)
On a cell next to it, I use a UDF function to get rid of special characters "(),;" in order to leave the data as
4A456789 4B4578 3A123 3B123 3C12
Function RemoveSpecial(Str As String) As String
Dim SpecialChars As String
Dim i As Long
SpecialChars = "(),;-abcdefghijklmnopqrstuvwxyz"
For i = 1 To Len(SpecialChars)
Str = Replace$(Str, Mid$(SpecialChars, i, 1), "")
Next
RemoveSpecial = Str
End Function
For the most part this works well. However, on certain occasions, the cell would contain an unorthodox pattern such as when a space is included between the 4A and the parenthesized items:
4A (4,5,6,7,8,9);
or when a text appears inside the parenthesis (including two spaces on each side):
4A (4,5, skip 8,9);
or a space appears between the first two characters:
4 A(4,5,6)
How would you fix this so that the random spaces are removed except to delaminate the actual combination of data?
One strategy would be to substitute the patterns you want to keep before eliminating the "special" characters, then restore the desired patterns.
From your sample data, it look like you want to keep a space only if it follow ); or ),
Something like this:
Function RemoveSpecial(Data As Variant) As Variant
Dim SpecialChars As String
Dim KeepStr As Variant, PlaceHolder As Variant, ReplaceStr As Variant
Dim i As Long
Dim DataStr As String
SpecialChars = " (),;-abcdefghijklmnopqrstuvwxyz"
KeepStr = Array("); ", "), ")
PlaceHolder = Array("~0~", "~1~") ' choose a PlaceHolder that won't appear in the data
ReplaceStr = Array(" ", " ")
DataStr = Data
For i = LBound(KeepStr) To UBound(KeepStr)
DataStr = Replace$(DataStr, KeepStr(i), PlaceHolder(i))
Next
For i = 1 To Len(SpecialChars)
DataStr = Replace$(DataStr, Mid$(SpecialChars, i, 1), vbNullString)
Next
For i = LBound(KeepStr) To UBound(KeepStr)
DataStr = Replace$(DataStr, PlaceHolder(i), ReplaceStr(i))
Next
RemoveSpecial = Application.Trim(DataStr)
End Function
Another strategy would be regular expressions (RegEx)
It looks like a regular expression could come in handy here, for example:
Function RemoveSpecial(Str As String) As String
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = "\)[;,]( )|[^A-Z\d]+"
RemoveSpecial = .Replace(Str, "$1")
End With
End Function
I have used the regular expression:
\)[;,]( )|[^A-Z\d]+
You can see an online demo to see the result in your browser. The way this works is to apply a form of what some would call "The best regex trick ever!"
\)[;,]( ) - Escape a closing paranthesis, then match either a comma or semicolon before we capture a space character in our 1st capture group.
| - Or use the following alternation:
[^A-Z\d]+ - Any 1+ char any other than in given character class.
EDIT:
In case you have values like 4A; or 4A, you can use:
(?:([A-Z])|\))[;,]( )|[^A-Z\d]+
And replace with $1$2. See an online demo.

VBA regex: extract multiple strings between strings within Excel cell with custom function

Within an Excel column I have data such as:
"Audi (ADI), Mercedes (modelx) (MEX), Ferrari super fast, high PS (FEH)"
There hundreds of models that are described by a name and an abbreviation of three capitalized letters in brackets.
I need to extract the names only and the abbreviations only to separate cells. I succeeded doing this for the abbreviations by the following module:
Function extrABR(cellRef) As String
Dim RE As Object, MC As Object, M As Object
Dim sTemp As Variant
Const sPat As String = "([A-Z][A-Z][A-Z][A-Z]?)" ' this is my regex to match my string
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.MultiLine = True
.Pattern = sPat
If .Test(cellRef) Then
Set MC = .Execute(cellRef)
For Each M In MC
sTemp = sTemp & ", " & M.SubMatches(0)
Next M
End If
End With
extrABR = Mid(sTemp, 3)
End Function
However, I do not manage to do so for names. I thought of just exchanging the regex by the following regex: (^(.*?)(?= \([A-Z][A-Z][A-Z])|(?<=, )(.*)(?= \([A-Z][A-Z][A-Z])), but VBA does not seem to allow lookbehind.
Any idea?
Correct, lookbehinds are not supported, but they are only necessary when your expected matches overlap. It is not the case here, all your matches are non-overlapping. So, you can again rely on capturing:
(?:^|,)\s*(.*?)(?=\s*\([A-Z]{3,}\))
See the regex demo. Group 1 values are accessed via .Submatches(0).
Details:
(?:^|,) - either start of a string or a comma
\s* - zero or more whitespace chars
(.*?) - Capturing group 1: any zero or more chars other than line break chars as few as possible
(?=\s*\([A-Z]{3,}\)) - a positive lookahead that matches a location that is immediately followed with
\s* - zero or more whitespace chars
\( - a ( char
[A-Z]{3,} - three or more uppercase chars
\) - a ) char.
Demo screenshot:
RE.REPLACE --
Try this function.. anything between the parenthesis will be replaced with "" giving you string of model names only, which you can then split on comma and get string array if so desired.
Function ModelNames(cellRef) As String
Dim RE As Object, MC As Object, M As Object
Dim sTemp As Variant, sPat As String
sPat = "\([^)]+\)"
'Or you can use your formula pattern "([A-Z][A-Z][A-Z][A-Z]?)" to get (modelx) in the final output.
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.MultiLine = True
.Pattern = sPat
End With
ModelNames = RE.Replace(cellRef, "")
End Function

How do i get part of string after a special character?

I have a column where i pickup increasing numbering values, and their format is xx_yy
so the first is 1_0, second 1_1 and so forth, no we are at 23_31
I want to get the right side of the string, and i am already getting the left side correctly.
using
newActionId = Left(lastActionID, (Application.WorksheetFunction.Find("_", lastActionID, 1) - 1))
i wish to do the following, human writing below
nextSubid = entire stringvalue AFTER special character "_"
I tried just switching left to right, didnt go so well, do you have a suggestion?
You can use Split function to get the relevant text.
Syntax: Split(expression, [ delimiter, [ limit, [ compare ]]])
Option Explicit
Sub Sample()
Dim id As String
Dim beforeSplChr As String
Dim afterSplChr As String
id = "23_31"
beforeSplChr = Split(id, "_")(0)
afterSplChr = Split(id, "_")(1)
Debug.Print beforeSplChr
Debug.Print afterSplChr
End Sub
Another way
Debug.Print Left(id, (InStrRev(id, "_", -1) - 1)) '<~~ Left Part
Debug.Print Right(id, (InStrRev(id, "_", -1) - 1)) '<~~ Right Part
Even though Siddharth Rout has given what can probably be considered a better answer here, I felt that this was worth adding:
To get the second part of the string using your original method, you would want to use the Mid function in place of Left, rather than trying to use Right.
Mid(string, start, [ length ])
Returns length characters from string, starting at the start position
If length is omitted, then will return characters from the start position until the end of the string
newActionId = Mid(lastActionID, Application.WorksheetFunction.Find("_", lastActionID, 1) + 1)
Just for fun (Split is the way to go here), an alternative way using regular expressions:
Sub Test()
Dim str As String: str = "23_31"
With CreateObject("VBScript.RegExp")
.Global = True
.Pattern = "\d+"
Debug.Print .Execute(str)(0) 'Left Part
Debug.Print .Execute(str)(1) 'Right Part
End With
End Sub
Btw, as per my comment, your first value could also be achieved through:
Debug.Print Val(str)
Split function of string is very usefull for this type of query.
Like:
String s = "23_34";
String left = s.split("_")[0];
String right = s.split("_")[1];
Or you can also use combination of indexOf and substring method together.
String left = s.substring(0,s.indexOf('_')+1)
String right = s.substring(s.indexOf('_'));

Replace Whole Words from a Predefined List

I currently have coding which will review an equipment description field, the aim of which to standardize entries. That is - whatever is found in COL A, replace with COL B
I want to post the answer back to a new clean description column (that will work OK, no dramas on that section, but I don't need any messages etc, and this may be doing 100,000+ descriptions at a time, so looking for efficient coding).
However when it applies the Replace function, it also replaces part words, instead of distinct whole words, no matter how I sort the words on the Dictionary tab.
** 99 times out of a hundred there are no preceding or trailing spaces in Col A entries, but there are rare occasions...
Description Examples:
AIR COMPRESSOR
LEVEL GAUGE OIL SEPARATOR GAS COMPRESSOR
PRESS CTRL VV
PRESSURE GAUGE FLAME FRONT
PRESS as part of word becomes PRESSURE, e.g.:
COL A: COL B:
COMPRESSSOR COMPRESSOR
PRESSURE PRESSURE
PRESSURE GAUGE PRESSURE GAUGE
PRESS PRESSURE
AIR COMPRESSOR AIR COMPRESSOR
I think I'm very close to getting this right, but I can't figure out how to adjust to make it run and replace whole words only - I think it is the order of where I have stuff, but not 100% sure, or if something is missing.
I would greatly appreciate your help with this.
Thanks, Wendy
Function CleanUntil(original As String, targetReduction As Integer)
Dim newString As String
newString = original
Dim targetLength As Integer
targetLength = Len(original) - targetReduction
Dim rowCounter As Integer
rowCounter = 2
Dim CleanSheet As Worksheet
Set CleanSheet = ActiveWorkbook.Sheets("Dictionary")
Dim word As String
Dim cleanword As String
' Coding for replacement of WHOLE words - with a regular expression using a pattern with the \b marker (for the word boundary) before and after word
Dim RgExp As Object
Set re = CreateObject("VBScript.RegExp")
With RgExp
.Global = True
'.IgnoreCase = True 'True if search is case insensitive. False otherwise
End With
'Loop through each word until we reach the target length (or other value noted), or run out of clean words to apply
'While Len(newString) > 1 (this line will do ALL descriptions - confirmed)
'While Len(newString) > targetLength (this line will only do to target length)
While Len(newString) > 1
word = CleanSheet.Cells(rowCounter, 1).Value
cleanword = CleanSheet.Cells(rowCounter, 2).Value
RgExp.Pattern = "\b" & word & "\b"
If (word = "") Then
CleanUntil = newString
Exit Function
End If
' TODO: Make sure it is replacing whole words and not just portions of words
' newString = Replace(newString, word, cleanword) ' This line works if no RgExp applied, but finds part words.
newString = RgExp.Replace(newString, word, cleanword)
rowCounter = rowCounter + 1
Wend
' Once word find/replace finished, set close out loop for RgExp Object with word boundaries.
Set RgExp = Nothing
' Finally return the cleaned string as clean as we could get it, based on dictionary
CleanUntil = newString
End Function
NB: I would strongly recommend adding a reference to the Microsoft VBScript Regular Expressions 5.5 library (via Tools -> References...). This will give you strong typing and Intellisense on the RegExp object.
Dim RgExp As New RegExp
If I understand correctly, you can find the entries that need to be replaced using a regular expression; the regular expression only matches entries where the value in A is a complete word.
But when you try to replace with the VBA Replace function, it replaces even partial words in the text. And using the RegExp.Replace method has no effect -- the string always remains the same.
This is a quirk of the regular expression engine used in VBA. You cannot replace a complete match; you can only replace something which has been captured in a group, using ( ).
RgExp.Pattern = "\b(" & word & ")\b"
' ...
newString = RgExp.Replace(newString, cleanword)
If you want to exclude the hyphen from the boundary characters, you might be able to use a negative pattern which excludes any word characters or the hyphen:
RgExp.Pattern = "[^\w-](" & word & ")[^w-]"
Reference:
Replace method
Introduction to the VBScript regular expression library

Split a long string by using a complete word

I'm splitting a long string like shown below wherever it finds 'END' keyword:
string_end.Split(New String() {"END"}, StringSplitOptions.None)
This would perfectly split the string into multiple parts wherever it finds 'END' . But the problem arises when a string contains the word 'RECOMMENDED'. It would split it as 'RECOMM' and 'ED'. I want it to split by searching for the whole word, so that words like 'RECOMMENDED' stays as it is. Kindly help.
C# and VB.NET codes would suffice.
You can use Regex.Split to solve this:
Dim rgx As New Regex("\bEND\b")
Dim input As String = "RECOMMENDED AND THE END OF A STRING END"
Dim result() As String = rgx.Split(input)
'Output:
'-----------------------------
'result = {Length=3}
'(0) = "RECOMMENDED AND THE "
'(1) = " OF A STRING "
'(2) = ""
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
source: http://www.regular-expressions.info/wordboundaries.html
Why you shouldn't use spaces on a String.Split?
Dim input As String = "RECOMMENDED AND THE END OF A STRING END"
Dim res() As String = input.Split(New String() {" END "}, StringSplitOptions.None)
'Output:
'----------------------------
'res = {Length=2}
'(0) = "RECOMMENDED AND THE"
'(1) = "OF A STRING END"
The split doesn't work with this code only the word END is a single word with surrounded space. But the word can be surrounded by another character or could be the beginning or end of a string:
END TEST - doesn't work
TEST END - doesn't work
TEST END, HELLO WORLD - doesn't work
...
If your simply looking for the word: END then simply add a white-space before and after:
string_end.Split(New String() {" END "}, StringSplitOptions.None)
After I don't recommend using split to include or exclude a line.
A cleaner solution would simply to do
For Each line As String In Lines
If line.Contains(" END ") Then
'Do Stuff
End If
Next
You could also use Regex but that depends if you are familiar with it.
Private _pattern As New Regex("\bEND\b", RegexOptions.Compiled)
For Each line As String In Lines
Dim matches As MatchCollection = _pattern.Matches(line)
If matches.Count <> 0 Then
'Do Stuff
End If
Next

Resources