how to output regex group values - excel

I need the group values instead of the matches.
This is how i tried to get those:
Dim item as Variant, matches As Object, match As Object, subMatch As Variant, subMatches(), row As integer
row = 1
For Each item In arr
With regex
.Pattern = "\bopenfile ([^\s]+)|\bopen file ([^\s]+)"
Set matches = regex.Execute(item)
For Each match In matches
For Each subMatch In match.subMatches
subMatches(i) = match.subMatches(i)
ActiveSheet.Range("A" & row).Value = subMatches(i)
row = row + 1
i = i + 1
Next subMatch
Next match
End With
Next item
This is the text from where it should be extracted:
Some help would be great :)
Open File file.M_p3_23432e done
Openfile file.M_p4_6432e done
Open File file.M_p3_857432 done
Open File file.M_p4_34892f done
Openfile file.M_p3_781 done
Info: I'm using Excel VBA.. If that is important to know.

You can revamp the regex to match and capture with one capturing group:
\bopen\s?file\s+(\S+)
See the regex demo.
Details:
\b - word boundary
open - a fixed word
\s? - an optional whitespace
file - a fixed word
\s+ - one or more whitespaces
(\S+) - Group 1: one or more non-whitespaces.
Now, the file names are always in SubMatches(0).
Note that the regex must be compiled with the case insensitive option and global (if the string contains multiple matches):
With regex
.Pattern = "\bopen\s?file\s+(\S+)"
.IgnoreCase = True
.Global = True
End With

Related

How to remove line feed?

This script for Outlook returns the desired date but doesn't remove the line feed.
String I want to get the date from:
1_c Anruf/Meldung am (Datum): 04.Mai.2020
With Reg1
.Pattern = "1_c Anruf\/Meldung am \(Datum\)\s*[:]+\s*(.*)\s+\n"
.Global = False
End With
If Reg1.Test(olMail.Body) Then
Set M1 = Reg1.Execute(olMail.Body)
End If
For Each M In M1
Debug.Print M.SubMatches(0)
With xExcelApp
Range("A5").Value = M.SubMatches(0)
End With
Next M
regex101 selects correctly but the debugger always shows something like "02.12.2020 ". <- containing no whitespace but a line feed.
In Excel the line feed is also visible. Also trailing whitespace isn't a problem since I can use TRIM but the line feed doesn't allow it to function.
Your regex captures the CR symbol, you can replace . with [^\r\n] to avoid this behavior.
It seems you want to use
1_c Anruf/Meldung am \(Datum\)\s*:+\s*([^\r\n]*\S)
See the regex demo. Note the forward slash does not have to be escaped in the VBA code. Details:
1_c Anruf/Meldung am \(Datum\) - a fixed 1_c Anruf/Meldung am (Datum) string
\s*:+\s* - one or more colons enclosed with zero or more whitespaces
([^\r\n]*\S) - Capturing group 1 (accessed with M.SubMatches(0)) that captures zero or more occurrences of any char other than CR and LF chars and then any non-whitespace char.

How to match accented characters but not tab

I'm trying to match the company name in this string delimited with tabs.
Below table does not have tabs when you copy it, but I have replaced tabs with two spaces, which I assume will work fine for testing.
1025164 HERBEX IBERIA, S.L.U. KY01 4600292091
1016379 DRISCOLL´S OF EUROPE B.V. KY01 4600322589
1008809 LANDGARD NORD OBST & GEMÜSE GM KY01 4600347315
1008835 C.A.S.I. : COOPERATIVA PROVINC KY01 4600348112
1019258 SYDGRÖNT EKONOMISK FÖRENING KY02 4600343422
(The second column of the above, between 7 digit number and KY0 above)
In real life the columns are not always in the same order since it's a user preference.
I just took a few examples but names could also include /éèáà()´, pretty much anything (sadly).
I found another question here Concrete Javascript Regex for Accented Characters (Diacritics)
When I use the regex patterns in that thread, example: "\t([A-zÀ-ÿ0-9\s\.\,\_\-\'\&]+)\t" (I know some characters are still missing) to match between two tabs it becomes greedy and matches the whole line.
Is there any pattern that could match any character in a company name between tabs (or two spaces as the example above)?
Instead of returning a matched part, I matched everything and replaced it with the 1st capture group. Hope it helps.
Sub Test()
Dim str As String: str = "1025164" & vbTab & "HERBEX IBERIA, S.L.U." & vbTab & "KY01" & vbTab & "4600292091"
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = "(?:^|\t)(?:\d+|KY\d+|([^\t]+))(?=\t|$)"
Debug.Print .Replace(str, "$1")
End With
End Sub
Have a look at this online demo to test the pattern:
(?:^|\t) - Match either start line anchor or a tab. Unfortunately the VBA-regex object does not support lookbehinds.
(?: - Open a non-capture group to start matching all parts you don't want to capture first:
\d+ - match 1+ digits;
| - Or:
KY\d+ - Match "KY" followed by 1+ digits;
| - Or:
([^\t]+) - nest a capture group to capture 1+ non-tabs.
) - Close non-capture group.
(?=\t|$) - Positive lookahead to assert captured text is followed by either a tab or end-line anchor.
I would use a different attempt using the split-command. The following code assumes that you have Tabs as separator and that the company name is found if the column is not numeric (only digits) and does not start with 'KY'.
Function getCompanyName(line As String) As String
Const separator = vbTab ' Replace with " " if you need that.
Dim tokens() As String, i As Integer
tokens = Split(line, separator)
For i = 0 To UBound(tokens)
If Not IsNumeric(tokens(i)) And Left(tokens(i) <> "KY") Then
getCompanyName = tokens(i)
Exit Function
End If
Next
End Function

What's the best way to keep regex matches in Excel?

I'm working off of the excellent information provided in "How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops", however I'm running into a wall trying to keep the matched expression, rather than the un-matched portion:
"2022-02-14T13:30:00.000Z" converts to "T13:30:00.000Z" instead of "2022-02-14", when the function is used in a spreadsheet. Listed below is the code which was taken from "How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops". I though a negation of the strPattern2 would work, however I'm still having issues. Any help is greatly appreciated.
Function simpleCellRegex(Myrange As Range) As String
Dim regEx As New RegExp
Dim strPattern As String
Dim strPattern2 As String
Dim strInput As String
Dim strReplace As String
Dim strOutput As String
strPattern = "^T{0-9][0-9][:]{0-9][0-9][:]{0-9][0-9][0-9][Z]"
strPattern2 = "^(19|20)\d\d([- /.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])"
If strPattern2 <> "" Then
strInput = Myrange.Value
strReplace = ""
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = strPattern2
End With
If regEx.test(strInput) Then
simpleCellRegex = regEx.Replace(strInput, strReplace)
Else
simpleCellRegex = "Not matched"
End If
End If
End Function
Replace is very powerful, but you need to do two things:
Specify all the characters you want to drop, if your regexp is <myregexp>, then change it to ^.*?(<myregexp>).*$ assuming you only have one date occurrence in your string. The parentheses are called a 'capturing group' and you can refer to them later as part of your replacement pattern. The ^ at the beginning and the $ at the end ensure that you will only match one occurrence of your pattern even if Global=True. I noticed you were already using a capturing group as a back-reference - you need to add one to the back-reference number because we added a capturing group. Setting up the pattern this way, the entire string will participate in the match and we will use the capturing groups to preserve what we want to keep.
Change your strReplace="" to strReplace="$1", indicating you want to replace whatever was matched with the contents of capturing group #1.
Here is a screenprint from Excel using my RegexpReplace User Defined Function to process your example with my suggestions:
I had to fix up your time portion regexp because you used curly brackets three times where you meant square, and you left out the seconds part completely. Notice by adjusting where you start and end your capturing group parentheses you can keep or drop the T & Z at either end of the time string.
Also, if your program is being passed system timestamps from a reliable source then they are already well-formed and you don't need those long, long regular expressions to reject March 32. You can code both parts in one as
([-0-9/.]{10,10})T([0-9:.]{12,12})Z and when you want the date part use $1 and when you want the time part use $2.

Capturing the submatches after the second instance of a string variable

How would one capture two numeric submatches after the second instance of a string/number?
I have a # that changes from .txt file to .txt file. It is captured in a variable called "Total" which I declared as a string. The string contains numbers, in the format of 123, 456,789.23 or 123, 456.01. This number appears about 3 times within the .txt file, and I have written a RegEx pattern that is able to capture the first instance of this number and its submatches.
regex.Pattern = Total & "\s*([\d+\.\d*])\s*([\d+\.\d*])\s*"
The .txt file portion I am trying to capture may appear as
123,456,789.38
2.180
251.517
OR
123,456,789.38 2.180 251.517
I want to capture 2.180 and 251.517.
The first instance includes the words "Number of: " in front of it, and I tried to make the pattern avert from the ":" before it by writing:
regex.Pattern = "[^:\s]" & Total & "\s*([\d+\.\d*])\s*([\d+\.\d*])\s*"
It still picks up this first instance and the numbers after that first instance. The second instance does not have any defining words before it, just a blank line such as the one below:
123,456,789.38
2.180
251.517
Additional information:
Dim regex As Object: Set regex = CreateObject("vbscript.regexp")
regex.Pattern = Total & "\s*([\d+\.\d*])\s*([\d+\.\d*])\s*"
Dim MCS As Object
Set MCS = regex.Execute(myText)
Dim Total As String: Total = MCS(0).submatches(0)
Dim submatch1 As String: submatch1 = MCS(0).Submatches(1)
Dim submatch2 As String: submatch2 = MCS(0).Submatches(2)
where mytext is the contents of the .txt file entirely as a string.
There are also words and numbers between the different instances of the variable "Total", such as
Number of: 123,456,789.38
Text Here Text Here
Number Here
123,456,789.38
2.180
251.517
I am also not sure how much/many text/numbers there will be between the first two instances of 123,456,789.38, so I am trying to think of how to work this to be flexible.
When I mention second instance, I mean that the number 123,456,789.38 (which is the variable named "Total") appears three times in the document. I want to capture the two submatches that appear after that number. However, since there are three times it appears, I want to capture the two submatches that appear after the second time that 123,456,789.38 pops up.
Link to the text file.
https://regex101.com/r/4hPtY3/6
Output:
submatch1 = 2.180
submatch2 = 251.517
Currently, it is capturing
submatch1 = 97
submatch2 = 5
with the pattern:
regex.Pattern = Replace(Total, ".", "\.") & "\s*([.\d]+)\s*([.\d]+)"
You can use
regex.Pattern = "^(?:[\s\S]*?" & Replace(Total, ".", "\.") & "){2}\D+([.\d]+)\s*([.\d]+)"
Here is the regex demo where you can see that
As expected.
Details
^ - start of string
(?:[\s\S]*?123,456\.78){2} - two occurrences of any 0+ chars as few as possible and then 123,456.78 string
\D+ - 1 or more non-digit chars
([.\d]+) - Group 1: one or more dots or digits
\s* - 0+ whitespaces
([.\d]+) - Group 2: one or more dots or digits.

How to count exact text contain in string [Excel]

I already use these below formula to count exact text contain in string but still formula wrongly counted it. For example, i would like to count "ZIKA" test code in table, the answer should be two. But the formula count ZIKA2 as ZIKA also. How to ignore ZIKA2 from count it?
TEST
HS2, CCAL, EGFR, AFB
ZIKA, AG21
PPB, ZIKA2
ZIKA, AG21
I already try these formulas:
=SUMPRODUCT(--(ISNUMBER(FIND("ZIKA",F:F))))
and also
=COUNTIF(F:F,"ZIKA")
you could count exact zika, and comma-separated vriations
=COUNTIF(F:F,"ZIKA")+COUNTIF(F:F,"ZIKA,*")+COUNTIF(F:F,"*, ZIKA")+COUNTIF(F:F,"*, ZIKA,*")
I assume your data follow this format
xxx, yyy, zzz
space after comma
You may need to split your formula into 3 parts
=COUNTIF(F:F,"ZIKA,*")+COUNTIF(F:F,"*, ZIKA")+COUNTIF(F:F,"ZIKA")
The first part will count those start with ZIKA, second part count those end with ZIKA, last we should count those only with ZIKA
Try this regex, it may need a helpercolumn. I have not tested it that much yet.
Press ALT + F11 to open VBA editor.
Click Insert -> module and copy paste the code below.
Function Regex(Cell, Search)
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
RE.Pattern = "(\b" & Search & "\b)"
RE.Global = True
RE.IgnoreCase = True
Set Matches = RE.Execute(Cell)
For Each res In Matches
Regex = Regex & "," & res
Next res
Regex = Mid(Regex, 2)
End Function
It will return "ZIKA" if it finds ZIKA in the cell you run it on.
And then you just count the ZIKAs in the helper column.
Updated with a new code that you can change the search in.
Use it with =regex(A1, "ZIKA")

Resources