How to match accented characters but not tab - excel

I'm trying to match the company name in this string delimited with tabs.
Below table does not have tabs when you copy it, but I have replaced tabs with two spaces, which I assume will work fine for testing.
1025164 HERBEX IBERIA, S.L.U. KY01 4600292091
1016379 DRISCOLL´S OF EUROPE B.V. KY01 4600322589
1008809 LANDGARD NORD OBST & GEMÜSE GM KY01 4600347315
1008835 C.A.S.I. : COOPERATIVA PROVINC KY01 4600348112
1019258 SYDGRÖNT EKONOMISK FÖRENING KY02 4600343422
(The second column of the above, between 7 digit number and KY0 above)
In real life the columns are not always in the same order since it's a user preference.
I just took a few examples but names could also include /éèáà()´, pretty much anything (sadly).
I found another question here Concrete Javascript Regex for Accented Characters (Diacritics)
When I use the regex patterns in that thread, example: "\t([A-zÀ-ÿ0-9\s\.\,\_\-\'\&]+)\t" (I know some characters are still missing) to match between two tabs it becomes greedy and matches the whole line.
Is there any pattern that could match any character in a company name between tabs (or two spaces as the example above)?

Instead of returning a matched part, I matched everything and replaced it with the 1st capture group. Hope it helps.
Sub Test()
Dim str As String: str = "1025164" & vbTab & "HERBEX IBERIA, S.L.U." & vbTab & "KY01" & vbTab & "4600292091"
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = "(?:^|\t)(?:\d+|KY\d+|([^\t]+))(?=\t|$)"
Debug.Print .Replace(str, "$1")
End With
End Sub
Have a look at this online demo to test the pattern:
(?:^|\t) - Match either start line anchor or a tab. Unfortunately the VBA-regex object does not support lookbehinds.
(?: - Open a non-capture group to start matching all parts you don't want to capture first:
\d+ - match 1+ digits;
| - Or:
KY\d+ - Match "KY" followed by 1+ digits;
| - Or:
([^\t]+) - nest a capture group to capture 1+ non-tabs.
) - Close non-capture group.
(?=\t|$) - Positive lookahead to assert captured text is followed by either a tab or end-line anchor.

I would use a different attempt using the split-command. The following code assumes that you have Tabs as separator and that the company name is found if the column is not numeric (only digits) and does not start with 'KY'.
Function getCompanyName(line As String) As String
Const separator = vbTab ' Replace with " " if you need that.
Dim tokens() As String, i As Integer
tokens = Split(line, separator)
For i = 0 To UBound(tokens)
If Not IsNumeric(tokens(i)) And Left(tokens(i) <> "KY") Then
getCompanyName = tokens(i)
Exit Function
End If
Next
End Function

Related

how to output regex group values

I need the group values instead of the matches.
This is how i tried to get those:
Dim item as Variant, matches As Object, match As Object, subMatch As Variant, subMatches(), row As integer
row = 1
For Each item In arr
With regex
.Pattern = "\bopenfile ([^\s]+)|\bopen file ([^\s]+)"
Set matches = regex.Execute(item)
For Each match In matches
For Each subMatch In match.subMatches
subMatches(i) = match.subMatches(i)
ActiveSheet.Range("A" & row).Value = subMatches(i)
row = row + 1
i = i + 1
Next subMatch
Next match
End With
Next item
This is the text from where it should be extracted:
Some help would be great :)
Open File file.M_p3_23432e done
Openfile file.M_p4_6432e done
Open File file.M_p3_857432 done
Open File file.M_p4_34892f done
Openfile file.M_p3_781 done
Info: I'm using Excel VBA.. If that is important to know.
You can revamp the regex to match and capture with one capturing group:
\bopen\s?file\s+(\S+)
See the regex demo.
Details:
\b - word boundary
open - a fixed word
\s? - an optional whitespace
file - a fixed word
\s+ - one or more whitespaces
(\S+) - Group 1: one or more non-whitespaces.
Now, the file names are always in SubMatches(0).
Note that the regex must be compiled with the case insensitive option and global (if the string contains multiple matches):
With regex
.Pattern = "\bopen\s?file\s+(\S+)"
.IgnoreCase = True
.Global = True
End With

How to remove line feed?

This script for Outlook returns the desired date but doesn't remove the line feed.
String I want to get the date from:
1_c Anruf/Meldung am (Datum): 04.Mai.2020
With Reg1
.Pattern = "1_c Anruf\/Meldung am \(Datum\)\s*[:]+\s*(.*)\s+\n"
.Global = False
End With
If Reg1.Test(olMail.Body) Then
Set M1 = Reg1.Execute(olMail.Body)
End If
For Each M In M1
Debug.Print M.SubMatches(0)
With xExcelApp
Range("A5").Value = M.SubMatches(0)
End With
Next M
regex101 selects correctly but the debugger always shows something like "02.12.2020 ". <- containing no whitespace but a line feed.
In Excel the line feed is also visible. Also trailing whitespace isn't a problem since I can use TRIM but the line feed doesn't allow it to function.
Your regex captures the CR symbol, you can replace . with [^\r\n] to avoid this behavior.
It seems you want to use
1_c Anruf/Meldung am \(Datum\)\s*:+\s*([^\r\n]*\S)
See the regex demo. Note the forward slash does not have to be escaped in the VBA code. Details:
1_c Anruf/Meldung am \(Datum\) - a fixed 1_c Anruf/Meldung am (Datum) string
\s*:+\s* - one or more colons enclosed with zero or more whitespaces
([^\r\n]*\S) - Capturing group 1 (accessed with M.SubMatches(0)) that captures zero or more occurrences of any char other than CR and LF chars and then any non-whitespace char.

What's the best way to keep regex matches in Excel?

I'm working off of the excellent information provided in "How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops", however I'm running into a wall trying to keep the matched expression, rather than the un-matched portion:
"2022-02-14T13:30:00.000Z" converts to "T13:30:00.000Z" instead of "2022-02-14", when the function is used in a spreadsheet. Listed below is the code which was taken from "How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops". I though a negation of the strPattern2 would work, however I'm still having issues. Any help is greatly appreciated.
Function simpleCellRegex(Myrange As Range) As String
Dim regEx As New RegExp
Dim strPattern As String
Dim strPattern2 As String
Dim strInput As String
Dim strReplace As String
Dim strOutput As String
strPattern = "^T{0-9][0-9][:]{0-9][0-9][:]{0-9][0-9][0-9][Z]"
strPattern2 = "^(19|20)\d\d([- /.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])"
If strPattern2 <> "" Then
strInput = Myrange.Value
strReplace = ""
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = strPattern2
End With
If regEx.test(strInput) Then
simpleCellRegex = regEx.Replace(strInput, strReplace)
Else
simpleCellRegex = "Not matched"
End If
End If
End Function
Replace is very powerful, but you need to do two things:
Specify all the characters you want to drop, if your regexp is <myregexp>, then change it to ^.*?(<myregexp>).*$ assuming you only have one date occurrence in your string. The parentheses are called a 'capturing group' and you can refer to them later as part of your replacement pattern. The ^ at the beginning and the $ at the end ensure that you will only match one occurrence of your pattern even if Global=True. I noticed you were already using a capturing group as a back-reference - you need to add one to the back-reference number because we added a capturing group. Setting up the pattern this way, the entire string will participate in the match and we will use the capturing groups to preserve what we want to keep.
Change your strReplace="" to strReplace="$1", indicating you want to replace whatever was matched with the contents of capturing group #1.
Here is a screenprint from Excel using my RegexpReplace User Defined Function to process your example with my suggestions:
I had to fix up your time portion regexp because you used curly brackets three times where you meant square, and you left out the seconds part completely. Notice by adjusting where you start and end your capturing group parentheses you can keep or drop the T & Z at either end of the time string.
Also, if your program is being passed system timestamps from a reliable source then they are already well-formed and you don't need those long, long regular expressions to reject March 32. You can code both parts in one as
([-0-9/.]{10,10})T([0-9:.]{12,12})Z and when you want the date part use $1 and when you want the time part use $2.

Capturing the submatches after the second instance of a string variable

How would one capture two numeric submatches after the second instance of a string/number?
I have a # that changes from .txt file to .txt file. It is captured in a variable called "Total" which I declared as a string. The string contains numbers, in the format of 123, 456,789.23 or 123, 456.01. This number appears about 3 times within the .txt file, and I have written a RegEx pattern that is able to capture the first instance of this number and its submatches.
regex.Pattern = Total & "\s*([\d+\.\d*])\s*([\d+\.\d*])\s*"
The .txt file portion I am trying to capture may appear as
123,456,789.38
2.180
251.517
OR
123,456,789.38 2.180 251.517
I want to capture 2.180 and 251.517.
The first instance includes the words "Number of: " in front of it, and I tried to make the pattern avert from the ":" before it by writing:
regex.Pattern = "[^:\s]" & Total & "\s*([\d+\.\d*])\s*([\d+\.\d*])\s*"
It still picks up this first instance and the numbers after that first instance. The second instance does not have any defining words before it, just a blank line such as the one below:
123,456,789.38
2.180
251.517
Additional information:
Dim regex As Object: Set regex = CreateObject("vbscript.regexp")
regex.Pattern = Total & "\s*([\d+\.\d*])\s*([\d+\.\d*])\s*"
Dim MCS As Object
Set MCS = regex.Execute(myText)
Dim Total As String: Total = MCS(0).submatches(0)
Dim submatch1 As String: submatch1 = MCS(0).Submatches(1)
Dim submatch2 As String: submatch2 = MCS(0).Submatches(2)
where mytext is the contents of the .txt file entirely as a string.
There are also words and numbers between the different instances of the variable "Total", such as
Number of: 123,456,789.38
Text Here Text Here
Number Here
123,456,789.38
2.180
251.517
I am also not sure how much/many text/numbers there will be between the first two instances of 123,456,789.38, so I am trying to think of how to work this to be flexible.
When I mention second instance, I mean that the number 123,456,789.38 (which is the variable named "Total") appears three times in the document. I want to capture the two submatches that appear after that number. However, since there are three times it appears, I want to capture the two submatches that appear after the second time that 123,456,789.38 pops up.
Link to the text file.
https://regex101.com/r/4hPtY3/6
Output:
submatch1 = 2.180
submatch2 = 251.517
Currently, it is capturing
submatch1 = 97
submatch2 = 5
with the pattern:
regex.Pattern = Replace(Total, ".", "\.") & "\s*([.\d]+)\s*([.\d]+)"
You can use
regex.Pattern = "^(?:[\s\S]*?" & Replace(Total, ".", "\.") & "){2}\D+([.\d]+)\s*([.\d]+)"
Here is the regex demo where you can see that
As expected.
Details
^ - start of string
(?:[\s\S]*?123,456\.78){2} - two occurrences of any 0+ chars as few as possible and then 123,456.78 string
\D+ - 1 or more non-digit chars
([.\d]+) - Group 1: one or more dots or digits
\s* - 0+ whitespaces
([.\d]+) - Group 2: one or more dots or digits.

Split, escaping certain splits

I have a cell that contains multiple questions and answers and is organised like a CSV. So to get all these questions and answers separated a simple split using the comma as the delimiter should separate this easily.
Unfortunately, there are some values that use the comma as the decimal separator. Is there a way to escape the split for those occurrences?
Fortunately, my data can be split using ", " as separator, but if this wouldn't be the case, would there still be a solution besides manually replacing the decimal delimiter from a comma to a dot?
Example:
"Price: 0,09,Quantity: 12,Sold: Yes"
Using Split("Price: 0,09,Quantity: 12,Sold: Yes",",") would yield:
Price: 0
09
Quantity: 12
Sold: Yes
One possibility, given this test data, is to loop through the array after splitting, and whenever there's no : in the string, add this entry to the previous one.
The function that does this might look like this:
Public Function CleanUpSeparator(celldata As String) As String()
Dim ret() As String
Dim tmp() As String
Dim i As Integer, j As Integer
tmp = Split(celldata, ",")
For i = 0 To UBound(tmp)
If InStr(1, tmp(i), ":") < 1 Then
' Put this value on the previous line, and restore the comma
tmp(i - 1) = tmp(i - 1) & "," & tmp(i)
tmp(i) = ""
End If
Next i
j = 0
ReDim ret(j)
For i = 0 To UBound(tmp)
If tmp(i) <> "" Then
ret(j) = tmp(i)
j = j + 1
ReDim Preserve ret(j)
End If
Next i
ReDim Preserve ret(j - 1)
CleanUpSeparator = ret
End Function
Note that there's room for improvement by making the separator caharacters : and , into parameters, for instance.
I spent the last 24 hours or so puzzling over what I THINK is a completely analogous problem, so I'll share my solution here. Forgive me if I'm wrong about the applicability of my solution to this question. :-)
My Problem: I have a SharePoint list in which teachers (I'm an elementary school technology specialist) enter end-of-year award certificates for me to print. Teachers can enter multiple students' names for a given award, separating each name using a comma. I have a VBA macro in Access that turns each name into a separate record for mail merging. Okay, I lied. That was more of a story. HERE'S the problem: How can teachers add a student name like Hank Williams, Jr. (note the comma) without having the comma cause "Jr." to be interpreted as a separate student in my macro?
The full contents of the (SharePoint exported to Excel) field "Students" are stored within the macro in a variable called strStudentsBeforeSplit, and this string is eventually split with this statement:
strStudents = Split(strStudentsBeforeSplit, ",", -1, vbTextCompare)
So there's the problem, really. The Split function is using a comma as a separator, but poor student Hank Williams, Jr. has a comma in his name. What to do?
I spent a long time trying to figure out how to escape the comma. If this is possible, I never figured it out.
Lots of forum posts suggested using a different character as the separator. That's okay, I guess, but here's the solution I came up with:
Replace only the special commas preceding "Jr" with a different, uncommon character BEFORE the Split function runs.
Swap back to the commas after Split runs.
That's really the end of my post, but here are the lines from my macro that accomplish step 1. This may or may not be of interest because it really just deals with the minutiae of making the swap. Note that the code handles several different (mostly wrong) ways my teachers might type the "Jr" part of the name.
'Dealing with the comma before Jr. This will handle ", Jr." and ", Jr" and " Jr." and " Jr".
'Replaces the comma with ~ because commas are used to separate fields in Split function below.
'Will swap ~ back to comma later in UpdateQ_Comma_for_Jr query.
strStudentsBeforeSplit = Replace(strStudentsBeforeSplit, "Jr", "~ Jr.") 'Every Jr gets this treatment regardless of what else is around it.
'Note that because of previous Replace functions a few lines prior, the space between the comma and Jr will have been removed. This adds it back.
strStudentsBeforeSplit = Replace(strStudentsBeforeSplit, ",~ Jr", "~ Jr") 'If teacher had added a comma, strip it.
strStudentsBeforeSplit = Replace(strStudentsBeforeSplit, " ~ Jr", "~ Jr") 'In cases when teacher added Jr but no comma, remove the (now extra)...
'...space that was before Jr.

Resources