Capturing the submatches after the second instance of a string variable - excel

How would one capture two numeric submatches after the second instance of a string/number?
I have a # that changes from .txt file to .txt file. It is captured in a variable called "Total" which I declared as a string. The string contains numbers, in the format of 123, 456,789.23 or 123, 456.01. This number appears about 3 times within the .txt file, and I have written a RegEx pattern that is able to capture the first instance of this number and its submatches.
regex.Pattern = Total & "\s*([\d+\.\d*])\s*([\d+\.\d*])\s*"
The .txt file portion I am trying to capture may appear as
123,456,789.38
2.180
251.517
OR
123,456,789.38 2.180 251.517
I want to capture 2.180 and 251.517.
The first instance includes the words "Number of: " in front of it, and I tried to make the pattern avert from the ":" before it by writing:
regex.Pattern = "[^:\s]" & Total & "\s*([\d+\.\d*])\s*([\d+\.\d*])\s*"
It still picks up this first instance and the numbers after that first instance. The second instance does not have any defining words before it, just a blank line such as the one below:
123,456,789.38
2.180
251.517
Additional information:
Dim regex As Object: Set regex = CreateObject("vbscript.regexp")
regex.Pattern = Total & "\s*([\d+\.\d*])\s*([\d+\.\d*])\s*"
Dim MCS As Object
Set MCS = regex.Execute(myText)
Dim Total As String: Total = MCS(0).submatches(0)
Dim submatch1 As String: submatch1 = MCS(0).Submatches(1)
Dim submatch2 As String: submatch2 = MCS(0).Submatches(2)
where mytext is the contents of the .txt file entirely as a string.
There are also words and numbers between the different instances of the variable "Total", such as
Number of: 123,456,789.38
Text Here Text Here
Number Here
123,456,789.38
2.180
251.517
I am also not sure how much/many text/numbers there will be between the first two instances of 123,456,789.38, so I am trying to think of how to work this to be flexible.
When I mention second instance, I mean that the number 123,456,789.38 (which is the variable named "Total") appears three times in the document. I want to capture the two submatches that appear after that number. However, since there are three times it appears, I want to capture the two submatches that appear after the second time that 123,456,789.38 pops up.
Link to the text file.
https://regex101.com/r/4hPtY3/6
Output:
submatch1 = 2.180
submatch2 = 251.517
Currently, it is capturing
submatch1 = 97
submatch2 = 5
with the pattern:
regex.Pattern = Replace(Total, ".", "\.") & "\s*([.\d]+)\s*([.\d]+)"

You can use
regex.Pattern = "^(?:[\s\S]*?" & Replace(Total, ".", "\.") & "){2}\D+([.\d]+)\s*([.\d]+)"
Here is the regex demo where you can see that
As expected.
Details
^ - start of string
(?:[\s\S]*?123,456\.78){2} - two occurrences of any 0+ chars as few as possible and then 123,456.78 string
\D+ - 1 or more non-digit chars
([.\d]+) - Group 1: one or more dots or digits
\s* - 0+ whitespaces
([.\d]+) - Group 2: one or more dots or digits.

Related

how to output regex group values

I need the group values instead of the matches.
This is how i tried to get those:
Dim item as Variant, matches As Object, match As Object, subMatch As Variant, subMatches(), row As integer
row = 1
For Each item In arr
With regex
.Pattern = "\bopenfile ([^\s]+)|\bopen file ([^\s]+)"
Set matches = regex.Execute(item)
For Each match In matches
For Each subMatch In match.subMatches
subMatches(i) = match.subMatches(i)
ActiveSheet.Range("A" & row).Value = subMatches(i)
row = row + 1
i = i + 1
Next subMatch
Next match
End With
Next item
This is the text from where it should be extracted:
Some help would be great :)
Open File file.M_p3_23432e done
Openfile file.M_p4_6432e done
Open File file.M_p3_857432 done
Open File file.M_p4_34892f done
Openfile file.M_p3_781 done
Info: I'm using Excel VBA.. If that is important to know.
You can revamp the regex to match and capture with one capturing group:
\bopen\s?file\s+(\S+)
See the regex demo.
Details:
\b - word boundary
open - a fixed word
\s? - an optional whitespace
file - a fixed word
\s+ - one or more whitespaces
(\S+) - Group 1: one or more non-whitespaces.
Now, the file names are always in SubMatches(0).
Note that the regex must be compiled with the case insensitive option and global (if the string contains multiple matches):
With regex
.Pattern = "\bopen\s?file\s+(\S+)"
.IgnoreCase = True
.Global = True
End With

How to match accented characters but not tab

I'm trying to match the company name in this string delimited with tabs.
Below table does not have tabs when you copy it, but I have replaced tabs with two spaces, which I assume will work fine for testing.
1025164 HERBEX IBERIA, S.L.U. KY01 4600292091
1016379 DRISCOLL´S OF EUROPE B.V. KY01 4600322589
1008809 LANDGARD NORD OBST & GEMÜSE GM KY01 4600347315
1008835 C.A.S.I. : COOPERATIVA PROVINC KY01 4600348112
1019258 SYDGRÖNT EKONOMISK FÖRENING KY02 4600343422
(The second column of the above, between 7 digit number and KY0 above)
In real life the columns are not always in the same order since it's a user preference.
I just took a few examples but names could also include /éèáà()´, pretty much anything (sadly).
I found another question here Concrete Javascript Regex for Accented Characters (Diacritics)
When I use the regex patterns in that thread, example: "\t([A-zÀ-ÿ0-9\s\.\,\_\-\'\&]+)\t" (I know some characters are still missing) to match between two tabs it becomes greedy and matches the whole line.
Is there any pattern that could match any character in a company name between tabs (or two spaces as the example above)?
Instead of returning a matched part, I matched everything and replaced it with the 1st capture group. Hope it helps.
Sub Test()
Dim str As String: str = "1025164" & vbTab & "HERBEX IBERIA, S.L.U." & vbTab & "KY01" & vbTab & "4600292091"
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = "(?:^|\t)(?:\d+|KY\d+|([^\t]+))(?=\t|$)"
Debug.Print .Replace(str, "$1")
End With
End Sub
Have a look at this online demo to test the pattern:
(?:^|\t) - Match either start line anchor or a tab. Unfortunately the VBA-regex object does not support lookbehinds.
(?: - Open a non-capture group to start matching all parts you don't want to capture first:
\d+ - match 1+ digits;
| - Or:
KY\d+ - Match "KY" followed by 1+ digits;
| - Or:
([^\t]+) - nest a capture group to capture 1+ non-tabs.
) - Close non-capture group.
(?=\t|$) - Positive lookahead to assert captured text is followed by either a tab or end-line anchor.
I would use a different attempt using the split-command. The following code assumes that you have Tabs as separator and that the company name is found if the column is not numeric (only digits) and does not start with 'KY'.
Function getCompanyName(line As String) As String
Const separator = vbTab ' Replace with " " if you need that.
Dim tokens() As String, i As Integer
tokens = Split(line, separator)
For i = 0 To UBound(tokens)
If Not IsNumeric(tokens(i)) And Left(tokens(i) <> "KY") Then
getCompanyName = tokens(i)
Exit Function
End If
Next
End Function

Truncating Text To Full Words Based On Character Limit - Excel

I'm working with some data (DataSet#1) which has a text field truncated using some unconventional logic:
If "Service Type Description" is > 60 Characters, Trim the name down to < 60 characters, but only full words
My problem is that I need to format some other data (DataSet#2) in excel to match this logic which is being applied on the back-end on our reporting server (outside my control). No one can seem to find a list of all the potential truncated descriptions either.
Dataset#1 is live and can be re-pulled with updated data at any time, so I need to create a template that allows me to pull in information from the list in DataSet#2 (which currently has the full length descriptions) into any copy of Dataset#1 based on the trimmed Service Type Description in DataSet#1.
Example:
The following is the full product name, and the product name in my DataSet#2:
"FNMA 1025 Small Residential Income Property Appraisal & FNMA 216 Addendum" (73 characters, including spaces)
Simply trimming this text to be <60 characters (59) would yield:
"FNMA 1025 Small Residential Income Property Appraisal & FNM"
However, this same product, in the main data (DataSet#1) is named as follows:
"FNMA 1025 Small Residential Income Property Appraisal & " (56 characters, 8 "words", including &)
The logic on the back-end for DataSet#1 has trimmed the full product name to under 60 characters, but retains only full words (removes the "FNM" partial word).
Ideally I have to be able to take a list that has the full description name - and apply logic in Excel (or VBA) that will yield the same result as the trimmed data from the other dataset - which then allows me to pull information from dataset #2 (full product names) into dataset#1 based on the service type description.
You could use something like so
Function truncate_string(strInput As String, Optional lngChars As Long = 60)
Dim lngCharInstance As Long
lngCharInstance = Len(strInput)
While lngCharInstance > lngChars
lngCharInstance = InStrRev(strInput, " ", _
IIf(lngCharInstance >= Len(strInput), _
Len(strInput), lngCharInstance - 1))
Wend
truncate_string = Mid(strInput, 1, lngCharInstance)
End Function
This would be called like so
truncate_string("FNMA 1025 Small Residential Income Property Appraisal & FNMA 216 Addendum")
and would return as follows
FNMA 1025 Small Residential Income Property Appraisal &
or like so for, 30 chars for example
truncate_string("FNMA 1025 Small Residential Income Property Appraisal & FNMA 216 Addendum",30)
which gives
FNMA 1025 Small Residential
Hope this helps, as there is a loop in there, i'd look at the possibilities of any potential for infinite loops.
You can use Regular Expressions for this.
Option Explicit
Function trimLength(S As String, Optional Length As Long = 60) As String
Dim RE As Object, MC As Object
Dim sPat As String
sPat = "^.{1," & Length - 1 & "}(?=\s|$)"
If Len(S) > 60 Then
Set RE = CreateObject("vbscript.regexp")
With RE
.Pattern = sPat
.MultiLine = True
Set MC = .Execute(S)
trimLength = MC(0)
End With
Else
trimLength = S
End If
End Function
Note that, in accord with your question, we subtract one from the desired length.
Explanation of the Regex
Trim Length to Whole Word
^.{1,59}(?=\s|$)
Options: ^$ match at line breaks
Assert position at the beginning of a line ^
Match any single character that is NOT a line break character .{1,59}
Between one and 59 times, as many times as possible, giving back as needed (greedy) {1,59}
Assert that the regex below can be matched starting at this position (positive lookahead) (?=\s|$)
Match this alternative \s
Match a single character that is a “whitespace character” \s
Or match this alternative $
Assert position at the end of a line $
Created with RegexBuddy

How to count exact text contain in string [Excel]

I already use these below formula to count exact text contain in string but still formula wrongly counted it. For example, i would like to count "ZIKA" test code in table, the answer should be two. But the formula count ZIKA2 as ZIKA also. How to ignore ZIKA2 from count it?
TEST
HS2, CCAL, EGFR, AFB
ZIKA, AG21
PPB, ZIKA2
ZIKA, AG21
I already try these formulas:
=SUMPRODUCT(--(ISNUMBER(FIND("ZIKA",F:F))))
and also
=COUNTIF(F:F,"ZIKA")
you could count exact zika, and comma-separated vriations
=COUNTIF(F:F,"ZIKA")+COUNTIF(F:F,"ZIKA,*")+COUNTIF(F:F,"*, ZIKA")+COUNTIF(F:F,"*, ZIKA,*")
I assume your data follow this format
xxx, yyy, zzz
space after comma
You may need to split your formula into 3 parts
=COUNTIF(F:F,"ZIKA,*")+COUNTIF(F:F,"*, ZIKA")+COUNTIF(F:F,"ZIKA")
The first part will count those start with ZIKA, second part count those end with ZIKA, last we should count those only with ZIKA
Try this regex, it may need a helpercolumn. I have not tested it that much yet.
Press ALT + F11 to open VBA editor.
Click Insert -> module and copy paste the code below.
Function Regex(Cell, Search)
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
RE.Pattern = "(\b" & Search & "\b)"
RE.Global = True
RE.IgnoreCase = True
Set Matches = RE.Execute(Cell)
For Each res In Matches
Regex = Regex & "," & res
Next res
Regex = Mid(Regex, 2)
End Function
It will return "ZIKA" if it finds ZIKA in the cell you run it on.
And then you just count the ZIKAs in the helper column.
Updated with a new code that you can change the search in.
Use it with =regex(A1, "ZIKA")

Split, escaping certain splits

I have a cell that contains multiple questions and answers and is organised like a CSV. So to get all these questions and answers separated a simple split using the comma as the delimiter should separate this easily.
Unfortunately, there are some values that use the comma as the decimal separator. Is there a way to escape the split for those occurrences?
Fortunately, my data can be split using ", " as separator, but if this wouldn't be the case, would there still be a solution besides manually replacing the decimal delimiter from a comma to a dot?
Example:
"Price: 0,09,Quantity: 12,Sold: Yes"
Using Split("Price: 0,09,Quantity: 12,Sold: Yes",",") would yield:
Price: 0
09
Quantity: 12
Sold: Yes
One possibility, given this test data, is to loop through the array after splitting, and whenever there's no : in the string, add this entry to the previous one.
The function that does this might look like this:
Public Function CleanUpSeparator(celldata As String) As String()
Dim ret() As String
Dim tmp() As String
Dim i As Integer, j As Integer
tmp = Split(celldata, ",")
For i = 0 To UBound(tmp)
If InStr(1, tmp(i), ":") < 1 Then
' Put this value on the previous line, and restore the comma
tmp(i - 1) = tmp(i - 1) & "," & tmp(i)
tmp(i) = ""
End If
Next i
j = 0
ReDim ret(j)
For i = 0 To UBound(tmp)
If tmp(i) <> "" Then
ret(j) = tmp(i)
j = j + 1
ReDim Preserve ret(j)
End If
Next i
ReDim Preserve ret(j - 1)
CleanUpSeparator = ret
End Function
Note that there's room for improvement by making the separator caharacters : and , into parameters, for instance.
I spent the last 24 hours or so puzzling over what I THINK is a completely analogous problem, so I'll share my solution here. Forgive me if I'm wrong about the applicability of my solution to this question. :-)
My Problem: I have a SharePoint list in which teachers (I'm an elementary school technology specialist) enter end-of-year award certificates for me to print. Teachers can enter multiple students' names for a given award, separating each name using a comma. I have a VBA macro in Access that turns each name into a separate record for mail merging. Okay, I lied. That was more of a story. HERE'S the problem: How can teachers add a student name like Hank Williams, Jr. (note the comma) without having the comma cause "Jr." to be interpreted as a separate student in my macro?
The full contents of the (SharePoint exported to Excel) field "Students" are stored within the macro in a variable called strStudentsBeforeSplit, and this string is eventually split with this statement:
strStudents = Split(strStudentsBeforeSplit, ",", -1, vbTextCompare)
So there's the problem, really. The Split function is using a comma as a separator, but poor student Hank Williams, Jr. has a comma in his name. What to do?
I spent a long time trying to figure out how to escape the comma. If this is possible, I never figured it out.
Lots of forum posts suggested using a different character as the separator. That's okay, I guess, but here's the solution I came up with:
Replace only the special commas preceding "Jr" with a different, uncommon character BEFORE the Split function runs.
Swap back to the commas after Split runs.
That's really the end of my post, but here are the lines from my macro that accomplish step 1. This may or may not be of interest because it really just deals with the minutiae of making the swap. Note that the code handles several different (mostly wrong) ways my teachers might type the "Jr" part of the name.
'Dealing with the comma before Jr. This will handle ", Jr." and ", Jr" and " Jr." and " Jr".
'Replaces the comma with ~ because commas are used to separate fields in Split function below.
'Will swap ~ back to comma later in UpdateQ_Comma_for_Jr query.
strStudentsBeforeSplit = Replace(strStudentsBeforeSplit, "Jr", "~ Jr.") 'Every Jr gets this treatment regardless of what else is around it.
'Note that because of previous Replace functions a few lines prior, the space between the comma and Jr will have been removed. This adds it back.
strStudentsBeforeSplit = Replace(strStudentsBeforeSplit, ",~ Jr", "~ Jr") 'If teacher had added a comma, strip it.
strStudentsBeforeSplit = Replace(strStudentsBeforeSplit, " ~ Jr", "~ Jr") 'In cases when teacher added Jr but no comma, remove the (now extra)...
'...space that was before Jr.

Resources