How to remove line feed? - excel

This script for Outlook returns the desired date but doesn't remove the line feed.
String I want to get the date from:
1_c Anruf/Meldung am (Datum): 04.Mai.2020
With Reg1
.Pattern = "1_c Anruf\/Meldung am \(Datum\)\s*[:]+\s*(.*)\s+\n"
.Global = False
End With
If Reg1.Test(olMail.Body) Then
Set M1 = Reg1.Execute(olMail.Body)
End If
For Each M In M1
Debug.Print M.SubMatches(0)
With xExcelApp
Range("A5").Value = M.SubMatches(0)
End With
Next M
regex101 selects correctly but the debugger always shows something like "02.12.2020 ". <- containing no whitespace but a line feed.
In Excel the line feed is also visible. Also trailing whitespace isn't a problem since I can use TRIM but the line feed doesn't allow it to function.

Your regex captures the CR symbol, you can replace . with [^\r\n] to avoid this behavior.
It seems you want to use
1_c Anruf/Meldung am \(Datum\)\s*:+\s*([^\r\n]*\S)
See the regex demo. Note the forward slash does not have to be escaped in the VBA code. Details:
1_c Anruf/Meldung am \(Datum\) - a fixed 1_c Anruf/Meldung am (Datum) string
\s*:+\s* - one or more colons enclosed with zero or more whitespaces
([^\r\n]*\S) - Capturing group 1 (accessed with M.SubMatches(0)) that captures zero or more occurrences of any char other than CR and LF chars and then any non-whitespace char.

Related

how to output regex group values

I need the group values instead of the matches.
This is how i tried to get those:
Dim item as Variant, matches As Object, match As Object, subMatch As Variant, subMatches(), row As integer
row = 1
For Each item In arr
With regex
.Pattern = "\bopenfile ([^\s]+)|\bopen file ([^\s]+)"
Set matches = regex.Execute(item)
For Each match In matches
For Each subMatch In match.subMatches
subMatches(i) = match.subMatches(i)
ActiveSheet.Range("A" & row).Value = subMatches(i)
row = row + 1
i = i + 1
Next subMatch
Next match
End With
Next item
This is the text from where it should be extracted:
Some help would be great :)
Open File file.M_p3_23432e done
Openfile file.M_p4_6432e done
Open File file.M_p3_857432 done
Open File file.M_p4_34892f done
Openfile file.M_p3_781 done
Info: I'm using Excel VBA.. If that is important to know.
You can revamp the regex to match and capture with one capturing group:
\bopen\s?file\s+(\S+)
See the regex demo.
Details:
\b - word boundary
open - a fixed word
\s? - an optional whitespace
file - a fixed word
\s+ - one or more whitespaces
(\S+) - Group 1: one or more non-whitespaces.
Now, the file names are always in SubMatches(0).
Note that the regex must be compiled with the case insensitive option and global (if the string contains multiple matches):
With regex
.Pattern = "\bopen\s?file\s+(\S+)"
.IgnoreCase = True
.Global = True
End With

Replace non-printable characters with " (Inch sign) VBA Excel

I need to replace non-printable characters with " (Inch sign).
I tried to use excel clean function and other UDF functions, but it just remove and not replace.
Note: non-printable characters are highlighted in blue on the above photo and it's position is random on the cells.
this is a sample string file Link`
The expected correct output should be 12"x14" LPG . OUTLET OCT-SEP# process
In advance grateful for useful comments and answer.
As per my comment, you can try:
=SUBSTITUTE(A1,CHAR(25)&CHAR(25),CHAR(34))
Or the VBA pseudo-code:
[A1] = [A1].Replace(Chr(25) & Chr(25), Chr(34))
Where [A1] is the obvious placeholder for the range-object you would want to use with proper and absolute referencing.
With ms365 newest functions, we could also use:
=TEXTJOIN(CHAR(34),,TEXTSPLIT(A1,CHAR(25)))
You can use Regular Expressions within a UDF to create a flexible method to replace "bad" characters, when you don't know exactly what they are.
In the UDF below, I show two pattern options, but others are possible.
One is to replace all characters with a character code >127
the second is to replace all characters with a charcter code >255
Option Explicit
Function ReplaceBadChars(str As String, replWith As String) As String
Dim RE As Object
Set RE = CreateObject("Vbscript.Regexp")
With RE
.Pattern = "[\u0080-\uFFFF]" 'to replace all characters with code >127 or
'.Pattern = "[\u0100-\uFFFF]" 'to replace all characters with code >255
.Global = True
ReplaceBadChars = .Replace(str, replWith)
End With
End Function
On the worksheet you can use, for example:
=ReplaceBadChars(A1,"""")
Or you could use it in a macro if you wanted to process a column of data without adding an extra column.
Note: I am uncertain as to whether there might be an efficiency difference using a smaller negated character class (eg: [^\x00-\x79] instead of the character class I showed in the code. But if, as written, execution seems slow, I'd try this change)
You can try this :
Cells.Replace What:="[The caracter to replace]", Replacement:=""""

How to match accented characters but not tab

I'm trying to match the company name in this string delimited with tabs.
Below table does not have tabs when you copy it, but I have replaced tabs with two spaces, which I assume will work fine for testing.
1025164 HERBEX IBERIA, S.L.U. KY01 4600292091
1016379 DRISCOLL´S OF EUROPE B.V. KY01 4600322589
1008809 LANDGARD NORD OBST & GEMÜSE GM KY01 4600347315
1008835 C.A.S.I. : COOPERATIVA PROVINC KY01 4600348112
1019258 SYDGRÖNT EKONOMISK FÖRENING KY02 4600343422
(The second column of the above, between 7 digit number and KY0 above)
In real life the columns are not always in the same order since it's a user preference.
I just took a few examples but names could also include /éèáà()´, pretty much anything (sadly).
I found another question here Concrete Javascript Regex for Accented Characters (Diacritics)
When I use the regex patterns in that thread, example: "\t([A-zÀ-ÿ0-9\s\.\,\_\-\'\&]+)\t" (I know some characters are still missing) to match between two tabs it becomes greedy and matches the whole line.
Is there any pattern that could match any character in a company name between tabs (or two spaces as the example above)?
Instead of returning a matched part, I matched everything and replaced it with the 1st capture group. Hope it helps.
Sub Test()
Dim str As String: str = "1025164" & vbTab & "HERBEX IBERIA, S.L.U." & vbTab & "KY01" & vbTab & "4600292091"
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = "(?:^|\t)(?:\d+|KY\d+|([^\t]+))(?=\t|$)"
Debug.Print .Replace(str, "$1")
End With
End Sub
Have a look at this online demo to test the pattern:
(?:^|\t) - Match either start line anchor or a tab. Unfortunately the VBA-regex object does not support lookbehinds.
(?: - Open a non-capture group to start matching all parts you don't want to capture first:
\d+ - match 1+ digits;
| - Or:
KY\d+ - Match "KY" followed by 1+ digits;
| - Or:
([^\t]+) - nest a capture group to capture 1+ non-tabs.
) - Close non-capture group.
(?=\t|$) - Positive lookahead to assert captured text is followed by either a tab or end-line anchor.
I would use a different attempt using the split-command. The following code assumes that you have Tabs as separator and that the company name is found if the column is not numeric (only digits) and does not start with 'KY'.
Function getCompanyName(line As String) As String
Const separator = vbTab ' Replace with " " if you need that.
Dim tokens() As String, i As Integer
tokens = Split(line, separator)
For i = 0 To UBound(tokens)
If Not IsNumeric(tokens(i)) And Left(tokens(i) <> "KY") Then
getCompanyName = tokens(i)
Exit Function
End If
Next
End Function

What's the best way to keep regex matches in Excel?

I'm working off of the excellent information provided in "How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops", however I'm running into a wall trying to keep the matched expression, rather than the un-matched portion:
"2022-02-14T13:30:00.000Z" converts to "T13:30:00.000Z" instead of "2022-02-14", when the function is used in a spreadsheet. Listed below is the code which was taken from "How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops". I though a negation of the strPattern2 would work, however I'm still having issues. Any help is greatly appreciated.
Function simpleCellRegex(Myrange As Range) As String
Dim regEx As New RegExp
Dim strPattern As String
Dim strPattern2 As String
Dim strInput As String
Dim strReplace As String
Dim strOutput As String
strPattern = "^T{0-9][0-9][:]{0-9][0-9][:]{0-9][0-9][0-9][Z]"
strPattern2 = "^(19|20)\d\d([- /.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])"
If strPattern2 <> "" Then
strInput = Myrange.Value
strReplace = ""
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = strPattern2
End With
If regEx.test(strInput) Then
simpleCellRegex = regEx.Replace(strInput, strReplace)
Else
simpleCellRegex = "Not matched"
End If
End If
End Function
Replace is very powerful, but you need to do two things:
Specify all the characters you want to drop, if your regexp is <myregexp>, then change it to ^.*?(<myregexp>).*$ assuming you only have one date occurrence in your string. The parentheses are called a 'capturing group' and you can refer to them later as part of your replacement pattern. The ^ at the beginning and the $ at the end ensure that you will only match one occurrence of your pattern even if Global=True. I noticed you were already using a capturing group as a back-reference - you need to add one to the back-reference number because we added a capturing group. Setting up the pattern this way, the entire string will participate in the match and we will use the capturing groups to preserve what we want to keep.
Change your strReplace="" to strReplace="$1", indicating you want to replace whatever was matched with the contents of capturing group #1.
Here is a screenprint from Excel using my RegexpReplace User Defined Function to process your example with my suggestions:
I had to fix up your time portion regexp because you used curly brackets three times where you meant square, and you left out the seconds part completely. Notice by adjusting where you start and end your capturing group parentheses you can keep or drop the T & Z at either end of the time string.
Also, if your program is being passed system timestamps from a reliable source then they are already well-formed and you don't need those long, long regular expressions to reject March 32. You can code both parts in one as
([-0-9/.]{10,10})T([0-9:.]{12,12})Z and when you want the date part use $1 and when you want the time part use $2.

Capturing the submatches after the second instance of a string variable

How would one capture two numeric submatches after the second instance of a string/number?
I have a # that changes from .txt file to .txt file. It is captured in a variable called "Total" which I declared as a string. The string contains numbers, in the format of 123, 456,789.23 or 123, 456.01. This number appears about 3 times within the .txt file, and I have written a RegEx pattern that is able to capture the first instance of this number and its submatches.
regex.Pattern = Total & "\s*([\d+\.\d*])\s*([\d+\.\d*])\s*"
The .txt file portion I am trying to capture may appear as
123,456,789.38
2.180
251.517
OR
123,456,789.38 2.180 251.517
I want to capture 2.180 and 251.517.
The first instance includes the words "Number of: " in front of it, and I tried to make the pattern avert from the ":" before it by writing:
regex.Pattern = "[^:\s]" & Total & "\s*([\d+\.\d*])\s*([\d+\.\d*])\s*"
It still picks up this first instance and the numbers after that first instance. The second instance does not have any defining words before it, just a blank line such as the one below:
123,456,789.38
2.180
251.517
Additional information:
Dim regex As Object: Set regex = CreateObject("vbscript.regexp")
regex.Pattern = Total & "\s*([\d+\.\d*])\s*([\d+\.\d*])\s*"
Dim MCS As Object
Set MCS = regex.Execute(myText)
Dim Total As String: Total = MCS(0).submatches(0)
Dim submatch1 As String: submatch1 = MCS(0).Submatches(1)
Dim submatch2 As String: submatch2 = MCS(0).Submatches(2)
where mytext is the contents of the .txt file entirely as a string.
There are also words and numbers between the different instances of the variable "Total", such as
Number of: 123,456,789.38
Text Here Text Here
Number Here
123,456,789.38
2.180
251.517
I am also not sure how much/many text/numbers there will be between the first two instances of 123,456,789.38, so I am trying to think of how to work this to be flexible.
When I mention second instance, I mean that the number 123,456,789.38 (which is the variable named "Total") appears three times in the document. I want to capture the two submatches that appear after that number. However, since there are three times it appears, I want to capture the two submatches that appear after the second time that 123,456,789.38 pops up.
Link to the text file.
https://regex101.com/r/4hPtY3/6
Output:
submatch1 = 2.180
submatch2 = 251.517
Currently, it is capturing
submatch1 = 97
submatch2 = 5
with the pattern:
regex.Pattern = Replace(Total, ".", "\.") & "\s*([.\d]+)\s*([.\d]+)"
You can use
regex.Pattern = "^(?:[\s\S]*?" & Replace(Total, ".", "\.") & "){2}\D+([.\d]+)\s*([.\d]+)"
Here is the regex demo where you can see that
As expected.
Details
^ - start of string
(?:[\s\S]*?123,456\.78){2} - two occurrences of any 0+ chars as few as possible and then 123,456.78 string
\D+ - 1 or more non-digit chars
([.\d]+) - Group 1: one or more dots or digits
\s* - 0+ whitespaces
([.\d]+) - Group 2: one or more dots or digits.

Resources