What do you understand by this RegEx? - excel

I´m working with VBA and trying to split a string into three columns, almost all strings are like Company Name 3567782 Agent Name.pdf
With this pattern I want to match all the text before a space and digits (1st group), the digits (2nd group) and all the text after the space and before the .pdf (3rd group).
strPattern = "^(.+)\n(\d{4,10})\n(.+).pdf"
I recall spaces in python are \s but saw in VBA are \n.
Can you help me find the right pattern for what I´m looking for?

As I put in my comment, I use the https://regex101.com site. There are others but I find this one the most helpful to me.
When I put in your regex
^(.+)\n(\d{4,10})\n(.+).pdf
and test string
Company Name 3567782 Agent Name.pdf
the first thing I notice is that the regex does not match the test string (see right side under MATCH INFORMATION).
Here are a couple things that I saw:
\n is newline, not space. In regex, space is " ".
Your last "." in ".pdf" is not registering as a literal period, it's a token that matches any character. To match a literal period, you need \.
If we change those two things it returns three groups that seem to match what you are looking for.
^(.+) (\d{4,10}) (.+)\.pdf
It looks like for the digits, you are looking for between 4 and 10 digits. If that's correct, it looks like your regex is good. You could put in a handful of example strings into the TEST STRING area and make sure that it works in all cases.

I'd use either of these:
(?:(?:([a-zA-Z]+\.?)|(\d+)))
capture a-Z greedy with a possible . to allow for the .pdf or capture digits
this version excludes the space [ ] or \s
or keep the search structured so you can control what goes in and out of each column
^(\w+\s\w+)|(\d+)|(\w+\s\w+\.\w+$)
\b or ^ - word boundary or start of string
(\w+\s\w+) - 1st capture \w+ - any alpha numeric char greedily, followed by 1 x space (use \s* or \s+ for more), followed again by alpha numeric greedily
|(\d+) - alteration - \d+ - capture just digits
`|(\w+\s\w+.\w+$) - similar to 1st group but allows for the '.' of pdf and bounds to the end of string (\G or $).
you could optionally build the '.' into the 1st group like my top answer, but for neatness and better control I prefer the 2nd.

Related

Mark words in notepad++ including dash (-)

I would like to mark in Notepad++ the sql scripts in a text log. The sql files have this format in the text:
AAAAAAAA.BBBBBBBBBBB.sql
So what I execute is this sentence in search menu:
\w*.sql
As I should get BBBBBBBBBBB.sql. The point is that in some script names there are dashes (-), and when that happens I dont get the whole name, but just the end after the last dash.
For example, in:
AAAAAAAA.BBBBB-CCCCCCC.sql
I would like to get BBBBB-CCCCCCC.sql, but I just get CCCCCCC.sql
Is there any possible formula to get them?
If the match can not start and end with a hyphen:
\w+(?:-\w+)*\.sql
\w+ Match 1+ word characters
(?:-\w+)* Optionally match - and 1+ word characters
\.sql Match .sql
See a regex demo.
Note that in your pattern the \w* can also match 0 occurrences and that the . can match any character if it is not escaped.
Another option could be using a character class to match either - or a word character, but this would also allow to mix and match like --a--.sql
[\w-]+\.sql
See another regex demo.

Excel Remove only last characters if they match

I've been trying a few different ways to try and search and replace on excell to remove the last couple of characters.
For instance in one column I have product name S
I want to remove the " S" only.
I have tried some if formulas a swell and not had much luck. I'm assuming there is a simple regex that can be used for the search and replace e.g. " S/" that would just replace if its the last characters and has nothing after it.
Try using the SUBSTITUTE function and replace the letters you want to remove with a unique character/ word / space not appearing anywhere else in the booklet, depending on which part of the string you're trying to remove and what format you're trying to keep
then find and replace ( CTRL +F) that word with the black (space) character
see how to use SUBSTITUTE function here:
https://exceljet.net/excel-functions/excel-substitute-function
Since you are only interested in the end of the string, I don't think you need regex or anything too sophisticated.
If I understand correctly, you want to get the original string (product name S) up until but not including something that appears at the end (S). This means that in your example, you want the 12 leftmost digits: the digits of the original string (14) minus the digits of the pattern (2) - this would give you product name. If the original string does not end with the pattern, you want the original string.
Therefore, I suggest the following:
=IF(RIGHT("original string",LEN("pattern"))="pattern",
LEFT("original string",LEN("original string")-LEN("pattern")),
"original string")
Check these examples:

VIM line count in status bar with thousands separator?

Is it possible to display the line count in the VIM status bar with thousands separators, preferably custom thousands separators?
Example:
set statusline=%L
should lead to "1,234,567" instead of "1234567".
I've found a way but it looks a bit crazy:
set statusline=%{substitute(line('$')\,'\\d\\zs\\ze\\%(\\d\\d\\d\\)\\+$'\,'\,'\,'g')}
The first round of backslashes is just for set (I have to escape , and \ itself).
What I'm actually setting the option to is this string:
%{substitute(line('$'),'\d\zs\ze\%(\d\d\d\)\+$',',','g')}
As a format string, this line contains one formatting code, which is %{...}. Everything in ... is evaluated as an expression and the result substituted back in.
The expression I'm evaluating is (spaces added (if I had added them to the real code, I would've had to escape them for set again, forcing yet more backslashes)):
substitute(line('$'), '\d\zs\ze\%(\d\d\d\)\+$', ',', 'g')
This is a call to the substitute function. The arguments are the source string, the regex, the replacement string, and a list of flags.
The string we're starting with is line('$'). This call returns the number of lines in the current buffer (or rather the number of the last line in the buffer). This is what %L normally shows.
The search pattern we're looking for is \d(\d\d\d)+$ (special vim craziness removed), i.e. a digit followed by 1 or more groups of 3 digits, followed by the end of the string. Grouping is spelled \%( \) in vim, and "1 or more" is \+, which gives us \d\%(\d\d\d\)\+$. The last bit of magic is \zs\ze. \zs sets the start of the matched string; \ze sets the end. This works as if everything before \zs were a look-behind pattern and everything after \ze were a look-ahead pattern.
What this amounts to is: We're looking for every position in the source string that is preceded by a digit and followed by exactly N digits (where N is a multiple of 3). This works like starting at the right and going left, skipping 3 digits each time. These are the positions where we need to insert a comma.
That's what the replacement string is: ',' (a comma). Because we're matching a string of length 0, we're effectively inserting into the source string (by replacing '' with ',').
Finally, the g flag says to do this with all matches, not just the first one.
TL;DR:
line('$') gives us the number of lines
substitute(..., '\d\zs\ze\%(\d\d\d\)\+$', ',', 'g') adds commas where we want them
%{ } lets us embed arbitrary expressions into statusline

^[:blank:] does not match dot in sed

I have an input as follows:
INa.aa................... October 2010 after its previous U.S.-based owners failed to pay debts
My goal is to put brackets around every word starting with letter i/I. So I issued a command:
sed 's/\<i[^[:blank:]]*\>/(&)/gi' input_data
Which returned this output:
(INa.aa)................... October 2010 after (its) previous U.S.-based owners failed to pay debts
What I don't get is, why doesn't the ^[:blank:]* also include the dots after INa.aa?
Thank you for any suggestions.
You use the \> "end of word" escape. A word boundary is defined as
the character to the left is a "word" character and the character to the right is a "non-word" character, or vice-versa
in the manual (referring to \b). In the case of \>, the "vice-versa" does not apply.
What is a "word" character?
A "word" character is any letter or digit or the underscore character.
And "non-word" are all the others. You expect the boundary between your periods and a blank to match \>, but it doesn't: both the period and the blank are non-word characters. The word boundary is between the last a and the first ..
The period between the as is also surrounded by word boundaries, but because there aren't any blanks involved, it's a part of the match.
If you want to match everything up to the next blank, you can just skip the \> in your regex.

Tell vim to add commas to a number, e.g. change 31415926 to 31,415,926

I have a very large number (a couple hundred digits long), and I'd like to use vim to add commas to the number in the appropriate manner, i.e. after each group of three digits, moving from right to left. How can I do this efficiently?
Taken from here
Substitue command that adds commas in the right spot.
:%s/\(\d\)\(\(\d\d\d\)\+\d\#!\)\#=/\1,/g
This uses a zero width lookahead to match any number that isn't followed by groups of three numbers followed by one number. (or 3n+1 numbers)
So the numbers that match in are marked with ^. These are then replaced with a comma after it the match.
31415926
^ ^
Which replaces to
31,415,926
A friend of mine suggests using the printf program: ciw<C-r>=system("printf \"%'d\" ".shellescape(#"))<CR>.
This is one way of doing it:
s/\d\{-1,}\ze\(\d\{3}\)\+\s/&,/g
Notes:
\{-1,} is saying match at least 1 but in a non-greedy way (Vim doesn't seem to support the usual \+\? syntax; also, for quantifiers, you just need to escape the opening curly brace)
\ze is saying match the pattern behind this but don't store the match in & (equivalent to positive look-ahead)
\(\d\{3}\)\+\> matches groups of 3 digits that ends with word-nonword boundary (word in this sense means alphanumerical + underscore).
Alternatively, you can use \s for space/tab, or \D for non-digit instead of \>, whichever fits your needs better
The way that I used is to create a macro that adds one single comma, and then invoke the macro a whole bunch of times, like qahhi,<ESC>hq#a#a#a#a…

Resources