I want to parse a table line using regex.
Input
|---|---|---|
|---|---|---|
So far I've come up with this regex:
/^(?<indent>\s*)\|(?<cell>-+|)/g
Regex101 Link: https://regex101.com/r/wzMYxd/1
But this regex is incomplete.
This only finds the first cell --|, but I want to find all the following cells as different ----|.
Question: Can we catch the following cells with the same pattern using the regex?
ExpectedOutput: groups with array of matched cells: ["---|", "----|", "---|"]
Note: no constant number of - is required
How about first verifying, if the line matches the pattern:
^[ \t]*\|(?:-+\|)+$
See this demo at regex101 - If it matches, extract the stuff:
^(?<indent>[\t ]*)\||(?<cell>-+)\|
Another demo at regex101 (explanation on the right side)
With just one regex maybe by use of sticky flag y and a lookahead for validation:
/^(?<indent>[ \t]*)\|(?=(?:-+\|)+$)|(?!^)(?<cell>-+)\|/gy
One more demo at regex101
The lookahead checks once after the first | if the rest of the string matches the pattern. If this first match fails, due to the y flag (matches are "glued" to each other) the rest of the pattern fails too.
Related
I would like to mark in Notepad++ the sql scripts in a text log. The sql files have this format in the text:
AAAAAAAA.BBBBBBBBBBB.sql
So what I execute is this sentence in search menu:
\w*.sql
As I should get BBBBBBBBBBB.sql. The point is that in some script names there are dashes (-), and when that happens I dont get the whole name, but just the end after the last dash.
For example, in:
AAAAAAAA.BBBBB-CCCCCCC.sql
I would like to get BBBBB-CCCCCCC.sql, but I just get CCCCCCC.sql
Is there any possible formula to get them?
If the match can not start and end with a hyphen:
\w+(?:-\w+)*\.sql
\w+ Match 1+ word characters
(?:-\w+)* Optionally match - and 1+ word characters
\.sql Match .sql
See a regex demo.
Note that in your pattern the \w* can also match 0 occurrences and that the . can match any character if it is not escaped.
Another option could be using a character class to match either - or a word character, but this would also allow to mix and match like --a--.sql
[\w-]+\.sql
See another regex demo.
I have checked the documentation: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.regexp_replace.html
But cannot for the life of me figure out why this part
r'(\d+)'
Leads to changing
'100-200'
to
'-----'
Anyone with good documentation on that? I believe the \d section looks for 0-9 but that's about as far as I get. I don't understand in which sequence you need to do what either.
\d matches a digit i.e 0-9 and + matches the previous token between one and unlimited times, as many times as possible, giving back as needed.
Column has value 100-200. according to above statement, 100 matches for the regex(\d will match each digit of 100 separately but + matches 100 completely). So 100 will be replaced by --. In the same way 200 will be replaced by --. Finally we will have ----- as column value.
Brackets are used if incase we want to group to capture later using index starting with 1.
Let's say we want to extract only 1st matched value in a column then in spark we can use regexp_extract as shown below:
df.select(regexp_extract('column', '(\d+)', 1)) # 1 is groupIndex
In python Prefix r used before a regular expression, it marks raw string. For example, '\n' is a new line whereas r'\n' means two characters: a backslash \ followed by n.
If you want to match "\n" and if you don't use r prefix then you have to escape \ like this "\\n" in your regex expression.
You can practice/test regex in this website, you will get real time explanation about what's happening in background. You can go through this simple cheatsheet.
I´m working with VBA and trying to split a string into three columns, almost all strings are like Company Name 3567782 Agent Name.pdf
With this pattern I want to match all the text before a space and digits (1st group), the digits (2nd group) and all the text after the space and before the .pdf (3rd group).
strPattern = "^(.+)\n(\d{4,10})\n(.+).pdf"
I recall spaces in python are \s but saw in VBA are \n.
Can you help me find the right pattern for what I´m looking for?
As I put in my comment, I use the https://regex101.com site. There are others but I find this one the most helpful to me.
When I put in your regex
^(.+)\n(\d{4,10})\n(.+).pdf
and test string
Company Name 3567782 Agent Name.pdf
the first thing I notice is that the regex does not match the test string (see right side under MATCH INFORMATION).
Here are a couple things that I saw:
\n is newline, not space. In regex, space is " ".
Your last "." in ".pdf" is not registering as a literal period, it's a token that matches any character. To match a literal period, you need \.
If we change those two things it returns three groups that seem to match what you are looking for.
^(.+) (\d{4,10}) (.+)\.pdf
It looks like for the digits, you are looking for between 4 and 10 digits. If that's correct, it looks like your regex is good. You could put in a handful of example strings into the TEST STRING area and make sure that it works in all cases.
I'd use either of these:
(?:(?:([a-zA-Z]+\.?)|(\d+)))
capture a-Z greedy with a possible . to allow for the .pdf or capture digits
this version excludes the space [ ] or \s
or keep the search structured so you can control what goes in and out of each column
^(\w+\s\w+)|(\d+)|(\w+\s\w+\.\w+$)
\b or ^ - word boundary or start of string
(\w+\s\w+) - 1st capture \w+ - any alpha numeric char greedily, followed by 1 x space (use \s* or \s+ for more), followed again by alpha numeric greedily
|(\d+) - alteration - \d+ - capture just digits
`|(\w+\s\w+.\w+$) - similar to 1st group but allows for the '.' of pdf and bounds to the end of string (\G or $).
you could optionally build the '.' into the 1st group like my top answer, but for neatness and better control I prefer the 2nd.
I am trying to extract a cell number from the formula expression (in vba) which I need to replace by another cell number. eg: I have the following formulae in different cells "=AL82+L8+L82", "=L8+L82" and "=AL82+L8" . I have to change "L8" in each of the formulae to "L9". I am new to Regex and was trying the following expression in regex pattern:
"(?=[^A-Z])([L8])(?=[^0-9])"
However only 8 is changed to L9. Please assist me with the error.
Thanks
You can capture either plus or an equals sign in a capturing group.
Then Match L8 and assert using a negative lookahead, that the 8 is not directly followed by a digit.
In the replacement use group 1 followed by L9: $1L9
([+=])L8(?!\d)
See a regex demo
I would like to isolate all operands from a formula (in the form of a string) by taking out the arithmetic operators so take out: "+","-","/","*","**2"
the formula string is something like:
"y=A+B1*options+B2*items**2+B3*factor+B4"
However: I can manage for most arithmetic operators, except for the exponents "**2" part. It has to be a wildcard search or so (not positional), because the whole formula might change in future and also might have another exponent (eg **5 or **54)
What would be the easiest way to strip "**?" out of the formula where ? can be any number?
To match the pattern you want, use the regex string r"\*\*\d+"
Breakdown:
r"" is the how one denotes regex in python (see the re module for more info)
\* matches a single * character - because the * is a special character in regex, we escape it with the \
\d matches a digit
+ matches the previous pattern at least once greedily: this means it will try to find at least one digit, then keep finding digits until it can find no more. So, it will match **2, **44382, and so on
As for stripping the pattern from the equation, you can do re.sub(pattern, "", equation) - replacing all instances of the pattern with nothing