Why the symbol + of the pattern in the regular expression pre-search is invalid ?
regular expression presearch Pattern in python3.
details are as follows
My purpose is to match the symbol dot and any number of adjacent digits to the left of dot in order to extract unmatched parts. Such as
"Contents156.html" -> "Contents" ;
"PingHang12Report_ipad1_1269.html" ->"PingHang12Report_ipad1_" ;
But now it seems that pattern doesn't work because of "Lookaround Is Atomic". So how should I do ?
You are using ?= which "matches next but doesn’t consume any of the string". Your .* matches the return value (including 2 digits) and ?= part find a digit and dot to be the "next" part. Things matched by ?= will not appear in the final result.
If you need a non-greedy match for .* part, use .*? instead.
re.findall(r'.*?(?=\d+\.)', 'PingHang12Report_ipad1_1269.html')
# => ['PingHang12Report_ipad1_', '', '', '', '']`
where you can just take the first element.
Another way to do this,
re.findall(r'(.*?)(\d+\..*)', 'PingHang12Report_ipad1_1269.html')
# => [('PingHang12Report_ipad1_', '1269.html')]
Related
Please suggest a wildcard for below Firstjson list
Firstjson = { p10_7_8 , p10_7_2 , p10_7_3 p10_7_4}
I have tried p10.7.* wildcard for below Secondjson list, it worked. But when I tried p10_7_* for above Firstjson list it did not work
Secondjson = { p10.7.8 , p10.7.2 , p10.7.3 , p10.7.4 }
You are attempting to use wildcard syntax, but Groovy expects regular expression syntax for its pattern matching.
What went wrong with your attempt:
Attempt #1: p10.7.*
A regular expression of . matches any single character and .* matches 0 or more characters. This means:
p10{exactly one character of any kind here}7{zero or more characters of any
kind here}
You didn't realize it, but the . character in your first attempt was acting like a single-character wildcard too. This might match with p10x7abcdefg for example. It also does match p10.7.8 though. But be careful, it also matches p10.78, because the .* expression at the end of your pattern will happily match any sequence of characters, thus any and all characters following p10.7 are accepted.
Attempt #2: p10_7_*
_ matches only a literal underscore. But _* means to match zero or more underscores. It does not mean to match characters of any kind. So p10_7_* matches things like p10_7_______. Literally:
p10_7{zero or more underscores here}
What you can do instead:
You probably want a regular expression like p10_7_\d+
This will match things like p10_7_3 or p10_7_422. It works by matching the literal text p10_7_ followed by one or more digits where a digit is 0 through 9. \d matches any digit, and + means to match one or more of the preceding thing. Literally:
p10_7_{one or more digits here}
I'm using python's re module to grab all instances of values between the
opening and closing parenthesis.
i.e. (A)way(Of)testing(This)
would produce a list:
['A', 'Of', 'This']
I took a look at 1 and 2.
This is my code:
import re
sentence = "(A)way(Of)testing(This)is running (it)"
res = re.compile(r".*\(([a-zA-Z0-9|^)])\).*", re.S)
for s in re.findall(res, sentence):
print(s)
What I get from this is:
it
Then I realized I was only capturing just one character, so I used
res = re.compile(r".*\(([a-zA-Z0-9-|^)]*)\).*", re.S)
But I still get it
I've always struggled with regex. My understanding of my search string
is as follows:
.* (any character)
\( (escapes the opening parenthesis)
( (starts the grouping)
[a-zA-Z0-9-|^)]* (set of characters allowed : a-Z, A-Z, 0-9, - *EXCEPT the ")" )
) (closes the grouping)
\) (escapes the closing parenthesis)
.* (anything else)
So in theory it should go through sentence and once it encounters a (,
it should copy the contents up until it encounters a ), at which point it should
store that into one group. It then proceeds through the sentence.
I even used the following:
res = re.compile(r".*\(([a-z|A-Z|0-9|-|^)]*)\).*", re.S)
But it still returns an it.
Any help greatly appreciated,
Thanks
You can shorten the pattern without the .* and the ^ and ) and only use the character class.
The .* part matches any character, and as the part between parenthesis is only once in the pattern you will capture only 1 group.
In your explanation about this part [a-zA-Z0-9-|^)]* the character class does not rule out the ) using |^). It will just match either a | ^ or ) char.
If you want to use a negated character class, the ^ should be at the start of the character class like [^ but that is not necessary here as you can specify what do you want to match instead of what you don't want to match.
\(([a-zA-Z0-9-]*)\)
The pattern matches:
\( Match (
( Capture group 1
[a-zA-Z0-9-]* Optionally repeat matching one of the listed ranges a-zA-Z0-9 or -
) Close group 1
\) Match )
regex demo
You don't need the re.S as there is no dot in the pattern that should match a newline.
import re
sentence = "(A)way(Of)testing(This)is running (it)"
res = re.compile(r"\(([a-zA-Z0-9-]*)\)")
print(re.findall(res, sentence))
Output
['A', 'Of', 'This', 'it']
I have a string in the following format
----------some text-------------
How do I extract "some text" without the hyphens?
I have tried with this but it matches the whole string
(?<=-).*(?=-)
The pattern matches the whole line except the first and last hyphen due to the assertions on the left and right and the . also matches a hyphen.
You can keep using the assertions, and match any char except a hyphen using [^-]+ which is a negated character class.
(?<=-)[^-]+(?=-)
See a regex demo.
Note: if you also want to prevent matching a newline you can use (?<=-)[^-\r\n]+(?=-)
With your shown samples please try following regex.
^-+([^-]*)-+$
Online regex demo
Explanation: Matching all dashes from starting 1 or more occurrences then creating one and only capturing group which has everything matched till next - comes, then matching all dashes till last of value.
You are using Python, right? Then you do not need regex.
Use
s = "----------some text-------------"
print(s.strip("-"))
Results: some text.
See Python proof.
With regex, the same can be achieved using
re.sub(r'^-+|-+$', '', s)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
-+ '-' (1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
-+ '-' (1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
I need to find mentions of compensations in emails. I am new to regex. Please see below the approach I am using.
sample_text = "Rate – $115k/yr. or $55/hr. - $60/hr"
My python code to find this,
impor re
PATTERN = r'((\$|\£) [0-9]*)|((\$|\£)[0-9]*)'
print(re.findall(PATTERN,sample_text))
The matches I am getting is
[('', '', '$115', '$'), ('', '', '$55', '$'), ('', '', '$60', '$')]
Expected match
["$115k/yr","$55/hr","$60/hr"]
Also the $ sign can be written as USD. How do I handle this in the same regex.
You can use
[$£]\d+[^.\s]*
[$£] Match either $ or £
\d+ Match 1+ digits
[^.\s]* Repeat 0+ times matching any char except . or a whitespace
Regex demo
import re
sample_text = "Rate – $115k/yr. or $55/hr. - $60/hr"
PATTERN = r'[$£]\d+[^.\s]*'
print(re.findall(PATTERN,sample_text))
Output
['$115k/yr', '$55/hr', '$60/hr']
If there must be a / present, you might also use
[$£]\d+[^\s/.]*/\w+
Regex demo
You can have something like:
[$£]\d+[^.]+
>>> PATTERN = '[$£]\d+[^.]+
>>> print(re.findall(PATTERN,sample_text))
['$115k/yr', '$55/hr', '$60/hr']
[$£] matches "$" or a "£"
\d+ matches one or more digits
[^.]+ matches everything that's not a "."
The parentheses in your regex cause the engine to report the contents of each parenthesized subexpression. You can use non-grouping parentheses (?:...) to avoid this, but of course, your expression can be rephrased to not have any parentheses at all:
PATTERN = r'[$£]\s*\d+'
Notice also how I changed the last quantifier to a + -- your attempt would also find isolated currency symbols with no numbers after them.
To point out the hopefully obvious, \s matches whitespace and \s* matches an arbitrary run of whitespace, including none at all; and \d matches a digit.
If you want to allow some text after the extracted match, add something like (?:/\w+)? to allow for a slash and one single word token as an optional expression after the main match. (Maybe adorn that with \s* on both sides of the slash, too.)
I have two scenarios to match . Length should be exactly 16.
Pattern should contain A-F,a-f,0-9 and '-' in 1st case.
AC-DE-48-23-45-67-AB-CD
ACDE48234567ABCD
I have tried with r'^([0-9A-Fa-f]{16})$|(([0-9A-Fa-f]{2}\-){7}[0-9A-Fa-f]{2})$'this , which is working fine . Looking for better expression .
You can simplify the regex by considering the string to be a group of two hex digits followed by an optional -, followed by 6 similar groups (i.e. if the first group had a -, the subsequent ones must too), followed by a group of 2 hex digits:
^[0-9A-Fa-f]{2}(-?)([0-9A-Fa-f]{2}\1){6}[0-9A-Fa-f]{2}$
Use of the re.I flag allows you to remove the a-f from the character classes:
^[0-9A-F]{2}(-?)([0-9A-F]{2}\1){6}[0-9A-F]{2}$
You can also simplify slightly further by replacing 0-9 by \d in the character classes (although personally I find 0-9 easier to read):
^[\dA-F]{2}(-?)([\dA-F]{2}\1){6}[\dA-F]{2}$
Demo on regex101
Sample python code:
import re
strs = ['AC-DE-48-23-45-67-AB-CD',
'ACDE48234567ABCD',
'AC-DE48-23-45-67-AB-CD',
'ACDE48234567ABC',
'ACDE48234567ABCDE']
for s in strs:
print(s + (' matched' if re.match(r'^[0-9A-F]{2}(-?)([0-9A-F]{2}\1){6}[0-9A-F]{2}$', s, re.I) else ' didn\'t match'))
Output
AC-DE-48-23-45-67-AB-CD matched
ACDE48234567ABCD matched
AC-DE48-23-45-67-AB-CD didn't match
ACDE48234567ABC didn't match
ACDE48234567ABCDE didn't match