Regex to find compensations in text - python-3.x

I need to find mentions of compensations in emails. I am new to regex. Please see below the approach I am using.
sample_text = "Rate – $115k/yr. or $55/hr. - $60/hr"
My python code to find this,
impor re
PATTERN = r'((\$|\£) [0-9]*)|((\$|\£)[0-9]*)'
print(re.findall(PATTERN,sample_text))
The matches I am getting is
[('', '', '$115', '$'), ('', '', '$55', '$'), ('', '', '$60', '$')]
Expected match
["$115k/yr","$55/hr","$60/hr"]
Also the $ sign can be written as USD. How do I handle this in the same regex.

You can use
[$£]\d+[^.\s]*
[$£] Match either $ or £
\d+ Match 1+ digits
[^.\s]* Repeat 0+ times matching any char except . or a whitespace
Regex demo
import re
sample_text = "Rate – $115k/yr. or $55/hr. - $60/hr"
PATTERN = r'[$£]\d+[^.\s]*'
print(re.findall(PATTERN,sample_text))
Output
['$115k/yr', '$55/hr', '$60/hr']
If there must be a / present, you might also use
[$£]\d+[^\s/.]*/\w+
Regex demo

You can have something like:
[$£]\d+[^.]+
>>> PATTERN = '[$£]\d+[^.]+
>>> print(re.findall(PATTERN,sample_text))
['$115k/yr', '$55/hr', '$60/hr']
[$£] matches "$" or a "£"
\d+ matches one or more digits
[^.]+ matches everything that's not a "."

The parentheses in your regex cause the engine to report the contents of each parenthesized subexpression. You can use non-grouping parentheses (?:...) to avoid this, but of course, your expression can be rephrased to not have any parentheses at all:
PATTERN = r'[$£]\s*\d+'
Notice also how I changed the last quantifier to a + -- your attempt would also find isolated currency symbols with no numbers after them.
To point out the hopefully obvious, \s matches whitespace and \s* matches an arbitrary run of whitespace, including none at all; and \d matches a digit.
If you want to allow some text after the extracted match, add something like (?:/\w+)? to allow for a slash and one single word token as an optional expression after the main match. (Maybe adorn that with \s* on both sides of the slash, too.)

Related

How to match a wildcard for strings?

Please suggest a wildcard for below Firstjson list
Firstjson = { p10_7_8 , p10_7_2 , p10_7_3 p10_7_4}
I have tried p10.7.* wildcard for below Secondjson list, it worked. But when I tried p10_7_* for above Firstjson list it did not work
Secondjson = { p10.7.8 , p10.7.2 , p10.7.3 , p10.7.4 }
You are attempting to use wildcard syntax, but Groovy expects regular expression syntax for its pattern matching.
What went wrong with your attempt:
Attempt #1: p10.7.*
A regular expression of . matches any single character and .* matches 0 or more characters. This means:
p10{exactly one character of any kind here}7{zero or more characters of any
kind here}
You didn't realize it, but the . character in your first attempt was acting like a single-character wildcard too. This might match with p10x7abcdefg for example. It also does match p10.7.8 though. But be careful, it also matches p10.78, because the .* expression at the end of your pattern will happily match any sequence of characters, thus any and all characters following p10.7 are accepted.
Attempt #2: p10_7_*
_ matches only a literal underscore. But _* means to match zero or more underscores. It does not mean to match characters of any kind. So p10_7_* matches things like p10_7_______. Literally:
p10_7{zero or more underscores here}
What you can do instead:
You probably want a regular expression like p10_7_\d+
This will match things like p10_7_3 or p10_7_422. It works by matching the literal text p10_7_ followed by one or more digits where a digit is 0 through 9. \d matches any digit, and + means to match one or more of the preceding thing. Literally:
p10_7_{one or more digits here}

Regular expression to extract text between multiple hyphens

I have a string in the following format
----------some text-------------
How do I extract "some text" without the hyphens?
I have tried with this but it matches the whole string
(?<=-).*(?=-)
The pattern matches the whole line except the first and last hyphen due to the assertions on the left and right and the . also matches a hyphen.
You can keep using the assertions, and match any char except a hyphen using [^-]+ which is a negated character class.
(?<=-)[^-]+(?=-)
See a regex demo.
Note: if you also want to prevent matching a newline you can use (?<=-)[^-\r\n]+(?=-)
With your shown samples please try following regex.
^-+([^-]*)-+$
Online regex demo
Explanation: Matching all dashes from starting 1 or more occurrences then creating one and only capturing group which has everything matched till next - comes, then matching all dashes till last of value.
You are using Python, right? Then you do not need regex.
Use
s = "----------some text-------------"
print(s.strip("-"))
Results: some text.
See Python proof.
With regex, the same can be achieved using
re.sub(r'^-+|-+$', '', s)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
-+ '-' (1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
-+ '-' (1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

Check that both `lookbehind` conditions are satisfied in `RegEx`

I'm trying to check if a username is preceded either by RT # or by RT# by using lookbehind mechanism paired with conditionals, as explained in this tutorial.
The regex and the example are shown in Example 1:
Example 1
import re
text = 'RT #u1, #u2, u3, #u4, rt #u5:, #u3.#u1^, rt#u3'
mt_regex = r'(?i)(?<!RT )&(?<!RT)#(\w+)'
mt_pat = re.compile(mt_regex)
re.findall(mt_pat, text)
which outputs [] (empty list), while the desired output should be:
['u2', 'u4', 'u3', 'u1']
What am I missing? Thanks in advance.
The other answer is obviously correct and rightfully accepted, but here is what I think might work for you without negative lookbehinds. The benefit is that you are not limited to the single space character using \s*:
(?i)(?:^|[,.])\s*#(\w+)
See the online demo
(?i) - Turn of case sensitivity. Note you can also use re.IGNORECASE.
(?:^|[,.]) - Non-capture group to match start of string or literal comma/dot.
\s* - Zero or more spaces.
# - Literally match "#".
(\w+) - Open capture group and match word characters, short for [A-Za-z0-9_].
This print ['u2', 'u4', 'u3', 'u1']
If we break down your regex:
r"(?i)(?<!RT )&(?<!RT)#(\w+)"
(?i) match the remainder of the pattern, case insensitive match
(?<!RT ) negative lookbehind
asserts that 'RT ' does not match
& matches the character '&' literally
(?<!RT) negative lookbehind
asserts that 'RT' does not match
# matches the character '#' literally
(\w+) Capturing Group
matches [a-zA-Z0-9_] between one and unlimited times
You have the & character that is preventing your regex matching:
import re
text = "RT #u1, #u2, u3, #u4, rt #u5:, #u3.#u1^, rt#u3"
mt_regex = r"(?i)(?<!RT )(?<!RT)#(\w+)"
mt_pat = re.compile(mt_regex)
print(re.findall(mt_pat, text))
# ['u2', 'u4', 'u3', 'u1']
See this regex here

The correct way to identify a regular expression of the sort [variableName].add(

I'm looking for a clean way to identify occurrences of [variableName] followed by the exact string .add(.
A variable name is a string which contains one or more characters from a-z, A-Z, 0-9 and an underscore.
One more thing is that it cannot start with any of the characters from 0-9, but I don't mind ignoring this condition because there are no such cases in the text that I need to parse anyway.
I've been following several tutorials, but the farthest I got was finding all occurrences of what I've referred to above as "variableName":
import re
txt = "The _rain() in+ Spain5"
x = re.split("[^a-zA-Z0-9_]+", txt)
print(x)
What is the right way to do it?
You may use
re.findall(r'\w+(?=\.add\()', txt, flags=re.ASCII)
The regex matches:
\w+ - 1+ word chars (due to re.ASCII, it only matches [A-Za-z0-9_] chars)
(?=\.add\() - a positive lookahead that matches a location immediately followed with .add( substring.

How to write a better regex in python?

I have two scenarios to match . Length should be exactly 16.
Pattern should contain A-F,a-f,0-9 and '-' in 1st case.
AC-DE-48-23-45-67-AB-CD
ACDE48234567ABCD
I have tried with r'^([0-9A-Fa-f]{16})$|(([0-9A-Fa-f]{2}\-){7}[0-9A-Fa-f]{2})$'this , which is working fine . Looking for better expression .
You can simplify the regex by considering the string to be a group of two hex digits followed by an optional -, followed by 6 similar groups (i.e. if the first group had a -, the subsequent ones must too), followed by a group of 2 hex digits:
^[0-9A-Fa-f]{2}(-?)([0-9A-Fa-f]{2}\1){6}[0-9A-Fa-f]{2}$
Use of the re.I flag allows you to remove the a-f from the character classes:
^[0-9A-F]{2}(-?)([0-9A-F]{2}\1){6}[0-9A-F]{2}$
You can also simplify slightly further by replacing 0-9 by \d in the character classes (although personally I find 0-9 easier to read):
^[\dA-F]{2}(-?)([\dA-F]{2}\1){6}[\dA-F]{2}$
Demo on regex101
Sample python code:
import re
strs = ['AC-DE-48-23-45-67-AB-CD',
'ACDE48234567ABCD',
'AC-DE48-23-45-67-AB-CD',
'ACDE48234567ABC',
'ACDE48234567ABCDE']
for s in strs:
print(s + (' matched' if re.match(r'^[0-9A-F]{2}(-?)([0-9A-F]{2}\1){6}[0-9A-F]{2}$', s, re.I) else ' didn\'t match'))
Output
AC-DE-48-23-45-67-AB-CD matched
ACDE48234567ABCD matched
AC-DE48-23-45-67-AB-CD didn't match
ACDE48234567ABC didn't match
ACDE48234567ABCDE didn't match

Resources