Check that both `lookbehind` conditions are satisfied in `RegEx` - python-3.x

I'm trying to check if a username is preceded either by RT # or by RT# by using lookbehind mechanism paired with conditionals, as explained in this tutorial.
The regex and the example are shown in Example 1:
Example 1
import re
text = 'RT #u1, #u2, u3, #u4, rt #u5:, #u3.#u1^, rt#u3'
mt_regex = r'(?i)(?<!RT )&(?<!RT)#(\w+)'
mt_pat = re.compile(mt_regex)
re.findall(mt_pat, text)
which outputs [] (empty list), while the desired output should be:
['u2', 'u4', 'u3', 'u1']
What am I missing? Thanks in advance.

The other answer is obviously correct and rightfully accepted, but here is what I think might work for you without negative lookbehinds. The benefit is that you are not limited to the single space character using \s*:
(?i)(?:^|[,.])\s*#(\w+)
See the online demo
(?i) - Turn of case sensitivity. Note you can also use re.IGNORECASE.
(?:^|[,.]) - Non-capture group to match start of string or literal comma/dot.
\s* - Zero or more spaces.
# - Literally match "#".
(\w+) - Open capture group and match word characters, short for [A-Za-z0-9_].
This print ['u2', 'u4', 'u3', 'u1']

If we break down your regex:
r"(?i)(?<!RT )&(?<!RT)#(\w+)"
(?i) match the remainder of the pattern, case insensitive match
(?<!RT ) negative lookbehind
asserts that 'RT ' does not match
& matches the character '&' literally
(?<!RT) negative lookbehind
asserts that 'RT' does not match
# matches the character '#' literally
(\w+) Capturing Group
matches [a-zA-Z0-9_] between one and unlimited times
You have the & character that is preventing your regex matching:
import re
text = "RT #u1, #u2, u3, #u4, rt #u5:, #u3.#u1^, rt#u3"
mt_regex = r"(?i)(?<!RT )(?<!RT)#(\w+)"
mt_pat = re.compile(mt_regex)
print(re.findall(mt_pat, text))
# ['u2', 'u4', 'u3', 'u1']
See this regex here

Related

Replace characters other than A-Za-z0-9 and decimal values with space using regex

I want to keep alphanumeric characters and also the decimal numbers present in my text string and replace all other characters with space.
For alphanumeric characters, I can use
def clean_up(text):
return re.sub(r"[^A-Za-z0-9]", " ", text)
But this will replace all . whether they are between two digits or a fullstop or at random locations. I just want to keep the . if they come between two digits.
I thought of [^((A-Za-z0-9)|(\d\.\d))], but it doesn't seem to work.
You can match and capture the patterns you need to keep and just match any char otherwise. Then, using the lambda expression as the replacement argument, you can either replace with the captured substring or a space.
The patterns are:
[+-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)? - matches any number
[^\W_] - matches any alphanumeric, Unicode included
. - matches any char (with re.S or re.DOTALL).
The solution looks like
pattern = re.compile(r'([+-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?|[^\W_])|.', re.DOTALL)
def clean_up(text):
return pattern.sub(lambda x: x.group(1) or " ", text)
See the online demo:
import re
pattern = re.compile(r'([+-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?|[^\W_])|.', re.DOTALL)
def clean_up(text):
return pattern.sub(lambda x: x.group(1) or " ", text)
print( clean_up("+1.2E02 ANT01-TEXT_HERE!") )
Output: +1.2E02 ANT01 TEXT HERE
[^A-Za-z0-9](?!\d)
You can use Negated Character Class with Negative lookahead.
[...] is a character set, which match the literal character inside, so [^((A-Za-z0-9)|(\d\.\d))] means not to match [A-Za-z0-9] and the literal (, |, . and ).
You may try [^a-zA-Z0-9.]|(?<!\d)\.|\.(?!\d):
[^a-zA-Z0-9.] match all non-alphanumeric characters except .
(?<!\d)\. match . after non-digit
\.(?!\d) match . before non-digit
Test run:
print(re.sub(r"[^a-zA-Z0-9.]|(?<!\d)\.|\.(?!\d)", " ", ".aBc.1e2#3.4F5$6.gHi."))
Output:
aBc 1e2 3.4F5 6 gHi

Regular expression to extract text between multiple hyphens

I have a string in the following format
----------some text-------------
How do I extract "some text" without the hyphens?
I have tried with this but it matches the whole string
(?<=-).*(?=-)
The pattern matches the whole line except the first and last hyphen due to the assertions on the left and right and the . also matches a hyphen.
You can keep using the assertions, and match any char except a hyphen using [^-]+ which is a negated character class.
(?<=-)[^-]+(?=-)
See a regex demo.
Note: if you also want to prevent matching a newline you can use (?<=-)[^-\r\n]+(?=-)
With your shown samples please try following regex.
^-+([^-]*)-+$
Online regex demo
Explanation: Matching all dashes from starting 1 or more occurrences then creating one and only capturing group which has everything matched till next - comes, then matching all dashes till last of value.
You are using Python, right? Then you do not need regex.
Use
s = "----------some text-------------"
print(s.strip("-"))
Results: some text.
See Python proof.
With regex, the same can be achieved using
re.sub(r'^-+|-+$', '', s)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
-+ '-' (1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
-+ '-' (1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

Regex to find compensations in text

I need to find mentions of compensations in emails. I am new to regex. Please see below the approach I am using.
sample_text = "Rate – $115k/yr. or $55/hr. - $60/hr"
My python code to find this,
impor re
PATTERN = r'((\$|\£) [0-9]*)|((\$|\£)[0-9]*)'
print(re.findall(PATTERN,sample_text))
The matches I am getting is
[('', '', '$115', '$'), ('', '', '$55', '$'), ('', '', '$60', '$')]
Expected match
["$115k/yr","$55/hr","$60/hr"]
Also the $ sign can be written as USD. How do I handle this in the same regex.
You can use
[$£]\d+[^.\s]*
[$£] Match either $ or £
\d+ Match 1+ digits
[^.\s]* Repeat 0+ times matching any char except . or a whitespace
Regex demo
import re
sample_text = "Rate – $115k/yr. or $55/hr. - $60/hr"
PATTERN = r'[$£]\d+[^.\s]*'
print(re.findall(PATTERN,sample_text))
Output
['$115k/yr', '$55/hr', '$60/hr']
If there must be a / present, you might also use
[$£]\d+[^\s/.]*/\w+
Regex demo
You can have something like:
[$£]\d+[^.]+
>>> PATTERN = '[$£]\d+[^.]+
>>> print(re.findall(PATTERN,sample_text))
['$115k/yr', '$55/hr', '$60/hr']
[$£] matches "$" or a "£"
\d+ matches one or more digits
[^.]+ matches everything that's not a "."
The parentheses in your regex cause the engine to report the contents of each parenthesized subexpression. You can use non-grouping parentheses (?:...) to avoid this, but of course, your expression can be rephrased to not have any parentheses at all:
PATTERN = r'[$£]\s*\d+'
Notice also how I changed the last quantifier to a + -- your attempt would also find isolated currency symbols with no numbers after them.
To point out the hopefully obvious, \s matches whitespace and \s* matches an arbitrary run of whitespace, including none at all; and \d matches a digit.
If you want to allow some text after the extracted match, add something like (?:/\w+)? to allow for a slash and one single word token as an optional expression after the main match. (Maybe adorn that with \s* on both sides of the slash, too.)

The correct way to identify a regular expression of the sort [variableName].add(

I'm looking for a clean way to identify occurrences of [variableName] followed by the exact string .add(.
A variable name is a string which contains one or more characters from a-z, A-Z, 0-9 and an underscore.
One more thing is that it cannot start with any of the characters from 0-9, but I don't mind ignoring this condition because there are no such cases in the text that I need to parse anyway.
I've been following several tutorials, but the farthest I got was finding all occurrences of what I've referred to above as "variableName":
import re
txt = "The _rain() in+ Spain5"
x = re.split("[^a-zA-Z0-9_]+", txt)
print(x)
What is the right way to do it?
You may use
re.findall(r'\w+(?=\.add\()', txt, flags=re.ASCII)
The regex matches:
\w+ - 1+ word chars (due to re.ASCII, it only matches [A-Za-z0-9_] chars)
(?=\.add\() - a positive lookahead that matches a location immediately followed with .add( substring.

How to write a better regex in python?

I have two scenarios to match . Length should be exactly 16.
Pattern should contain A-F,a-f,0-9 and '-' in 1st case.
AC-DE-48-23-45-67-AB-CD
ACDE48234567ABCD
I have tried with r'^([0-9A-Fa-f]{16})$|(([0-9A-Fa-f]{2}\-){7}[0-9A-Fa-f]{2})$'this , which is working fine . Looking for better expression .
You can simplify the regex by considering the string to be a group of two hex digits followed by an optional -, followed by 6 similar groups (i.e. if the first group had a -, the subsequent ones must too), followed by a group of 2 hex digits:
^[0-9A-Fa-f]{2}(-?)([0-9A-Fa-f]{2}\1){6}[0-9A-Fa-f]{2}$
Use of the re.I flag allows you to remove the a-f from the character classes:
^[0-9A-F]{2}(-?)([0-9A-F]{2}\1){6}[0-9A-F]{2}$
You can also simplify slightly further by replacing 0-9 by \d in the character classes (although personally I find 0-9 easier to read):
^[\dA-F]{2}(-?)([\dA-F]{2}\1){6}[\dA-F]{2}$
Demo on regex101
Sample python code:
import re
strs = ['AC-DE-48-23-45-67-AB-CD',
'ACDE48234567ABCD',
'AC-DE48-23-45-67-AB-CD',
'ACDE48234567ABC',
'ACDE48234567ABCDE']
for s in strs:
print(s + (' matched' if re.match(r'^[0-9A-F]{2}(-?)([0-9A-F]{2}\1){6}[0-9A-F]{2}$', s, re.I) else ' didn\'t match'))
Output
AC-DE-48-23-45-67-AB-CD matched
ACDE48234567ABCD matched
AC-DE48-23-45-67-AB-CD didn't match
ACDE48234567ABC didn't match
ACDE48234567ABCDE didn't match

Resources