How to write a better regex in python? - python-3.x

I have two scenarios to match . Length should be exactly 16.
Pattern should contain A-F,a-f,0-9 and '-' in 1st case.
AC-DE-48-23-45-67-AB-CD
ACDE48234567ABCD
I have tried with r'^([0-9A-Fa-f]{16})$|(([0-9A-Fa-f]{2}\-){7}[0-9A-Fa-f]{2})$'this , which is working fine . Looking for better expression .

You can simplify the regex by considering the string to be a group of two hex digits followed by an optional -, followed by 6 similar groups (i.e. if the first group had a -, the subsequent ones must too), followed by a group of 2 hex digits:
^[0-9A-Fa-f]{2}(-?)([0-9A-Fa-f]{2}\1){6}[0-9A-Fa-f]{2}$
Use of the re.I flag allows you to remove the a-f from the character classes:
^[0-9A-F]{2}(-?)([0-9A-F]{2}\1){6}[0-9A-F]{2}$
You can also simplify slightly further by replacing 0-9 by \d in the character classes (although personally I find 0-9 easier to read):
^[\dA-F]{2}(-?)([\dA-F]{2}\1){6}[\dA-F]{2}$
Demo on regex101
Sample python code:
import re
strs = ['AC-DE-48-23-45-67-AB-CD',
'ACDE48234567ABCD',
'AC-DE48-23-45-67-AB-CD',
'ACDE48234567ABC',
'ACDE48234567ABCDE']
for s in strs:
print(s + (' matched' if re.match(r'^[0-9A-F]{2}(-?)([0-9A-F]{2}\1){6}[0-9A-F]{2}$', s, re.I) else ' didn\'t match'))
Output
AC-DE-48-23-45-67-AB-CD matched
ACDE48234567ABCD matched
AC-DE48-23-45-67-AB-CD didn't match
ACDE48234567ABC didn't match
ACDE48234567ABCDE didn't match

Related

Python regex findall did not respond

I already did a research and find out about catastrophic backtracking, but I can't figure out if it is the case.
I have a small script:
import re
if __name__ == '__main__':
name = 'vuejs-complete-guide-vue-course.vue.test'
print( name )
extractedDomain = re.findall(r'([A-Za-z0-9\-\_]+){1,63}.([A-Za-z0-9\-\_]+){1,63}$', name)
print( extractedDomain )
This regex does not finalize and I don't understand why.
But if the name be:
name = 'vue-course.vue.test'
Then it works.
Someone can help me?
The issue is catastrophic backtracking due to the nested quantifiers (the quantifier + for the character class and the outer group {1,63})
Your string contains a dot, which can only be matched by the . in your pattern (as the . can match any character)
As your string contains 2 dots which it can not match, it will still try to explore all the paths.
Ending for example the string on a dot like vuejs-complete. can also become problematic as there should be at least a single char other than a dot following.
Looking at the pattern that you tried and the example string, you can repeat the character class 1-63 times, followed by repeating a group 1 or more times starting with a dot.
Note to escape the dot to match it literally.
^[A-Za-z0-9_-]{1,63}(?:\.[A-Za-z0-9_-]{1,63})+$
Explanation
^ Start ofs tring
[A-Za-z0-9_-]{1,63} Repeat the character class 1-63 times
(?: Non capture group to repeat as a whole part
\.[A-Za-z0-9_-]{1,63} Match . and repeat the character class 1-63 times
)+ Close the group and repeat 1+ times
$ End of string
Regex demo

Regex: Match between delimiters (a letter and a special character) in a string to form new sub-strings

I was working on a certain problem where I have form new sub-strings from a main string.
For e.g.
in_string=ste5ts01,s02,s03
The expected output strings are ste5ts01, ste5ts02, ste5ts03
There could be comma(,) or forward-slash (/) as the separator and in this case the delimiters are the letter s and ,
The pattern I have created so far:
pattern = r"([^\s,/]+)(?<num>\d+)([,/])(?<num>\d+)(?:\2(?<num>\d+))*(?!\S)"
The issue is, I am not able to figure out how to give the letter 's' as one of the delimiters.
Any help will be much appreciated!
You might use an approach using the PyPi regex module and named capture groups which are available in the captures:
=(?<prefix>s\w+)(?<num>s\d+)(?:,(?<num>s\d+))+
Explanation
= Match literally
(?<prefix>s\w+) Match s and 1+ word chars in group prefix
(?<num>s\d+) Capture group num match s and 1+ digits
(?:,(?<num>s\d+))+ Repeat 1+ times matching , and capture s followed by 1+ digits in group num
Example
import regex as re
pattern = r"=(?<prefix>s\w+)(?<num>s\d+)(?:,(?<num>s\d+))+"
s="in_string=ste5ts01,s02,s03"
matches = re.finditer(pattern, s)
for _, m in enumerate(matches, start=1):
print(','.join([m.group("prefix") + c for c in m.captures("num")]))
Output
ste5ts01,ste5ts02,ste5ts03

Regular expression to extract text between multiple hyphens

I have a string in the following format
----------some text-------------
How do I extract "some text" without the hyphens?
I have tried with this but it matches the whole string
(?<=-).*(?=-)
The pattern matches the whole line except the first and last hyphen due to the assertions on the left and right and the . also matches a hyphen.
You can keep using the assertions, and match any char except a hyphen using [^-]+ which is a negated character class.
(?<=-)[^-]+(?=-)
See a regex demo.
Note: if you also want to prevent matching a newline you can use (?<=-)[^-\r\n]+(?=-)
With your shown samples please try following regex.
^-+([^-]*)-+$
Online regex demo
Explanation: Matching all dashes from starting 1 or more occurrences then creating one and only capturing group which has everything matched till next - comes, then matching all dashes till last of value.
You are using Python, right? Then you do not need regex.
Use
s = "----------some text-------------"
print(s.strip("-"))
Results: some text.
See Python proof.
With regex, the same can be achieved using
re.sub(r'^-+|-+$', '', s)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
-+ '-' (1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
-+ '-' (1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

Regex to find compensations in text

I need to find mentions of compensations in emails. I am new to regex. Please see below the approach I am using.
sample_text = "Rate – $115k/yr. or $55/hr. - $60/hr"
My python code to find this,
impor re
PATTERN = r'((\$|\£) [0-9]*)|((\$|\£)[0-9]*)'
print(re.findall(PATTERN,sample_text))
The matches I am getting is
[('', '', '$115', '$'), ('', '', '$55', '$'), ('', '', '$60', '$')]
Expected match
["$115k/yr","$55/hr","$60/hr"]
Also the $ sign can be written as USD. How do I handle this in the same regex.
You can use
[$£]\d+[^.\s]*
[$£] Match either $ or £
\d+ Match 1+ digits
[^.\s]* Repeat 0+ times matching any char except . or a whitespace
Regex demo
import re
sample_text = "Rate – $115k/yr. or $55/hr. - $60/hr"
PATTERN = r'[$£]\d+[^.\s]*'
print(re.findall(PATTERN,sample_text))
Output
['$115k/yr', '$55/hr', '$60/hr']
If there must be a / present, you might also use
[$£]\d+[^\s/.]*/\w+
Regex demo
You can have something like:
[$£]\d+[^.]+
>>> PATTERN = '[$£]\d+[^.]+
>>> print(re.findall(PATTERN,sample_text))
['$115k/yr', '$55/hr', '$60/hr']
[$£] matches "$" or a "£"
\d+ matches one or more digits
[^.]+ matches everything that's not a "."
The parentheses in your regex cause the engine to report the contents of each parenthesized subexpression. You can use non-grouping parentheses (?:...) to avoid this, but of course, your expression can be rephrased to not have any parentheses at all:
PATTERN = r'[$£]\s*\d+'
Notice also how I changed the last quantifier to a + -- your attempt would also find isolated currency symbols with no numbers after them.
To point out the hopefully obvious, \s matches whitespace and \s* matches an arbitrary run of whitespace, including none at all; and \d matches a digit.
If you want to allow some text after the extracted match, add something like (?:/\w+)? to allow for a slash and one single word token as an optional expression after the main match. (Maybe adorn that with \s* on both sides of the slash, too.)

The correct way to identify a regular expression of the sort [variableName].add(

I'm looking for a clean way to identify occurrences of [variableName] followed by the exact string .add(.
A variable name is a string which contains one or more characters from a-z, A-Z, 0-9 and an underscore.
One more thing is that it cannot start with any of the characters from 0-9, but I don't mind ignoring this condition because there are no such cases in the text that I need to parse anyway.
I've been following several tutorials, but the farthest I got was finding all occurrences of what I've referred to above as "variableName":
import re
txt = "The _rain() in+ Spain5"
x = re.split("[^a-zA-Z0-9_]+", txt)
print(x)
What is the right way to do it?
You may use
re.findall(r'\w+(?=\.add\()', txt, flags=re.ASCII)
The regex matches:
\w+ - 1+ word chars (due to re.ASCII, it only matches [A-Za-z0-9_] chars)
(?=\.add\() - a positive lookahead that matches a location immediately followed with .add( substring.

Resources