Python regex findall did not respond - python-3.x

I already did a research and find out about catastrophic backtracking, but I can't figure out if it is the case.
I have a small script:
import re
if __name__ == '__main__':
name = 'vuejs-complete-guide-vue-course.vue.test'
print( name )
extractedDomain = re.findall(r'([A-Za-z0-9\-\_]+){1,63}.([A-Za-z0-9\-\_]+){1,63}$', name)
print( extractedDomain )
This regex does not finalize and I don't understand why.
But if the name be:
name = 'vue-course.vue.test'
Then it works.
Someone can help me?

The issue is catastrophic backtracking due to the nested quantifiers (the quantifier + for the character class and the outer group {1,63})
Your string contains a dot, which can only be matched by the . in your pattern (as the . can match any character)
As your string contains 2 dots which it can not match, it will still try to explore all the paths.
Ending for example the string on a dot like vuejs-complete. can also become problematic as there should be at least a single char other than a dot following.
Looking at the pattern that you tried and the example string, you can repeat the character class 1-63 times, followed by repeating a group 1 or more times starting with a dot.
Note to escape the dot to match it literally.
^[A-Za-z0-9_-]{1,63}(?:\.[A-Za-z0-9_-]{1,63})+$
Explanation
^ Start ofs tring
[A-Za-z0-9_-]{1,63} Repeat the character class 1-63 times
(?: Non capture group to repeat as a whole part
\.[A-Za-z0-9_-]{1,63} Match . and repeat the character class 1-63 times
)+ Close the group and repeat 1+ times
$ End of string
Regex demo

Related

Regex: Match between delimiters (a letter and a special character) in a string to form new sub-strings

I was working on a certain problem where I have form new sub-strings from a main string.
For e.g.
in_string=ste5ts01,s02,s03
The expected output strings are ste5ts01, ste5ts02, ste5ts03
There could be comma(,) or forward-slash (/) as the separator and in this case the delimiters are the letter s and ,
The pattern I have created so far:
pattern = r"([^\s,/]+)(?<num>\d+)([,/])(?<num>\d+)(?:\2(?<num>\d+))*(?!\S)"
The issue is, I am not able to figure out how to give the letter 's' as one of the delimiters.
Any help will be much appreciated!
You might use an approach using the PyPi regex module and named capture groups which are available in the captures:
=(?<prefix>s\w+)(?<num>s\d+)(?:,(?<num>s\d+))+
Explanation
= Match literally
(?<prefix>s\w+) Match s and 1+ word chars in group prefix
(?<num>s\d+) Capture group num match s and 1+ digits
(?:,(?<num>s\d+))+ Repeat 1+ times matching , and capture s followed by 1+ digits in group num
Example
import regex as re
pattern = r"=(?<prefix>s\w+)(?<num>s\d+)(?:,(?<num>s\d+))+"
s="in_string=ste5ts01,s02,s03"
matches = re.finditer(pattern, s)
for _, m in enumerate(matches, start=1):
print(','.join([m.group("prefix") + c for c in m.captures("num")]))
Output
ste5ts01,ste5ts02,ste5ts03

using re to grab all instances of values between parenthesis

I'm using python's re module to grab all instances of values between the
opening and closing parenthesis.
i.e. (A)way(Of)testing(This)
would produce a list:
['A', 'Of', 'This']
I took a look at 1 and 2.
This is my code:
import re
sentence = "(A)way(Of)testing(This)is running (it)"
res = re.compile(r".*\(([a-zA-Z0-9|^)])\).*", re.S)
for s in re.findall(res, sentence):
print(s)
What I get from this is:
it
Then I realized I was only capturing just one character, so I used
res = re.compile(r".*\(([a-zA-Z0-9-|^)]*)\).*", re.S)
But I still get it
I've always struggled with regex. My understanding of my search string
is as follows:
.* (any character)
\( (escapes the opening parenthesis)
( (starts the grouping)
[a-zA-Z0-9-|^)]* (set of characters allowed : a-Z, A-Z, 0-9, - *EXCEPT the ")" )
) (closes the grouping)
\) (escapes the closing parenthesis)
.* (anything else)
So in theory it should go through sentence and once it encounters a (,
it should copy the contents up until it encounters a ), at which point it should
store that into one group. It then proceeds through the sentence.
I even used the following:
res = re.compile(r".*\(([a-z|A-Z|0-9|-|^)]*)\).*", re.S)
But it still returns an it.
Any help greatly appreciated,
Thanks
You can shorten the pattern without the .* and the ^ and ) and only use the character class.
The .* part matches any character, and as the part between parenthesis is only once in the pattern you will capture only 1 group.
In your explanation about this part [a-zA-Z0-9-|^)]* the character class does not rule out the ) using |^). It will just match either a | ^ or ) char.
If you want to use a negated character class, the ^ should be at the start of the character class like [^ but that is not necessary here as you can specify what do you want to match instead of what you don't want to match.
\(([a-zA-Z0-9-]*)\)
The pattern matches:
\( Match (
( Capture group 1
[a-zA-Z0-9-]* Optionally repeat matching one of the listed ranges a-zA-Z0-9 or -
) Close group 1
\) Match )
regex demo
You don't need the re.S as there is no dot in the pattern that should match a newline.
import re
sentence = "(A)way(Of)testing(This)is running (it)"
res = re.compile(r"\(([a-zA-Z0-9-]*)\)")
print(re.findall(res, sentence))
Output
['A', 'Of', 'This', 'it']

Python - Replacing repeated consonants with other values in a string

I want to write a function that, given a string, returns a new string in which occurences of a sequence of the same consonant with 2 or more elements are replaced with the same sequence except the first consonant - which should be replaced with the character 'm'.
The explanation was probably very confusing, so here are some examples:
"hello world" should return "hemlo world"
"Hannibal" should return "Hamnibal"
"error" should return "emror"
"although" should return "although" (returns the same string because none of the characters are repeated in a sequence)
"bbb" should return "mbb"
I looked into using regex but wasn't able to achieve what I wanted. Any help is appreciated.
Thank you in advance!
Regex is probably the best tool for the job here. The 'correct' expression is
test = """
hello world
Hannibal
error
although
bbb
"""
output = re.sub(r'(.)\1+', lambda g:f'm{g.group(0)[1:]}', test)
# '''
# hemlo world
# Hamnibal
# emror
# although
# mbb
# '''
The only real complicated part of this is the lambda that we give as an argument. re.sub() can accept one as its 'replacement criteria' - it gets passed a regex object (which we call .group(0) on to get the full match, i.e. all of the repeated letters) and should output a string, with which to replace whatever was matched. Here, we use it to output the character 'm' followed by the second character onwards of the match, in an f-string.
The regex itself is pretty straightforward as well. Any character (.), then the same character (\1) again one or more times (+). If you wanted just alphanumerics (i.e. not to replace duplicate whitespace characters), you could use (\w) instead of (.)

Regular expression to capture n lines of text between two regex patterns

Need help with a regular expression to grab exactly n lines of text between two regex matches. For example, I need 17 lines of text and I used the example below, which does not work. I
Please see sample code below:
import re
match_string = re.search(r'^.*MDC_IDC_RAW_MARKER((.*?\r?\n){17})Stored_EGM_Trigger.*\n'), t, re.DOTALL).group()
value1 = re.search(r'value="(\d+)"', match_string).group(1)
value2 = re.search(r'value="(\d+\.\d+)"', match_string).group(1)
print(match_string)
print(value1)
print(value2)
I added a sample string to here, because SO does not allow long code string:
https://hastebin.com/aqowusijuc.xml
You are getting false positives because you are using the re.DOTALL flag, which allows the . character to match newline characters. That is, when you are matching ((.*?\r?\n){17}), the . could eat up many extra newline characters just to satisfy your required count of 17. You also now realize that the \r is superfluous. Also, starting your regex with ^.*? is superfluous because you are forcing the search to start from the beginning but then saying that the search engine should skip as many characters as necessary to find MDC_IDC_RAW_MARKER. So, a simplified and correct regex would be:
match_string = re.search(r'MDC_IDC_RAW_MARKER.*\n((.*\n){17})Stored_EGM_Trigger.*\n', t)
Regex Demo

How to write a better regex in python?

I have two scenarios to match . Length should be exactly 16.
Pattern should contain A-F,a-f,0-9 and '-' in 1st case.
AC-DE-48-23-45-67-AB-CD
ACDE48234567ABCD
I have tried with r'^([0-9A-Fa-f]{16})$|(([0-9A-Fa-f]{2}\-){7}[0-9A-Fa-f]{2})$'this , which is working fine . Looking for better expression .
You can simplify the regex by considering the string to be a group of two hex digits followed by an optional -, followed by 6 similar groups (i.e. if the first group had a -, the subsequent ones must too), followed by a group of 2 hex digits:
^[0-9A-Fa-f]{2}(-?)([0-9A-Fa-f]{2}\1){6}[0-9A-Fa-f]{2}$
Use of the re.I flag allows you to remove the a-f from the character classes:
^[0-9A-F]{2}(-?)([0-9A-F]{2}\1){6}[0-9A-F]{2}$
You can also simplify slightly further by replacing 0-9 by \d in the character classes (although personally I find 0-9 easier to read):
^[\dA-F]{2}(-?)([\dA-F]{2}\1){6}[\dA-F]{2}$
Demo on regex101
Sample python code:
import re
strs = ['AC-DE-48-23-45-67-AB-CD',
'ACDE48234567ABCD',
'AC-DE48-23-45-67-AB-CD',
'ACDE48234567ABC',
'ACDE48234567ABCDE']
for s in strs:
print(s + (' matched' if re.match(r'^[0-9A-F]{2}(-?)([0-9A-F]{2}\1){6}[0-9A-F]{2}$', s, re.I) else ' didn\'t match'))
Output
AC-DE-48-23-45-67-AB-CD matched
ACDE48234567ABCD matched
AC-DE48-23-45-67-AB-CD didn't match
ACDE48234567ABC didn't match
ACDE48234567ABCDE didn't match

Resources