Match multiple instances of pattern in parenthesis - python-3.x

For the following string, is it possible for regex to return the comma delimited matches within the square brackets?
"root.path.definition[id=1234,test=blah,scope=A,B,C,D]"
Expected output would be:
["id=1234", "test=blah", "scope=A,B,C,D"]
The closest I have gotten so far is the following:
(?<=\[)(.*?)(?=\])
But this will only return one match for everything within the square brackets.

One option is to use the re module and first get the part between the square brackets using a capturing group and a negated character class.
\[([^][]*)]
That part will match:
\[ Match [ char
([^][]*) Capture group 1, match 0+ times any char other than [ and ]
] A [ char
Then get the separate parts by matching the key value pairs separated by a comma.
\w+=.*?(?=,\w+=|$)
That part will match:
\w+ Match 1+ word characters
= Match literally
.*?(?=,\w+=|$) Match as least as possible chars until you either encounter a comma, 1+ word characters and = or the end of the string
For example
import re
s = "root.path.definition[id=1234,test=blah,scope=A,B,C,D]"
m = re.search(r"\[([^][]*)]", s)
if m:
print(re.findall(r"\w+=.*?(?=,\w+=|$)", m.group(1)))
Python demo
Output
['id=1234', 'test=blah', 'scope=A,B,C,D']
If you can make use of the regex module, this might also be an option matching the keys and values using lookarounds to assert the [ and ]
(?<=\[[^][]*)\w+=.*?(?=,\w+=|])(?=[^][]*])
For example
import regex
s = "root.path.definition[id=1234,test=blah,scope=A,B,C,D]"
print(regex.findall(r"(?<=\[[^][]*)\w+=.*?(?=,\w+=|])(?=[^][]*])", s))
Output
['id=1234', 'test=blah', 'scope=A,B,C,D']
Regex demo | Python demo

Related

How do i find/count number of variable in string using Python

Here is example of string
Hi {{1}},
The status of your leave application has changed,
Leaves: {{2}}
Status: {{3}}
See you soon back at office by Management.
Expected Result:
Variables Count = 3
i tried python count() using if/else, but i'm looking for sustainable solution.
You can use regular expressions:
import re
PATTERN = re.compile(r'\{\{\d+\}\}', re.DOTALL)
def count_vars(text: str) -> int:
return sum(1 for _ in PATTERN.finditer(text))
PATTERN defines the regular expression. The regular expression matches all strings that contain at least one digit (\d+) within a pair of curly brackets (\{\{\}\}). Curly brackets are special characters in regular expressions, so we must add \. re.DOTALL makes sure that we don't skip over new lines (\n). The finditer method iterates over all matches in the text and we simply count them.

Python regex findall did not respond

I already did a research and find out about catastrophic backtracking, but I can't figure out if it is the case.
I have a small script:
import re
if __name__ == '__main__':
name = 'vuejs-complete-guide-vue-course.vue.test'
print( name )
extractedDomain = re.findall(r'([A-Za-z0-9\-\_]+){1,63}.([A-Za-z0-9\-\_]+){1,63}$', name)
print( extractedDomain )
This regex does not finalize and I don't understand why.
But if the name be:
name = 'vue-course.vue.test'
Then it works.
Someone can help me?
The issue is catastrophic backtracking due to the nested quantifiers (the quantifier + for the character class and the outer group {1,63})
Your string contains a dot, which can only be matched by the . in your pattern (as the . can match any character)
As your string contains 2 dots which it can not match, it will still try to explore all the paths.
Ending for example the string on a dot like vuejs-complete. can also become problematic as there should be at least a single char other than a dot following.
Looking at the pattern that you tried and the example string, you can repeat the character class 1-63 times, followed by repeating a group 1 or more times starting with a dot.
Note to escape the dot to match it literally.
^[A-Za-z0-9_-]{1,63}(?:\.[A-Za-z0-9_-]{1,63})+$
Explanation
^ Start ofs tring
[A-Za-z0-9_-]{1,63} Repeat the character class 1-63 times
(?: Non capture group to repeat as a whole part
\.[A-Za-z0-9_-]{1,63} Match . and repeat the character class 1-63 times
)+ Close the group and repeat 1+ times
$ End of string
Regex demo

Regex: Match between delimiters (a letter and a special character) in a string to form new sub-strings

I was working on a certain problem where I have form new sub-strings from a main string.
For e.g.
in_string=ste5ts01,s02,s03
The expected output strings are ste5ts01, ste5ts02, ste5ts03
There could be comma(,) or forward-slash (/) as the separator and in this case the delimiters are the letter s and ,
The pattern I have created so far:
pattern = r"([^\s,/]+)(?<num>\d+)([,/])(?<num>\d+)(?:\2(?<num>\d+))*(?!\S)"
The issue is, I am not able to figure out how to give the letter 's' as one of the delimiters.
Any help will be much appreciated!
You might use an approach using the PyPi regex module and named capture groups which are available in the captures:
=(?<prefix>s\w+)(?<num>s\d+)(?:,(?<num>s\d+))+
Explanation
= Match literally
(?<prefix>s\w+) Match s and 1+ word chars in group prefix
(?<num>s\d+) Capture group num match s and 1+ digits
(?:,(?<num>s\d+))+ Repeat 1+ times matching , and capture s followed by 1+ digits in group num
Example
import regex as re
pattern = r"=(?<prefix>s\w+)(?<num>s\d+)(?:,(?<num>s\d+))+"
s="in_string=ste5ts01,s02,s03"
matches = re.finditer(pattern, s)
for _, m in enumerate(matches, start=1):
print(','.join([m.group("prefix") + c for c in m.captures("num")]))
Output
ste5ts01,ste5ts02,ste5ts03

Replace matched susbtring using re sub

Is there a way to replace the matched pattern substring using a single re.sub() line?.
What I would like to avoid is using a string replace method to the current re.sub() output.
Input = "/J&L/LK/Tac1_1/shareloc.pdf"
Current output using re.sub("[^0-9_]", "", input): "1_1"
Desired output in a single re.sub use: "1.1"
According to the documentation, re.sub is defined as
re.sub(pattern, repl, string, count=0, flags=0)
If repl is a function, it is called for every non-overlapping occurrence of pattern.
This said, if you pass a lambda function, you can remain the code in one line. Furthermore, remember that the matched characters can be accessed easier to an individual group by: x[0].
I removed _ from the regex to reach the desired output.
txt = "/J&L/LK/Tac1_1/shareloc.pdf"
x = re.sub("[^0-9]", lambda x: '.' if x[0] is '_' else '', txt)
print(x)
There is no way to use a string replacement pattern in Python re.sub to replace with two possible strings, as there is no conditional replacement construct support in Python re.sub. So, using a callable as the replacement argument or use other work-arounds.
It looks like you only expect one match of <DIGITS>_<DIGITS> in the input string. In this case, you can use
import re
text = "/J&L/LK/Tac1_1/shareloc.pdf"
print( re.sub(r'^.*?(\d+)_(\d+).*', r'\1.\2', text, flags=re.S) )
# => 1.1
See the Python demo. See the regex demo. Details:
^ - start of string
.*? - zero or more chars as few as possible
(\d+) - Group 1: one or more digits
_ - a _ char
(\d+) - Group 2: one or more digits
.* - zero or more chars as many as possible.

How to write a better regex in python?

I have two scenarios to match . Length should be exactly 16.
Pattern should contain A-F,a-f,0-9 and '-' in 1st case.
AC-DE-48-23-45-67-AB-CD
ACDE48234567ABCD
I have tried with r'^([0-9A-Fa-f]{16})$|(([0-9A-Fa-f]{2}\-){7}[0-9A-Fa-f]{2})$'this , which is working fine . Looking for better expression .
You can simplify the regex by considering the string to be a group of two hex digits followed by an optional -, followed by 6 similar groups (i.e. if the first group had a -, the subsequent ones must too), followed by a group of 2 hex digits:
^[0-9A-Fa-f]{2}(-?)([0-9A-Fa-f]{2}\1){6}[0-9A-Fa-f]{2}$
Use of the re.I flag allows you to remove the a-f from the character classes:
^[0-9A-F]{2}(-?)([0-9A-F]{2}\1){6}[0-9A-F]{2}$
You can also simplify slightly further by replacing 0-9 by \d in the character classes (although personally I find 0-9 easier to read):
^[\dA-F]{2}(-?)([\dA-F]{2}\1){6}[\dA-F]{2}$
Demo on regex101
Sample python code:
import re
strs = ['AC-DE-48-23-45-67-AB-CD',
'ACDE48234567ABCD',
'AC-DE48-23-45-67-AB-CD',
'ACDE48234567ABC',
'ACDE48234567ABCDE']
for s in strs:
print(s + (' matched' if re.match(r'^[0-9A-F]{2}(-?)([0-9A-F]{2}\1){6}[0-9A-F]{2}$', s, re.I) else ' didn\'t match'))
Output
AC-DE-48-23-45-67-AB-CD matched
ACDE48234567ABCD matched
AC-DE48-23-45-67-AB-CD didn't match
ACDE48234567ABC didn't match
ACDE48234567ABCDE didn't match

Resources