using re to grab all instances of values between parenthesis

using re to grab all instances of values between parenthesis - python-3.x

I'm using python's re module to grab all instances of values between the
opening and closing parenthesis.
i.e. (A)way(Of)testing(This)
would produce a list:
['A', 'Of', 'This']
I took a look at 1 and 2.
This is my code:
import re
sentence = "(A)way(Of)testing(This)is running (it)"
res = re.compile(r".*\(([a-zA-Z0-9|^)])\).*", re.S)
for s in re.findall(res, sentence):
print(s)
What I get from this is:
it
Then I realized I was only capturing just one character, so I used
res = re.compile(r".*\(([a-zA-Z0-9-|^)]*)\).*", re.S)
But I still get it
I've always struggled with regex. My understanding of my search string
is as follows:
.* (any character)
\( (escapes the opening parenthesis)
( (starts the grouping)
[a-zA-Z0-9-|^)]* (set of characters allowed : a-Z, A-Z, 0-9, - *EXCEPT the ")" )
) (closes the grouping)
\) (escapes the closing parenthesis)
.* (anything else)
So in theory it should go through sentence and once it encounters a (,
it should copy the contents up until it encounters a ), at which point it should
store that into one group. It then proceeds through the sentence.
I even used the following:
res = re.compile(r".*\(([a-z|A-Z|0-9|-|^)]*)\).*", re.S)
But it still returns an it.
Any help greatly appreciated,
Thanks

You can shorten the pattern without the .* and the ^ and ) and only use the character class.
The .* part matches any character, and as the part between parenthesis is only once in the pattern you will capture only 1 group.
In your explanation about this part [a-zA-Z0-9-|^)]* the character class does not rule out the ) using |^). It will just match either a | ^ or ) char.
If you want to use a negated character class, the ^ should be at the start of the character class like [^ but that is not necessary here as you can specify what do you want to match instead of what you don't want to match.
\(([a-zA-Z0-9-]*)\)
The pattern matches:
\( Match (
( Capture group 1
[a-zA-Z0-9-]* Optionally repeat matching one of the listed ranges a-zA-Z0-9 or -
) Close group 1
\) Match )
regex demo
You don't need the re.S as there is no dot in the pattern that should match a newline.
import re
sentence = "(A)way(Of)testing(This)is running (it)"
res = re.compile(r"\(([a-zA-Z0-9-]*)\)")
print(re.findall(res, sentence))
Output
['A', 'Of', 'This', 'it']

Related

Python - how to find string and remove string plus next x characters

I have the following string:
mystr = '(string_to_delete_20221012_11-36) keep this (string_to_delete_20221016_22-22) keep this (string_to_delete_20221017_20-55) keep this'
I wish to delete all the entries (string_to_deletexxxxxxxxxxxxxxx) (including the trailing space)
I sort of need pseudo code as follows:
If you find a string (string_to_delete then replace that string and the timestamp, closing parenthesis and trailing space with null e.g. delete the string (string_to_delete_20221012_11-36)
I would use a list comprehension but given that not all strings are contained inside parenthesis I cannot see what I could use to create the list via a string.split().
Is this somethng that needs regular expressions?

it seemed like a good place to put regex:
import re
pattern = r'\(string_to_delete_.*?\)\s*'
mystr = '(string_to_delete_20221012_11-36) keep this (string_to_delete_20221016_22-22) keep this (string_to_delete_20221017_20-55) keep this'
for match in re.findall(pattern, mystr):
mystr = mystr.replace(match, '', 1) # replace 1st occurence of matched str with empty string
print(mystr)
results with:
>> keep this keep this keep this
brief regex breakdown: \(string_to_delete_.*?\)\s*
\( look for left parenthesis - escape needed
match string string_to_delete_
.*? look for zero or more characters if any
\) match closing parenthesis
\s* include zero or more whitespaces after that

Python regex findall did not respond

I already did a research and find out about catastrophic backtracking, but I can't figure out if it is the case.
I have a small script:
import re
if __name__ == '__main__':
name = 'vuejs-complete-guide-vue-course.vue.test'
print( name )
extractedDomain = re.findall(r'([A-Za-z0-9\-\_]+){1,63}.([A-Za-z0-9\-\_]+){1,63}$', name)
print( extractedDomain )
This regex does not finalize and I don't understand why.
But if the name be:
name = 'vue-course.vue.test'
Then it works.
Someone can help me?

The issue is catastrophic backtracking due to the nested quantifiers (the quantifier + for the character class and the outer group {1,63})
Your string contains a dot, which can only be matched by the . in your pattern (as the . can match any character)
As your string contains 2 dots which it can not match, it will still try to explore all the paths.
Ending for example the string on a dot like vuejs-complete. can also become problematic as there should be at least a single char other than a dot following.
Looking at the pattern that you tried and the example string, you can repeat the character class 1-63 times, followed by repeating a group 1 or more times starting with a dot.
Note to escape the dot to match it literally.
^[A-Za-z0-9_-]{1,63}(?:\.[A-Za-z0-9_-]{1,63})+$
Explanation
^ Start ofs tring
[A-Za-z0-9_-]{1,63} Repeat the character class 1-63 times
(?: Non capture group to repeat as a whole part
\.[A-Za-z0-9_-]{1,63} Match . and repeat the character class 1-63 times
)+ Close the group and repeat 1+ times
$ End of string
Regex demo

Regex to find compensations in text

I need to find mentions of compensations in emails. I am new to regex. Please see below the approach I am using.
sample_text = "Rate – $115k/yr. or $55/hr. - $60/hr"
My python code to find this,
impor re
PATTERN = r'((\$|\£) [0-9]*)|((\$|\£)[0-9]*)'
print(re.findall(PATTERN,sample_text))
The matches I am getting is
[('', '', '$115', '$'), ('', '', '$55', '$'), ('', '', '$60', '$')]
Expected match
["$115k/yr","$55/hr","$60/hr"]
Also the $ sign can be written as USD. How do I handle this in the same regex.

You can use
[$£]\d+[^.\s]*
[$£] Match either $ or £
\d+ Match 1+ digits
[^.\s]* Repeat 0+ times matching any char except . or a whitespace
Regex demo
import re
sample_text = "Rate – $115k/yr. or $55/hr. - $60/hr"
PATTERN = r'[$£]\d+[^.\s]*'
print(re.findall(PATTERN,sample_text))
Output
['$115k/yr', '$55/hr', '$60/hr']
If there must be a / present, you might also use
[$£]\d+[^\s/.]*/\w+
Regex demo

You can have something like:
[$£]\d+[^.]+
>>> PATTERN = '[$£]\d+[^.]+
>>> print(re.findall(PATTERN,sample_text))
['$115k/yr', '$55/hr', '$60/hr']
[$£] matches "$" or a "£"
\d+ matches one or more digits
[^.]+ matches everything that's not a "."

The parentheses in your regex cause the engine to report the contents of each parenthesized subexpression. You can use non-grouping parentheses (?:...) to avoid this, but of course, your expression can be rephrased to not have any parentheses at all:
PATTERN = r'[$£]\s*\d+'
Notice also how I changed the last quantifier to a + -- your attempt would also find isolated currency symbols with no numbers after them.
To point out the hopefully obvious, \s matches whitespace and \s* matches an arbitrary run of whitespace, including none at all; and \d matches a digit.
If you want to allow some text after the extracted match, add something like (?:/\w+)? to allow for a slash and one single word token as an optional expression after the main match. (Maybe adorn that with \s* on both sides of the slash, too.)

Check that both `lookbehind` conditions are satisfied in `RegEx`

I'm trying to check if a username is preceded either by RT # or by RT# by using lookbehind mechanism paired with conditionals, as explained in this tutorial.
The regex and the example are shown in Example 1:
Example 1
import re
text = 'RT #u1, #u2, u3, #u4, rt #u5:, #u3.#u1^, rt#u3'
mt_regex = r'(?i)(?<!RT )&(?<!RT)#(\w+)'
mt_pat = re.compile(mt_regex)
re.findall(mt_pat, text)
which outputs [] (empty list), while the desired output should be:
['u2', 'u4', 'u3', 'u1']
What am I missing? Thanks in advance.

The other answer is obviously correct and rightfully accepted, but here is what I think might work for you without negative lookbehinds. The benefit is that you are not limited to the single space character using \s*:
(?i)(?:^|[,.])\s*#(\w+)
See the online demo
(?i) - Turn of case sensitivity. Note you can also use re.IGNORECASE.
(?:^|[,.]) - Non-capture group to match start of string or literal comma/dot.
\s* - Zero or more spaces.
# - Literally match "#".
(\w+) - Open capture group and match word characters, short for [A-Za-z0-9_].
This print ['u2', 'u4', 'u3', 'u1']

If we break down your regex:
r"(?i)(?<!RT )&(?<!RT)#(\w+)"
(?i) match the remainder of the pattern, case insensitive match
(?<!RT ) negative lookbehind
asserts that 'RT ' does not match
& matches the character '&' literally
(?<!RT) negative lookbehind
asserts that 'RT' does not match
# matches the character '#' literally
(\w+) Capturing Group
matches [a-zA-Z0-9_] between one and unlimited times
You have the & character that is preventing your regex matching:
import re
text = "RT #u1, #u2, u3, #u4, rt #u5:, #u3.#u1^, rt#u3"
mt_regex = r"(?i)(?<!RT )(?<!RT)#(\w+)"
mt_pat = re.compile(mt_regex)
print(re.findall(mt_pat, text))
# ['u2', 'u4', 'u3', 'u1']
See this regex here

Match multiple instances of pattern in parenthesis

For the following string, is it possible for regex to return the comma delimited matches within the square brackets?
"root.path.definition[id=1234,test=blah,scope=A,B,C,D]"
Expected output would be:
["id=1234", "test=blah", "scope=A,B,C,D"]
The closest I have gotten so far is the following:
(?<=\[)(.*?)(?=\])
But this will only return one match for everything within the square brackets.

One option is to use the re module and first get the part between the square brackets using a capturing group and a negated character class.
\[([^][]*)]
That part will match:
\[ Match [ char
([^][]*) Capture group 1, match 0+ times any char other than [ and ]
] A [ char
Then get the separate parts by matching the key value pairs separated by a comma.
\w+=.*?(?=,\w+=|$)
That part will match:
\w+ Match 1+ word characters
= Match literally
.*?(?=,\w+=|$) Match as least as possible chars until you either encounter a comma, 1+ word characters and = or the end of the string
For example
import re
s = "root.path.definition[id=1234,test=blah,scope=A,B,C,D]"
m = re.search(r"\[([^][]*)]", s)
if m:
print(re.findall(r"\w+=.*?(?=,\w+=|$)", m.group(1)))
Python demo
Output
['id=1234', 'test=blah', 'scope=A,B,C,D']
If you can make use of the regex module, this might also be an option matching the keys and values using lookarounds to assert the [ and ]
(?<=\[[^][]*)\w+=.*?(?=,\w+=|])(?=[^][]*])
For example
import regex
s = "root.path.definition[id=1234,test=blah,scope=A,B,C,D]"
print(regex.findall(r"(?<=\[[^][]*)\w+=.*?(?=,\w+=|])(?=[^][]*])", s))
Output
['id=1234', 'test=blah', 'scope=A,B,C,D']
Regex demo | Python demo

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

using re to grab all instances of values between parenthesis - python-3.x

Related

Python - how to find string and remove string plus next x characters

Python regex findall did not respond

Regex to find compensations in text

Check that both `lookbehind` conditions are satisfied in `RegEx`

Match multiple instances of pattern in parenthesis

Categories

Resources