Regular expressions using Python3 with greedy modifier not working - python-3.x

import re
s="facebook.com/https://www.facebook.com/test/"
re.findall("facebook\.com/[^?\"\'>&\*\n\r\\\ <]+?", s)
I only want as a result "facebook.com/test/" ... but I'm getting as a result --
facebook.com/h
facebook.com/t
What's wrong with my RE? I applied the "?" at the end of the expression thinking this would stop greediness, but it's being treated as 0 or 1 expression.
If I remove the "?" I get:
facebook.com/https://www.facebook.com/test/

The non-greedy modifier works forwards but not backwards, which means that when the first instance of facebook.com/ matches it will not be discarded unless the rest of the pattern fails to match, even if it's non-greedy.
To match the last instance of facebook.com/ you can use a negative lookahead pattern instead:
facebook\.com/(?!.*facebook\.com/)[^?\"\'>&\*\n\r\\\ <]+
Demo: https://replit.com/#blhsing/WordyAgitatedCallbacks

Related

How do i find/count number of variable in string using Python

Here is example of string
Hi {{1}},
The status of your leave application has changed,
Leaves: {{2}}
Status: {{3}}
See you soon back at office by Management.
Expected Result:
Variables Count = 3
i tried python count() using if/else, but i'm looking for sustainable solution.
You can use regular expressions:
import re
PATTERN = re.compile(r'\{\{\d+\}\}', re.DOTALL)
def count_vars(text: str) -> int:
return sum(1 for _ in PATTERN.finditer(text))
PATTERN defines the regular expression. The regular expression matches all strings that contain at least one digit (\d+) within a pair of curly brackets (\{\{\}\}). Curly brackets are special characters in regular expressions, so we must add \. re.DOTALL makes sure that we don't skip over new lines (\n). The finditer method iterates over all matches in the text and we simply count them.

How to match a wildcard for strings?

Please suggest a wildcard for below Firstjson list
Firstjson = { p10_7_8 , p10_7_2 , p10_7_3 p10_7_4}
I have tried p10.7.* wildcard for below Secondjson list, it worked. But when I tried p10_7_* for above Firstjson list it did not work
Secondjson = { p10.7.8 , p10.7.2 , p10.7.3 , p10.7.4 }
You are attempting to use wildcard syntax, but Groovy expects regular expression syntax for its pattern matching.
What went wrong with your attempt:
Attempt #1: p10.7.*
A regular expression of . matches any single character and .* matches 0 or more characters. This means:
p10{exactly one character of any kind here}7{zero or more characters of any
kind here}
You didn't realize it, but the . character in your first attempt was acting like a single-character wildcard too. This might match with p10x7abcdefg for example. It also does match p10.7.8 though. But be careful, it also matches p10.78, because the .* expression at the end of your pattern will happily match any sequence of characters, thus any and all characters following p10.7 are accepted.
Attempt #2: p10_7_*
_ matches only a literal underscore. But _* means to match zero or more underscores. It does not mean to match characters of any kind. So p10_7_* matches things like p10_7_______. Literally:
p10_7{zero or more underscores here}
What you can do instead:
You probably want a regular expression like p10_7_\d+
This will match things like p10_7_3 or p10_7_422. It works by matching the literal text p10_7_ followed by one or more digits where a digit is 0 through 9. \d matches any digit, and + means to match one or more of the preceding thing. Literally:
p10_7_{one or more digits here}

pass regex group to function for substituting [duplicate]

I have a string S = '02143' and a list A = ['a','b','c','d','e']. I want to replace all those digits in 'S' with their corresponding element in list A.
For example, replace 0 with A[0], 2 with A[2] and so on. Final output should be S = 'acbed'.
I tried:
S = re.sub(r'([0-9])', A[int(r'\g<1>')], S)
However this gives an error ValueError: invalid literal for int() with base 10: '\\g<1>'. I guess it is considering backreference '\g<1>' as a string. How can I solve this especially using re.sub and capture-groups, else alternatively?
The reason the re.sub(r'([0-9])',A[int(r'\g<1>')],S) does not work is that \g<1> (which is an unambiguous representation of the first backreference otherwise written as \1) backreference only works when used in the string replacement pattern. If you pass it to another method, it will "see" just \g<1> literal string, since the re module won't have any chance of evaluating it at that time. re engine only evaluates it during a match, but the A[int(r'\g<1>')] part is evaluated before the re engine attempts to find a match.
That is why it is made possible to use callback methods inside re.sub as the replacement argument: you may pass the matched group values to any external methods for advanced manipulation.
See the re documentation:
re.sub(pattern, repl, string, count=0, flags=0)
If repl is a function, it is called for every non-overlapping
occurrence of pattern. The function takes a single match object
argument, and returns the replacement string.
Use
import re
S = '02143'
A = ['a','b','c','d','e']
print(re.sub(r'[0-9]',lambda x: A[int(x.group())],S))
See the Python demo
Note you do not need to capture the whole pattern with parentheses, you can access the whole match with x.group().

python Using variable in re.search source.error("bad escape %s" % escape, len(escape)) [duplicate]

I want to use input from a user as a regex pattern for a search over some text. It works, but how I can handle cases where user puts characters that have meaning in regex?
For example, the user wants to search for Word (s): regex engine will take the (s) as a group. I want it to treat it like a string "(s)" . I can run replace on user input and replace the ( with \( and the ) with \) but the problem is I will need to do replace for every possible regex symbol.
Do you know some better way ?
Use the re.escape() function for this:
4.2.3 re Module Contents
escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
A simplistic example, search any occurence of the provided string optionally followed by 's', and return the match object.
def simplistic_plural(word, text):
word_or_plural = re.escape(word) + 's?'
return re.match(word_or_plural, text)
You can use re.escape():
re.escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
>>> import re
>>> re.escape('^a.*$')
'\\^a\\.\\*\\$'
If you are using a Python version < 3.7, this will escape non-alphanumerics that are not part of regular expression syntax as well.
If you are using a Python version < 3.7 but >= 3.3, this will escape non-alphanumerics that are not part of regular expression syntax, except for specifically underscore (_).
Unfortunately, re.escape() is not suited for the replacement string:
>>> re.sub('a', re.escape('_'), 'aa')
'\\_\\_'
A solution is to put the replacement in a lambda:
>>> re.sub('a', lambda _: '_', 'aa')
'__'
because the return value of the lambda is treated by re.sub() as a literal string.
Usually escaping the string that you feed into a regex is such that the regex considers those characters literally. Remember usually you type strings into your compuer and the computer insert the specific characters. When you see in your editor \n it's not really a new line until the parser decides it is. It's two characters. Once you pass it through python's print will display it and thus parse it as a new a line but in the text you see in the editor it's likely just the char for backslash followed by n. If you do \r"\n" then python will always interpret it as the raw thing you typed in (as far as I understand). To complicate things further there is another syntax/grammar going on with regexes. The regex parser will interpret the strings it's receives differently than python's print would. I believe this is why we are recommended to pass raw strings like r"(\n+) -- so that the regex receives what you actually typed. However, the regex will receive a parenthesis and won't match it as a literal parenthesis unless you tell it to explicitly using the regex's own syntax rules. For that you need r"(\fun \( x : nat \) :)" here the first parens won't be matched since it's a capture group due to lack of backslashes but the second one will be matched as literal parens.
Thus we usually do re.escape(regex) to escape things we want to be interpreted literally i.e. things that would be usually ignored by the regex paraser e.g. parens, spaces etc. will be escaped. e.g. code I have in my app:
# escapes non-alphanumeric to help match arbitrary literal string, I think the reason this is here is to help differentiate the things escaped from the regex we are inserting in the next line and the literal things we wanted escaped.
__ppt = re.escape(_ppt) # used for e.g. parenthesis ( are not interpreted as was to group this but literally
e.g. see these strings:
_ppt
Out[4]: '(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
__ppt
Out[5]: '\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'
print(rf'{_ppt=}')
_ppt='(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
print(rf'{__ppt=}')
__ppt='\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'
the double backslashes I believe are there so that the regex receives a literal backslash.
btw, I am surprised it printed double backslashes instead of a single one. If anyone can comment on that it would be appreciated. I'm also curious how to match literal backslashes now in the regex. I assume it's 4 backslashes but I honestly expected only 2 would have been needed due to the raw string r construct.

About python re

I use the online regular tool on the Internet, and it shows the right results.
But when I use python's re package, the results are different.
pattern = re.compile(u'(?<=slot).*?(?=(}]}}]|$))')
result = pattern.findall(data)
print(result)
I want to get a string that ends with '}]}}]' beginning with 'slot'
What your Regular Expression:
u'(?<=slot).*?(?=(}]}}]|$))'
does, is matching every sequence .?*, in a non-greedy way, such that the sequence is preceded by slot and is followed by these symbols }]}}] or by the end of the text $.
In the following example string I put in bold the matches, for this RegExp:
"Crazy Train slot Spider's Lullabyes }]}}]"
To achieve what you want you can use the following pattern:
pattern = re.compile(u'(?<=^slot).*(?=}]}}]$)')
The caret (^) ensures that you search slot from the very beginning, and the operator (?<...) discards slot from result. After that: you match everything .* until reach the end with the desired final }]}}], the operator (?=...) discards }]}}] of the result.
The match for the following text is put in bold:
"slot Spider's Lullabyes }]}}]"

Resources