How to search for A (any lengths any character) * in Python 3 - python-3.x

I have a string with letters and occasionally *. I am trying to write regular expressions in Python 3 to get anything in the string that starts with A and ends with *, things in between can be any letter but not *. Help is much appreciated.

Try using the regex pattern \bA\w*\*(?!\S):
inp = "An example is an A* and also an Abc*"
matches = re.findall(r'\bA\w*\*(?!\S)', inp)
print(matches) # ['A*', 'Abc*']

There is a simpler pattern one can use with python regex.
r"(A.+?\*)"
For more on "Greedy vs Non-Greedy in Python": https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy
An example:
import re
string_to_test = "This is Atest1* but not this* but Atest2* done *a asdf"
for x in re.findall(r"(A.+?\*)", string_to_test):
print(x)
This will produce:
Atest1*
Atest2*
Note that depending on if the string "A*" is valid or not you might need the pattern r"(A.*?\*)"

Related

How do i find/count number of variable in string using Python

Here is example of string
Hi {{1}},
The status of your leave application has changed,
Leaves: {{2}}
Status: {{3}}
See you soon back at office by Management.
Expected Result:
Variables Count = 3
i tried python count() using if/else, but i'm looking for sustainable solution.
You can use regular expressions:
import re
PATTERN = re.compile(r'\{\{\d+\}\}', re.DOTALL)
def count_vars(text: str) -> int:
return sum(1 for _ in PATTERN.finditer(text))
PATTERN defines the regular expression. The regular expression matches all strings that contain at least one digit (\d+) within a pair of curly brackets (\{\{\}\}). Curly brackets are special characters in regular expressions, so we must add \. re.DOTALL makes sure that we don't skip over new lines (\n). The finditer method iterates over all matches in the text and we simply count them.

Replace matched susbtring using re sub

Is there a way to replace the matched pattern substring using a single re.sub() line?.
What I would like to avoid is using a string replace method to the current re.sub() output.
Input = "/J&L/LK/Tac1_1/shareloc.pdf"
Current output using re.sub("[^0-9_]", "", input): "1_1"
Desired output in a single re.sub use: "1.1"
According to the documentation, re.sub is defined as
re.sub(pattern, repl, string, count=0, flags=0)
If repl is a function, it is called for every non-overlapping occurrence of pattern.
This said, if you pass a lambda function, you can remain the code in one line. Furthermore, remember that the matched characters can be accessed easier to an individual group by: x[0].
I removed _ from the regex to reach the desired output.
txt = "/J&L/LK/Tac1_1/shareloc.pdf"
x = re.sub("[^0-9]", lambda x: '.' if x[0] is '_' else '', txt)
print(x)
There is no way to use a string replacement pattern in Python re.sub to replace with two possible strings, as there is no conditional replacement construct support in Python re.sub. So, using a callable as the replacement argument or use other work-arounds.
It looks like you only expect one match of <DIGITS>_<DIGITS> in the input string. In this case, you can use
import re
text = "/J&L/LK/Tac1_1/shareloc.pdf"
print( re.sub(r'^.*?(\d+)_(\d+).*', r'\1.\2', text, flags=re.S) )
# => 1.1
See the Python demo. See the regex demo. Details:
^ - start of string
.*? - zero or more chars as few as possible
(\d+) - Group 1: one or more digits
_ - a _ char
(\d+) - Group 2: one or more digits
.* - zero or more chars as many as possible.

python Using variable in re.search source.error("bad escape %s" % escape, len(escape)) [duplicate]

I want to use input from a user as a regex pattern for a search over some text. It works, but how I can handle cases where user puts characters that have meaning in regex?
For example, the user wants to search for Word (s): regex engine will take the (s) as a group. I want it to treat it like a string "(s)" . I can run replace on user input and replace the ( with \( and the ) with \) but the problem is I will need to do replace for every possible regex symbol.
Do you know some better way ?
Use the re.escape() function for this:
4.2.3 re Module Contents
escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
A simplistic example, search any occurence of the provided string optionally followed by 's', and return the match object.
def simplistic_plural(word, text):
word_or_plural = re.escape(word) + 's?'
return re.match(word_or_plural, text)
You can use re.escape():
re.escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
>>> import re
>>> re.escape('^a.*$')
'\\^a\\.\\*\\$'
If you are using a Python version < 3.7, this will escape non-alphanumerics that are not part of regular expression syntax as well.
If you are using a Python version < 3.7 but >= 3.3, this will escape non-alphanumerics that are not part of regular expression syntax, except for specifically underscore (_).
Unfortunately, re.escape() is not suited for the replacement string:
>>> re.sub('a', re.escape('_'), 'aa')
'\\_\\_'
A solution is to put the replacement in a lambda:
>>> re.sub('a', lambda _: '_', 'aa')
'__'
because the return value of the lambda is treated by re.sub() as a literal string.
Usually escaping the string that you feed into a regex is such that the regex considers those characters literally. Remember usually you type strings into your compuer and the computer insert the specific characters. When you see in your editor \n it's not really a new line until the parser decides it is. It's two characters. Once you pass it through python's print will display it and thus parse it as a new a line but in the text you see in the editor it's likely just the char for backslash followed by n. If you do \r"\n" then python will always interpret it as the raw thing you typed in (as far as I understand). To complicate things further there is another syntax/grammar going on with regexes. The regex parser will interpret the strings it's receives differently than python's print would. I believe this is why we are recommended to pass raw strings like r"(\n+) -- so that the regex receives what you actually typed. However, the regex will receive a parenthesis and won't match it as a literal parenthesis unless you tell it to explicitly using the regex's own syntax rules. For that you need r"(\fun \( x : nat \) :)" here the first parens won't be matched since it's a capture group due to lack of backslashes but the second one will be matched as literal parens.
Thus we usually do re.escape(regex) to escape things we want to be interpreted literally i.e. things that would be usually ignored by the regex paraser e.g. parens, spaces etc. will be escaped. e.g. code I have in my app:
# escapes non-alphanumeric to help match arbitrary literal string, I think the reason this is here is to help differentiate the things escaped from the regex we are inserting in the next line and the literal things we wanted escaped.
__ppt = re.escape(_ppt) # used for e.g. parenthesis ( are not interpreted as was to group this but literally
e.g. see these strings:
_ppt
Out[4]: '(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
__ppt
Out[5]: '\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'
print(rf'{_ppt=}')
_ppt='(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
print(rf'{__ppt=}')
__ppt='\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'
the double backslashes I believe are there so that the regex receives a literal backslash.
btw, I am surprised it printed double backslashes instead of a single one. If anyone can comment on that it would be appreciated. I'm also curious how to match literal backslashes now in the regex. I assume it's 4 backslashes but I honestly expected only 2 would have been needed due to the raw string r construct.

return only chars from the string in python

I am looking to extract only chars from the given string. but my query is doing exactly opposite
s= "A man, a plan, a canal: Panama"
newS = ''.join(re.findall("[^a-zA-Z]*", s))
print(newS) // my o/p: , , :
expected o/p string is:
"A man a plan a canal Panama"
Your regular expression is inverting the match - that's what the caret symbol (^) does inside square brackets (negated character class). You first need to remove that.
Next, you should be matching a sequence of one or more characters (+) rather than zero or more characters (*) -- using * will match the empty string, which you don't want in this case.
Finally your join should join with a space to get the intended output, rather than an empty string -- which won't retain the spaces between the words.
newS = ' '.join(re.findall(r'[a-zA-Z]+', s))
Though not essential in this case, its advised to use raw strings for regular expressions (r). More in this post.
Full working code:
import re
s = 'A man, a plan, a canal: Panama'
newS = ' '.join(re.findall(r'[a-zA-Z]+', s))
print(newS)

Regular expression to capture n lines of text between two regex patterns

Need help with a regular expression to grab exactly n lines of text between two regex matches. For example, I need 17 lines of text and I used the example below, which does not work. I
Please see sample code below:
import re
match_string = re.search(r'^.*MDC_IDC_RAW_MARKER((.*?\r?\n){17})Stored_EGM_Trigger.*\n'), t, re.DOTALL).group()
value1 = re.search(r'value="(\d+)"', match_string).group(1)
value2 = re.search(r'value="(\d+\.\d+)"', match_string).group(1)
print(match_string)
print(value1)
print(value2)
I added a sample string to here, because SO does not allow long code string:
https://hastebin.com/aqowusijuc.xml
You are getting false positives because you are using the re.DOTALL flag, which allows the . character to match newline characters. That is, when you are matching ((.*?\r?\n){17}), the . could eat up many extra newline characters just to satisfy your required count of 17. You also now realize that the \r is superfluous. Also, starting your regex with ^.*? is superfluous because you are forcing the search to start from the beginning but then saying that the search engine should skip as many characters as necessary to find MDC_IDC_RAW_MARKER. So, a simplified and correct regex would be:
match_string = re.search(r'MDC_IDC_RAW_MARKER.*\n((.*\n){17})Stored_EGM_Trigger.*\n', t)
Regex Demo

Resources