Using the OR (|) function in regex [duplicate] - python-3.x

The source string is:
# Python 3.4.3
s = r'abc123d, hello 3.1415926, this is my book'
and here is my pattern:
pattern = r'-?[0-9]+(\\.[0-9]*)?|-?\\.[0-9]+'
however, re.search can give me correct result:
m = re.search(pattern, s)
print(m) # output: <_sre.SRE_Match object; span=(3, 6), match='123'>
re.findall just dump out an empty list:
L = re.findall(pattern, s)
print(L) # output: ['', '', '']
why can't re.findall give me the expected list:
['123', '3.1415926']

There are two things to note here:
re.findall returns captured texts if the regex pattern contains capturing groups in it
the r'\\.' part in your pattern matches two consecutive chars, \ and any char other than a newline.
See findall reference:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
Note that to make re.findall return just match values, you may usually
remove redundant capturing groups (e.g. (a(b)c) -> abc)
convert all capturing groups into non-capturing (that is, replace ( with (?:) unless there are backreferences that refer to the group values in the pattern (then see below)
use re.finditer instead ([x.group() for x in re.finditer(pattern, s)])
In your case, findall returned all captured texts that were empty because you have \\ within r'' string literal that tried to match a literal \.
To match the numbers, you need to use
-?\d*\.?\d+
The regex matches:
-? - Optional minus sign
\d* - Optional digits
\.? - Optional decimal separator
\d+ - 1 or more digits.
See demo
Here is IDEONE demo:
import re
s = r'abc123d, hello 3.1415926, this is my book'
pattern = r'-?\d*\.?\d+'
L = re.findall(pattern, s)
print(L)

s = r'abc123d, hello 3.1415926, this is my book'
print re.findall(r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+',s)
You dont need to escape twice when you are using raw mode.
Output:['123', '3.1415926']
Also the return type will be a list of strings. If you want return type as integers and floats use map
import re,ast
s = r'abc123d, hello 3.1415926, this is my book'
print map(ast.literal_eval,re.findall(r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+',s))
Output: [123, 3.1415926]

Just to explain why you think that search returned what you want and findall didn't?
search return a SRE_Match object that hold some information like:
string : attribute contains the string that was passed to search function.
re : REGEX object used in search function.
groups() : list of string captured by the capturing groups inside the REGEX.
group(index): to retrieve the captured string by group using index > 0.
group(0) : return the string matched by the REGEX.
search stops when It found the first mach build the SRE_Match Object and returning it, check this code:
import re
s = r'abc123d'
pattern = r'-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+'
m = re.search(pattern, s)
print(m.string) # 'abc123d'
print(m.group(0)) # REGEX matched 123
print(m.groups()) # there is only one group in REGEX (\.[0-9]*) will empy string tgis why it return (None,)
s = ', hello 3.1415926, this is my book'
m2 = re.search(pattern, s) # ', hello 3.1415926, this is my book'
print(m2.string) # abc123d
print(m2.group(0)) # REGEX matched 3.1415926
print(m2.groups()) # the captured group has captured this part '.1415926'
findall behave differently because it doesn't just stop when It find the first mach it keeps extracting until the end of the text, but if the REGEX contains at least one capturing group the findall don't return the matched string but the captured string by the capturing groups:
import re
s = r'abc123d , hello 3.1415926, this is my book'
pattern = r'-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+'
m = re.findall(pattern, s)
print(m) # ['', '.1415926']
the first element is return when the first mach was found witch is '123' the capturing group captured only '', but the second element was captured in the second match '3.1415926' the capturing group matched this part '.1415926'.
If you want to make the findall return matched string you should make all capturing groups () in your REGEX a non capturing groups(?:):
import re
s = r'abc123d , hello 3.1415926, this is my book'
pattern = r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+'
m = re.findall(pattern, s)
print(m) # ['123', '3.1415926']

Related

Regular Expression to remove substring having at least 5 Uppercases

I have a python list and I want a regular expression to remove substring which contains at least 5 uppercases. And another regex which could remove the part of string from ‘?’ till ‘:’
INPUT : list = [‘helLo/aPPle/BuTTeRfLY:Missed’,’bliss/ScIENCEs/brew?Dyna=skjdk:Nest’,’Self/NESTeDsd/hello/MiSSInG:Good’]
Output : list = [‘helLo/aPPle/:Missed’,’bliss//brew:Nest’,’Self//hello/:Good’]
Here make 2 regex:
(\w*[A-Z]\w*){5,} - find atleast 5 uppercase letters
?.*(?=:) - find substring start with ? and end with :
if we find string match with regex pattern then replace string with '' and update value in list
import re
reg =r'(\w*[A-Z]\w*){5,}|\?.*(?=:)'
input_list = ["helLo/aPPle/BuTTeRfLY:Missed","bliss/ScIENCEs/brew?Dyna=skjdk:Nest","Self/NESTeDsd/hello/MiSSInG:Good"]
for data in input_list:
match = re.finditer(reg,data)
if match:
for match_word in match:
print(match_word)
if match_word.group() in data:
# if uppercase char >5 then replace this substring with ''
final_str = data.replace(str(match_word.group()),'')
# find index of data
index = input_list.index(data)
# replce new value in list
input_list[index] = data =final_str
print(input_list)
Output: :- ['helLo/aPPle/:Missed', 'bliss//brew:Nest', 'Self//hello/:Good']

find better way to find the text in string contains multi same signs

I have below text which each info (text and length) between "|" is different by time , only the number of "|" is fixed. I can retrieve the info i want ("XYZGM")but do we have better way to do ?
"#BATCH|ABCDEF|01|12|1||XYZGM|210401113439|online|ATGHDGV03|QGH83826|RevA|||"
Current code i used:
text="{#BATCH|ABCDEF|01|12|1||XYZGM|210401113439|online|ATGHDGV03|QGH83826|RevA|||"
# get text from 6th position to 7th position of "|"
pos_count=0
z=0
for i in range(z,len(text)):
pos=text.find('|', z, len(text))
if pos>0:
pos_count+=1
z=pos+1
if pos_count==6:
x=pos+1
if pos_count==7:
y=pos
break
print("X: {}, Y: {}".format(x,y))
result=text[x:y]
print(result)
and the result is : "XYZGM"
Another option could be using a pattern:
^{#(?:[^|]*\|){6}([^|]+)
^ Start of string
{# Match {#
(?:[^|]*\|){6} Repeat 6 times any char except | then match |
([^|]+) Capture group 1, match 1+ times any char except |
Regex demo
import re
pattern = r"^{#(?:[^|]*\|){6}([^|]+)"
s = "{#BATCH|ABCDEF|01|12|1||XYZGM|210401113439|online|ATGHDGV03|QGH83826|RevA|||"
match = re.match(pattern, s)
if match:
print(match.group(1))
Output
XYZGM
No need using regex:
text="{#BATCH|ABCDEF|01|12|1||XYZGM|210401113439|online|ATGHDGV03|QGH83826|RevA|||"
if text.startswith("{#"):
print(text[2:].split("|")[6])
Make sure there is {# text at the beginning, split the rest with |, and get the sixth value.
Python code.

Python regex multiple matches occurrences between two strings

I have a multi-line string with my start/end magic strings ("X" and "Y"). I'm trying to capture all occurrences but I'm experiencing some issues.
Here is the code
testString = '''AAAAAXBBBBBYCCCCCXDDDDDYEEEEEEXFFF
FFFYGGG
'''
pattern = re.compile(r'(.*)X(.*)Y(.*)', re.MULTILINE)
match = re.search(pattern, testString)
print match.group(1) # output: AAAAAXBBBBBYCCCCC
print match.group(2) # output: DDDDD
print match.group(3) # output: EEEEEEXFFF
Basically, I'm trying to capture all occurrences of the following (And I have to maintain text order):
Text before the magic start string (e.g.: AAAAA, CCCCC, EEEEEE)
Text between start/end magic strings (e.g.: BBBBB, DDDDD, FFF\nFFF)
Text after the magic start string (e.g.: CCCCC, GGG)
So I'm trying to print the following output: (what's in between brackets below is just a comment)
AAAAA (before magic string)
BBBBB (between magic strings)
CCCCC (before/after magic strings, it does not matter. Just the order matters.)
DDDDD (after magic string)
And so on. Printing them in that order would solve the issue. (Then I can pass each to other functions, ...etc.)
The code works nicely when the text is as simple as for example "AAXBBYCC", but with complicated strings I'm losing control.
Any ideas or alternative ways to do this?
You could match any character except X or Y in group 1 and then match X and do the same for Y. The "after the magic string" part you could capture in a lookahead with a third group.
The negated character class using [^ will also match an newline to match the FFFFFF part.
([^XY]+)X([^XY]+)Y(?=([^XY]+))
([^XY]+)X Capture group 1, match 1+ times any char except X or Y, then match X
([^XY]+)Y Capture group 2, match 1+ times any char except X or Y, then match Y
(?= Positive lookahead, assert what is directly to the right is
([^XY]+) Capture group 3, match 1+ times any char except X or Y
) Close lookahead
Regex demo | Python demo
import re
regex = r"([^XY]+)X([^XY]+)Y(?=([^XY]*))"
s = ("AAAAAXBBBBBYCCCCCXDDDDDYEEEEEEXFFF\n"
"FFFYGGG")
matches = re.findall(regex, s)
print(matches)
Output
[('AAAAA', 'BBBBB', 'CCCCC'), ('CCCCC', 'DDDDD', 'EEEEEE'), ('EEEEEE', 'FFF\nFFF', 'GGG')]
So I'm trying to print the following output: (what's in between brackets below is just a comment)
AAAAA (before magic string)
BBBBB (between magic strings)
CCCCC (before/after magic strings, it does not matter. Just the order matters.)
DDDDD (after magic string)
And so on.
Since it doesn't matter whether before or after start or end, it is as simple as:
import re
o = re.split("X|Y", testString)
print(*o, sep='\n')
Can't you just use:
pattern = re.compile(r'[^XY]+')
match = re.findall(pattern, testString)
print(match)
# ['AAAAA', 'BBBBB', 'CCCCC', 'DDDDD', 'EEEEEE', 'FFF\nFFF', 'GGG\n']

how to get values using regex in python

here is my sample code
import re
string = '[P-123,SHA-123]'
pattern = re.compile(r"^\[(?P<curve>).*\]$", re.MULTILINE | re.IGNORECASE)
result = pattern.search(string)
print(result)
Expected output:
P-123
If you want to match that data format:
^\[(?P<curve>[A-Z]-\d+),[A-Z]+-\d+]\Z
Explanation
^ Start of string
\[ Match [
(?P<curve> Named capture group curve
[A-Z]-\d+ Match a single uppercase char, - and 1+ digits
) Close group
,[A-Z]+-\d+ Match 1+ uppercase chars - and 1+ digits
] Match ]
\Z End of string (or use $ if a newline after is allowed)
The value is in named capturing group curve. You could also use re.match instead of re.search as you are looking for a single group in the whole string.
Regex demo | Python demo
Example code
import re
string = '[P-123,SHA-123]'
pattern = re.compile(r"\[(?P<curve>[A-Z]-\d+),[A-Z]+-\d+]\Z", re.MULTILINE | re.IGNORECASE)
result = pattern.match(string)
print(result.group("curve"))
Output
P-123
string = '[P-123,SHA-123]'
pattern = re.compile(r"(P.\d*)", re.MULTILINE | re.IGNORECASE)
result = pattern.search(string)
print(result[1])
You can try this regex \W([A-Z]-[0-9]*) that extracting capital letter follow by - and then numbers
import re
string = '[P-123,SHA-123]'
pattern = re.compile(r"\W([A-Z]-[0-9]*)", re.MULTILINE | re.IGNORECASE)
result = pattern.search(string).group(1)
print(result)
Output
P-123

Find and append characters of a String by matching with a List in Python 2.7

There is a list contains with character sequences such below:
seq_list = ['C','CA','CAF','CMMVF','E','CMM','CMMF','CMMFF',...]
and a string can be defined as below:
a_str = 'CAFCMMVFCMMECMMFFCCAF'
The problem is to match the longest character sequence of seq_list in a_str from left to right iteratively, and then a character('|') should be appended if it's found.
For example,
a_str begins with 'C' but the actual character sequence is 'CAF' because 'CAF' has the longer sequence than 'C',
so that it should be achieved such below:
a_str = 'CAF|CMMVFCMMECMMFFCCAF' #actual sequence match
'C|AFCMMVFCMMECMMFFCCAF' #false sequence match
Then, remaining a_str_r should be like this a_str_r = 'CMMVFCMMECMMFFCCAF' after a character '|' has been appended. So that the iterative process has to start over again by matching the longest sequence from the list until the end of the string, and the final result should be like this:
a_str = 'CAF|CMMVF|CMM|E|CMMFF|C|CAF|'
This was one of the attempts for this problem, and still couldn't get right!
a_str_r = []
for each in seq_list:
for i in a_str:
if each in i:
a_str_r.append(i+'|')
return a_str_r
You want to search for leftmost longest match. That is a standout for a regular expression search.
import re
seq_list = ['C','CA','CAF','CMMVF','E','CMM','CMMF','CMMFF']
# Sort to put longer match strings before shorter ones
sseq_list = sorted(seq_list, key=lambda a: len(a), reverse=True)
# Turn list into a regular expression string
sseq_re = '|'.join(sseq_list)
# Compile regular expression string
rx = rx = re.compile(sseq_re)
# Put pipe characters between the matches
print '|'.join(rx.findall('CAFCMMVFCMMECMMFFCCAF'))

Resources