Regex: possibly two patterns found in one text - python-3.x

I have a specific pattern but the text to be processed can change randomly.
The text I am trying to filter currently using regex (Python.re.findall, python v3.9.13) is as follow:
"ABC9,10.11A5:6,7:8.10BC1"
I am using the following regex expression: r"([ABC]{1,})(([0-9]{1,}[,.:]{0,}){1,})"
The current result is:
[("ABC", "9,10.11", "11"), ("A", "5:6,7:8.10", "10"), ("BC", "1", "1")]
What I am looking for as result should be:
[("ABC", "9,10.11"), ("A", "5:6,7:8.10"), ("BC", "1")]
I don't understand why the last number in the second part is always repeated again.
Please help.

I presume you are using re.findall, since that returns the contents of all capture groups in its output. In your case the last number repetition is due to the capture group around [0-9]{1,}[,.:]{0,}. Making that a non-capturing group resolves the issue:
([ABC]{1,})((?:[0-9]{1,}[,.:]{0,}){1,})
In python:
re.findall(r"([ABC]{1,})((?:[0-9]{1,}[,.:]{0,}){1,})", s)
# [('ABC', '9,10.11'), ('A', '5:6,7:8.10'), ('BC', '1')]

Related

Text analysis: Match long list of keywords with character strings in a text variable in dataframe

I am trying to match a long list of keywords (all German municipalities) saved as a dataframe (df_2) with another variable from a second dataframe (df_1) that contains character strings (long descriptive text).
As output, I am trying to extract the names of municipalities mentioned in df_1 that are given in df_2, if possible, including a count how many times the name was mentioned in df_1.
So far, I have tried using grep(), but the pattern from df_2 exceeds my memory (since the pattern contains close to 13000 values), sapply(), but here, I have the same problems with the memory (only using a part of df_2 takes up more than 6GB) str_extract() and using the quanteda package, using df_2 as a "dictionary". But I am getting nowhere.
This is a replicated sample with the part of my code that is working (although I only get the whole string as output, not just the name of the municipality, so it isn't very useful yet).
Is there a way to write a function that tells df_1 to add a new variable and copy the value of df_2[1] whenever there is a match in the character string for df_2[i] in df_1$description[i]?
I think this may be the only option that doesn't overload the memory.
`###sample for testing
name <- c( "A", "B", "C", "D")
date <- c("1999-03-02","1999-04-02","1999-05-02","1999-06-02" )
event <- c("occurrence1","occurrence2","occurrence3","occurrence4" )
description <- c("this is a sample text and München that is also a sample text Berlin",
"this is a sample text and Detmold that is also a sample text Berlin and Berlin",
"this is a sample text and Darmstadt that is also a sample text Magdeburg and Halle",
"this is a sample text and München that is also a sample text Berlin" )
df_1 <- cbind(name, date, event, description)
df_1 <- as.data.frame(df_1)
locations <- c("München", "Berlin", "Darmstadt", "Magdeburg", "Detmold", "Halle")
df_2 <- as.data.frame(locations)`
##sample code for generating output
pattern_sample <- paste(df_2$locations, collapse="|")
result_sample <- grep(pattern_sample, df_1$description, value=TRUE)
result_sample

Python regex capture group containing nested non-capturing group

I'm trying to capture string-parts Abbb, Abb, Ab, A, C###, C#, C, etc. into one group and whatever follows (anything that's not b, #) into a separate group.
I'm using this regex:
sample = "Cbb-7" # for testing purposes
re.search(r"([A-G](?:#*|b*))(.*?)", sample).groups()
which results in:
('C', '')
while I'm expecting:
('Cbb', '-7').
When modifying the regex to (greedy follow-up capture group(.*)):
re.search(r"([A-G](?:#*|b*))(.*)", sample).groups()
I get the result:
('C', 'bb-7'). (I still would need: ('Cbb','-7'))
Moving optionality of b, #out of the non-capturing group seems to help:
re.search(r"([A-G](?:#+|b+)?)(.*)", sample).groups()
results in:
('Cbb', '-7') Still wondering why!

How do I print exact sentence by filtering using regular expression in Python

I'm new to regular expression and got stuck with the code below.
import re
s = "5. Consider the task in Figure 8.11, which are balanced in fig 99.2"
output = re.findall((r'[A-Z][a-z]*'), s)[0]
output2 = re.findall(r'\b[^A-Z\s\d]+\b', s)
mixing = " ".join(str(x) for x in output2)
finalmix = output+" " + mixing
print(finalmix)
Here I'm trying to print "Consider the task in Figure 8.11, which are balanced in fig 99.2" from the given string s' as a sentence in output. So I joined the two outputs using join statement at the end to get it as a sentence. But its a lot confusing now since "Figure 8.11" and "fig 99.2" will not be printed as I have not given a regex code for that because I cannot determine what regex I should be using and later combining it at the end.
It's probably because I'm using a wrong approach to print the given sentence from the string s. I'll be glad if anyone could help me fix the code or guide me using some alternate approach as this code looks absurd.
This is the output I get:
Consider the task in . which are balanced in .
To capture all bulleted items, I would use:
import re
s = "5. Consider the task in Figure 8.11, which are balanced in fig 99.2"
items = re.findall(r'\d+\.(?!\d)(.*?)(?=\d+\.(?!\d)|$)', s, flags=re.DOTALL)
print(items)
This prints:
['Consider the task in Figure 8.11, which are balanced in fig 99.2']
Here is an explanation of the regex pattern:
\d+\. match a bulleted number
(?!\d) which is NOT followed by another number
(.*?) match and capture all content, across newlines, until hitting
(?=\d+\.(?!\d)|$) another number bullet OR the end of the input
#TimBiegeleisen's answer works, but is somewhat verbose due to the fact that using re.findall would require repeating the pattern of the bullet point as a start and as a lookahead in the end.
For the purpose of finding strings between repeating patterns (bullet points in this case) it may be simpler to use re.split instead. Slice the resulting list to discard the first item since we don't need what comes before the first bullet point:
re.split(r'\d+\.(?!\d)\s*', s)[1:]
This returns:
['Consider the task in Figure 8.11, which are balanced in fig 99.2']

Python: Print entire line of string match and not cut off after the period

See bottom for the solution I came up with.
Hopefully this is a easy question for you guys. Trying to match a string to a list and print just that string matched. I was successful using re, but it is cutting off the rest of the string after the period. The span per re is 0,10 and when i look at the output without using re it is 0,14 not 0,10 so match is cutting off the info after the period. So I would like to learn how to tell it to print the entire span or learn a new way to match a var string to a list and print that exact string. My original attempts printed anything with the TESTPR in it, 3 printed total, the others I do not want printing have a 1 in the front and the last match has an additional R at the end. Here is my current match code:
#OLD See below
for element in catalog:
z = re.match("((TESTPRR )\w+)", element)
if z:
print((z.group()))
Output: TESTPR 105
It should show:
Wanted output: TESTPT 105.465
It will go up to 3 decimal places after the period and no more. I am currently taking a Python class to learn Python and love it so far, but this one has me stumped as I am just now learning about re and matching by reading as we have not gotten to that yet in class.
I am open to learning a different way to search for and match a string and print just that string. For my first attempt that prints 3 results was this:
catalog = [ long list pulled from API then code here to make it a nice column]
prod = 'TESTPR'
print ([s for s in catalog if prod in s])
When I add a space at the end of prod i can get rid of the match with the extra char at the end, but I cannot add a space to do the same thing with the match that has an extra char at the front. This is for the code above and not for the re match code. Thanks!
Answer below!
Since you are interested in learning about ways to match strings and solve your problem: try fuzzywuzzy.
In your case you could try:
from fuzzywuzzy import process
catalog = [long list pulled from API then code here to make it a nice column]
prod = "TESTPR"
hit = process.extractOne(prod, catalog, score_cutoff = 75) #you can adjust this to suit how close the match should be
print(hit[0]) #hit will be sth like ("TESTPT 105.465", 75)
Output: TESTPT 105.465
For information on different ways of using fuzzywuzzy, check out this link.
You can use different ways of matching such as:
fuzz.partial_ratio
fuzz.ratio
token_sort_ratio
fuzz.token_set_ratio
for this from fuzzywuzzy import fuzz
Kept at it with re.match and got the correct regex so the entire match prints and it does not cut off numbers after the period.
my original match as you can see above was re.match("((TESTPRR )\w+)", element), some of the ( were unneeded and needed to add a few more expressions and now it prints the correct match. See above for old code and below for the new code that works.
# New code, replaced w+ with w*\d*[.,]?\d*$
for element in catalog:
z = re.match("STRING\w*\d*[.,]?\d*$", element)
if z:
print(z.group())

How to separate amino acid, number and amino acid string?

Right now, I have amino acid string.
The amino acid mutation column looks like this A59M, T133G, K2*, G1927? and ? only.
So, I tried to use re to separate one column into three columns and remove those ? only but keep G1297?.
import re
AA_mut = AA_mut.replace('p.','')
m = re.search(r'^(\w+)(\d+)(\S+)$',AA_mut)
But, I got
(A5,9,M; T13,3,M;....)
Please give me some advise.
Thanks
\w matches letters and digits in perl. It looks to me like it's doing the same thing in python.
You might try being more explicit. Is that a single, capital letter on the front? If so maybe you want something like
^([A-Z])(\d+)(\D+)$
In perl:
print join ("<>", m/^([A-Z])(\d+)(\D+)$/) while <DATA>;
__DATA__
A59M
T133G
K2*
G1927?
?
prints
A<>59<>M
T<>133<>G
K<>2<>*
G<>1927<>?
Assuming you have:
data = ["A59M", "T133G", "K2*", "G1927?", "?"]
You can extract it using:
out = [(s[0], s[1:-1], s[-1]) for s in data if len(s) > 2]
This gives me:
out == [('A', '59', 'M'), ('T', '133', 'G'),
('K', '2', '*'), ('G', '1927', '?')]
import re
AA_mut = AA_mut.replace('p.','')
m = re.search(r'^(\w)(\d+)(\S+)$',AA_mut)
I use this one to solve my problem. The original \w+ leaves one digit for \d+ and one alphabet for \S+. Once I removed the "+". It takes only first alphabet and leaves other parts.

Resources