Python regex capture group containing nested non-capturing group - python-regex

I'm trying to capture string-parts Abbb, Abb, Ab, A, C###, C#, C, etc. into one group and whatever follows (anything that's not b, #) into a separate group.
I'm using this regex:
sample = "Cbb-7" # for testing purposes
re.search(r"([A-G](?:#*|b*))(.*?)", sample).groups()
which results in:
('C', '')
while I'm expecting:
('Cbb', '-7').
When modifying the regex to (greedy follow-up capture group(.*)):
re.search(r"([A-G](?:#*|b*))(.*)", sample).groups()
I get the result:
('C', 'bb-7'). (I still would need: ('Cbb','-7'))

Moving optionality of b, #out of the non-capturing group seems to help:
re.search(r"([A-G](?:#+|b+)?)(.*)", sample).groups()
results in:
('Cbb', '-7') Still wondering why!

Related

Regex: possibly two patterns found in one text

I have a specific pattern but the text to be processed can change randomly.
The text I am trying to filter currently using regex (Python.re.findall, python v3.9.13) is as follow:
"ABC9,10.11A5:6,7:8.10BC1"
I am using the following regex expression: r"([ABC]{1,})(([0-9]{1,}[,.:]{0,}){1,})"
The current result is:
[("ABC", "9,10.11", "11"), ("A", "5:6,7:8.10", "10"), ("BC", "1", "1")]
What I am looking for as result should be:
[("ABC", "9,10.11"), ("A", "5:6,7:8.10"), ("BC", "1")]
I don't understand why the last number in the second part is always repeated again.
Please help.
I presume you are using re.findall, since that returns the contents of all capture groups in its output. In your case the last number repetition is due to the capture group around [0-9]{1,}[,.:]{0,}. Making that a non-capturing group resolves the issue:
([ABC]{1,})((?:[0-9]{1,}[,.:]{0,}){1,})
In python:
re.findall(r"([ABC]{1,})((?:[0-9]{1,}[,.:]{0,}){1,})", s)
# [('ABC', '9,10.11'), ('A', '5:6,7:8.10'), ('BC', '1')]

why re.findall behaves weird way as compared with re.search

Scenario 1: Works as expected
>>> output = 'addr:10.0.2.15'
>>> regnew = re.search(r'addr:(([0-9]+\.){3}[0-9]+)',output)
>>> print(regnew)
<re.Match object; span=(0, 14), match='addr:10.0.2.15'>
>>> print(regnew.group(1))
10.0.2.15
Scenario 2: Works as expected
>>> regnew = re.findall(r'addr:([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)',output)
>>> print(regnew)
['10.0.2.15']
Scenario 3: Does not work as expected. Why is the output not ['10.0.2.15']?
>>> regnew = re.findall(r'addr:([0-9]+\.){3}[0-9]+',output)
>>> print(regnew)
['2.']
Your regex is not correct for what you want:
import re
output = 'addr:10.0.2.15'
regnew = re.findall(r'addr:((?:[0-9]+.){3}[0-9]+)', output)
print(regnew)
Notice what it changed is that I wrapped with parenthesis the full IP address, and added '?:' for the first part of the address. '?:' means it is a non capturing group. findall() as stated in the docs, gives a list of captured groups, that is why you want that '(?:[0-9]+.)' as non capturing group and you want to have the whole thing in a group.
The difference here between findall and everything else is that findall returns capture groups by default (if any are present) instead of the entire matched expression.
A quick fix would be to simply change your repeated group to a noncapturing group, so findall will return the full match rather than the last result in your capture group.
addr:(?:[0-9]+\.){3}[0-9]+
That will of course include addr: in your match. To get just the IP address, wrap both the pattern and quantifier in a capture group.
addr:((?:[0-9]+\.){3}[0-9]+)

Check if a set of characters is contained in a string?

There is a pool of letters (chosen randomly), and you want to make a word with these letters. I found some codes that can help me with this, but then if the word has for example 2 L's and the pool only 1, I'd like the program to know when this happens.
If I understand this correctly, you will also need a list of all valid words in whichever language you are using.
Assuming you have this, then one strategy for solving this problem could be to generate a key for every word in the dictionary that is a sorted list of the letters in that word. You could then group all words in the dictionary by these keys.
Then the task of finding out if a valid word can be constructed from a given list of random characters would be easy and fast.
Here is a simple implementation of what I am suggesting:
list_of_all_valid_words = ['this', 'pot', 'is', 'not', 'on', 'top']
def make_key(word):
return "".join(sorted(word))
lookup_dictionary = {}
for word in list_of_all_valid_words:
key = make_key(word)
lookup_dictionary[key] = lookup_dictionary.get(key, set()).union(set([word]))
def words_from_chars(s):
return list(lookup_dictionary.get(make_key(s), set()))
print words_from_chars('xyz')
print words_from_chars('htsi')
print words_from_chars('otp')
Output:
[]
['this']
['pot', 'top']

How to separate amino acid, number and amino acid string?

Right now, I have amino acid string.
The amino acid mutation column looks like this A59M, T133G, K2*, G1927? and ? only.
So, I tried to use re to separate one column into three columns and remove those ? only but keep G1297?.
import re
AA_mut = AA_mut.replace('p.','')
m = re.search(r'^(\w+)(\d+)(\S+)$',AA_mut)
But, I got
(A5,9,M; T13,3,M;....)
Please give me some advise.
Thanks
\w matches letters and digits in perl. It looks to me like it's doing the same thing in python.
You might try being more explicit. Is that a single, capital letter on the front? If so maybe you want something like
^([A-Z])(\d+)(\D+)$
In perl:
print join ("<>", m/^([A-Z])(\d+)(\D+)$/) while <DATA>;
__DATA__
A59M
T133G
K2*
G1927?
?
prints
A<>59<>M
T<>133<>G
K<>2<>*
G<>1927<>?
Assuming you have:
data = ["A59M", "T133G", "K2*", "G1927?", "?"]
You can extract it using:
out = [(s[0], s[1:-1], s[-1]) for s in data if len(s) > 2]
This gives me:
out == [('A', '59', 'M'), ('T', '133', 'G'),
('K', '2', '*'), ('G', '1927', '?')]
import re
AA_mut = AA_mut.replace('p.','')
m = re.search(r'^(\w)(\d+)(\S+)$',AA_mut)
I use this one to solve my problem. The original \w+ leaves one digit for \d+ and one alphabet for \S+. Once I removed the "+". It takes only first alphabet and leaves other parts.

Generate sensible strings using a pattern

I have a table of strings (about 100,000) in following format:
pattern , string
e.g. -
*l*ph*nt , elephant
c*mp*t*r , computer
s*v* , save
s*nn] , sunny
]*rr] , worry
To simplify, assume a * denotes a vowel, a consonant stands unchanged and ] denotes either a 'y' or a 'w' (say, for instance, semi-vowels/round-vowels in phonology).
Given a pattern, what is the best way to generate the possible sensible strings? A sensible string is defined as a string having each of its consecutive two-letter substrings, that were not specified in the pattern, inside the data-set.
e.g. -
h*ll* --> hallo, hello, holla ...
'hallo' is sensible because 'ha', 'al', 'lo' can be seen in the data-set as with the words 'have', 'also', 'low'. The two letters 'll' is not considered because it was specified in the pattern.
What are the simple and efficient ways to do this?
Are there any libraries/frameworks for achieving this?
I've no specific language in mind but prefer to use java for this program.
This is particularly well suited to Python itertools, set and re operations:
import re
import itertools
VOWELS = 'aeiou'
SEMI_VOWELS = 'wy'
DATASET = '/usr/share/dict/words'
SENSIBLES = set()
def digraphs(word, digraph=r'..'):
'''
>>> digraphs('bar')
set(['ar', 'ba'])
'''
base = re.findall(digraph, word)
base.extend(re.findall(digraph, word[1:]))
return set(base)
def expand(pattern, wildcard, elements):
'''
>>> expand('h?', '?', 'aeiou')
['ha', 'he', 'hi', 'ho', 'hu']
'''
tokens = re.split(re.escape(wildcard), pattern)
results = set()
for perm in itertools.permutations(elements, len(tokens)):
results.add(''.join([l for p in zip(tokens, perm) for l in p][:-1]))
return sorted(results)
def enum(pattern):
not_sensible = digraphs(pattern, r'[^*\]]{2}')
for p in expand(pattern, '*', VOWELS):
for q in expand(p, ']', SEMI_VOWELS):
if (digraphs(q) - not_sensible).issubset(SENSIBLES):
print q
## Init the data-set (may be long...)
## you may want to pre-compute this
## and adapt it to your data-set.
for word in open(DATASET, 'r').readlines():
for digraph in digraphs(word.rstrip()):
SENSIBLES.add(digraph)
enum('*l*ph*nt')
enum('s*nn]')
enum('h*ll*')
As there aren't many possibilites for two-letter substrings, you can go through your dataset and generate a table that contains the count for every two-letter substring, so the table will look something like this:
ee 1024 times
su 567 times
...
xy 45 times
xz 0 times
The table will be small as you'll only have about 26*26 = 676 values to store.
You have to do this only once for your dataset (or update the table every time it changes if the dataset is dynamic) and can use the table for evaluating possible strings. F.e., for your example, add the values for 'ha', 'al' and 'lo' to get a "score" for the string 'hallo'. After that, choose the string(s) with the highest score(s).
Note that the scoring can be improved by checking longer substrings, f.e. three letters, but this will also result in larger tables.

Resources