How can I search a pattern and extract the value behind it - python-3.x

I am a newbee in python. I am trying to pull data (XXXX) out from a text with a pattern PDB:XXXX. The XXXX varies, but it is exactly what I want.
Since the data all contain PDB:, I use re.findall() to search and get this pattern. But this only gave me a list of PDB:. How can I get it to include the XXXX???
this is my code:
text = 'blah...........
PDB:AAAA
blah...........
blah...........
PDB:BBBB'
etc.
r = re.findall("PDB:",text)
and the output gave me:
['PDB:', 'PDB:']
My desired output should be something like
['AAAA', 'BBBB']

You need to use """ to quote multi-line strings in Python. Also, to get a specific subset of the matched pattern, you need to use capture groups (the parentheses in my regular expression below).
import re
text = """blah...........
PDB:AAAA
blah...........
blah...........
PDB:BBBB"""
results = re.findall(r"PDB:(.*)", text)
print results #['AAAA', 'BBBB']

Related

I'm looking for a way to extract strings from a text file using specific criterias

I have a text file containing random strings. I want to use specific criterias to extract the strings that match these criterias.
Example text :
B311-SG-1700-ASJND83-ANSDN762
BAKSJD873-JAN-1293
Example criteria :
All the strings that contains characters seperated by hyphens this way : XXX-XX-XXXX
Output : 'B311-SG-1700'
I tried creating a function but I can't seem to know how to use criterias for string specifically and how to apply them.
Based on your comment here is a python script that might do what you want (I'm not that familiar with python).
import re
p = re.compile(r'\b(.{4}-.{2}-.{4})')
results = p.findall('B111-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293\nB211-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293 B311-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293')
print(results)
Output:
['B111-SG-1700', 'B211-SG-1700', 'B311-SG-1700']
You can read a file as a string like this
text_file = open("file.txt", "r")
data = text_file.read()
And use findall over that. Depending on the size of the file it might require a bit more work (e.g. reading line by line for example
You can use re module to extract the pattern from text:
import re
text = """\
B311-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293
BAKSJD873-JAN-1293 B312-SG-1700-ASJND83-ANSDN762"""
for m in re.findall(r"\b.{4}-.{2}-.{4}", text):
print(m)
Prints:
B311-SG-1700
B312-SG-1700

Get number from string in Python

I have a string, I have to get digits only from that string.
url = "www.mylocalurl.com/edit/1987"
Now from that string, I need to get 1987 only.
I have been trying this approach,
id = [int(i) for i in url.split() if i.isdigit()]
But I am getting [] list only.
You can use regex and get the digit alone in the list.
import re
url = "www.mylocalurl.com/edit/1987"
digit = re.findall(r'\d+', url)
output:
['1987']
Replace all non-digits with blank (effectively "deleting" them):
import re
num = re.sub('\D', '', url)
See live demo.
You aren't getting anything because by default the .split() method splits a sentence up where there are spaces. Since you are trying to split a hyperlink that has no spaces, it is not splitting anything up. What you can do is called a capture using regex. For example:
import re
url = "www.mylocalurl.com/edit/1987"
regex = r'(\d+)'
numbers = re.search(regex, url)
captured = numbers.groups()[0]
If you do not what what regular expressions are, the code is basically saying. Using the regex string defined as r'(\d+)' which basically means capture any digits, search through the url. Then in the captured we have the first captured group which is 1987.
If you don't want to use this, then you can use your .split() method but this time provide a split using / as the separator. For example `url.split('/').

How to get demangled function name using regex

I have list of demangled-function names like _Z6__comp7StudentS_
_Z4SortiSt6vectorI7StudentSaIS0_EE. I read wiki and found out that it follows some sort of defined structure. _Z is mangled Symbol followed by a number and then the function name of that length.
So I wanted to retrieve that function name using regex. I only come close to _Z(?:\d)(?<function_name>[a-z_A-Z]){\1}. But referring \1 won't work because its string, right? Is there a single regex pattern solution to this.
You can use 2 capture groups, and get the part of the string using the position of capture group 2
import re
pattern = r"_Z(\d+)([a-z_A-Z]+)"
s = "_Z4SortiSt6vectorI7StudentSaIS0_EE"
m = re.search(pattern, s)
if m:
print(m.group(2)[0: int(m.group(1))])
Output
Sort
Using _Z6__comp7StudentS_ will return __comp

How to get the content after a string using regex in python

I am having a string as follows:
A5697[2:10] = {ravi, rageev, raghav, smith};
I want the content after "A5697[2:10] =". So, my output should be:
{ravi, rageev, raghav, smith};
This is my code:
print(re.search(r'(?<=A\d+\[.*\] =\s).*', line).group())
But, this is giving error:
sre_constants.error: look-behind requires fixed-width pattern
Can anyone help to solve this issue? I would prefer to use regex.
You can try re.sub , like below, Since you have given only one data point. I am assuming all the other data points are following the similar pattern.
import re
text = "A5697[2:10] = {ravi, rageev, raghav, smith}"
re.sub(r'(A\d+\[\d+:\d+\]\s+=\s+)(.+)', r'\2', text)
returns,
'{ravi, rageev, raghav, smith}'
re.sub : substitutes the entire match as given as regex with the 2nd capturing group. The second capturing group captures every thing after '= '.
Simply replace the bits you don't want:
print re.sub(r'A\d[^=]*= *','',line)
See demo here: https://rextester.com/NSG17655

How to filter only text in a line?

I have many lines like these:
_ÙÓ´Immediate Transformation With Vee_ÙÓ´
‰ÛÏThe Real Pernell Stacks‰Û
I want to get something like this:
Immediate Transformation With Vee
The Real Pernell Stacks
I tried this:
for t in test:
t.isalpha()
but characters like this Ó count as well
So I also thought that I can create a list of English words, a space and punctuation marks and delete all the elements from the line that are not in this list, but I do not think that this is the right option, since the line can contain not only English words and that's fine.
Using Regex.
Ex:
import re
data = """_ÙÓ´Immediate Transformation With Vee_ÙÓ´
‰ÛÏThe Real Pernell Stacks‰Û"""
for line in data.splitlines(keepends=False):
print(re.sub(r"[^A-Za-z\s]", "", line))
Output:
Immediate Transformation With Vee
The Real Pernell Stacks
use re
result = ' '.join(re.split(r'[^A-Za-z]', s))

Resources