I'm looking for a way to extract strings from a text file using specific criterias - string

I have a text file containing random strings. I want to use specific criterias to extract the strings that match these criterias.
Example text :
B311-SG-1700-ASJND83-ANSDN762
BAKSJD873-JAN-1293
Example criteria :
All the strings that contains characters seperated by hyphens this way : XXX-XX-XXXX
Output : 'B311-SG-1700'
I tried creating a function but I can't seem to know how to use criterias for string specifically and how to apply them.

Based on your comment here is a python script that might do what you want (I'm not that familiar with python).
import re
p = re.compile(r'\b(.{4}-.{2}-.{4})')
results = p.findall('B111-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293\nB211-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293 B311-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293')
print(results)
Output:
['B111-SG-1700', 'B211-SG-1700', 'B311-SG-1700']
You can read a file as a string like this
text_file = open("file.txt", "r")
data = text_file.read()
And use findall over that. Depending on the size of the file it might require a bit more work (e.g. reading line by line for example

You can use re module to extract the pattern from text:
import re
text = """\
B311-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293
BAKSJD873-JAN-1293 B312-SG-1700-ASJND83-ANSDN762"""
for m in re.findall(r"\b.{4}-.{2}-.{4}", text):
print(m)
Prints:
B311-SG-1700
B312-SG-1700

Related

Replace semicolon with the word present in the same line in python

I want my data like <Prags>87654321</Prags>,
<Cookie>2476157</Cookie> <Guddu>98765</Guddu>
My data is like <Prags>87654321;
Replace the semicolon with the first word of the sentence.
Given your information, I suppose you have either a file or multiline string with information that you want to process. If your multiline string looks like this:
data = """
<Prags>87654321;
<Cookie>87654321;
<Guddu>87654321;
<Prags>87654321;
<Prags>87654321;"""
I would split this string into individual lines and extract the tag using re as follows:
# extracting lines
lines = data.splitlines()
# this function looks for tags
# if there are no tags, it returns empty string
def find_tag(line):
try:
return re.match("<[^>]*>", line).group()
except AttributeError:
return ""
# we then iterate over lines and process them
for line in lines:
line = line.replace(";", find_tag(line))
print(line)
Do not forget to import re package with import re.

F string is adding new line

I am trying to make a name generator. I am using F string to concatenate the first and the last names. But instead of getting them together, I am getting them in a new line.
print(f"Random Name Generated is:\n{random.choice(firstname_list)}{random.choice(surname_list)}")
This give the output as:
Random Name Generated is:
Yung
heady
Instead of:
Random Name Generated is:
Yung heady
Can someone please explain why so?
The code seems right, perhaps could be of newlines (\n) characters in element of list.
Check the strings of lists.
import random
if __name__ == '__main__':
firstname_list = ["yung1", "yung2", "yung3"]
surname_list = ["heady1", "heady2", "heady3"]
firstname_list = [name.replace('\n', '') for name in firstname_list]
print(f"Random Name Generated is:\n{random.choice(firstname_list)} {random.choice(surname_list)}")
Output:
Random Name Generated is:
yung3 heady2
Since I had pulled these values from UTF-8 encoded .txt file, the readlines() did convert the names to list elements but they had a hidden '\xa0\n' in it.
This caused this particular printing problem. Using .strip() helped to remove the spaces.
print(f"Random Name Generated is:\n{random.choice(firstname_list).strip()} {random.choice(surname_list).strip()}")

How to find a substring in a line from a text file and add that line or the characters after the searched string into a list using Python?

I have a MIB dataset which is around 10k lines. I want to find a certain string (for eg: "SNMPv2-MIB::sysORID") in the text file and add the whole line into a list. I am using Jupyter Notebooks for running the code.
I used the below code to search the search string and it print the searched string along with the next two strings.
basic = open('mibdata.txt')
file = basic.read()
city_name = re.search(r"SNMPv2-MIB::sysORID(?:[^a-zA-Z'-]+[a-zA-Z'-]+) {1,2}", file)
city_name = city_name.group()
print(city_name)
Sample lines in file:
SNMPv2-MIB::sysORID.10 = OID: NOTIFICATION-LOG-MIB::notificationLogMIB
SNMPv2-MIB::sysORDescr.1 = STRING: The MIB for Message Processing and Dispatching.
The output expected is
SNMPv2-MIB::sysORID.10 = OID: NOTIFICATION-LOG-MIB::notificationLogMIB
but i get only
SNMPv2-MIB::sysORID.10 = OID: NOTIFICATION-LOG-MIB
The problem with changing the number of string after the searched strings is that the number of strings in each line is different and i cannot specify a constant. Instead i want to use '\n' as a delimiter but I could not find one such post.
P.S. Any other solution is also welcome
EDIT
You can read all lines one by one of the file and look for a certain Regex that matches the case.
r(NMPv2-MIB::sysORID).* finds the encounter of the string in the parenthesis and then matches everything followed after.
import re
basic = open('file.txt')
entries = map(lambda x : re.search(r"(SNMPv2-MIB::sys).*",x).group() if re.search(r"(SNMPv2-MIB::sys).*",x) is not None else "", basic.readlines())
non_empty_entries = list(filter(lambda x : x is not "", entries))
print(non_empty_entries)
If you are not comfortable with Lambdas, what the above script does is
taking the text from the file, splits it into lines and checks all lines individually for a regex match.
Entries is a list of all lines where the match was encountered.
EDIT vol2
Now when the regex doesn't match it will add an empty string and after we filter them out.

How can I search a pattern and extract the value behind it

I am a newbee in python. I am trying to pull data (XXXX) out from a text with a pattern PDB:XXXX. The XXXX varies, but it is exactly what I want.
Since the data all contain PDB:, I use re.findall() to search and get this pattern. But this only gave me a list of PDB:. How can I get it to include the XXXX???
this is my code:
text = 'blah...........
PDB:AAAA
blah...........
blah...........
PDB:BBBB'
etc.
r = re.findall("PDB:",text)
and the output gave me:
['PDB:', 'PDB:']
My desired output should be something like
['AAAA', 'BBBB']
You need to use """ to quote multi-line strings in Python. Also, to get a specific subset of the matched pattern, you need to use capture groups (the parentheses in my regular expression below).
import re
text = """blah...........
PDB:AAAA
blah...........
blah...........
PDB:BBBB"""
results = re.findall(r"PDB:(.*)", text)
print results #['AAAA', 'BBBB']

How to decode a text file by extracting alphabet characters and listing them into a message?

So we were given an assignment to create a code that would sort through a long message filled with special characters (ie. [,{,%,$,*) with only a few alphabet characters throughout the entire thing to make a special message.
I've been searching on this site for a while and haven't found anything specific enough that would work.
I put the text file into a pastebin if you want to see it
https://pastebin.com/48BTWB3B
Anywho, this is what I've come up with for code so far
code = open('code.txt', 'r')
lettersList = code.readlines()
lettersList.sort()
for letters in lettersList:
print(letters)
It prints the code.txt out but into short lists, essentially cutting it into smaller pieces. I want it to find and sort out the alphabet characters into a list and print the decoded message.
This is something you can do pretty easily with regex.
import re
with open('code.txt', 'r') as filehandle:
contents = filehandle.read()
letters = re.findall("[a-zA-Z]+", contents)
if you want to condense the list into a single string, you can use a join:
single_str = ''.join(letters)

Resources