Building a mad libs program in Python using regular expressions - python-3.x

I'm a new Python programmer working through the book Automate the Boring Stuff with Python. One of the end-of-chapter projects is to build a mad libs program. Based on what has been introduced so far, I think that the author intends for me to use regular expressions.
Here is my code:
#! python3
#
# madlibs.py - reads a text file and let's the user add their own text
# anywhere the words ADJECTIVE, NOUN, ADVERB, or VERB appear in the text
# file.
import sys, re, copy
# open text file, save text to variable
if len(sys.argv) == 2:
print('Opening text file...')
textSource = open(sys.argv[1])
textContent = textSource.read()
textSource.close()
else:
print('Usage: madlibs.py <textSource>')
# locate instances of keywords
keywordRegex = re.compile(r'ADJECTIVE|NOUN|ADVERB|VERB', re.I)
matches = keywordRegex.findall(textContent)
# prompt user to replace keywords with their own input
answers = copy.copy(matches)
for i in range(len(answers)):
answers[i] = input()
# create a new text file with the end result
for i in range(len(matches)):
findMatch = re.compile(matches[i])
textContent = findMatch.sub(answers[i], textContent)
print(textContent)
textEdited = open('madlibbed.txt', 'w')
textEdited.write(textContent)
textEdited.close()
The input I'm using for textSource is a text file that reads:
This is the test source file. It has the keyword ADJECTIVE in it, as well as the keyword NOUN. Also, it has another instance of NOUN and then one of ADVERB.
My problem is that the findMatch.sub method is replacing both of the instances of NOUN at once. I understand that this is how the sub() method works, but I'm having trouble thinking of a simple way to work around it. How can I design this program so that it only targets and replaces one keyword at a time? I don't want all NOUNS to be replaced with the same word, but rather different words respective to the order in which the user types them.

All you need is to set a keyword argument count to sub, so that it will replace no more occurrences then you set.
textContent = findMatch.sub(answers[i], textContent, count=1)
for more details, see https://docs.python.org/3/library/re.html#re.sub

The thodnev's answer works, however you sometimes are better off by tokenizing the string first, and then building a new string with the parts.
If your string is:
textContent = 'This is the test source file. It has the keyword ADJECTIVE in it, as well as the keyword NOUN. Also, it has another instance of NOUN and then one of ADVERB.'
then you can use a re.finditer to do this:
for it in re.finditer(r'ADJECTIVE|NOUN|ADVERB|VERB', textContent):
print(it.span(), it.group())
gives
(49, 58) ADJECTIVE
(89, 93) NOUN
(128, 132) NOUN
(149, 155) ADVERB
You can use this information with substring to build a new string the way you want.

Related

I'm looking for a way to extract strings from a text file using specific criterias

I have a text file containing random strings. I want to use specific criterias to extract the strings that match these criterias.
Example text :
B311-SG-1700-ASJND83-ANSDN762
BAKSJD873-JAN-1293
Example criteria :
All the strings that contains characters seperated by hyphens this way : XXX-XX-XXXX
Output : 'B311-SG-1700'
I tried creating a function but I can't seem to know how to use criterias for string specifically and how to apply them.
Based on your comment here is a python script that might do what you want (I'm not that familiar with python).
import re
p = re.compile(r'\b(.{4}-.{2}-.{4})')
results = p.findall('B111-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293\nB211-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293 B311-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293')
print(results)
Output:
['B111-SG-1700', 'B211-SG-1700', 'B311-SG-1700']
You can read a file as a string like this
text_file = open("file.txt", "r")
data = text_file.read()
And use findall over that. Depending on the size of the file it might require a bit more work (e.g. reading line by line for example
You can use re module to extract the pattern from text:
import re
text = """\
B311-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293
BAKSJD873-JAN-1293 B312-SG-1700-ASJND83-ANSDN762"""
for m in re.findall(r"\b.{4}-.{2}-.{4}", text):
print(m)
Prints:
B311-SG-1700
B312-SG-1700

Automating The Boring Stuff With Python - Chapter 8 - Exercise - Regex Search

I'm trying to complete the exercise for Chapter 8 using which takes a user supplied regular expression and uses it to search each string in each text file in a folder.
I keep getting the error:
AttributeError: 'NoneType' object has no attribute 'group'
The code is here:
import os, glob, re
os.chdir("C:\Automating The Boring Stuff With Python\Chapter 8 - \
Reading and Writing Files\Practice Projects\RegexSearchTextFiles")
userRegex = re.compile(input('Enter your Regex expression :'))
for textFile in glob.glob("*.txt"):
currentFile = open(textFile) #open the text file and assign it to a file object
textCurrentFile = currentFile.read() #read the contents of the text file and assign to a variable
print(textCurrentFile)
#print(type(textCurrentFile))
searchedText = userRegex.search(textCurrentFile)
searchedText.group()
When I try this individually in the IDLE shell it works:
textCurrentFile = "What is life like for those left behind when the last foreign troops flew out of Afghanistan? Four people from cities and provinces around the country told the BBC they had lost basic freedoms and were struggling to survive."
>>> userRegex = re.compile(input('Enter the your Regex expression :'))
Enter the your Regex expression :troops
>>> searchedText = userRegex.search(textCurrentFile)
>>> searchedText.group()
'troops'
But I can't seem to make it work in the code when I run it. I'm really confused.
Thanks
Since you are just looping across all .txt files, there could be files that doesn't have the word "troops" in it. To prove this, don't call the .group(), just perform:
print(textFile, textCurrentFile, searchedText)
If you see that searchedText is None, then that means the contents of textFile (which is textCurrentFile) doesn't have the word "troops".
You could either:
Add the word troops in all .txt files.
Only select the target .txt files, not all.
Check first if if the match is found before accessing .group()
print(searchedText.group() if searchedText else None)

How to remove duplicate sentences from paragraph using NLTK?

I had a huge document with many repeated sentences such as (footer text, hyperlinks with alphanumeric chars), I need to get rid of those repeated hyperlinks or Footer text. I have tried with the below code but unfortunately couldn't succeed. Please review and help.
corpus = "We use file handling methods in python to remove duplicate lines in python text file or function. The text file or function has to be in the same directory as the python program file. Following code is one way of removing duplicates in a text file bar.txt and the output is stored in foo.txt. These files should be in the same directory as the python script file, else it won’t work.Now, we should crop our big image to extract small images with amounts.In terms of topic modelling, the composites are documents and the parts are words and/or phrases (phrases n words in length are referred to as n-grams).We use file handling methods in python to remove duplicate lines in python text file or function.As an example I will use some image of a bill, saved in the pdf format. From this bill I want to extract some amounts.All our wrappers, except of textract, can’t work with the pdf format, so we should transform our pdf file to the image (jpg). We will use wand for this.Now, we should crop our big image to extract small images with amounts."
from nltk.tokenize import sent_tokenize
sentences_with_dups = []
for sentence in corpus:
words = sentence.sent_tokenize(corpus)
if len(set(words)) != len(words):
sentences_with_dups.append(sentence)
print(sentences_with_dups)
else:
print('No duplciates found')
Error message for the above code :
AttributeError: 'str' object has no attribute 'sent_tokenize'
Desired Output :
Duplicates = ['We use file handling methods in python to remove duplicate lines in python text file or function.','Now, we should crop our big image to extract small images with amounts.']
Cleaned_corpus = {removed duplicates from corpus}
First of all, the example you provided is messed up with spaces between the last period and next sentence, there are a lot of space missing in between them, so I cleaned up.
Then you can do:
corpus = "......"
sentences = sent_tokenize(corpus)
duplicates = list(set([s for s in sentences if sentences.count(s) > 1]))
cleaned = list(set(sentences))
Above will mess the order. If you care about the order, you can do the following to preserve:
duplicates = []
cleaned = []
for s in sentences:
if s in cleaned:
if s in duplicates:
continue
else:
duplicates.append(s)
else:
cleaned.append(s)

How can I search a pattern and extract the value behind it

I am a newbee in python. I am trying to pull data (XXXX) out from a text with a pattern PDB:XXXX. The XXXX varies, but it is exactly what I want.
Since the data all contain PDB:, I use re.findall() to search and get this pattern. But this only gave me a list of PDB:. How can I get it to include the XXXX???
this is my code:
text = 'blah...........
PDB:AAAA
blah...........
blah...........
PDB:BBBB'
etc.
r = re.findall("PDB:",text)
and the output gave me:
['PDB:', 'PDB:']
My desired output should be something like
['AAAA', 'BBBB']
You need to use """ to quote multi-line strings in Python. Also, to get a specific subset of the matched pattern, you need to use capture groups (the parentheses in my regular expression below).
import re
text = """blah...........
PDB:AAAA
blah...........
blah...........
PDB:BBBB"""
results = re.findall(r"PDB:(.*)", text)
print results #['AAAA', 'BBBB']

I want to extract sentences that containing a drug and gene name from 10,000 articles

I want to extract sentences that containing a drug and gene name from 10,000 articles.
and my code is
import re
import glob
import fnmatch
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
flist= glob.glob ("C:/Users/Emma Belladona/Desktop/drug working/*.txt")
print (flist)
for txt in flist:
#print (txt)
fr = open (txt, "r")
tmp = fr.read().strip()
a = (sent_tokenize(tmp))
b = (word_tokenize(tmp))
for c, value in enumerate(a, 1):
if value.find("SLC22A1") != -1 and value.find("Metformin"):
print ("Result", value)
re.findall("\w+\s?[gene]+", a)
else:
if value.find("Metformin") != -1 and value.find("SLC22A1"):
print ("Results", value)
if value.find("SLC29B2") != -1 and value.find("Metformin"):
print ("Result", value)
I want to extract sentences that have gene and drug name from the whole body of article. For example "Metformin decreased logarithmically converted SLC22A1 excretion (from 1.5860.47 to 1.0060.52, p¼0.001)." "In conclusion, we could not demonstrate striking associations of the studied polymorphisms of SLC22A1, ACE, AGTR1, and ADD1 with antidiabetic responses to metformin in this well-controlled study."
This code return a lot of sentences i.e if one word of above came into the sentence that get printed out...!
Help me making the code for this
You don't show your real code, but the code you have now has at least one mistake that would lead to lots of spurious output. It's on this line:
re.findall("\w+\s?[gene]+", a)
This regexp does not match strings containing gene, as you clearly intended. It matches (almost) any string contains one of the letters g, e or n.
This cannot be your real code, since a is a list and you would get an error on this line-- plus you ignore the results of the findall()! Sort out your question so it reflects reality. If your problem is still not solved, edit your question and include at least one sentence that is part of the output but you do NOT want to be seeing.
When you do this:
if value.find("SLC22A1") != -1 and value.find("Metformin"):
You're testing for "SLC22A1 in the string and "Metformin" not at the start of the string (the second part is probably not what you want)
You probably wanted this:
if value.find("SLC22A1") != -1 and value.find("Metformin") != -1:
This find method is error-prone due to its return value and you don't care for the position, so you'd be better off with in.
To test for 2 words in a sentence (possibly case-insensitive for the 2nd occurrence) do like this:
if "SLC22A1" in vlow and "metformin" in value.lower():
I'd take a different approach:
Read in the text file
Split the text file into sentences. Check out https://stackoverflow.com/a/28093215/223543 for a hand-rolled approach to do this. Or you could use the ntlk.tokenizer.punkt module. (Edited after Alexis pointed me in the right direction in the comments below).
Check if I find your key terms in each sentence and print if I do.
As long as your text files are well formatted, this should work.

Resources