Automating The Boring Stuff With Python - Chapter 8 - Exercise - Regex Search - python-3.x

I'm trying to complete the exercise for Chapter 8 using which takes a user supplied regular expression and uses it to search each string in each text file in a folder.
I keep getting the error:
AttributeError: 'NoneType' object has no attribute 'group'
The code is here:
import os, glob, re
os.chdir("C:\Automating The Boring Stuff With Python\Chapter 8 - \
Reading and Writing Files\Practice Projects\RegexSearchTextFiles")
userRegex = re.compile(input('Enter your Regex expression :'))
for textFile in glob.glob("*.txt"):
currentFile = open(textFile) #open the text file and assign it to a file object
textCurrentFile = currentFile.read() #read the contents of the text file and assign to a variable
print(textCurrentFile)
#print(type(textCurrentFile))
searchedText = userRegex.search(textCurrentFile)
searchedText.group()
When I try this individually in the IDLE shell it works:
textCurrentFile = "What is life like for those left behind when the last foreign troops flew out of Afghanistan? Four people from cities and provinces around the country told the BBC they had lost basic freedoms and were struggling to survive."
>>> userRegex = re.compile(input('Enter the your Regex expression :'))
Enter the your Regex expression :troops
>>> searchedText = userRegex.search(textCurrentFile)
>>> searchedText.group()
'troops'
But I can't seem to make it work in the code when I run it. I'm really confused.
Thanks

Since you are just looping across all .txt files, there could be files that doesn't have the word "troops" in it. To prove this, don't call the .group(), just perform:
print(textFile, textCurrentFile, searchedText)
If you see that searchedText is None, then that means the contents of textFile (which is textCurrentFile) doesn't have the word "troops".
You could either:
Add the word troops in all .txt files.
Only select the target .txt files, not all.
Check first if if the match is found before accessing .group()
print(searchedText.group() if searchedText else None)

Related

How do I perform a regular expression on multiple .txt files in a folder (Python)?

I'm trying to open up 32 .txt files, extract some text from them (using RegEx) and then save them as individual files again(later on in the project I'm hoping to collate them together). I've tested the RegEx on a single file and it seems to work:
import os
import re
os.chdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation planning\Manual scrape\Finished years proper')
with open('1988.txt') as txtfile:
text= txtfile.read()
#print(len(text)) #sentences in text
start = r'Body\n\n\n'
docs = re.findall(start, text)
print('Found the start of %s documents.' % len(docs))
end = r'Load-Date:'
print('Found the end of %s documents.' % len(docs))
docs = re.findall(end, text)
regex = start+r'(.+?)'+end
articles = re.findall(regex, text, re.S)
print('You have now parsed the 154 articles so only the body of content remains. All metadata has been removed.')
print('Here is an example of a parsed article:', articles[0])
Now I want to perform the exact same thing on all my .txt files in that folder, but I can't figure out how to. I've been playing around with For loops but with little success. Currently I have this:
import os
import re
finished_years_proper= os.listdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Manual scrape\Finished years proper')
os.chdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Manual scrape\Finished years proper')
print('There are %s .txt files in this folder.' % len(finished_years_proper))
if i.endswith(".txt"):
with open(finished_years_proper + i, 'r') as all_years:
for line in all_years:
start = r'Body\n\n\n'
docs = re.findall(start, all_years)
end = r'Load-Date:'
docs = re.findall(end, all_years)
regex = start+r'(.+?)'+end
articles = re.findall(regex, all_years, re.S)
However, I'm returning a type error:
File "C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Method\Python\untitled1.py", line 15, in <module>
with open(finished_years_proper + i, 'r') as all_years:
TypeError: can only concatenate list (not "str") to list
I'm unsure how to proceed... I've seen on other forums that I should convert something into a string, but I'm not sure what to convert or even if this is the right way to proceed. Any help with this would be really appreciated!
After taking Benedictanjw's into my codes I've ended up with this:
Hi, this is what I ended up with:
all_years= []
for fyp in finished_years_proper: #fyp is each text file in folder
with open(fyp, 'r') as year:
for line in year: #line is each element in each text file in folder
start = r'Body\n\n\n'
docs = re.findall(start, line)
end = r'Load-Date:'
docs = re.findall(end, line)
regex = start+r'(.+?)'+end
articles = re.findall(regex, line, re.S)
all_years.append(articles) #append strings to reflect RegEx
parsed_documents= all_years.append(articles)
print(parsed_documents) #returns None. Apparently this is okay.
Does the 'None' mean that the parsing of each file is successful (as in it emulates the result I had when I tested the RegEx on a single file)? And if so, how can I visualise my output without returning None. Many thanks in advance!!
The problem shows because finished_years_proper is a list and in your line:
with open(finished_years_proper + i, 'r') as all_years:
you are trying to concatenate i with that list. I presume you had accidentally defined i elsewhere as a string. I guess you probably want to do something like:
all_years = []
for fyp in finished_years_proper:
with open(fyp, 'r') as year:
for line in year:
... # your regex search on year
all_years.append(xxx)

Why the output of "open" function doesn't allow me to attribute index?

I started to learn programming in python3 and i am doing a project that reads the content of a text file and tells you how many words are in the file. Being me I always want to challenge myself and tried to add in the output message the name of the file so in the future I will do a GUI for it and so on.
The error that I get is : AttributeError: '_io.TextIOWrapper' object has no attribute 'index'
Here is my code:
# Open text file
document = open("text2.txt", "r+")
# Reads the text file and splits it into arrays
text_split = document.read().split()
# Count the words
words = len(text_split)
# Display the counted words
document_name = document[document.index("name=")]
output = "In the file {} there are {} words.".format(document_name, words)
print (output)
Decided to take #Jean-François Fabre 's advice and abandoned the idea to also output the name of the file (FOR NOW).

read_csv naming the resulted dataframes

I want to read some CSVs from a given directory and I want to name the resulted dataframes similarly to the name of the csv.
So I wrote the code bellow but I am aware that it is not the rigth syntax.
Beside I have the error :
TypeError: 'str' object does not support item assignment
My code :
import os
for element in os.listdir('.'):
element[:-4] = read_csv(element)
Thank you for your help
you can do that by add tempering the global scope as follows:
import os
for i in os.listdir('.'):
globals()[i] = pd.read_csv(i)
But, that's very ugly and, as #JonClements pointed out, won't work if the filename doesn't follow the python variable naming rules. As a reminder, variable naming rules are :
Variables names must start with a letter or an underscore, such as:
_underscore
underscore_
The remainder of your variable name may consist of letters, numbers and underscores.
password1
n00b
un_der_scores
check this link for more explanation.
The best way is to create a dictionary:
import os
d = {}
for i in os.listdir('.'):
d[i] = pd.read_csv(i)
Then you can access any dataframe you want as follows d['file1.csv']

Building a mad libs program in Python using regular expressions

I'm a new Python programmer working through the book Automate the Boring Stuff with Python. One of the end-of-chapter projects is to build a mad libs program. Based on what has been introduced so far, I think that the author intends for me to use regular expressions.
Here is my code:
#! python3
#
# madlibs.py - reads a text file and let's the user add their own text
# anywhere the words ADJECTIVE, NOUN, ADVERB, or VERB appear in the text
# file.
import sys, re, copy
# open text file, save text to variable
if len(sys.argv) == 2:
print('Opening text file...')
textSource = open(sys.argv[1])
textContent = textSource.read()
textSource.close()
else:
print('Usage: madlibs.py <textSource>')
# locate instances of keywords
keywordRegex = re.compile(r'ADJECTIVE|NOUN|ADVERB|VERB', re.I)
matches = keywordRegex.findall(textContent)
# prompt user to replace keywords with their own input
answers = copy.copy(matches)
for i in range(len(answers)):
answers[i] = input()
# create a new text file with the end result
for i in range(len(matches)):
findMatch = re.compile(matches[i])
textContent = findMatch.sub(answers[i], textContent)
print(textContent)
textEdited = open('madlibbed.txt', 'w')
textEdited.write(textContent)
textEdited.close()
The input I'm using for textSource is a text file that reads:
This is the test source file. It has the keyword ADJECTIVE in it, as well as the keyword NOUN. Also, it has another instance of NOUN and then one of ADVERB.
My problem is that the findMatch.sub method is replacing both of the instances of NOUN at once. I understand that this is how the sub() method works, but I'm having trouble thinking of a simple way to work around it. How can I design this program so that it only targets and replaces one keyword at a time? I don't want all NOUNS to be replaced with the same word, but rather different words respective to the order in which the user types them.
All you need is to set a keyword argument count to sub, so that it will replace no more occurrences then you set.
textContent = findMatch.sub(answers[i], textContent, count=1)
for more details, see https://docs.python.org/3/library/re.html#re.sub
The thodnev's answer works, however you sometimes are better off by tokenizing the string first, and then building a new string with the parts.
If your string is:
textContent = 'This is the test source file. It has the keyword ADJECTIVE in it, as well as the keyword NOUN. Also, it has another instance of NOUN and then one of ADVERB.'
then you can use a re.finditer to do this:
for it in re.finditer(r'ADJECTIVE|NOUN|ADVERB|VERB', textContent):
print(it.span(), it.group())
gives
(49, 58) ADJECTIVE
(89, 93) NOUN
(128, 132) NOUN
(149, 155) ADVERB
You can use this information with substring to build a new string the way you want.

Comparing input against .txt and receiving error

Im trying to compare a users input with a .txt file but they never equal. The .txt contains the number 12. When I check to see what the .txt is it prints out as
<_io.TextIOWrapper name='text.txt' encoding='cp1252'>
my code is
import vlc
a = input("test ")
rflist = open("text.txt", "r")
print(a)
print(rflist)
if rflist == a:
p = vlc.MediaPlayer('What Sarah Said.mp3')
p.play()
else:
print('no')
so am i doing something wrong with my open() or is it something else entirely
To print the contents of the file instead of the file object, try
print(rflist.read())
instead of
print(rflist)
A file object is not the text contained in the file itself, but rather a wrapper object that facilitates operations on the file, like reading its contents or closing it.
rflist.read() or f.readline() is correct.
Read the documentation section 7.2
Dive Into Python is a fantastic book to start Python. take a look at it and you can not put it down.

Resources