How do I perform a regular expression on multiple .txt files in a folder (Python)? - python-3.x

I'm trying to open up 32 .txt files, extract some text from them (using RegEx) and then save them as individual files again(later on in the project I'm hoping to collate them together). I've tested the RegEx on a single file and it seems to work:
import os
import re
os.chdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation planning\Manual scrape\Finished years proper')
with open('1988.txt') as txtfile:
text= txtfile.read()
#print(len(text)) #sentences in text
start = r'Body\n\n\n'
docs = re.findall(start, text)
print('Found the start of %s documents.' % len(docs))
end = r'Load-Date:'
print('Found the end of %s documents.' % len(docs))
docs = re.findall(end, text)
regex = start+r'(.+?)'+end
articles = re.findall(regex, text, re.S)
print('You have now parsed the 154 articles so only the body of content remains. All metadata has been removed.')
print('Here is an example of a parsed article:', articles[0])
Now I want to perform the exact same thing on all my .txt files in that folder, but I can't figure out how to. I've been playing around with For loops but with little success. Currently I have this:
import os
import re
finished_years_proper= os.listdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Manual scrape\Finished years proper')
os.chdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Manual scrape\Finished years proper')
print('There are %s .txt files in this folder.' % len(finished_years_proper))
if i.endswith(".txt"):
with open(finished_years_proper + i, 'r') as all_years:
for line in all_years:
start = r'Body\n\n\n'
docs = re.findall(start, all_years)
end = r'Load-Date:'
docs = re.findall(end, all_years)
regex = start+r'(.+?)'+end
articles = re.findall(regex, all_years, re.S)
However, I'm returning a type error:
File "C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Method\Python\untitled1.py", line 15, in <module>
with open(finished_years_proper + i, 'r') as all_years:
TypeError: can only concatenate list (not "str") to list
I'm unsure how to proceed... I've seen on other forums that I should convert something into a string, but I'm not sure what to convert or even if this is the right way to proceed. Any help with this would be really appreciated!
After taking Benedictanjw's into my codes I've ended up with this:
Hi, this is what I ended up with:
all_years= []
for fyp in finished_years_proper: #fyp is each text file in folder
with open(fyp, 'r') as year:
for line in year: #line is each element in each text file in folder
start = r'Body\n\n\n'
docs = re.findall(start, line)
end = r'Load-Date:'
docs = re.findall(end, line)
regex = start+r'(.+?)'+end
articles = re.findall(regex, line, re.S)
all_years.append(articles) #append strings to reflect RegEx
parsed_documents= all_years.append(articles)
print(parsed_documents) #returns None. Apparently this is okay.
Does the 'None' mean that the parsing of each file is successful (as in it emulates the result I had when I tested the RegEx on a single file)? And if so, how can I visualise my output without returning None. Many thanks in advance!!

The problem shows because finished_years_proper is a list and in your line:
with open(finished_years_proper + i, 'r') as all_years:
you are trying to concatenate i with that list. I presume you had accidentally defined i elsewhere as a string. I guess you probably want to do something like:
all_years = []
for fyp in finished_years_proper:
with open(fyp, 'r') as year:
for line in year:
... # your regex search on year
all_years.append(xxx)

Related

How to read many files have a specific format in python

I am a little bit confused in how to read all lines in many files where the file names have format from "datalog.txt.98" to "datalog.txt.120".
This is my code:
import json
file = "datalog.txt."
i = 97
for line in file:
i+=1
f = open (line + str (i),'r')
for row in f:
print (row)
Here, you will find an example of one line in one of those files:
I need really to your help
I suggest using a loop for opening multiple files with different formats.
To better understand this project I would recommend researching the following topics
for loops,
String manipulation,
Opening a file and reading its content,
List manipulation,
String parsing.
This is one of my favourite beginner guides.
To set the parameters of the integers at the end of the file name I would look into python for loops.
I think this is what you are trying to do
# create a list to store all your file content
files_content = []
# the prefix is of type string
filename_prefix = "datalog.txt."
# loop from 0 to 13
for i in range(0,14):
# make the filename variable with the prefix and
# the integer i which you need to convert to a string type
filename = filename_prefix + str(i)
# open the file read all the lines to a variable
with open(filename) as f:
content = f.readlines()
# append the file content to the files_content list
files_content.append(content)
To get rid of white space from file parsing add the missing line
content = [x.strip() for x in content]
files_content.append(content)
Here's an example of printing out files_content
for file in files_content:
print(file)

Trying to pull a twitter handle from a text file

I am trying to extract a set of alpha numeric characters from a text file.
below would be some lines in the file. I want to extract the '#' as well as anything that follows.
im trying to pull #bob from a file.
this is a #line in the #file
#bob is a wierdo
the below code is what I have so far.
def getAllPeople(fileName):
#give empty list
allPeople=[]
#open TweetsFile.txt
with open(fileName, 'r') as f1:
lines=f1.readlines()
#split all words into strings
for word in lines:
char = word.split("#")
print(char)
#close the file
f1.close()
What I am trying to get is;
['#bob','#line','#file', '#bob']
If you do not want to use re, take Andrew's suggestion
mentions = list(filter(lambda x: x.startswith('#'), tweet.split()))
otherwise, see the marked duplicate.
mentions = [w for w in tweet.split() if w.startswith('#')]
since you apparently can not use filter or lambda.

How do I print out results on a separate line after converting them from a set to a string?

I am currently trying to compare to text files, to see if they have any words in common in both files.
The text files are as
ENGLISH.TXT
circle
table
year
competition
FRENCH.TXT
bien
competition
merci
air
table
My current code is getting them to print, Ive removed all the unnessecary squirly brackets and so on, but I cant get them to print on different lines.
List = open("english.txt").readlines()
List2 = open("french.txt").readlines()
anb = set(List) & set(List2)
anb = str(anb)
anb = (str(anb)[1:-1])
anb = anb.replace("'","")
anb = anb.replace(",","")
anb = anb.replace('\\n',"")
print(anb)
The output is expected to separate both results onto new lines.
Currently Happening:
Competition Table
Expected:
Competition
Table
Thanks in advance!
- Xphoon
Hi I'd suggest you to try two things as a good practice:
1) Use "with" for opening files
with open('english.txt', 'r') as englishfile, open('french.txt', 'r') as frenchfile:
##your python operations for the file
2) Try to use the "f-String" opportunity if you're using Python 3:
print(f"Hello\nWorld!")
File read using "open()" vs "with open()"
This post explains very well why to use the "with" statement :)
And additionally to the f-strings if you want to print out variables do it like this:
print(f"{variable[index]}\n variable2[index2]}")
Should print out:
Hello and World! in seperate lines
Here is one solution including converting between sets and lists:
with open('english.txt', 'r') as englishfile, open('french.txt', 'r') as frenchfile:
english_words = englishfile.readlines()
english_words = [word.strip('\n') for word in english_words]
french_words = frenchfile.readlines()
french_words = [word.strip('\n') for word in french_words]
anb = set(english_words) & set(french_words)
anb_list = [item for item in anb]
for item in anb_list:
print(item)
Here is another solution by keeping the words in lists:
with open('english.txt', 'r') as englishfile, open('french.txt', 'r') as frenchfile:
english_words = englishfile.readlines()
english_words = [word.strip('\n') for word in english_words]
french_words = frenchfile.readlines()
french_words = [word.strip('\n') for word in french_words]
for english_word in english_words:
for french_word in french_words:
if english_word == french_word:
print(english_word)

How do I replace the 4th item in a list that is in a file that starts with a particular string?

I need to search for a name in a file and in the line starting with that name, I need to replace the fourth item in the list that is separated my commas. I have began trying to program this with the following code, but I have not got it to work.
with open("SampleFile.txt", "r") as f:
newline=[]
for word in f.line():
newline.append(word.replace(str(String1), str(String2)))
with open("SampleFile.txt", "w") as f:
for line in newline :
f.writelines(line)
#this piece of code replaced every occurence of String1 with String 2
f = open("SampleFile.txt", "r")
for line in f:
if line.startswith(Name):
if line.contains(String1):
newline = line.replace(str(String1), str(String2))
#this came up with a syntax error
You could give some dummy data which would help people to answer your question. I suppose you to backup your data: You can save the edited data to a new file or you can backup the old file to a backup folder before working on the data (think about using "from shutil import copyfile" and then "copyfile(src, dst)"). Otherwise by making a mistake you could easily ruin your data without being able to easily restore them.
You can't replace the string with "newline = line.replace(str(String1), str(String2))"! Think about "strong" as your search term and a line like "Armstrong,Paul,strong,44" - if you replace "strong" with "weak" you would get "Armweak,Paul,weak,44".
I hope the following code helps you:
filename = "SampleFile.txt"
filename_new = filename.replace(".", "_new.")
search_term = "Smith"
with open(filename) as src, open(filename_new, 'w') as dst:
for line in src:
if line.startswith(search_term):
items = line.split(",")
items[4-1] = items[4-1].replace("old", "new")
line = ",".join(items)
dst.write(line)
If you work with a csv-file you should have a look at the csv module.
PS My files contain the following data (the filenames are not in the files!!!):
SampleFile.txt SampleFile_new.txt
Adams,George,m,old,34 Adams,George,m,old,34
Adams,Tracy,f,old,32 Adams,Tracy,f,old,32
Smith,John,m,old,53 Smith,John,m,new,53
Man,Emily,w,old,44 Man,Emily,w,old,44

Python changing file name

My application offers the ability to the user to export its results. My application exports text files with name Exp_Text_1, Exp_Text_2 etc. I want it so that if a file with the same file name pre-exists in Desktop then to start counting from this number upwards. For example if a file with name Exp_Text_3 is already in Desktop, then I want the file to be created to have the name Exp_Text_4.
This is my code:
if len(str(self.Output_Box.get("1.0", "end"))) == 1:
self.User_Line_Text.set("Nothing to export!")
else:
import os.path
self.txt_file_num = self.txt_file_num + 1
file_name = os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt" + "_" + str(self.txt_file_num) + ".txt")
file = open(file_name, "a")
file.write(self.Output_Box.get("1.0", "end"))
file.close()
self.User_Line_Text.set("A text file has been exported to Desktop!")
you likely want os.path.exists:
>>> import os
>>> help(os.path.exists)
Help on function exists in module genericpath:
exists(path)
Test whether a path exists. Returns False for broken symbolic links
a very basic example would be create a file name with a formatting mark to insert the number for multiple checks:
import os
name_to_format = os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt_{}.txt")
#the "{}" is a formatting mark so we can do file_name.format(num)
num = 1
while os.path.exists(name_to_format.format(num)):
num+=1
new_file_name = name_to_format.format(num)
this would check each filename starting with Exp_Txt_1.txt then Exp_Txt_2.txt etc. until it finds one that does not exist.
However the format mark may cause a problem if curly brackets {} are part of the rest of the path, so it may be preferable to do something like this:
import os
def get_file_name(num):
return os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt_" + str(num) + ".txt")
num = 1
while os.path.exists(get_file_name(num)):
num+=1
new_file_name = get_file_name(num)
EDIT: answer to why don't we need get_file_name function in first example?
First off if you are unfamiliar with str.format you may want to look at Python doc - common string operations and/or this simple example:
text = "Hello {}, my name is {}."
x = text.format("Kotropoulos","Tadhg")
print(x)
print(text)
The path string is figured out with this line:
name_to_format = os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt_{}.txt")
But it has {} in the place of the desired number. (since we don't know what the number should be at this point) so if the path was for example:
name_to_format = "/Users/Tadhg/Desktop/Exp_Txt_{}.txt"
then we can insert a number with:
print(name_to_format.format(1))
print(name_to_format.format(2))
and this does not change name_to_format since str objects are Immutable so the .format returns a new string without modifying name_to_format. However we would run into a problem if out path was something like these:
name_to_format = "/Users/Bob{Cat}/Desktop/Exp_Txt_{}.txt"
#or
name_to_format = "/Users/Bobcat{}/Desktop/Exp_Txt_{}.txt"
#or
name_to_format = "/Users/Smiley{:/Desktop/Exp_Txt_{}.txt"
Since the formatting mark we want to use is no longer the only curly brackets and we can get a variety of errors:
KeyError: 'Cat'
IndexError: tuple index out of range
ValueError: unmatched '{' in format spec
So you only want to rely on str.format when you know it is safe to use. Hope this helps, have fun coding!

Resources