a python program that searches text files and prints out mutual lines - python-3.x

I am trying to write a python program that takes n number of text files , each file contains names , each name on a separate line like this
Steve
Mark
Sarah
what the program does is that it prints out only the names that exist in all the inputted files .
I am new to programming so I don't really know how to implement this idea , but I thought in recursion , still the program seems to run in an infinite loop , I am not sure what's the problem . is the implementation wrong ? if so , do you have a better idea of how to implement it ?
import sys
arguments = sys.argv[1:]
files = {}
file = iter(arguments)
for number in range(len(sys.argv[1:])):
files[number] = open(next(file))
def close_files():
for num in files:
files[num].close()
def start_next_file(line,files,orderOfFile):
print('starting next file')
if orderOfFile < len(files): # to avoid IndexError
for line_searched in files[orderOfFile]:
if line_searched.strip():
line_searched = line_searched[:-1]
print('searched line = '+line_searched)
print('searched compared to = ' + line)
if line_searched == line:
#good now see if that name exists in the other files as well
start_next_file(line,files,orderOfFile+1)
elif orderOfFile >= len(files): # when you finish searching all the files
print('got ya '+line) #print the name that exists in all the files
for file in files:
# to make sure the cursor is at the beginning of the read files
#so we can loop through them again
files[file].seek(0)
def start_find_match(files):
orderOfFile = 0
for line in files[orderOfFile] :
# for each name in the file see if it exists in all other files
if line.strip():
line = line[:-1]
print ('starting line = '+line)
start_next_file(line,files,orderOfFile+1)
start_find_match(files)
close_files()

I'm not sure how to fix your code exactly but here's one conceptual way to think about it.
listdir gets all the files in the directory as a list. We narrow that to only .txt files. Next, open, read, split on newlines, and lower to make a larger list containing names. So, files will be a list of lists. Last, find the intersection across all lists using some set logic.
import os
folder = [f for f in os.listdir() if f[-4:] == '.txt']
files = []
for i,file in enumerate(folder):
with open(file) as f:
files.append([name.lower() for name in f.read().splitlines()])
result = set.intersection(*map(set, files))
Example:
#file1.txt
john
smith
mary
sue
pretesh
ashton
olaf
Elsa
#file2.txt
David
Lorenzo
Cassy
Grant
elsa
Felica
Salvador
Candance
Fidel
olaf
Tammi
Pasquale
#file3.txt
Jaleesa
Domenic
Shala
Berry
Pamelia
Kenneth
Georgina
Olaf
Kenton
Milly
Morgan
elsa
Returns:
{'olaf', 'elsa'}

Related

How to separate lines of data read from a textfile? Customers with their orders

I have this data in a text file. (Doesn't have the spacing I added for clarity)
I am using Python3:
orders = open('orders.txt', 'r')
lines = orders.readlines()
I need to loop through the lines variable that contains all the lines of the data and separate the CO lines as I've spaced them.
CO are customers and the lines below each CO are the orders that customer placed.
The CO lines tells us how many lines of orders exist if you look at the index[7-9] of the CO string.
I illustrating this below.
CO77812002D10212020 <---(002)
125^LO917^11212020. <----line 1
235^IL993^11252020 <----line 2
CO77812002S10212020
125^LO917^11212020
235^IL993^11252020
CO95307005D06092019 <---(005)
194^AF977^06292019 <---line 1
72^L223^07142019 <---line 2
370^IL993^08022019 <---line 3
258^Y337^07072019 <---line 4
253^O261^06182019 <---line 5
CO30950003D06012019
139^LM485^06272019
113^N669^06192019
249^P530^07112019
CO37501001D05252020
479^IL993^06162020
I have thought of a brute force way of doing this but it won't work against much larger datasets.
Any help would be greatly appreciated!
You can use fileinput (source) to "simultaneously" read and modify your file. In fact, the in-place functionality that offers to modify a file while parsing it is implemented through a second backup file. Specifically, as stated here:
Optional in-place filtering: if the keyword argument inplace=True is passed to fileinput.input() or to the FileInput constructor, the file is moved to a backup file and standard output is directed to the input file (...) by default, the extension is '.bak' and it is deleted when the output file is closed.
Therefore, you can format your file as specified this way:
import fileinput
with fileinput.input(files = ['orders.txt'], inplace=True) as orders_file:
for line in orders_file:
if line[:2] == 'CO': # Detect customer line
orders_counter = 0
num_of_orders = int(line[7:10]) # Extract number of orders
else:
orders_counter += 1
# If last order for specific customer has been reached
# append a '\n' character to format it as desired
if orders_counter == num_of_orders:
line += '\n'
# Since standard output is redirected to the file, print writes in the file
print(line, end='')
Note: it's supposed that the file with the orders is formatted exactly in the way you specified:
CO...
(order_1)
(order_2)
...
(order_i)
CO...
(order_1)
...
This did what I hoping to get done!
tot_customers = []
with open("orders.txt", "r") as a_file:
customer = []
for line in a_file:
stripped_line = line.strip()
if stripped_line[:2] == "CO":
customer.append(stripped_line)
print("customers: ", customer)
orders_counter = 0
num_of_orders = int(stripped_line[7:10])
else:
customer.append(stripped_line)
orders_counter +=1
if orders_counter == num_of_orders:
tot_customers.append(customer)
customer = []
orders_counter = 0

How do I perform a regular expression on multiple .txt files in a folder (Python)?

I'm trying to open up 32 .txt files, extract some text from them (using RegEx) and then save them as individual files again(later on in the project I'm hoping to collate them together). I've tested the RegEx on a single file and it seems to work:
import os
import re
os.chdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation planning\Manual scrape\Finished years proper')
with open('1988.txt') as txtfile:
text= txtfile.read()
#print(len(text)) #sentences in text
start = r'Body\n\n\n'
docs = re.findall(start, text)
print('Found the start of %s documents.' % len(docs))
end = r'Load-Date:'
print('Found the end of %s documents.' % len(docs))
docs = re.findall(end, text)
regex = start+r'(.+?)'+end
articles = re.findall(regex, text, re.S)
print('You have now parsed the 154 articles so only the body of content remains. All metadata has been removed.')
print('Here is an example of a parsed article:', articles[0])
Now I want to perform the exact same thing on all my .txt files in that folder, but I can't figure out how to. I've been playing around with For loops but with little success. Currently I have this:
import os
import re
finished_years_proper= os.listdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Manual scrape\Finished years proper')
os.chdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Manual scrape\Finished years proper')
print('There are %s .txt files in this folder.' % len(finished_years_proper))
if i.endswith(".txt"):
with open(finished_years_proper + i, 'r') as all_years:
for line in all_years:
start = r'Body\n\n\n'
docs = re.findall(start, all_years)
end = r'Load-Date:'
docs = re.findall(end, all_years)
regex = start+r'(.+?)'+end
articles = re.findall(regex, all_years, re.S)
However, I'm returning a type error:
File "C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Method\Python\untitled1.py", line 15, in <module>
with open(finished_years_proper + i, 'r') as all_years:
TypeError: can only concatenate list (not "str") to list
I'm unsure how to proceed... I've seen on other forums that I should convert something into a string, but I'm not sure what to convert or even if this is the right way to proceed. Any help with this would be really appreciated!
After taking Benedictanjw's into my codes I've ended up with this:
Hi, this is what I ended up with:
all_years= []
for fyp in finished_years_proper: #fyp is each text file in folder
with open(fyp, 'r') as year:
for line in year: #line is each element in each text file in folder
start = r'Body\n\n\n'
docs = re.findall(start, line)
end = r'Load-Date:'
docs = re.findall(end, line)
regex = start+r'(.+?)'+end
articles = re.findall(regex, line, re.S)
all_years.append(articles) #append strings to reflect RegEx
parsed_documents= all_years.append(articles)
print(parsed_documents) #returns None. Apparently this is okay.
Does the 'None' mean that the parsing of each file is successful (as in it emulates the result I had when I tested the RegEx on a single file)? And if so, how can I visualise my output without returning None. Many thanks in advance!!
The problem shows because finished_years_proper is a list and in your line:
with open(finished_years_proper + i, 'r') as all_years:
you are trying to concatenate i with that list. I presume you had accidentally defined i elsewhere as a string. I guess you probably want to do something like:
all_years = []
for fyp in finished_years_proper:
with open(fyp, 'r') as year:
for line in year:
... # your regex search on year
all_years.append(xxx)

I'm trying to 'shuffle' a folder of music and there is an error where random.choice() keeps choosing things that it is supposed to have removed

I'm trying to make a python script that renames files randomly from a list and I used numbers.remove(place) on it but it keeps choosing values that are supposed to have been removed.
I used to just use random.randint but now I have moved to choosing from a list then removing the chosen value from the list but it seems to keep choosing chosen values.
'''python
from os import chdir, listdir, rename
from random import choice
def main():
chdir('C:\\Users\\user\\Desktop\\Folders\\Music')
for f in listdir():
if f.endswith('.mp4'):
numbers = [str(x) for x in range(0, 100)]
had = []
print(f'numbers = {numbers}')
place = choice(numbers)
print(f'place = {place}')
numbers.remove(place)
print(f'numbers = {numbers}')
while place in had:
input('Place has been had.')
place = choice(numbers)
had.append(place)
name = place + '.mp4'
print(f'name = {name}')
print(f'\n\nRenaming {f} to {name}.\n\n')
try:
rename(f, name)
except FileExistsError:
pass
if __name__ == '__main__':
main()
'''
It should randomly number the files without choosing the same value for a file twice but it does that and I have no idea why.
When you call listdir() the first time, that's the same list that you're iterating over the entire time. Yes, you're changing the contents of the directory, but python doesn't really care about that because you only asked for the contents of the directory at a specific point in time - before you began modifying it.
I would do this in two separate steps:
# get the current list of files in the directory
dirlist = os.listdir()
# choose a new name for each file
to_rename = zip(
dirlist,
[f'{num}.mp4' for num in random.sample(range(100), len(dirlist))]
)
# actually rename each file
for oldname, newname in to_rename:
try:
os.rename(oldname, newname)
except FileExistsError:
pass
This method is more concise than the one you're using. First, I use random.sample() on the iterable range(100) to generate non-overlapping numbers from that range (without having to do the extra step of using had like you're doing now). I generate exactly as many as I need, and then use the built-in zip() function to bind together the original filenames with these new numbers.
Then, I do the rename() operations all at once.

how do i manipulate the path name so it doesn't print out the entire name

I'm new to programming. i need to index three separate txt files. And do a search from an input. When i do a print it gives me the entire path name. i would like to print the txt file name.
i've trying using os.list in the function
import os
import time
import string
import os.path
import sys
word_occurrences= {}
def index_text_file (txt_filename,ind_filename, delimiter_chars=",.;:!?"):
try:
txt_fil = open(txt_filename, "r")
fileString = txt_fil.read()
for word in fileString.split():
if word in word_occurrences:
word_occurrences[word] += 1
else:#
word_occurrences [word] = 1
word_keys = word_occurrences.keys()
print ("{} unique words found in".format(len(word_keys)),txt_filename)
word_keys = word_occurrences.keys()
sorted(word_keys)
except IOError as ioe: #if the file can't be opened
sys.stderr.write ("Caught IOError:"+ repr(ioe) + "/n")
sys.exit (1)
index_text_file("/Users/z007881/Documents/ABooks_search/CODE/booksearch/book3.txt","/Users/z007881/Documents/ABooks_search/CODE/booksearch/book3.idx")
SyntaxError: invalid syntax
(base) 8c85908188d1:CODE z007881$ python3 indexed.py
9395 unique words found in /Users/z007881/Documents/ABooks_search/CODE/booksearch/book3.t
xt
i would like it to say 9395 unique words found in book3.txt
One way to do it would be to split the path on the directory separator / and pick the last element:
file_name = txt_filename.split("/")[-1]
# ...
# Then:
print("{} unique words found in".format(len(word_keys)), file_name)
# I would prefer using an fstring, unless your Python version is too old:
print(f"{len(word_keys)} found in {file_name}")
I strongly advise to change the name of txt_filename into something less misleading like txt_filepath, since it does not contain a file name but a whole path (including, but not limited to, the file name).

Search different words between two files

I have two files (txt), for examle FILE A and FILE B and i want to
find all the words in FILE A that exist in FILE B,
for example if file A is :
HIS HOUSE IS VERY SMALL
and file B is
HIS DOG IS VERY NICE
I want to write a program that show me that HOUSE is not
in file B.
I thought to use the SPLIT command and looping over the file
but since I do not know the python well, does anyone can help me
if there is another command that can help me?
Maybe there is a better solution, but this one below will solve your problem.
import re
def is_letter(s):
return re.match('[a-z]|[A-Z]$', s)
def words_only(s):
for i, x in enumerate(s):
if not is_letter(x):
s = s[:i] + s[i:].replace(x, ' ')
s = re.sub( '\s+', ' ', s).strip().upper().split(' ')
return s
file_a = words_only(open('file_a.txt', 'r').read())
file_b = words_only(open('file_b.txt', 'r').read())
for x in file_a:
if x not in file_b:
print(x)
file_a.txt
HIS DOG IS VERY SMALL.
His wife is very nice.
file_b.txt
HIS DOG IS VERY NICE.
He is very ugly.

Resources