Python 3.x outputting a text file with names of files that contain a list of words - python-3.x

I have approximately 160,000 text files in a directory. My first objective is to create a list of files that contain at least one item from a list of about 50 keywords. My current code is
import os
ngwrds= [list of words]
for filename in os.listdir(os.getcwd()):
with open(filename, 'r') as searchfile:
for line in searchfile:
if any(x in line for x in ngwrds):
with open("keyword.txt", 'a') as out:
out.write(filename + '\n')
Which works but sends out duplicate filenames. Ideally what I would like is for the loop to stop once it hits the first keyword, write the file name to 'keyword.txt', and move on to the next file in the directory. Any thoughts on how to do this?

A more in depth answer to #strubbly's comment, you would simply add a break in the 2nd for loop
with open(filename, 'r') as searchfile:
for line in searchfile:
if any(x in line for x in ngwrds):
with open("keyword.txt", 'a') as out:
out.write(filename + '\n')
break
What does the break do? from the python3 docs:
The break statement, like in C, breaks out of the smallest enclosing for or while loop.
for more information on break go to the control flow documentation :https://docs.python.org/3/tutorial/controlflow.html

Related

How to add to the beginning of each line of a large file (>100GB) the index of that line with Python?

some_file.txt: (berore)
one
two
three
four
five
...
How can I effectively modify large file in Python?
with open("some_file.txt", "r+") as file:
for idx, line in enumerate(file.readlines()):
file.writeline(f'{idx} {line}') # something like this
some_file.txt: (after)
1 one
2 two
3 three
4 four
5 five
...
Don't try to load your entire file in memory, because the file may be too large for that. Instead, read line by line:
with open('input.txt') as inp, open('output.txt', 'w') as out:
idx = 1
for line in inp:
out.write(f'{idx} {line}'
idx += 1
You can't insert into the middle of a file without re-writing it. This is an operating system thing, not a Python thing.
Use pathlib for path manipulation. Rename the original file. Then copy it to a new file, adding the line numbers as you go. Keep the old file until you verify the new file is correct.
Open files are iterable, so you can use enumerate() on them directly without having to use readlines() first. The second argument to enumerate() is the number to start the count with. So the loop below will number the lines starting with 1.
from pathlib import Path
target = Path("some_file.txt")
# rename the file with ".old" suffix
original = target.rename(target.with_suffix(".old"))
with original.open("r") as source, target.open("w") as sink:
for line_no, line in enumerate(source, 1):
sink.writeline(f'{line_no} {line}')

Split text in text file into lines

I need to split a text file into lines.
I imported the text file into python but print(readline()) prints the whole file.
with open('laxdaela_saga.en.txt', 'r+') as f:
for line in f.readlines():
print(line)
I eventually need to count unique words in the text file and other stats, but one step is to divide into lines. This is the step I'm dealing with.
You can use split() function of Python. It splits the given string into an array based on some pattern.
In your case, the pattern will be newline \n.
so split('\n') should do it.
Try this
with open('laxdaela_saga.en.txt', 'r+') as f:
for line in f.readlines():
x = line.split()
print(x)
Hope this will be of your help.

How can I delete every second line in avery big text file?

I have a very big text file and I want to delete every second line. How can I do it in an effective way?
I have written a code like this:
_file = open("merged_DGM.txt", "r")
text = _file.readlines()
for i, j in enumerate(text):
if i % 2 == 0:
del text[i]
_file.close()
_file = open("half_DGM.txt", "w")
for i in text:
_file.write(i)
_file.close()
It works for small textfiles. but for big files, it loads the whole text into the variable. After 10 minutes it could not solve the problem.
Any suggestions would be appreciated.
The file object returned by open iherits from io.IOBase and can be iterated. By directly iteration over the file you avoid loading your whole file into the memory at once.
with open("merged_DGM.txt", "r") as in_file and open("half_DGM.txt", "w") as out_file:
for index, line in enumerate(in_file):
if index % 2:
out_file.write(line)

Search text file for word from list then output word that matched in Python 3.x

I have been searching a large directory of text files for files that match a list of words. How do I have python output the word from the list that matches?
This is what I have so far. It writes the file name every time one of the words from the list is found. I want to add the matching word to the line with the file name so I have the file name and 1 matched word each time. How do I do that?
ngwrds= ['words'...]
for filename in os.listdir(os.getcwd()):
with open(filename, 'r') as searchfile:
for line in searchfile:
if any(x in line for x in ngwrds):
with open("keyword.txt", 'a') as out:
out.write(filename + '\n')
The input is a long text file a line might read like this:
The company reported depreciation of $1.20.
The if one of the search words from the list was depreciation then the output file would look like this:
filename depreciation
Thank you.
I am not sure what out is and I can't run your code from where I am but you could try something like this:
ngwrds= ['words'...]
for filename in os.listdir(os.getcwd()):
with open(filename, 'r') as searchfile:
for line in searchfile:
line = line.strip().split(" ")
for word in line:
if word in ngwrds:
out.write(filename + " " + word)
strip gets rid of whitespace on either end of line. split returns a list of the words in line.

python3 opening files and reading lines

Can you explain what is going on in this code? I don't seem to understand
how you can open the file and read it line by line instead of all of the sentences at the same time in a for loop. Thanks
Let's say I have these sentences in a document file:
cat:dog:mice
cat1:dog1:mice1
cat2:dog2:mice2
cat3:dog3:mice3
Here is the code:
from sys import argv
filename = input("Please enter the name of a file: ")
f = open(filename,'r')
d1ct = dict()
print("Number of times each animal visited each station:")
print("Animal Id Station 1 Station 2")
for line in f:
if '\n' == line[-1]:
line = line[:-1]
(AnimalId, Timestamp, StationId,) = line.split(':')
key = (AnimalId,StationId,)
if key not in d1ct:
d1ct[key] = 0
d1ct[key] += 1
The magic is at:
for line in f:
if '\n' == line[-1]:
line = line[:-1]
Python file objects are special in that they can be iterated over in a for loop. On each iteration, it retrieves the next line of the file. Because it includes the last character in the line, which could be a newline, it's often useful to check and remove the last character.
As Moshe wrote, open file objects can be iterated. Only, they are not of the file type in Python 3.x (as they were in Python 2.x). If the file object is opened in text mode, then the unit of iteration is one text line including the \n.
You can use line = line.rstrip() to remove the \n plus the trailing withespaces.
If you want to read the content of the file at once (into a multiline string), you can use content = f.read().
There is a minor bug in the code. The open file should always be closed. I means to use f.close() after the for loop. Or you can wrap the open to the newer with construct that will close the file for you -- I suggest to get used to the later approach.

Resources