Scan through large text file using contents of another text file - python-3.x

Hello I am very new to coding, I am writing small python script but I am stuck. The goal is to compare the log.txt contents to the contents of the LargeFile.txt and every line of the log.txt that is not matching to any line of the LargeFile.txt to be stored in the outfile.txt but with the code below I only get the First line of the log.txt to repeat itself in the outfile.txt
logfile = open('log1.txt', 'r') # This file is 8KB
keywordlist = open('LargeFile.txt', 'r') # This file is 1,4GB
outfile = open('outfile.txt', 'w')
loglines = [n for n in logfile]
keywords = [n for n in keywordlist]
for line in loglines:
for word in keywords:
if line not in word:
outfile.write(line)
outfile.close()

So conceptually you're trying to check whether any line of your 1+ GB file occurs in your 8 KB file.
This means one of the files needs to be loaded into RAM, and the smaller file is the natural choice. The other file can be read sequentially and does not need to be loaded in full.
We need
a list of lines from the smaller file
an index of those lines for quick look-ups (we'll use a dict for this)
a loop that runs through the large file and checks each line against the index, making note of every matching line it finds
a loop that outputs the original lines and uses the index to determine whether they are unique or not.
The sample below prints the complete output to the console. Write it to a file as needed.
with open('log1.txt', 'r') as f:
log_lines = list(f)
index = {line: [] for line in log_lines}
with open('LargeFile.txt', 'r') as f:
for line_num, line in enumerate(f, 1):
if line in index:
index[line].append(line_num)
for line in log_lines:
if len(index[line]) == 0:
print(f'{line} -> unique')
else:
print(f'{line} -> found {len(index[line])}x')

Related

How to add to the beginning of each line of a large file (>100GB) the index of that line with Python?

some_file.txt: (berore)
one
two
three
four
five
...
How can I effectively modify large file in Python?
with open("some_file.txt", "r+") as file:
for idx, line in enumerate(file.readlines()):
file.writeline(f'{idx} {line}') # something like this
some_file.txt: (after)
1 one
2 two
3 three
4 four
5 five
...
Don't try to load your entire file in memory, because the file may be too large for that. Instead, read line by line:
with open('input.txt') as inp, open('output.txt', 'w') as out:
idx = 1
for line in inp:
out.write(f'{idx} {line}'
idx += 1
You can't insert into the middle of a file without re-writing it. This is an operating system thing, not a Python thing.
Use pathlib for path manipulation. Rename the original file. Then copy it to a new file, adding the line numbers as you go. Keep the old file until you verify the new file is correct.
Open files are iterable, so you can use enumerate() on them directly without having to use readlines() first. The second argument to enumerate() is the number to start the count with. So the loop below will number the lines starting with 1.
from pathlib import Path
target = Path("some_file.txt")
# rename the file with ".old" suffix
original = target.rename(target.with_suffix(".old"))
with original.open("r") as source, target.open("w") as sink:
for line_no, line in enumerate(source, 1):
sink.writeline(f'{line_no} {line}')

Splitting up a large file into smaller files at specific points

I know this question has been asked several times. But those solutions really don't help me here. I have a really big file (5GB almost) to read, get the data and give it to my neural network. I have to read line by line. At first I loaded the entire file into the memory using .readlines() function but it obviously resulted in out-of-memory issue. Next I instead of loading the entire file into the memory, I read it line by line but it still hasn't worked. So now I am thinking to split my file into smaller files and then read each of those files. The file format that for each sequence I have a header starting with '>' followed by a sequence for example:
>seq1
acgtccgttagggtjhtttttttttt
tttsggggggtattttttttt
>seq2
accggattttttstttttttttaasftttttttt
stttttttttttttttttttttttsttattattat
tttttttttttttttt
>seq3
aa
.
.
.
>seqN
bbbbaatatattatatatatattatatat
tatatattatatatattatatatattatat
tatattatatattatatatattatatatatta
tatatatatattatatatatatatattatatat
tatatatattatatattatattatatatattata
tatatattatatattatatatattatatatatta
So now I want to split my file which has 12700000 sequences into smaller files such that for each file with header '>' has it's correct corresponding sequence as well. How can I achieve this in python without running into memory issues. Insights would be appreciated.
I was able to do this with 12,700,000 randomized lines with 1-20 random characters in each line. Though the size of my file was far less than 5GB (roughly 300MB)--likely due to format. All of that said, you can try this:
x = 0
y = 1
string = ""
cycle = "Seq1"
with open(f"{FILEPATH}/main.txt", "r") as file:
for line in file:
if line[0] == ">":
if x % 5000 == 0 and x != 0:
with open(f"{FILEPATH}/Sequence Files/Starting{cycle}.txt", "a") as newfile:
newfile.writelines(string)
cycle = f"Seq{y*5000+1}"
y += 1
string = ""
string += line
x += 1
if line[0] != ">":
string += line
with open(f"{FILEPATH}/Sequence Files/Starting{cycle}.txt", "a") as newfile:
newfile.writelines(string)
This will read the file line-by-line, append the first 5000 values to a string, write the string to a new file, and repeat for the rest of the original file. It will also name the file with the first sequence within the file.
The line that reads if x % 5000 == 0: is the line that defines the number of sequences within each file and the line cycle = "Seq" + str(y*5000+1) creates the formatting for the next filename. You can adjust the 5000 in these if you change your mind about how many sequences per file (you're creating 2,540 new files this way).

How can I delete every second line in avery big text file?

I have a very big text file and I want to delete every second line. How can I do it in an effective way?
I have written a code like this:
_file = open("merged_DGM.txt", "r")
text = _file.readlines()
for i, j in enumerate(text):
if i % 2 == 0:
del text[i]
_file.close()
_file = open("half_DGM.txt", "w")
for i in text:
_file.write(i)
_file.close()
It works for small textfiles. but for big files, it loads the whole text into the variable. After 10 minutes it could not solve the problem.
Any suggestions would be appreciated.
The file object returned by open iherits from io.IOBase and can be iterated. By directly iteration over the file you avoid loading your whole file into the memory at once.
with open("merged_DGM.txt", "r") as in_file and open("half_DGM.txt", "w") as out_file:
for index, line in enumerate(in_file):
if index % 2:
out_file.write(line)

How to print a file containing a list

So basically i have a list in a file and i only want to print the line containing an A
Here is a small part of the list
E5341,21/09/2015,C102,440,E,0
E5342,21/09/2015,C103,290,A,290
E5343,21/09/2015,C104,730,N,0
E5344,22/09/2015,C105,180,A,180
E5345,22/09/2015,C106,815,A,400
So i only want to print the line containing A
Sorry im still new at python,
i gave a try using one "print" to print the whole line but ended up failing guess i will always suck at python
You just have to:
open file
read lines
for each line, split at ","
for each line, if the 5th part of the splitted str is equal to "A", print line
Code:
filepath = 'file.txt'
with open(filepath, 'r') as f:
lines = f.readlines()
for line in lines:
if line.split(',')[4] == "A":
print(line)

python3 opening files and reading lines

Can you explain what is going on in this code? I don't seem to understand
how you can open the file and read it line by line instead of all of the sentences at the same time in a for loop. Thanks
Let's say I have these sentences in a document file:
cat:dog:mice
cat1:dog1:mice1
cat2:dog2:mice2
cat3:dog3:mice3
Here is the code:
from sys import argv
filename = input("Please enter the name of a file: ")
f = open(filename,'r')
d1ct = dict()
print("Number of times each animal visited each station:")
print("Animal Id Station 1 Station 2")
for line in f:
if '\n' == line[-1]:
line = line[:-1]
(AnimalId, Timestamp, StationId,) = line.split(':')
key = (AnimalId,StationId,)
if key not in d1ct:
d1ct[key] = 0
d1ct[key] += 1
The magic is at:
for line in f:
if '\n' == line[-1]:
line = line[:-1]
Python file objects are special in that they can be iterated over in a for loop. On each iteration, it retrieves the next line of the file. Because it includes the last character in the line, which could be a newline, it's often useful to check and remove the last character.
As Moshe wrote, open file objects can be iterated. Only, they are not of the file type in Python 3.x (as they were in Python 2.x). If the file object is opened in text mode, then the unit of iteration is one text line including the \n.
You can use line = line.rstrip() to remove the \n plus the trailing withespaces.
If you want to read the content of the file at once (into a multiline string), you can use content = f.read().
There is a minor bug in the code. The open file should always be closed. I means to use f.close() after the for loop. Or you can wrap the open to the newer with construct that will close the file for you -- I suggest to get used to the later approach.

Resources