Splitting up a large file into smaller files at specific points - python-3.x

I know this question has been asked several times. But those solutions really don't help me here. I have a really big file (5GB almost) to read, get the data and give it to my neural network. I have to read line by line. At first I loaded the entire file into the memory using .readlines() function but it obviously resulted in out-of-memory issue. Next I instead of loading the entire file into the memory, I read it line by line but it still hasn't worked. So now I am thinking to split my file into smaller files and then read each of those files. The file format that for each sequence I have a header starting with '>' followed by a sequence for example:
>seq1
acgtccgttagggtjhtttttttttt
tttsggggggtattttttttt
>seq2
accggattttttstttttttttaasftttttttt
stttttttttttttttttttttttsttattattat
tttttttttttttttt
>seq3
aa
.
.
.
>seqN
bbbbaatatattatatatatattatatat
tatatattatatatattatatatattatat
tatattatatattatatatattatatatatta
tatatatatattatatatatatatattatatat
tatatatattatatattatattatatatattata
tatatattatatattatatatattatatatatta
So now I want to split my file which has 12700000 sequences into smaller files such that for each file with header '>' has it's correct corresponding sequence as well. How can I achieve this in python without running into memory issues. Insights would be appreciated.

I was able to do this with 12,700,000 randomized lines with 1-20 random characters in each line. Though the size of my file was far less than 5GB (roughly 300MB)--likely due to format. All of that said, you can try this:
x = 0
y = 1
string = ""
cycle = "Seq1"
with open(f"{FILEPATH}/main.txt", "r") as file:
for line in file:
if line[0] == ">":
if x % 5000 == 0 and x != 0:
with open(f"{FILEPATH}/Sequence Files/Starting{cycle}.txt", "a") as newfile:
newfile.writelines(string)
cycle = f"Seq{y*5000+1}"
y += 1
string = ""
string += line
x += 1
if line[0] != ">":
string += line
with open(f"{FILEPATH}/Sequence Files/Starting{cycle}.txt", "a") as newfile:
newfile.writelines(string)
This will read the file line-by-line, append the first 5000 values to a string, write the string to a new file, and repeat for the rest of the original file. It will also name the file with the first sequence within the file.
The line that reads if x % 5000 == 0: is the line that defines the number of sequences within each file and the line cycle = "Seq" + str(y*5000+1) creates the formatting for the next filename. You can adjust the 5000 in these if you change your mind about how many sequences per file (you're creating 2,540 new files this way).

Related

Is there a function from terminal that removes repetition and concatenates the output on the same line?

With this input
x 1
x 2
x 3
y 1
y 2
y 3
I'd like to have this output
x 1;2;3
y 1;2;3
Thank you in advance,
Simone
If by terminal you mean something natively built in you might not be in much luck, however you could run a python file from the terminal which could do want you want and more. If having a standalone file isn't possible then you can always run python in REPL mode for purely terminal usage.
If you have python installed all you would need to do to access REPL would be "py" and you could manually setup a processor. If you can use a file then something like this below should be able to take any input text and output the formatted text to the terminal.
file = open("data.txt","r")
lines = file.readlines()
same_starts = {}
#parse each line in the file and get the starting and trailing data for sorting
for line in lines:
#remove trailing/leading whitesapce and newlines
line_norm = line.strip()#.replace('\n','')
#splits data by the first space in the line
#formatting errors make the line get skipped
try:
data_split = line_norm.split(' ')
start = data_split[0]
end = data_split[1]
except:
continue
#check if dictionary same_starts already has this start
if same_starts.get(start):
same_starts[start].append(end)
else:
#add new list with first element being this ending
same_starts[start] = [end]
#print(same_starts)
#format the final data into the needed output
final_output = ""
for key in same_starts:
text = key + ' '
for element in same_starts[key]:
text += element + ";"
final_output += text + '\n'
print(final_output)
NOTE: final_output is the text in the final formatting
assuming you have python installed then this file would only need to be run with the current directory being the folder where it is stored along with a text file called "data.txt" in the same folder which contains the starting values you want processed. Then you would do "py FILE_NAME.ex" ensuring you replace FILE_NAME.ex with the exact same name as the python file, extension included.

Replacing a float number in txt file

Firstly, I would like to say that I am newbie in Python.
I will ll try to explain my problem as best as I can.
The main aim of the code is to be able to read, modify and copy a txt file.
In order to do that I would like to split the problem up in three different steps.
1 - Copy the first N lines into a new txt file (CopyFile), exactly as they are in the original file (OrigFile)
2 - Access to a specific line where I want to change a float number for other. I want to append this line to CopyFile.
3 - Copy the rest of the OrigFile from line in point 2 to the end of the file.
At the moment I have been able to do step 1 with next code:
with open("OrigFile.txt") as myfile:
head = [next(myfile) for x iin range(10)] #read first 10 lines of txt file
copy = open("CopyFile.txt", "w") #create a txt file named CopyFile.txt
copy.write("".join(head)) #convert list into str
copy.close #close txt file
For the second step, my idea is to access directly to the txt line I am interested in and recognize the float number I would like to change. Code:
line11 = linecache.getline("OrigFile.txt", 11) #opening and accessing directly to line 11
FltNmb = re.findall("\d+\.\d+", line11) #regular expressions to identify float numbers
My problem comes when I need to change FltNmb for a new one, taking into consideration that I need to specify it inside the line11. How could I achieve that?
Open both files and write each line sequentially while incrementing line counter.
Condition for line 11 to replace the float number. Rest of the lines are written without modifications:
with open("CopyFile.txt", "w") as newfile:
with open("OrigFile.txt") as myfile:
linecounter = 1
for line in myfile:
if linecounter == 11:
newline = re.sub("^(\d+\.\d+)", "<new number>", line)
linecounter += 1
outfile.write(newline)
else:
newfile.write(line)
linecounter += 1

Reading text files and calculate the mean length of every 3rd word

How to open a text file (includes 5 lines) and writting a program to calculate the mean length of the third word in line over all lines in this text file. (A word is defined as a group of characters surrounded by spaces and/or a line ending.)
I suggest reading this Reading and writing Files in Python .. since what you are asking is a pretty basic question and I believe there are many resources out there. Just search :]
But not to leave you empty handed...
# mean_word.py
with open('file.txt') as data_file:
# Split data into lists representing lines
word_lists = [line.split(' ') for line in data_file.readlines()]
word_count = sum(len(line) for line in word_lists)
n_of_words = sum(len(word) for line in word_lists for word in line)
mean_word_len = n_of_words / word_count

Load only a specific line

I have a data file containing over a million lines consisting of 16 integer numbers (it doesn't really matter) and I need to process the lines in Octave. Obviously, it's impossible to load the whole file. How can I load only a specific line?
I have thought of two possibilities:
I have missed something in docs of Simple I/O
I should convert the file to be a CSV and use some of the csvread features
If you want to iterate through the file, line-by-line, you can open the file and then use fscanf to parse out each line.
fid = fopen(filename);
while true
% Read the next 16 integers
data = fscanf(fid, '%d', 16);
% Go until we can't read anymore
if isempty(data)
break
end
end
If you want each line as a string, you can instead use fgetl to get each line
fid = fopen(filename);
% Get the first line
line = fgetl(fid);
while line
% Do thing
% Get the next line
line = fgetl(fid);
end

python3 opening files and reading lines

Can you explain what is going on in this code? I don't seem to understand
how you can open the file and read it line by line instead of all of the sentences at the same time in a for loop. Thanks
Let's say I have these sentences in a document file:
cat:dog:mice
cat1:dog1:mice1
cat2:dog2:mice2
cat3:dog3:mice3
Here is the code:
from sys import argv
filename = input("Please enter the name of a file: ")
f = open(filename,'r')
d1ct = dict()
print("Number of times each animal visited each station:")
print("Animal Id Station 1 Station 2")
for line in f:
if '\n' == line[-1]:
line = line[:-1]
(AnimalId, Timestamp, StationId,) = line.split(':')
key = (AnimalId,StationId,)
if key not in d1ct:
d1ct[key] = 0
d1ct[key] += 1
The magic is at:
for line in f:
if '\n' == line[-1]:
line = line[:-1]
Python file objects are special in that they can be iterated over in a for loop. On each iteration, it retrieves the next line of the file. Because it includes the last character in the line, which could be a newline, it's often useful to check and remove the last character.
As Moshe wrote, open file objects can be iterated. Only, they are not of the file type in Python 3.x (as they were in Python 2.x). If the file object is opened in text mode, then the unit of iteration is one text line including the \n.
You can use line = line.rstrip() to remove the \n plus the trailing withespaces.
If you want to read the content of the file at once (into a multiline string), you can use content = f.read().
There is a minor bug in the code. The open file should always be closed. I means to use f.close() after the for loop. Or you can wrap the open to the newer with construct that will close the file for you -- I suggest to get used to the later approach.

Resources