Load only a specific line - io

I have a data file containing over a million lines consisting of 16 integer numbers (it doesn't really matter) and I need to process the lines in Octave. Obviously, it's impossible to load the whole file. How can I load only a specific line?
I have thought of two possibilities:
I have missed something in docs of Simple I/O
I should convert the file to be a CSV and use some of the csvread features

If you want to iterate through the file, line-by-line, you can open the file and then use fscanf to parse out each line.
fid = fopen(filename);
while true
% Read the next 16 integers
data = fscanf(fid, '%d', 16);
% Go until we can't read anymore
if isempty(data)
break
end
end
If you want each line as a string, you can instead use fgetl to get each line
fid = fopen(filename);
% Get the first line
line = fgetl(fid);
while line
% Do thing
% Get the next line
line = fgetl(fid);
end

Related

Splitting up a large file into smaller files at specific points

I know this question has been asked several times. But those solutions really don't help me here. I have a really big file (5GB almost) to read, get the data and give it to my neural network. I have to read line by line. At first I loaded the entire file into the memory using .readlines() function but it obviously resulted in out-of-memory issue. Next I instead of loading the entire file into the memory, I read it line by line but it still hasn't worked. So now I am thinking to split my file into smaller files and then read each of those files. The file format that for each sequence I have a header starting with '>' followed by a sequence for example:
>seq1
acgtccgttagggtjhtttttttttt
tttsggggggtattttttttt
>seq2
accggattttttstttttttttaasftttttttt
stttttttttttttttttttttttsttattattat
tttttttttttttttt
>seq3
aa
.
.
.
>seqN
bbbbaatatattatatatatattatatat
tatatattatatatattatatatattatat
tatattatatattatatatattatatatatta
tatatatatattatatatatatatattatatat
tatatatattatatattatattatatatattata
tatatattatatattatatatattatatatatta
So now I want to split my file which has 12700000 sequences into smaller files such that for each file with header '>' has it's correct corresponding sequence as well. How can I achieve this in python without running into memory issues. Insights would be appreciated.
I was able to do this with 12,700,000 randomized lines with 1-20 random characters in each line. Though the size of my file was far less than 5GB (roughly 300MB)--likely due to format. All of that said, you can try this:
x = 0
y = 1
string = ""
cycle = "Seq1"
with open(f"{FILEPATH}/main.txt", "r") as file:
for line in file:
if line[0] == ">":
if x % 5000 == 0 and x != 0:
with open(f"{FILEPATH}/Sequence Files/Starting{cycle}.txt", "a") as newfile:
newfile.writelines(string)
cycle = f"Seq{y*5000+1}"
y += 1
string = ""
string += line
x += 1
if line[0] != ">":
string += line
with open(f"{FILEPATH}/Sequence Files/Starting{cycle}.txt", "a") as newfile:
newfile.writelines(string)
This will read the file line-by-line, append the first 5000 values to a string, write the string to a new file, and repeat for the rest of the original file. It will also name the file with the first sequence within the file.
The line that reads if x % 5000 == 0: is the line that defines the number of sequences within each file and the line cycle = "Seq" + str(y*5000+1) creates the formatting for the next filename. You can adjust the 5000 in these if you change your mind about how many sequences per file (you're creating 2,540 new files this way).

Replacing a float number in txt file

Firstly, I would like to say that I am newbie in Python.
I will ll try to explain my problem as best as I can.
The main aim of the code is to be able to read, modify and copy a txt file.
In order to do that I would like to split the problem up in three different steps.
1 - Copy the first N lines into a new txt file (CopyFile), exactly as they are in the original file (OrigFile)
2 - Access to a specific line where I want to change a float number for other. I want to append this line to CopyFile.
3 - Copy the rest of the OrigFile from line in point 2 to the end of the file.
At the moment I have been able to do step 1 with next code:
with open("OrigFile.txt") as myfile:
head = [next(myfile) for x iin range(10)] #read first 10 lines of txt file
copy = open("CopyFile.txt", "w") #create a txt file named CopyFile.txt
copy.write("".join(head)) #convert list into str
copy.close #close txt file
For the second step, my idea is to access directly to the txt line I am interested in and recognize the float number I would like to change. Code:
line11 = linecache.getline("OrigFile.txt", 11) #opening and accessing directly to line 11
FltNmb = re.findall("\d+\.\d+", line11) #regular expressions to identify float numbers
My problem comes when I need to change FltNmb for a new one, taking into consideration that I need to specify it inside the line11. How could I achieve that?
Open both files and write each line sequentially while incrementing line counter.
Condition for line 11 to replace the float number. Rest of the lines are written without modifications:
with open("CopyFile.txt", "w") as newfile:
with open("OrigFile.txt") as myfile:
linecounter = 1
for line in myfile:
if linecounter == 11:
newline = re.sub("^(\d+\.\d+)", "<new number>", line)
linecounter += 1
outfile.write(newline)
else:
newfile.write(line)
linecounter += 1

Python 3 going through a file until EOF. File is not just a set of similar lines needing processing

The answers to questions of the type "How do I do "while not eof(file)""
do not quite cover my issue
I have a file with a format like
header block
data
another header block
more data (with arbitrary number of data lines in each data block)
...
I do not know how many header-data sets there are
I have successfully read the first block, then a set of data using loops that look for the blank line at the end of the data block.
I can't just use the "for each line in openfile" type approach as I need to read the header-data blocks one at a time and then process them.
How can I detect the last header-data block.
My current approach is to use a try except construction and wait for the exception. Not terribly elegant.
It's hard to answer without seeing any of your code...
But my guess is that you are reading the file with fp.read():
fp = open("a.txt")
while True:
data = fp.read()
Instead:
try to pass always the length of data you spected
Check if the read chunck is a empty string, not None
For example:
fp = open("a.txt")
while True:
header = fp.read(headerSize)
if header is '':
# End of file
break
read_dataSize_from_header
data = fp.read(dataSize)
if data is '':
# Error reading file
raise FileError('Error reading file')
process_your_data(data)
This is some time later but I post this for others who do this search.
The following script, suitably adjusted, will read a file and deliver lines until the EOF.
"""
Script to read a file until the EOF
"""
def get_all_lines(the_file):
for line in the_file:
if line.endswith('\n'):
line = line[:-1]
yield line
line_counter = 1
data_in = open('OAall.txt')
for line in get_all_lines(data_in):
print(line)
print(line_counter)
line_counter += 1
data_in.close()

How do you read a specific line in a text file in Lua

I have the need in Lua to read a specific line in a text file I select, I know how to open it:
filename = "hallo.txt"
fp = io.open( filename, "r" )
but I don't know how to read a specific line in that specific text file.
How do you though?
If you have to do it several times, then read the whole file into memory, storing the lines in a table.
If you only have to do this once, try something like this:
local n=0
for l in io.lines(filename) do
n=n+1
if n==lineno then process(l); break end
end

python3 opening files and reading lines

Can you explain what is going on in this code? I don't seem to understand
how you can open the file and read it line by line instead of all of the sentences at the same time in a for loop. Thanks
Let's say I have these sentences in a document file:
cat:dog:mice
cat1:dog1:mice1
cat2:dog2:mice2
cat3:dog3:mice3
Here is the code:
from sys import argv
filename = input("Please enter the name of a file: ")
f = open(filename,'r')
d1ct = dict()
print("Number of times each animal visited each station:")
print("Animal Id Station 1 Station 2")
for line in f:
if '\n' == line[-1]:
line = line[:-1]
(AnimalId, Timestamp, StationId,) = line.split(':')
key = (AnimalId,StationId,)
if key not in d1ct:
d1ct[key] = 0
d1ct[key] += 1
The magic is at:
for line in f:
if '\n' == line[-1]:
line = line[:-1]
Python file objects are special in that they can be iterated over in a for loop. On each iteration, it retrieves the next line of the file. Because it includes the last character in the line, which could be a newline, it's often useful to check and remove the last character.
As Moshe wrote, open file objects can be iterated. Only, they are not of the file type in Python 3.x (as they were in Python 2.x). If the file object is opened in text mode, then the unit of iteration is one text line including the \n.
You can use line = line.rstrip() to remove the \n plus the trailing withespaces.
If you want to read the content of the file at once (into a multiline string), you can use content = f.read().
There is a minor bug in the code. The open file should always be closed. I means to use f.close() after the for loop. Or you can wrap the open to the newer with construct that will close the file for you -- I suggest to get used to the later approach.

Resources