I think I am constantly improving my previous question. Basically, I would need to chunk up a large text (csv) file to send pieces to a multiprocess.Pool. To do so, I think I need at iterable object where the lines can be iterated over.
(see how to multiprocess large text files in python?)
Now I realized that the file object itself (or an _io.TextIOWrapper type) after you open a textfile is iterable line by line, so perhaps my chunking code (now below, sorry for missing it previously) could chunk it, if it could get its length? But if it's iterable, why can't I simply call its length (by lines, not bytes)?
Thanks!
def chunks(l,n):
"""Divide a list of nodes `l` in `n` chunks"""
l_c = iter(l)
while 1:
x = tuple(itertools.islice(l_c,n))
if not x:
return
yield x
The reason files are iterable, is that they are read in series. The length of a file, in lines, can't be calculated unless the whole file is processed. (The file's length in bytes is no indicator to how many lines it has.)
The problem is that, if the file were Gigabytes long, you might not want to read it twice if it could be helped.
That is why it is better to not know the length; that is why one should deal with data files as an Iterable rather than a collection/vector/array that has a length.
Your chunking code should be able to deal directly with the file object itself, without knowing its length.
However if you wanted to know the number of lines before processing fully, your 2 options are
buffer the whole file into an array of lines first, then pass these lines to your chunker
read it twice over, the first time discarding all the data, just recording the lines
Related
I am reading and writing huge files, using unformatted form and stream access.
During the running, I open the same file multiple times, and I read only portions of the file that I need in that moment. I have huge files in order to avoid writing too many smaller files on the hard-disk. I don't read these huge files all in once because they are too large and I would have memory problems.
In order to read only portions of the files, I do this. Let's say that I have written the array A(1:10) on a file "data.dat", and let's say that I need to read it twice in an array B(1:5). This is what I do
real, dimension(5) :: B
integer:: fu, myposition
open(newunit=fu,file="data.dat",status="old",form = 'unformatted',access='stream')
read (fu,POS=1) B
inquire(unit=fu,POS=myposition)
close(fu)
[....]
open(newunit=fu,file="data.dat",status="old",form = 'unformatted',access='stream')
read (fu,POS=myposition) B
inquire(unit=fu,POS=myposition)
close(fu)
[...]
My questions are:
Is this approach correct?
When the files are too big, the inquire(fu,POS=myposition) goes wrong,
because the integer is too big (indeed, I get negative values).
Should I simply declare the integer myposition as a larger integer?
Or there is a better way to do what I am trying to do.
In other words, having such a huge integers is related to the fact that I am using a very clumsy approach?
P.S.
To be more quantitative, this is the order of magnitude: I have thousands of files of around 10 giga each.
I have a rather large file (32 GB) which is an image of an SD card, created using dd.
I suspected that the file is empty (i.e. filled with the null byte \x00) starting from a certain point.
I checked this using python in the following way (where f is an open file handle with the cursor at the last position I could find data at):
for i in xrange(512):
if set(f.read(64*1048576))!=set(['\x00']):
print i
break
This worked well (in fact it revealed some data at the very end of the image), but took >9 minutes.
Has anyone got a better way to do this? There must be a much faster way, I'm sure, but cannot think of one.
Looking at a guide about memory buffers in python here I suspected that the comparator itself was the issue. In most non-typed languages memory copies are not very obvious despite being a killer for performance.
In this case, as Oded R. established, creating a buffer from read and comparing the result with a previously prepared nul filled one is much more efficient.
size = 512
data = bytearray(size)
cmp = bytearray(size)
And when reading:
f = open(FILENAME, 'rb')
f.readinto(data)
Two things that need to be taken into account is:
The size of the compared buffers should be equal, but comparing bigger buffers should be faster until some point (I would expect memory fragmentation to be the main limit)
The last buffer may not be the same size, reading the file into the prepared buffer will keep the tailing zeroes where we want them.
Here the comparison of the two buffers will be quick and there will be no attempts of casting the bytes to string (which we don't need) and since we reuse the same memory all the time, the garbage collector won't have much work either... :)
I am trying to read a fastq formatted text file using scikit-bio.
Given that it is a fairly large file, performing operations is quite slow.
Ultimately, I am attempting to dereplicate the fastq file into a dictionary:
f = 'Undetermined_S0_L001_I1_001.fastq'
seqs = skbio.io.read(f, format='fastq')
seq_dic = {}
for seq in seqs:
seq = str(seq)
if seq in seq_dic.keys():
seq_dic[seq] +=1
else:
seq_dic[seq] = 1
Most of the time here is used during the reading of the file:
%%time
f = 'Undetermined_S0_L001_I1_001.fastq'
seqs = skbio.io.read(f, format='fastq')
for seq in itertools.islice(seqs, 100000):
seq
CPU times: user 46.2 s, sys: 334 ms, total: 46.5 s
Wall time: 47.8 s
My understanding is that not verifying the sequences would improve run time, however that does not appear to be the case:
%%time
f = 'Undetermined_S0_L001_I1_001.fastq'
seqs = skbio.io.read(f, format='fastq', verify=False, variant='illumina1.8')
for seq in itertools.islice(seqs, 100000):
seq
CPU times: user 47 s, sys: 369 ms, total: 47.4 s
Wall time: 48.9 s
So my question is, first why isn't verify=False improving run time and second is there a faster way using scikit-bio to read sequences?
first why isn't verify=False improving run time
verify=False is a parameter accepted by scikit-bio's I/O API generally. It is not specific to a particular file format. verify=False tells scikit-bio to not invoke the file format's sniffer to double-check that the file is in the format specified by the user. From the docs [1]:
verify : bool, optional
When True, will double check the format if provided.
So verify=False doesn't turn off sequence data validation; it turns off file format sniffer verification. You will have minimal performance gains with verify=False.
seqs = skbio.io.read(f, format='fastq', verify=False, variant='illumina1.8') will produce a generator of skbio.Sequence objects. Sequence alphabet validation is not performed because skbio.Sequence does not have an alphabet, so that isn't where your performance bottleneck is. Note that if you want to read your FASTQ file into a specific type of biological sequence (DNA, RNA, or Protein), you can pass constructor=skbio.DNA (for example). This will perform alphabet validation for the relevant sequence type, and that currently cannot be toggled off when reading. Since you're having performance issues, I don't recommend passing constructor as that will only slow things down even more.
and second is there a faster way using scikit-bio to read sequences?
There isn't a faster way to read FASTQ files with scikit-bio. There is an issue [2] exploring ideas that could speed this up, but those ideas haven't been implemented.
scikit-bio is slow at reading FASTQ files because it supports reading sequence data and quality scores that may span multiple lines. This complicates the reading logic and has a performance hit. FASTQ files with multi-line data are not common anymore however; Illumina used to produce these files but they now prefer/recommend writing FASTQ records that are exactly four lines (sequence header, sequence data, quality header, quality scores). In fact, this is how scikit-bio writes FASTQ data. With this simpler record format, it is much faster and easier to read a FASTQ file. scikit-bio is also slow at reading FASTQ files because it decodes and validates the quality scores. It also stores sequence data and quality scores in a skbio.Sequence object, which has performance overhead.
In your case, you don't need the quality scores decoded, and you likely have a FASTQ file with simple four-line records. Here's a Python 3 compatible generator that reads a FASTQ file and yields sequence data as Python strings:
import itertools
def read_fastq_seqs(filepath):
with open(filepath, 'r') as fh:
for seq_header, seq, qual_header, qual in itertools.zip_longest(*[fh] * 4):
if any(line is None for line in (seq_header, seq, qual_header, qual)):
raise Exception(
"Number of lines in FASTQ file must be multiple of four "
"(i.e., each record must be exactly four lines long).")
if not seq_header.startswith('#'):
raise Exception("Invalid FASTQ sequence header: %r" % seq_header)
if qual_header != '+\n':
raise Exception("Invalid FASTQ quality header: %r" % qual_header)
if qual == '\n':
raise Exception("FASTQ record is missing quality scores.")
yield seq.rstrip('\n')
You can remove the validation checks here if you're certain your file is a valid FASTQ file containing records that are exactly four lines long.
This isn't related to your question, but I wanted to note that you may have a bug in your counter logic. When you see a sequence for the first time, your counter is set to zero instead of 1. I think the logic should be:
if seq in seq_dic: # calling .keys() is necessary
seq_dic[seq] +=1
else:
seq_dic[seq] = 1
[1] http://scikit-bio.org/docs/latest/generated/skbio.io.registry.read.html
[2] https://github.com/biocore/scikit-bio/issues/907
If you want to count the number of occurrences of each unique sequence in a fastq file, I would suggest to try out the Bank parser of pyGATB, together with the convenient Counter object from the collections module of the standard library.
#!/usr/bin/env python3
from collections import Counter
from gatb import Bank
# (gunzipping is done transparently)
seq_counter = Counter(seq.sequence for seq in Bank("SRR077487_2.filt.fastq.gz"))
This is quite efficient for a python module (according to my benchmarks in Bioinformatics SE: https://bioinformatics.stackexchange.com/a/380/292).
This counter should behave like your seq_dic, plus some convenient methods.
For instance, if I want to print the 10 most frequent sequences, together with their counts:
print(*seq_counter.most_common(10), sep="\n")
I have a large binary file (~ GB size) generated from a Fortran 90 program. I want to modify something in the head part of the file. The structure of the file is very complicated and contains many different variables, which I want to avoid going into. After reading and re-writing the head, is it possible to "copy and paste" the reminder of the file without knowing its detailed structure? Or even better, can I avoid re-writing the whole file altogether and just make changes on the original file? (Not sure if it matters, but the length of the header will be changed.)
Since you are changing the length of the header, I think that you have to write a new, revised file. You could avoid having to "understand" the records after the header by opening the file with stream access and just reading bytes (or perhaps four byte words if the file is a multiple of four bytes) until you reach EOF and copying them to the new file. But if the file was originally created as sequential access and you want to access it that way in the future, you will have to handle the record length information for the header record(s), including altering the value(s) to be consistent with the changed the length of the record(s). This record length information is typically a four-byte integer at beginning and end of each record, but it depends on the compiler.
I have a text file of 3GB size (a FASTA file with DNA sequences). It contains about 50 million lines of differing
length, though the most lines are 70 characters wide. I want to extract a string from this file, given two character indices. The difficult
part is, that newlines shall not be counted as character.
For good speed, I want to use seek() to reach the beginning of the string and start reading, but I need the offset in bytes for that.
My current approach is to write a new file, with all the newlines removed, but that takes another 3GB on disk. I want to find a solution which requires less disk space.
Using a dictionary mapping each character count to a file offset is not practicable either, because there would be one key for every byte, therefore using at least 16bytes*3 billion characters = 48GB.
I think I need a data structure which allows to retrieve the number of newline characters that come before a character of certain index, then I can add their number and the character index to obtain the file offset in bytes.
The SamTools fai index was designed just for this purpose. Which makes a very small compact index file with enough information to quickly seek to any point in the fasta file for any record inside as long as the file is properly formatted
You can create a SamTools index using samtools faidx command.
You can then use other programs in the SamTools package to pull out subsequences or alignments very quickly using the index.
see http://www.htslib.org/doc/samtools.html for usage.