Fastest way to read a fastq with scikit-bio - python-3.x

I am trying to read a fastq formatted text file using scikit-bio.
Given that it is a fairly large file, performing operations is quite slow.
Ultimately, I am attempting to dereplicate the fastq file into a dictionary:
f = 'Undetermined_S0_L001_I1_001.fastq'
seqs = skbio.io.read(f, format='fastq')
seq_dic = {}
for seq in seqs:
seq = str(seq)
if seq in seq_dic.keys():
seq_dic[seq] +=1
else:
seq_dic[seq] = 1
Most of the time here is used during the reading of the file:
%%time
f = 'Undetermined_S0_L001_I1_001.fastq'
seqs = skbio.io.read(f, format='fastq')
for seq in itertools.islice(seqs, 100000):
seq
CPU times: user 46.2 s, sys: 334 ms, total: 46.5 s
Wall time: 47.8 s
My understanding is that not verifying the sequences would improve run time, however that does not appear to be the case:
%%time
f = 'Undetermined_S0_L001_I1_001.fastq'
seqs = skbio.io.read(f, format='fastq', verify=False, variant='illumina1.8')
for seq in itertools.islice(seqs, 100000):
seq
CPU times: user 47 s, sys: 369 ms, total: 47.4 s
Wall time: 48.9 s
So my question is, first why isn't verify=False improving run time and second is there a faster way using scikit-bio to read sequences?

first why isn't verify=False improving run time
verify=False is a parameter accepted by scikit-bio's I/O API generally. It is not specific to a particular file format. verify=False tells scikit-bio to not invoke the file format's sniffer to double-check that the file is in the format specified by the user. From the docs [1]:
verify : bool, optional
When True, will double check the format if provided.
So verify=False doesn't turn off sequence data validation; it turns off file format sniffer verification. You will have minimal performance gains with verify=False.
seqs = skbio.io.read(f, format='fastq', verify=False, variant='illumina1.8') will produce a generator of skbio.Sequence objects. Sequence alphabet validation is not performed because skbio.Sequence does not have an alphabet, so that isn't where your performance bottleneck is. Note that if you want to read your FASTQ file into a specific type of biological sequence (DNA, RNA, or Protein), you can pass constructor=skbio.DNA (for example). This will perform alphabet validation for the relevant sequence type, and that currently cannot be toggled off when reading. Since you're having performance issues, I don't recommend passing constructor as that will only slow things down even more.
and second is there a faster way using scikit-bio to read sequences?
There isn't a faster way to read FASTQ files with scikit-bio. There is an issue [2] exploring ideas that could speed this up, but those ideas haven't been implemented.
scikit-bio is slow at reading FASTQ files because it supports reading sequence data and quality scores that may span multiple lines. This complicates the reading logic and has a performance hit. FASTQ files with multi-line data are not common anymore however; Illumina used to produce these files but they now prefer/recommend writing FASTQ records that are exactly four lines (sequence header, sequence data, quality header, quality scores). In fact, this is how scikit-bio writes FASTQ data. With this simpler record format, it is much faster and easier to read a FASTQ file. scikit-bio is also slow at reading FASTQ files because it decodes and validates the quality scores. It also stores sequence data and quality scores in a skbio.Sequence object, which has performance overhead.
In your case, you don't need the quality scores decoded, and you likely have a FASTQ file with simple four-line records. Here's a Python 3 compatible generator that reads a FASTQ file and yields sequence data as Python strings:
import itertools
def read_fastq_seqs(filepath):
with open(filepath, 'r') as fh:
for seq_header, seq, qual_header, qual in itertools.zip_longest(*[fh] * 4):
if any(line is None for line in (seq_header, seq, qual_header, qual)):
raise Exception(
"Number of lines in FASTQ file must be multiple of four "
"(i.e., each record must be exactly four lines long).")
if not seq_header.startswith('#'):
raise Exception("Invalid FASTQ sequence header: %r" % seq_header)
if qual_header != '+\n':
raise Exception("Invalid FASTQ quality header: %r" % qual_header)
if qual == '\n':
raise Exception("FASTQ record is missing quality scores.")
yield seq.rstrip('\n')
You can remove the validation checks here if you're certain your file is a valid FASTQ file containing records that are exactly four lines long.
This isn't related to your question, but I wanted to note that you may have a bug in your counter logic. When you see a sequence for the first time, your counter is set to zero instead of 1. I think the logic should be:
if seq in seq_dic: # calling .keys() is necessary
seq_dic[seq] +=1
else:
seq_dic[seq] = 1
[1] http://scikit-bio.org/docs/latest/generated/skbio.io.registry.read.html
[2] https://github.com/biocore/scikit-bio/issues/907

If you want to count the number of occurrences of each unique sequence in a fastq file, I would suggest to try out the Bank parser of pyGATB, together with the convenient Counter object from the collections module of the standard library.
#!/usr/bin/env python3
from collections import Counter
from gatb import Bank
# (gunzipping is done transparently)
seq_counter = Counter(seq.sequence for seq in Bank("SRR077487_2.filt.fastq.gz"))
This is quite efficient for a python module (according to my benchmarks in Bioinformatics SE: https://bioinformatics.stackexchange.com/a/380/292).
This counter should behave like your seq_dic, plus some convenient methods.
For instance, if I want to print the 10 most frequent sequences, together with their counts:
print(*seq_counter.most_common(10), sep="\n")

Related

Reading large number of files(20k+) in python memory Error

I am trying to read a large number of files 20k+ from my computer using python but I keep on getting this Memory ERROR(details below). Although I have 16GB of RAM from which 8 GB or more is free all the time and the size of all the files is just 270Mb. I have tried many different solutions like pandas read_csv() reading in chunks using open(file_path).read(100) and many others but I am unable to read the files. I have to create a corpus of words after reading the files in the list. Below is my code so far. Any help will be highly appreciated.
import os
import pandas as pd
collectionPath = r"C:\Users\Asghar Nazir\OneDrive - Higher Education Commission\MSDS\S1\IR\assignment\ACL txt"
listOfFilesInCollection = os.listdir(collectionPath)
def wordList(file):
list_of_words_from_file =[]
for line in file:
for word in line.split():
list_of_words_from_file.append(word)
return list_of_words_from_file
list_of_file_word_Lists = {}
file=[]
for file_name in listOfFilesInCollection:
filePath = collectionPath + "\\" + file_name
with open(filePath) as f:
for line in f:
file.append(line)
list_of_file_word_Lists[file_name]=wordList(file)
print(list_of_file_word_Lists)
The error that I get
Traceback (most recent call last): File "C:/Users/Asghar
Nazir/PycharmProjects/pythonProject/IRAssignment1/init.py", line
25, in
list_of_file_word_Lists[file_name]=wordList(file) File "C:/Users/Asghar
Nazir/PycharmProjects/pythonProject/IRAssignment1/init.py", line
14, in wordList
list_of_words_from_file.append(word) MemoryError
You probably want to move the file=[] at the beginning of the loop because you're currently adding lines of each new file you open without having first removed the lines of all the previous files.
Then, there're very likely more efficient approaches depending on what you're trying to achieve. If the order of words doesn't matter, then maybe using using a dict or a collections.Counter instead of a list can help to avoid duplication of identical strings. If neither the order nor the frequency of words matter, then maybe using a set instead is going to be even better.
Finally, since it's likely you'll find most words in several files, try to store each of them only once in memory. That way, you'll be able to scale way higher than a mere 20k files: there's plenty of space in 16 GiB of RAM.
Keep in mind that Python has lots of fixed overheads and hidden costs: inefficient data structures can cost way more than you would expect.
It is hard to tell why your memory problems arise without knowing the content of your files. Maybe it is enough to make your code more efficient. For example: The split()-function can handle multiple lines itself. So you don't need a loop for that. And using list comprehension is always a good idea in python.
The following code should return what you want and I don't see a reason why you should run out of memory using it. Besides that, Arkanosis' hint to the importance of data types is very valid. It depends on what you want to achieve with all those words.
from pathlib import Path
def word_list_from_file(path):
with open(path, 'rt') as f:
list_words = f.read().split()
return list_words
path_dir = Path(r"C:\Users\Asghar Nazir\OneDrive - Higher Education Commission\MSDS\S1\IR\assignment\ACL txt")
dict_file_content = {
str(path.name): word_list_from_file(path)
for path in path_dir.rglob("*.txt")
}
P.S.: I'm not sure how the pathlib-module works in windows. But from what I read, this code snippet is platform independent.

Fastest way to read many inputs in PyPy3 and what is BytesIO doing here?

Recently I was working on a problem that required me to read many many lines of numbers (around 500,000).
Early on, I found that using input() was way too slow. Using stdin.readline() was much better. However, it still was not fast enough. I found that using the following code:
import io, os
input = io.BytesIO(os.read(0,os.fstat(0).st_size)).readline
and using input() in this manner improved the runtime. However, I don't actually understand how this code works. Reading the documentation for os.read, 0 in os.read(0, os.fstat(0).st_size) describes the file we are reading from. What file is 0 describing? Also, fstat describes the status of the file we are reading from but apparently that input is to denote the max number of bytes we are reading?
The code works but I want to understand what it is doing and why it is faster. Any help is appreciated.
0 is the file descriptor for standard input. os.fstat(0).st_size will tell Python how many bytes are currently waiting in the standard input buffer. Then os.read(0, ...) will read that many bytes in bulk, again from standard input, producing a bytestring.
(As an additional note, 1 is the file descriptor of standard output, and 2 is standard error.)
Here's a demo:
echo "five" | python3 -c "import os; print(os.stat(0).st_size)"
# => 5
Python found four single-byte characters and a newline in the standard input buffer, and reported five bytes waiting to be read.
Bytestrings are not very convenient to work with if you want text — for one thing, they don't really understand the concept of "lines" — so BytesIO fakes an input stream with the passed bytestring, allowing you to readline from it. I am not 100% sure why this is faster, but my guesses are:
Normal read is likely done character-wise, so that one can detect a line break and stop without reading too much; bulk read is more efficient (and finding newlines post-facto in memory is pretty fast)
There is no encoding processing done this way
os.read has a signature I am calling fd, size. Setting size to the bytes left in fd causes everything else to come rushing at you like a tusnami. There is also "standard file descriptors" for 0=stdin, 1=stdout, 2=stderr.
Code deconstruction:
import io, os # Utilities
input = \ # Replace the input built-in
io.BytesIO( \ # Create a fake file
os.read( \ # Read data from a file descriptor
0, \ # stdin
os.fstat(0) \ # Information about stdin
.st_size \ # Bytes left in the file
)
) \
.readline # When called, gets a line of the file

how to overcome this case?

I have a program that read a CSV file line by line and for each line, I process some values and store them in a list then append them to a list, in the end, I have a list of lists.
Since I test on small files, it's ok. But when it will come to a large file I need to predict the machine memory behavior, to do so I need to know how a 200.000.000(for example) lines processed will need of memory.
I've tried something but then I encountered this problem that I've simplified in below.
from sys import getsizeof
# Code 1
print("{} bytes".format(getsizeof([['3657f9f6395fc3ef8fe640','table_source','field_source']*4])))
# 72 bytes
###### different from ######
# Code 2
print("{} bytes".format(getsizeof([['3657f9f6395fc3ef8fe640','table_source','field_source'],
['3657f9f6395fc3ef8fe640','table_source','field_source'],
['3657f9f6395fc3ef8fe640','table_source','field_source'],
['3657f9f6395fc3ef8fe640','table_source','field_source']])))
# 96 bytes
# when checking the unique identity
> [id(['3657f9f6395fc3ef8fe640','table_source','field_source'])*4]
[10301192613920]
> [ id(['3657f9f6395fc3ef8fe640','table_source','field_source']),
id(['3657f9f6395fc3ef8fe640','table_source','field_source']),
id(['3657f9f6395fc3ef8fe640','table_source','field_source']),
id(['3657f9f6395fc3ef8fe640','table_source','field_source']) ]
[2575297689160, 2575297689160, 2575297689160, 2575297689160]
The size in bytes is different like if there is some kind of indexing in the memory of the object/value when you try in the way of Code 1. How can I overcome this and make my approximative estimation?

Julia - Parallelism for Reading a Large file

In Julia v1.1, assume that I have a very large text file (30GB) and I want parallelism (multi-threads) to read eachline, how can I do ?
This code is an attempt to do this after checking Julia's documentation on multi-threading, but it's not working at all
open("pathtofile", "r") do file
# Count number of lines in file
seekend(file)
fileSize = position(file)
seekstart(file)
# skip nseekchars first characters of file
seek(file, nseekchars)
# progress bar, because it's a HUGE file
p = Progress(fileSize, 1, "Reading file...", 40)
Threads.#threads for ln in eachline(file)
# do something on ln
u, v = map(x->parse(UInt32, x), split(ln))
.... # other interesting things
update!(p, position(file))
end
end
Note 1 : you need using ProgressMeter (I want my code to show a progress bar while parallelism the file reading)
Note 2 : nseekchars is an Int and the number of characters I want to skip in the beginning in my file
Note 3 : the code is working but doesn't do parellelism without Threads.#threads macro next to the for loop
For the maximum I/O performance:
Parallelize the hardware - that is use disk arrays rather than a single drive. Try searching for raid performance for many excellent explanations (or ask a separate question)
Use the Julia memory mapping mechanism
s = open("my_file.txt","r")
using Mmap
a = Mmap.mmap(s)
Once having the memory mapping, do the processing in parallel. Beware of false sharing for threads (depends on your actual scenario).

why does an iterable object have no length in Python?

I think I am constantly improving my previous question. Basically, I would need to chunk up a large text (csv) file to send pieces to a multiprocess.Pool. To do so, I think I need at iterable object where the lines can be iterated over.
(see how to multiprocess large text files in python?)
Now I realized that the file object itself (or an _io.TextIOWrapper type) after you open a textfile is iterable line by line, so perhaps my chunking code (now below, sorry for missing it previously) could chunk it, if it could get its length? But if it's iterable, why can't I simply call its length (by lines, not bytes)?
Thanks!
def chunks(l,n):
"""Divide a list of nodes `l` in `n` chunks"""
l_c = iter(l)
while 1:
x = tuple(itertools.islice(l_c,n))
if not x:
return
yield x
The reason files are iterable, is that they are read in series. The length of a file, in lines, can't be calculated unless the whole file is processed. (The file's length in bytes is no indicator to how many lines it has.)
The problem is that, if the file were Gigabytes long, you might not want to read it twice if it could be helped.
That is why it is better to not know the length; that is why one should deal with data files as an Iterable rather than a collection/vector/array that has a length.
Your chunking code should be able to deal directly with the file object itself, without knowing its length.
However if you wanted to know the number of lines before processing fully, your 2 options are
buffer the whole file into an array of lines first, then pass these lines to your chunker
read it twice over, the first time discarding all the data, just recording the lines

Resources