how to overcome this case? - python-3.x

I have a program that read a CSV file line by line and for each line, I process some values and store them in a list then append them to a list, in the end, I have a list of lists.
Since I test on small files, it's ok. But when it will come to a large file I need to predict the machine memory behavior, to do so I need to know how a 200.000.000(for example) lines processed will need of memory.
I've tried something but then I encountered this problem that I've simplified in below.
from sys import getsizeof
# Code 1
print("{} bytes".format(getsizeof([['3657f9f6395fc3ef8fe640','table_source','field_source']*4])))
# 72 bytes
###### different from ######
# Code 2
print("{} bytes".format(getsizeof([['3657f9f6395fc3ef8fe640','table_source','field_source'],
['3657f9f6395fc3ef8fe640','table_source','field_source'],
['3657f9f6395fc3ef8fe640','table_source','field_source'],
['3657f9f6395fc3ef8fe640','table_source','field_source']])))
# 96 bytes
# when checking the unique identity
> [id(['3657f9f6395fc3ef8fe640','table_source','field_source'])*4]
[10301192613920]
> [ id(['3657f9f6395fc3ef8fe640','table_source','field_source']),
id(['3657f9f6395fc3ef8fe640','table_source','field_source']),
id(['3657f9f6395fc3ef8fe640','table_source','field_source']),
id(['3657f9f6395fc3ef8fe640','table_source','field_source']) ]
[2575297689160, 2575297689160, 2575297689160, 2575297689160]
The size in bytes is different like if there is some kind of indexing in the memory of the object/value when you try in the way of Code 1. How can I overcome this and make my approximative estimation?

Related

Reading large number of files(20k+) in python memory Error

I am trying to read a large number of files 20k+ from my computer using python but I keep on getting this Memory ERROR(details below). Although I have 16GB of RAM from which 8 GB or more is free all the time and the size of all the files is just 270Mb. I have tried many different solutions like pandas read_csv() reading in chunks using open(file_path).read(100) and many others but I am unable to read the files. I have to create a corpus of words after reading the files in the list. Below is my code so far. Any help will be highly appreciated.
import os
import pandas as pd
collectionPath = r"C:\Users\Asghar Nazir\OneDrive - Higher Education Commission\MSDS\S1\IR\assignment\ACL txt"
listOfFilesInCollection = os.listdir(collectionPath)
def wordList(file):
list_of_words_from_file =[]
for line in file:
for word in line.split():
list_of_words_from_file.append(word)
return list_of_words_from_file
list_of_file_word_Lists = {}
file=[]
for file_name in listOfFilesInCollection:
filePath = collectionPath + "\\" + file_name
with open(filePath) as f:
for line in f:
file.append(line)
list_of_file_word_Lists[file_name]=wordList(file)
print(list_of_file_word_Lists)
The error that I get
Traceback (most recent call last): File "C:/Users/Asghar
Nazir/PycharmProjects/pythonProject/IRAssignment1/init.py", line
25, in
list_of_file_word_Lists[file_name]=wordList(file) File "C:/Users/Asghar
Nazir/PycharmProjects/pythonProject/IRAssignment1/init.py", line
14, in wordList
list_of_words_from_file.append(word) MemoryError
You probably want to move the file=[] at the beginning of the loop because you're currently adding lines of each new file you open without having first removed the lines of all the previous files.
Then, there're very likely more efficient approaches depending on what you're trying to achieve. If the order of words doesn't matter, then maybe using using a dict or a collections.Counter instead of a list can help to avoid duplication of identical strings. If neither the order nor the frequency of words matter, then maybe using a set instead is going to be even better.
Finally, since it's likely you'll find most words in several files, try to store each of them only once in memory. That way, you'll be able to scale way higher than a mere 20k files: there's plenty of space in 16 GiB of RAM.
Keep in mind that Python has lots of fixed overheads and hidden costs: inefficient data structures can cost way more than you would expect.
It is hard to tell why your memory problems arise without knowing the content of your files. Maybe it is enough to make your code more efficient. For example: The split()-function can handle multiple lines itself. So you don't need a loop for that. And using list comprehension is always a good idea in python.
The following code should return what you want and I don't see a reason why you should run out of memory using it. Besides that, Arkanosis' hint to the importance of data types is very valid. It depends on what you want to achieve with all those words.
from pathlib import Path
def word_list_from_file(path):
with open(path, 'rt') as f:
list_words = f.read().split()
return list_words
path_dir = Path(r"C:\Users\Asghar Nazir\OneDrive - Higher Education Commission\MSDS\S1\IR\assignment\ACL txt")
dict_file_content = {
str(path.name): word_list_from_file(path)
for path in path_dir.rglob("*.txt")
}
P.S.: I'm not sure how the pathlib-module works in windows. But from what I read, this code snippet is platform independent.

Fastest way to read many inputs in PyPy3 and what is BytesIO doing here?

Recently I was working on a problem that required me to read many many lines of numbers (around 500,000).
Early on, I found that using input() was way too slow. Using stdin.readline() was much better. However, it still was not fast enough. I found that using the following code:
import io, os
input = io.BytesIO(os.read(0,os.fstat(0).st_size)).readline
and using input() in this manner improved the runtime. However, I don't actually understand how this code works. Reading the documentation for os.read, 0 in os.read(0, os.fstat(0).st_size) describes the file we are reading from. What file is 0 describing? Also, fstat describes the status of the file we are reading from but apparently that input is to denote the max number of bytes we are reading?
The code works but I want to understand what it is doing and why it is faster. Any help is appreciated.
0 is the file descriptor for standard input. os.fstat(0).st_size will tell Python how many bytes are currently waiting in the standard input buffer. Then os.read(0, ...) will read that many bytes in bulk, again from standard input, producing a bytestring.
(As an additional note, 1 is the file descriptor of standard output, and 2 is standard error.)
Here's a demo:
echo "five" | python3 -c "import os; print(os.stat(0).st_size)"
# => 5
Python found four single-byte characters and a newline in the standard input buffer, and reported five bytes waiting to be read.
Bytestrings are not very convenient to work with if you want text — for one thing, they don't really understand the concept of "lines" — so BytesIO fakes an input stream with the passed bytestring, allowing you to readline from it. I am not 100% sure why this is faster, but my guesses are:
Normal read is likely done character-wise, so that one can detect a line break and stop without reading too much; bulk read is more efficient (and finding newlines post-facto in memory is pretty fast)
There is no encoding processing done this way
os.read has a signature I am calling fd, size. Setting size to the bytes left in fd causes everything else to come rushing at you like a tusnami. There is also "standard file descriptors" for 0=stdin, 1=stdout, 2=stderr.
Code deconstruction:
import io, os # Utilities
input = \ # Replace the input built-in
io.BytesIO( \ # Create a fake file
os.read( \ # Read data from a file descriptor
0, \ # stdin
os.fstat(0) \ # Information about stdin
.st_size \ # Bytes left in the file
)
) \
.readline # When called, gets a line of the file

Fastest way to read a fastq with scikit-bio

I am trying to read a fastq formatted text file using scikit-bio.
Given that it is a fairly large file, performing operations is quite slow.
Ultimately, I am attempting to dereplicate the fastq file into a dictionary:
f = 'Undetermined_S0_L001_I1_001.fastq'
seqs = skbio.io.read(f, format='fastq')
seq_dic = {}
for seq in seqs:
seq = str(seq)
if seq in seq_dic.keys():
seq_dic[seq] +=1
else:
seq_dic[seq] = 1
Most of the time here is used during the reading of the file:
%%time
f = 'Undetermined_S0_L001_I1_001.fastq'
seqs = skbio.io.read(f, format='fastq')
for seq in itertools.islice(seqs, 100000):
seq
CPU times: user 46.2 s, sys: 334 ms, total: 46.5 s
Wall time: 47.8 s
My understanding is that not verifying the sequences would improve run time, however that does not appear to be the case:
%%time
f = 'Undetermined_S0_L001_I1_001.fastq'
seqs = skbio.io.read(f, format='fastq', verify=False, variant='illumina1.8')
for seq in itertools.islice(seqs, 100000):
seq
CPU times: user 47 s, sys: 369 ms, total: 47.4 s
Wall time: 48.9 s
So my question is, first why isn't verify=False improving run time and second is there a faster way using scikit-bio to read sequences?
first why isn't verify=False improving run time
verify=False is a parameter accepted by scikit-bio's I/O API generally. It is not specific to a particular file format. verify=False tells scikit-bio to not invoke the file format's sniffer to double-check that the file is in the format specified by the user. From the docs [1]:
verify : bool, optional
When True, will double check the format if provided.
So verify=False doesn't turn off sequence data validation; it turns off file format sniffer verification. You will have minimal performance gains with verify=False.
seqs = skbio.io.read(f, format='fastq', verify=False, variant='illumina1.8') will produce a generator of skbio.Sequence objects. Sequence alphabet validation is not performed because skbio.Sequence does not have an alphabet, so that isn't where your performance bottleneck is. Note that if you want to read your FASTQ file into a specific type of biological sequence (DNA, RNA, or Protein), you can pass constructor=skbio.DNA (for example). This will perform alphabet validation for the relevant sequence type, and that currently cannot be toggled off when reading. Since you're having performance issues, I don't recommend passing constructor as that will only slow things down even more.
and second is there a faster way using scikit-bio to read sequences?
There isn't a faster way to read FASTQ files with scikit-bio. There is an issue [2] exploring ideas that could speed this up, but those ideas haven't been implemented.
scikit-bio is slow at reading FASTQ files because it supports reading sequence data and quality scores that may span multiple lines. This complicates the reading logic and has a performance hit. FASTQ files with multi-line data are not common anymore however; Illumina used to produce these files but they now prefer/recommend writing FASTQ records that are exactly four lines (sequence header, sequence data, quality header, quality scores). In fact, this is how scikit-bio writes FASTQ data. With this simpler record format, it is much faster and easier to read a FASTQ file. scikit-bio is also slow at reading FASTQ files because it decodes and validates the quality scores. It also stores sequence data and quality scores in a skbio.Sequence object, which has performance overhead.
In your case, you don't need the quality scores decoded, and you likely have a FASTQ file with simple four-line records. Here's a Python 3 compatible generator that reads a FASTQ file and yields sequence data as Python strings:
import itertools
def read_fastq_seqs(filepath):
with open(filepath, 'r') as fh:
for seq_header, seq, qual_header, qual in itertools.zip_longest(*[fh] * 4):
if any(line is None for line in (seq_header, seq, qual_header, qual)):
raise Exception(
"Number of lines in FASTQ file must be multiple of four "
"(i.e., each record must be exactly four lines long).")
if not seq_header.startswith('#'):
raise Exception("Invalid FASTQ sequence header: %r" % seq_header)
if qual_header != '+\n':
raise Exception("Invalid FASTQ quality header: %r" % qual_header)
if qual == '\n':
raise Exception("FASTQ record is missing quality scores.")
yield seq.rstrip('\n')
You can remove the validation checks here if you're certain your file is a valid FASTQ file containing records that are exactly four lines long.
This isn't related to your question, but I wanted to note that you may have a bug in your counter logic. When you see a sequence for the first time, your counter is set to zero instead of 1. I think the logic should be:
if seq in seq_dic: # calling .keys() is necessary
seq_dic[seq] +=1
else:
seq_dic[seq] = 1
[1] http://scikit-bio.org/docs/latest/generated/skbio.io.registry.read.html
[2] https://github.com/biocore/scikit-bio/issues/907
If you want to count the number of occurrences of each unique sequence in a fastq file, I would suggest to try out the Bank parser of pyGATB, together with the convenient Counter object from the collections module of the standard library.
#!/usr/bin/env python3
from collections import Counter
from gatb import Bank
# (gunzipping is done transparently)
seq_counter = Counter(seq.sequence for seq in Bank("SRR077487_2.filt.fastq.gz"))
This is quite efficient for a python module (according to my benchmarks in Bioinformatics SE: https://bioinformatics.stackexchange.com/a/380/292).
This counter should behave like your seq_dic, plus some convenient methods.
For instance, if I want to print the 10 most frequent sequences, together with their counts:
print(*seq_counter.most_common(10), sep="\n")

How to convert fixed size dimension to unlimited in a netcdf file

I'm downloading daily 600MB netcdf-4 files that have this structure:
netcdf myfile {
dimensions:
time_counter = 18 ;
depth = 50 ;
latitude = 361 ;
longitude = 601 ;
variables:
salinity
temp, etc
I'm looking for a better way to convert the time_counter dimension from a fixed size (18) to an unlimited dimension.
I found a way of doing it with the netcdf commands and sed. Like this:
ncdump myfile.nc | sed -e "s#^.time_counter = 18 ;#time_counter = UNLIMITED ; // (currently 18)#" | ncgen -o myfileunlimited.nc
which worked for me for small files, but when dumping a 600 MB netcdf files, takes to much memory and time.
Somebody knows another method for accomplishing this?
Your answers are very insightful. I'm not really looking a way to improve this ncdump-sed-ncgen method, I know that dumping a netcdf file that is 600MB uses almost 5 times more space in a text file (CDL representation). To then modify some header text and generate the netcdf file again, doesn't feels very efficient.
I read the latest NCO commands documentation, and found a option specific to ncks "--mk_rec_dmn". Ncks mainly extracts and writes or appends data to a new netcdf file, then this seems the better approach, extract all the data of myfile.nc and write it with a new record dimension (unlimited dimension) which the "--mk_rec_dmn" does, then replace the old file.
ncks --mk_rec_dmn time_counter myfile.nc -o myfileunlimited.nc ; mv myfileunlimited.nc myfile.nc
To do the opposite operation (record dimension to fixed-size) would be.
ncks --fix_rec_dmn time_counter myfile.nc -o myfilefixedsize.nc ; mv myfilefixedsize.nc myfile.nc
The shell pipeline can only be marginally improved by making the sed step only modify the beginning of the file and pass everything else through, but the expression you have is very cheap to process and will not make a dent in the time spent.
The core problem is likely that you're spending a lot of time in ncdump formatting the file information into textual data, and in ncgen parsing textual data into a NetCDF file format again.
As the route through dump+gen is about as slow as it is shown, that leaves using NetCDF functionality to do the conversion of your data files.
If you're lucky, there may be tools that operate directly on your data files to do changes or conversions. If not, you may have to write them yourself with the NetCDF libraries.
If you're extremely unlucky, NetCDF-4 files are HDF5 files with some extra metadata. In particular, the length of the dimensions is stored in the _netcdf_dim_info dataset in group _netCDF (or so the documentation tells me).
It may be possible to modify the information there to turn the current length of the time_counter dimension into the value for UNLIMITED (which is the number 0), but if you do this, you really need to verify the integrity of the resulting file, as the documentation neatly puts it:
"Note that modifying these files with HDF5 will almost certainly make them unreadable to netCDF-4."
As a side note, if this process is important to your group, it may be worth looking into what hardware could do the task faster. On my Bulldozer system, the process of converting a 78 megabyte file takes 20 seconds, using around 500 MB memory for ncgen working set (1 GB virtual) and 12 MB memory for ncdump working set (111 MB virtual), each task taking up the better part of a core.
Any decent disk should read/sink your files in 10 seconds or so, memory doesn't matter as long as you don't swap, so CPU is probably your primary concern if you take the dump+gen route.
If concurrent memory use is a big concern, you can trade some bytes for space by saving the intermediary result from sed onto disk, which will likely take up to 1.5 gigabytes or so.
You can use the xarray python package's xr.to_netdf() method, then optimise memory usage via using Dask.
You just need to pass names of the dimensions to make unlimited to the unlimited_dims argument and use the chunks to split the data. For instance:
import xarray as xr
ds = xr.open_dataset('myfile.nc', chunks={'time_counter': 18})
ds.to_netcdf('myfileunlimited.nc', unlimited_dims={'time_counter':True})
There is a nice summary of combining Dask and xarray linked here.

why does an iterable object have no length in Python?

I think I am constantly improving my previous question. Basically, I would need to chunk up a large text (csv) file to send pieces to a multiprocess.Pool. To do so, I think I need at iterable object where the lines can be iterated over.
(see how to multiprocess large text files in python?)
Now I realized that the file object itself (or an _io.TextIOWrapper type) after you open a textfile is iterable line by line, so perhaps my chunking code (now below, sorry for missing it previously) could chunk it, if it could get its length? But if it's iterable, why can't I simply call its length (by lines, not bytes)?
Thanks!
def chunks(l,n):
"""Divide a list of nodes `l` in `n` chunks"""
l_c = iter(l)
while 1:
x = tuple(itertools.islice(l_c,n))
if not x:
return
yield x
The reason files are iterable, is that they are read in series. The length of a file, in lines, can't be calculated unless the whole file is processed. (The file's length in bytes is no indicator to how many lines it has.)
The problem is that, if the file were Gigabytes long, you might not want to read it twice if it could be helped.
That is why it is better to not know the length; that is why one should deal with data files as an Iterable rather than a collection/vector/array that has a length.
Your chunking code should be able to deal directly with the file object itself, without knowing its length.
However if you wanted to know the number of lines before processing fully, your 2 options are
buffer the whole file into an array of lines first, then pass these lines to your chunker
read it twice over, the first time discarding all the data, just recording the lines

Resources