Reading large number of files(20k+) in python memory Error - python-3.x

I am trying to read a large number of files 20k+ from my computer using python but I keep on getting this Memory ERROR(details below). Although I have 16GB of RAM from which 8 GB or more is free all the time and the size of all the files is just 270Mb. I have tried many different solutions like pandas read_csv() reading in chunks using open(file_path).read(100) and many others but I am unable to read the files. I have to create a corpus of words after reading the files in the list. Below is my code so far. Any help will be highly appreciated.
import os
import pandas as pd
collectionPath = r"C:\Users\Asghar Nazir\OneDrive - Higher Education Commission\MSDS\S1\IR\assignment\ACL txt"
listOfFilesInCollection = os.listdir(collectionPath)
def wordList(file):
list_of_words_from_file =[]
for line in file:
for word in line.split():
list_of_words_from_file.append(word)
return list_of_words_from_file
list_of_file_word_Lists = {}
file=[]
for file_name in listOfFilesInCollection:
filePath = collectionPath + "\\" + file_name
with open(filePath) as f:
for line in f:
file.append(line)
list_of_file_word_Lists[file_name]=wordList(file)
print(list_of_file_word_Lists)
The error that I get
Traceback (most recent call last): File "C:/Users/Asghar
Nazir/PycharmProjects/pythonProject/IRAssignment1/init.py", line
25, in
list_of_file_word_Lists[file_name]=wordList(file) File "C:/Users/Asghar
Nazir/PycharmProjects/pythonProject/IRAssignment1/init.py", line
14, in wordList
list_of_words_from_file.append(word) MemoryError

You probably want to move the file=[] at the beginning of the loop because you're currently adding lines of each new file you open without having first removed the lines of all the previous files.
Then, there're very likely more efficient approaches depending on what you're trying to achieve. If the order of words doesn't matter, then maybe using using a dict or a collections.Counter instead of a list can help to avoid duplication of identical strings. If neither the order nor the frequency of words matter, then maybe using a set instead is going to be even better.
Finally, since it's likely you'll find most words in several files, try to store each of them only once in memory. That way, you'll be able to scale way higher than a mere 20k files: there's plenty of space in 16 GiB of RAM.
Keep in mind that Python has lots of fixed overheads and hidden costs: inefficient data structures can cost way more than you would expect.

It is hard to tell why your memory problems arise without knowing the content of your files. Maybe it is enough to make your code more efficient. For example: The split()-function can handle multiple lines itself. So you don't need a loop for that. And using list comprehension is always a good idea in python.
The following code should return what you want and I don't see a reason why you should run out of memory using it. Besides that, Arkanosis' hint to the importance of data types is very valid. It depends on what you want to achieve with all those words.
from pathlib import Path
def word_list_from_file(path):
with open(path, 'rt') as f:
list_words = f.read().split()
return list_words
path_dir = Path(r"C:\Users\Asghar Nazir\OneDrive - Higher Education Commission\MSDS\S1\IR\assignment\ACL txt")
dict_file_content = {
str(path.name): word_list_from_file(path)
for path in path_dir.rglob("*.txt")
}
P.S.: I'm not sure how the pathlib-module works in windows. But from what I read, this code snippet is platform independent.

Related

How do I increase speed of XML retrieval and parsing in S3 Jupyter Notebook (SageMaker Studio)?

I am managing data for a computer vision project and am looking for a fast way to search and manipulate all the files in a given directory. I have a working solution but am only able to process maybe 10-20 files per second. I am new to Jupyter Notebooks so am looking for recommendations on increasing the efficiency of the attached code.
Current code is as follows:
car_count=0
label_dict={}
purge_list=[]
for each_src in source_keys:
pages = paginator.paginate(Bucket=src_bucket, Prefix=each_src)
for page in pages:
for obj in page['Contents']:
fpath = obj['Key']
fname = fpath.split('/')[-1]
if fname == '':
continue
copy_source = {
'Bucket': src_bucket,
'Key': fpath
}
if fname.endswith('.xml'):
obj=s3.Object(src_bucket,fpath)
data=obj.get()['Body'].read()
root = ET.fromstring(data)
for box in root.findall('object'):
name=box.find('name').text
if name in label_dict:
label_dict[name] +=1
else :
label_dict[name] = 1
if name not in allowed_labels:
purge_list.append(fpath)
print(f'Labels: {label_dict}',end='\r')
print(f'\nTotal Images files:{i}, Total XML files:{j}',end='\r')
#print(f'\nLabels: {label_dict})
print(f'\nPURGE LIST: ({len(purge_list)} files)')
Possible solutions:
Multithreading - I have done threading in normal Python 3.x is it common to multithread within a notebook?
Read Less of File - Currently read in whole file, not sure if this is a large bog down point but may increase speed.
Jupyter usually has a lot of overhead - also your syntax has three levels of for loops. In the python world, the lesser the for loops the better - also, binary data is almost always faster. So, a number of suggestions:
restructure your for loops, use some specialized lib from pypi for file fs
change language? use bash script
multi threading is a way indeed
caching, use redis or other fast data structures to "read-in" data
golang is comparatively easier to jump from python, also has good multi threading support - my 2 cents: its worth a try at least

List comprehension with multiple expressions and multiple conditions

Before this wode was written that iterates over a file (ipaddress) using nested loops and a module (ipaddress). Now i would like to optimize using list comprehension.
import ipaddress
tmp1=open('Path\\Test-ip.txt', 'r+')
tmp2=open('Path\\-ip.txt', 'w+')
tmp1=tmp1.readlines()
for i in tmp1:
i="".join(i.split())
i=ipaddress.ip_network(i,False)
for j in tmp1:
j="".join(j.split())
j=ipaddress.ip_network(j,False)
if j != i:
if ipaddress.IPv4Network(j).supernet_of(i):
tmp2.write(str(i))
tmp2.write('\n')
#Using List Comprehension
tmp1=open('Path\\Test-ip.txt', 'r+')
tmp2=open('Path\\Result-ip.txt', 'w+')
tmp1=tmp1.readlines()
tmp3=[ (("").join(i.split())) (("").join(j.split())) (ipaddress.ip_network(i)) (ipaddress.ip_network(j)) (tmp2.(write(str(i)))) for i in tmp1 for j in tmp1 if i!=j if ipaddress.IPv4Network(j).supernet_of(i)]
While generally a comprehension is faster than a for-loop, that is valid only if you are not doing things that are slow in the loop.
And (again in general) file I/O is much slower than doing calculations.
Note that you should always measure before trying to optimize. The bottleneck might not be where you think it is.
Run your script with python -m cProfile -s tottime yourscript.py your arguments and observe the output. That will tell you where the program is spending its time.
On my website, I have a list of articles detailing profiling of real-world python programs. You can find the first in the list here. The others are referenced at the bottom of that page.
If file I/O turns out to be the bottleneck, using memory mapped files (the memmap module) will probably improve things.

pandas.read_csv gives memory error despite comparatively small dimensions

I am trying to load this CSV file into a pandas data frame using
import pandas as pd
filename = '2016-2018_wave-IV.csv'
df = pd.read_csv(filename)
However, despite my PC being not super slow (8GB RAM, 64 bit python) and the file being somewhat but not extraordinarily large (< 33 MB), loading the file takes more than 10 minutes. It is my understanding that this shouldn't take nearly that long and I would like to figure out what's behind this.
(As suggested in similar questions, I have tried using chunksize and usecol parameters (EDIT and also low_memory), yet without success; so I believe this is not a duplicate but has more to do with the file or the setup.)
Could someone give me a pointer? Many thanks. :)
I was testing the file which you shared and problem is that this csv file have leading and ending double quotes on every line (so Panda thinks that whole line is one column). It have to be removed before processing for example by using sed in linux or just process and re-save file in python or just replace all double quotes in text editor.
To summarize and expand the answer by #Hubert Dudek:
The issue was with the file; not only did it include "s at the start of every line but also in the lines themselves. After I fixed the former, the latter caused the column attribution being messed up.

What is a quick way to check if file contents are null?

I have a rather large file (32 GB) which is an image of an SD card, created using dd.
I suspected that the file is empty (i.e. filled with the null byte \x00) starting from a certain point.
I checked this using python in the following way (where f is an open file handle with the cursor at the last position I could find data at):
for i in xrange(512):
if set(f.read(64*1048576))!=set(['\x00']):
print i
break
This worked well (in fact it revealed some data at the very end of the image), but took >9 minutes.
Has anyone got a better way to do this? There must be a much faster way, I'm sure, but cannot think of one.
Looking at a guide about memory buffers in python here I suspected that the comparator itself was the issue. In most non-typed languages memory copies are not very obvious despite being a killer for performance.
In this case, as Oded R. established, creating a buffer from read and comparing the result with a previously prepared nul filled one is much more efficient.
size = 512
data = bytearray(size)
cmp = bytearray(size)
And when reading:
f = open(FILENAME, 'rb')
f.readinto(data)
Two things that need to be taken into account is:
The size of the compared buffers should be equal, but comparing bigger buffers should be faster until some point (I would expect memory fragmentation to be the main limit)
The last buffer may not be the same size, reading the file into the prepared buffer will keep the tailing zeroes where we want them.
Here the comparison of the two buffers will be quick and there will be no attempts of casting the bytes to string (which we don't need) and since we reuse the same memory all the time, the garbage collector won't have much work either... :)

Fastest way to read a fastq with scikit-bio

I am trying to read a fastq formatted text file using scikit-bio.
Given that it is a fairly large file, performing operations is quite slow.
Ultimately, I am attempting to dereplicate the fastq file into a dictionary:
f = 'Undetermined_S0_L001_I1_001.fastq'
seqs = skbio.io.read(f, format='fastq')
seq_dic = {}
for seq in seqs:
seq = str(seq)
if seq in seq_dic.keys():
seq_dic[seq] +=1
else:
seq_dic[seq] = 1
Most of the time here is used during the reading of the file:
%%time
f = 'Undetermined_S0_L001_I1_001.fastq'
seqs = skbio.io.read(f, format='fastq')
for seq in itertools.islice(seqs, 100000):
seq
CPU times: user 46.2 s, sys: 334 ms, total: 46.5 s
Wall time: 47.8 s
My understanding is that not verifying the sequences would improve run time, however that does not appear to be the case:
%%time
f = 'Undetermined_S0_L001_I1_001.fastq'
seqs = skbio.io.read(f, format='fastq', verify=False, variant='illumina1.8')
for seq in itertools.islice(seqs, 100000):
seq
CPU times: user 47 s, sys: 369 ms, total: 47.4 s
Wall time: 48.9 s
So my question is, first why isn't verify=False improving run time and second is there a faster way using scikit-bio to read sequences?
first why isn't verify=False improving run time
verify=False is a parameter accepted by scikit-bio's I/O API generally. It is not specific to a particular file format. verify=False tells scikit-bio to not invoke the file format's sniffer to double-check that the file is in the format specified by the user. From the docs [1]:
verify : bool, optional
When True, will double check the format if provided.
So verify=False doesn't turn off sequence data validation; it turns off file format sniffer verification. You will have minimal performance gains with verify=False.
seqs = skbio.io.read(f, format='fastq', verify=False, variant='illumina1.8') will produce a generator of skbio.Sequence objects. Sequence alphabet validation is not performed because skbio.Sequence does not have an alphabet, so that isn't where your performance bottleneck is. Note that if you want to read your FASTQ file into a specific type of biological sequence (DNA, RNA, or Protein), you can pass constructor=skbio.DNA (for example). This will perform alphabet validation for the relevant sequence type, and that currently cannot be toggled off when reading. Since you're having performance issues, I don't recommend passing constructor as that will only slow things down even more.
and second is there a faster way using scikit-bio to read sequences?
There isn't a faster way to read FASTQ files with scikit-bio. There is an issue [2] exploring ideas that could speed this up, but those ideas haven't been implemented.
scikit-bio is slow at reading FASTQ files because it supports reading sequence data and quality scores that may span multiple lines. This complicates the reading logic and has a performance hit. FASTQ files with multi-line data are not common anymore however; Illumina used to produce these files but they now prefer/recommend writing FASTQ records that are exactly four lines (sequence header, sequence data, quality header, quality scores). In fact, this is how scikit-bio writes FASTQ data. With this simpler record format, it is much faster and easier to read a FASTQ file. scikit-bio is also slow at reading FASTQ files because it decodes and validates the quality scores. It also stores sequence data and quality scores in a skbio.Sequence object, which has performance overhead.
In your case, you don't need the quality scores decoded, and you likely have a FASTQ file with simple four-line records. Here's a Python 3 compatible generator that reads a FASTQ file and yields sequence data as Python strings:
import itertools
def read_fastq_seqs(filepath):
with open(filepath, 'r') as fh:
for seq_header, seq, qual_header, qual in itertools.zip_longest(*[fh] * 4):
if any(line is None for line in (seq_header, seq, qual_header, qual)):
raise Exception(
"Number of lines in FASTQ file must be multiple of four "
"(i.e., each record must be exactly four lines long).")
if not seq_header.startswith('#'):
raise Exception("Invalid FASTQ sequence header: %r" % seq_header)
if qual_header != '+\n':
raise Exception("Invalid FASTQ quality header: %r" % qual_header)
if qual == '\n':
raise Exception("FASTQ record is missing quality scores.")
yield seq.rstrip('\n')
You can remove the validation checks here if you're certain your file is a valid FASTQ file containing records that are exactly four lines long.
This isn't related to your question, but I wanted to note that you may have a bug in your counter logic. When you see a sequence for the first time, your counter is set to zero instead of 1. I think the logic should be:
if seq in seq_dic: # calling .keys() is necessary
seq_dic[seq] +=1
else:
seq_dic[seq] = 1
[1] http://scikit-bio.org/docs/latest/generated/skbio.io.registry.read.html
[2] https://github.com/biocore/scikit-bio/issues/907
If you want to count the number of occurrences of each unique sequence in a fastq file, I would suggest to try out the Bank parser of pyGATB, together with the convenient Counter object from the collections module of the standard library.
#!/usr/bin/env python3
from collections import Counter
from gatb import Bank
# (gunzipping is done transparently)
seq_counter = Counter(seq.sequence for seq in Bank("SRR077487_2.filt.fastq.gz"))
This is quite efficient for a python module (according to my benchmarks in Bioinformatics SE: https://bioinformatics.stackexchange.com/a/380/292).
This counter should behave like your seq_dic, plus some convenient methods.
For instance, if I want to print the 10 most frequent sequences, together with their counts:
print(*seq_counter.most_common(10), sep="\n")

Resources