MemoryError with large data frame in deep learning - python-3.x

Preamble
Hi all,
I'm trying to make a geometric deep learning model using StellarGraph package. With smaller data set, it works well, but unfortunately it's not scalable to a larger data set. Information on machine, environment, used data and resulting error presented as follow.
Machine specification:
CPU: Intel core i5-8350U
RAM: 8GB DDR4
SWAP: 4 GB + 4 GB (Divided into two swapfiles in different SSD)
SSD: 250 GB + 250 GB (2280 and 2242 NVMe)
Environment:
Linux 5.3.11_1 64-bit
Python 3.6.9
Used data (size acquired from sys.getsizeof()):
Sparse block diagonal matrix (shape: 158,950 x 158,950; size: 56)
Dense feature matrix (shape: 158,950 x 14,450; size: 9,537,152)
Modules:
networkx 2.3
numpy 1.15.4
pandas 0.25.3
scipy 1.1.0
scikit-learn 0.21.3
stellargraph 0.8.2
tensorflow 1.14.0
Problem description
I aim to create a geometric deep learning to categorize subject based on adjacency matrices acquired from resting state functional MRI. Adjacency matrix assumes 55 region of interest, resulting in 55x55 matrices for all subjects. In constructing the deep learning model, I used spectral graph convolutional network model from StellarGraph, which take a graph object and nodal feature as its input. I created the graph object from sparse block diagonal matrix obtained by combining adjacency matrices from all subjects. While nodal feature is the characteristic of each node (1 node has 5 characteristic values), constructed into dense block diagonal matrix.
Previously, I made the model using a subset of population sample (around 170). It ran perfectly, and I thought I'd be able to do the same using larger data set. Unfortunately, using the same code I got a MemoryError when registering the StellarGraph object. Code and error presented on following section.
Code and error
# Data parsing with scipy.io as sio and pandas as pd
data = sio.mmread('_data/sparse.mtx')
feature = sio.mmread('_data/sparse-feature.mtx')
feature = pd.DataFrame.sparse.from_spmatrix(feature)
# Create graph object using networkx as nx
g = nx.from_scipy_sparse_matrix(data)
# Create StellarGraph object and its generator
gs = StellarGraph(g, node_features=feature) # MemoryError
generator = FullBatchNodeGenerator(gs)
I'm sorry for not providing sparse.mtx and sparse-feature.mtx file due to confidentiality reason, but I hope previous description on data shape and size may help you to understand its construct. Using above code, python gave me following error:
>>> gs = StellarGraph(g, node_features=feature) # MemoryError
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/lam/.local/lib/python3.6/site-packages/stellargraph/core/graph.py", line 786, in __init__
super().__init__(incoming_graph_data, **attr)
File "/home/lam/.local/lib/python3.6/site-packages/stellargraph/core/graph.py", line 381, in __init__
node_features, type_for_node, node_types, dtype
File "/home/lam/.local/lib/python3.6/site-packages/stellargraph/core/graph.py", line 216, in _convert_from_node_data
{node_type: data}, node_type_map, node_types, dtype
File "/home/lam/.local/lib/python3.6/site-packages/stellargraph/core/graph.py", line 182, in _convert_from_node_data
data_arr = arr.values.astype(dtype)
File "/home/lam/.local/lib/python3.6/site-packages/pandas/core/generic.py", line 5443, in values
return self._data.as_array(transpose=self._AXIS_REVERSED)
File "/home/lam/.local/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 822, in as_array
arr = mgr._interleave()
File "/home/lam/.local/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 840, in _interleave
result = np.empty(self.shape, dtype=dtype)
MemoryError
While monitoring memory consumption, I observed that the RAM only used up to 55% of its total capacity, and the swap was not used at all. While running the code, I only used TTY + tmux with only vim, top and python session running. Moreover, I also made sure no other memory-hogging processes running in the background. So I'm certain the memory bottleneck is most likely caused by python.
What I have tried
To leverage the memory consumption, I tried to use dask in managing the dense feature data frame. Unfortunately, StellarGraph function can only have pandas array, pandas data frame, dictionary, tuple, or other iterable as its input.
Other than dask, I also tried using sparse matrix (since almost 80% of my data set is zero-valued anyways). However, it gave me TypeError since StellarGraph could not have sparse matrix as its node_features.
I've also read several solutions in managing large data set, which (mostly) suggest iteratively parsing the data into python session. However, I couldn'tgr find any documentation in StellarGraph on such method.
The other option would be using computer with better hardware, which to my regret, I couldn't do due to limited funding. I'm a student and couldn't afford buying better machines for now.
Potential solution
Upgrading the RAM. I'll try salvaging RAM from other computers, but current max size I have would be 16 GB. I'm not sure it will be enough.
Use smaller chunk of feature data set. I managed to go by this solution, but the model's accuracy was really bad (50-ish %).
Questions
Why python only use 55% of my total RAM without dynamic swap allocation?
How should I effectively manage large data frame?
How do I handle MemoryError when creating a StellarGraph object?
How much RAM do I actually need? Would 32GB suffice?

Python works fine. It's a implementation problem caused by StellarGraph.
I think StellarGraph so far doesn't support huge matrix.
File "/home/lam/.local/lib/python3.6/site-packages/stellargraph/core/graph.py", line 182, in _convert_from_node_data
data_arr = arr.values.astype(dtype)
From the beginning of your code to here, all of your data is stored as sparse array, which doesn't take too much memory. Here, arr is supposed to be a DataFrame with columns as pandas.SparseArray. This line of code converts the data structure to normal numpy array, which crash the memory usage.
import numpy as np
a = np.empty((158950,14450),float)
print(a.nbytes/2**30)
17.112698405981064
An empty numpy array here actually takes 17 G memory. I can initialize 3 array like that one on my 16 G computer. Then I get memory error if I try to get more than 3. And I can't initialize a 158,950 x 158,950 numpy array.

Related

NetCDF uses twice the memory when reading part of data. Why? How to rectify it?

I have 2 fairly large datasets (~47GB) stored in a netCDF file. The datasets have three dimensions: time, s, and s1. The first dataset is of shape (3000,2088,1000) and the second is of shape (1566,160000,25). Both datasets are equal in size. The only difference is their shape. Since my RAM size is only 32GB, I am accessing the data in blocks.
For the first dataset, when I read the first ~12GB chunk of data, the code uses almost twice the amount of memory. Whereas, for the second, it uses just the amount of memory as that of the chunk (12GB). Why is this happening? How do I stop the code from using more than what is necessary?
Not using more memory is very important for my code because my algorithm's efficiency hinges on the fact that every line of code uses just enough memory and not more. Also, because of this weird behaviour, my system starts swapping like crazy. I have a linux system, if that information is useful. And I use python 3.7.3 with netCDF 4.6.2
This is how I am accessing the datasets,
from netCDF4 import Dataset
dat = Dataset('dataset1.nc')
dat1 = Dataset('dataset2.nc')
chunk1 = dat.variables['data'][0:750] #~12GB worth of data uses ~24GB RAM memory
chunk2 = dat1.variables['data'][0:392] #~12GB worth of data uses ~12GB RAM memory

Why the Python memory error using shutil.copyfileobj?

I created an in-memory file and then tried to save it as a file:
import pandas as pd
from io import StringIO
# various calculations
with open(s_outfile, "w") as outfile:
# make a header row
outfile.write('npi,NPImatched,lookalike_npi,domain,dist,rank\n')
stream_out = StringIO()
for i in big_iterator
# more calculations, creating dataframe df_info
df_info.to_csv(stream_out, index=False, header=False)
with open(s_outfile, 'a', newline='\n') as file:
stream_out.seek(0)
shutil.copyfileobj(stream_out, file)
stream_out.close()
The point of writing inside the loop to the StringIO object was to to speed up df_info.to_csv(), which worked ok (but less dramatically than I expected). But when I tried to copy the in-memory object to a file with shutil.copyfileobj(), I got MemoryError, with essentially no further information.
It's a large-ish situation; the loop runs about 1M times and the output data should have had a size of about 6GB. This was running on a GCP Linux compute instance with (I think) about 15GB RAM, although of course less than that (and perhaps less than the size of the in-memory data object) was free at the time.
But why would I get a memory error? Isn't shutil.copyfileobj() all about copying incrementally, using memory safely, and avoiding excessive memory consumption? I see now that it has an optional buffer size parameter, but as far as I can see, it defaults to something much smaller than the scale I'm working at with this data.
Would you expect the error to be avoided if I simply set the buffer size to something moderate like 64KB? Is my whole approach wrong-headed? It takes long enough to get the in-memory data established that I can't test things willy-nilly. Thanks in advance.

computer crashes when doing calculations on big datasets on python using Dask

I am unable to do calculations on large datasets using python-Dask. My computer crashes.
I have a computer with 4GB of RAM and running Linux Debian. I am trying load some files from a Kaggle competition (ElO Merchant competition) When I try load and get the shape of the dask dataframe the computer crashes.
I am running the code on only my laptop. I chose dask because it could handle large datasets. I would also like to know if Dask is able to move computations to my hard disk if it does not fit in memory? If so do I need to activate such thing or dask automatically does it? If I need to do it manually how do I do it? If there is a tutorial on this it would be great also.
I have 250GB Solid State Drive as my hard Disk. Hence the there would be space for a large dataset to fit to disk.
Please help me on this regard. My code is as below.
Thank you
Michael
import dask.dataframe as dd
from dask.distributed import Client
from sklearn.externals.joblib import parallel_backend
client = Client(processes=False)
merchant = dd.read_csv('/home/michael/Elo_Merchant/merchants.csv')
new_merchant_transactions = dd.read_csv('/home/michael/Elo_Merchant/new_merchant_transactions.csv')
historical_transactions = dd.read_csv('/home/michael/Elo_Merchant/historical_transactions.csv')
train = dd.read_csv('/home/michael/Elo_Merchant/train.csv')
test = dd.read_csv('/home/michael/Elo_Merchant/test.csv')
merchant.head()
merchant.compute().shape
merchant_headers = merchant.columns.values.tolist()
for c in range(len(merchant_headers)):
print(merchant_headers[c])
print('--------------------')
print("{}".format(merchant[merchant_headers[c]].value_counts().compute()) + '\n')
print("Number of NaN values {}".format(merchant[merchant_headers[c]].isnull().sum().compute()) + '\n')
historical_transactions.head()
historical_transactions.compute().shape #after computing for a few minutes computer restarts.
I expect it to run as the code and give me the shape of the dask array and run the rest of the code (which I have not showed here since it is not relevant)
I found a way to get it.
Here it is:
print("(%s,%s)" % (historical_transactions.index.count().compute(),len(historical_transactions.columns)))
The first output value is the rows and the second output value is the number of columns.
Thanks
Michael

Size on disk of a partly filled HDF5 dataset

I'm reading the book Python and HDF5 (O'Reilly) which has a section on empty datasets and the size they take on disk:
import numpy as np
import h5py
f = h5py.File("testfile.hdf5")
dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32)
f.flush()
# Size on disk is 1KB
dset[0:1024] = np.arange(1024)
f.flush()
# Size on disk is 4GB
After filling part (first 1024 entries) of the dataset with values, I expected the file to grow, but not to 4GB. It's essentially the same size as when I do:
dset[...] = np.arange(1024**3)
The book states that the file size on disk should be around 66KB. Could anyone explain what the reason is for the sudden size increase?
Version info:
Python 3.6.1 (OSX)
h5py 2.7.0
If you open your file in HdfView you can see that chunking is off. This means that the array is stored in one contiguous block of memory in the file and cannot be resized. Thus all 4 GB must be allocated in the file.
If you create your data set with chunking enabled, the dataset is divided up into regularly-sized pieces which are stored haphazardly on disk, and indexed using a B-tree. In that case only the chunks that have (at least one element of) data are allocated on disk. If you create your dataset as follows the file will be much smaller:
dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32, chunks=True)
The chunks=True lets h5py determine the size of the chunks automatically. You can also set the chunk size explicitly. For example, to set it to 16384 floats (=64 Kb), use:
dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32, chunks=(2**14,) )
The best chunk size depends on the reading and writing patterns of your applications. Note that:
Chunking has performance implications. It’s recommended to keep the
total size of your chunks between 10 KiB and 1 MiB, larger for larger
datasets. Also keep in mind that when any element in a chunk is
accessed, the entire chunk is read from disk.
See http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage

memory efficient data structures in python

I have a large number of identical dictionaries (identically structured: same keys, different values), which leads to two different memory problems:
dictionaries are expanded exponentially, so each dictionary could be using up to twice the memory it needs to.
dictionaries need to record their labels, so each dictionary is storing the keys for that dictionary, which is a significant amount of memory.
What is a good way that I can share the labels (so each label is not stored in the object), and compress the memory?
It may be offer the following solution to the problem based on the recordclass library:
pip install recordclass
>>> from recordclass import make_dataclass
For given set of labels you create a class:
>>> DataCls = make_dataclass('DataCls', 'first second third')
>>> data = DataCls(first="red", second="green", third="blue")
>>> print(data)
DataCls(first="red", second="green", third="blue")
>>> print('Memory size:', sys.getsizeof(data), 'bytes')
Memory size: 40 bytes
It fast and takes minimum memory. Suitable for creating millions of instances.
The downside: it's C-extension and not in standard library. But available on pypi.
Addition: Since recordclass 0.15 version there is an option fast_new for faster instance creation:
>>> DataCls = make_dataclass('DataCls', 'first second third', fast_new=True)
If one don't need keyword arguments then instance creation will be accelerated twice.
P.S.: the author of the recordclass library is here.

Resources