Best way to save many tensors of different shapes? - io

I would like to store thousands to millions of tensors with different shapes to disk. The goal is to use them as a time series dataset. The dataset will probably not fit into memory and I will have to load samples or ranges of samples from disk.
What is the best way to accomplish this while keeping storage and access time low?

The easiest way to save anything in disk is by using pickle:
import pickle
import torch
a = torch.rand(3,4,5)
# save
with open('filename.pickle', 'wb') as handle:
pickle.dump(a, handle)
# open
with open('filename.pickle', 'rb') as handle:
b = pickle.load(handle)
You can also save things with pytorch directly, but that is just a pytorch wrapper around pikle.
import torch
x = torch.tensor([0, 1, 2, 3, 4])
torch.save(x, 'tensor.pt')
If you want to save multiple tensors in one file, you can wrap them in a dictionary:
import torch
x = torch.tensor([0, 1, 2, 3, 4])
a = torch.rand(2,3,4,5)
b = torch.zeros(37)
torch.save({"a": a, "b":b, "x", x}, 'tensors.pt')

h5py lets you save lots of tensors into the same file, and you don't have to be able to fit the entire file contents into memory. h5py will store tensors directly to disk, and you can load tensors you want when you want. It allows slicing of these tensors, at load and save time, which works in a similar way, i.e. no need to load entire tensor into memory, in order to load a slice of it, or in order to save a slice of it.

Related

How can I work on a large dataset without having to use Pyspark?

I'm trying to work on a dataset with 510,000 rows and 636 columns. I loaded it into a dataframe using the dask dataframe method, but the entries can't be displayed. When i try to get the shape, it results in delays. Is there a way for me to analyze the whole dataset without using big data technologies like Pyspark?
from dask import dataframe
import requests
import zipfile
import os
import pandas as pd
if os.path.exists('pisa2012.zip') == False:
r = requests.get('https://s3.amazonaws.com/udacity-hosted-downloads/ud507/pisa2012.csv.zip', allow_redirects=True)
open('pisa2012.zip', 'wb').write(r.content)
if os.path.exists('pisa2012.csv') == False:
with zipfile.ZipFile('pisa2012.zip', 'r') as zip_ref:
zip_ref.extractall('./')
df_pisa = dataframe.read_csv('pisa2012.csv')
df_pisa.shape #Output:(Delayed('int-e9d8366d-1b9e-4f8e-a83a-1d4cac510621'), 636)
Firstly, spark, dask and vaex are all "big data" technologies.
it results in delays
If you read the documentation, you will see that dask is lazy and only performs operations on demand, you have to want to. The reason is, that just getting the shape requires reading all the data, but the data will not be held in memory - that is the whole point and the feature that lets you work with bigger-than-memory data (otherwise, just use pandas).
This works:
df_pisa.shape.compute()
Bute, better, figure out what you actually want to do with the data; I assume you are not just after the shape. You can put multiple operations/delayed objects into a dask.compute() to do them at once and not have to repeat expensive tasks like reading/parsing the file.

How to load pickle files with a pytorch dataloader memory efficient?

I currently load data with torch.load() because it is saved as pickle. Pickle can only load everything at once into the memory. The dimension of the data is [2000, 3, 32, 32].
Can I write a dataloader, where data is loaded subsequently? I have limited CPU memory and all at once would be too much.
I give an example:
data = torch.load('clean_data.pkl')
test_loader = dataloader(data, batch_size=32, shuffle=True)
result = []
for img, label in test_loader:
# do somehting
result.append([img.gpu()])
torch.save(result)
Well, when I write a data loader, I also need to use torch.load. By my understanding, the data loader would also open the pickle file all at once, right? So, I don't have no memory advantage.
What to do, to just load one file / batch after another, instead of the the whole pickle at once?
I have found a similar thread, here: https://discuss.pytorch.org/t/loading-pickle-files-with-pytorch-dataloader/129405
https://localcoder.org/how-to-load-pickle-file-in-chunks
How does one create a data set in pytorch and save it into a file to later be used?
I am grateful for any help. Thanks

Can we share memory between workers in a Pytorch DataLoader?

My dataset depends on a 3GB tensor. This tensor could either be on the CPU or the GPU. The bottleneck of my code is the data loading preprocessing. But I can't add more than a few workers without killing my RAM.
This sounds silly for me: why could each worker receives a copy of the 3GB tensor, when this one is exactly the same across each worker?
Is there any solution for letting the workers access to a single version of this tensor?
Thanks,
The Pytorch documentation explicitly mentions this issue with DataLoader duplicating the underlying dataset (at least on Windows and macOS as I understand).
In general, you should not eagerly load all your dataset in memory because of such issue. The dataset should be lazy loaded, i.e. samples should only be loaded when they are accessed in the __getitem__ method.
If your whole dataset in stored on disk as a monolithic tensor, you could fragment it into individual samples and save them into a folder for instance.
You could then define your dataset as:
from torch.utils.data import Dataset, DataLoader
from glob import glob
from os.path import abspath
class MyDataset(Dataset):
def __init__(self, folder: str):
# Retrieve all tensor file names
folder = abspath(folder)
self.files = glob(f"{folder}/*.pt")
def __getitem__(self, index: int):
# Load tensors on demand
return torch.load(self.files[index])
def __len__(self) -> int:
return len(self.files)
Another solution is to memory-map the dataset. This is what HuggingFace does for huge datasets, take a look here. This avoids loading the whole dataset in RAM and also allows it to be shared in multiple processes without copies.
Ray may be an interesting option for you. Check out ray training datasets!
Additionally, you could also use
data_id = ray.put(data)
to dump your data, and
data = ray.get(data_id)
to load the same files without copying them between functions.

Flow huge amount of images from memory to Keras generator

I am trying to train keras model with very large number of images and labels. I want to use the model.fit_generator and somehow flow the input images and labels from memory because we prepare all the data in memory, after the image is loaded. The problem is that we have plenty of large images that we then clip into smaller size and provide it like that to the model. We need a for loop inside a While loop.
Something like this:
While True:
for file in files: #lets say that there are 500 files (images)
image = ReadImage (file)
X = prepareImage(image) # here it is cut and prepared in specific shape
Y = labels
yield X[batch_start:batch_end],Y[batch_start:batch_end]
After it yields the last batch for the first image we need to load the next image in the for loop, prepare the data and yield again in the same epoch. For the second epoch we need again all the images. The problem here is that we prepare everything in memory, from 1 image we create millions of training data and then move to next image. We cannot write all the data to the disk and flow_from_directory, since it would require plenty of disk space.
Any hint?

MemoryError with large data frame in deep learning

Preamble
Hi all,
I'm trying to make a geometric deep learning model using StellarGraph package. With smaller data set, it works well, but unfortunately it's not scalable to a larger data set. Information on machine, environment, used data and resulting error presented as follow.
Machine specification:
CPU: Intel core i5-8350U
RAM: 8GB DDR4
SWAP: 4 GB + 4 GB (Divided into two swapfiles in different SSD)
SSD: 250 GB + 250 GB (2280 and 2242 NVMe)
Environment:
Linux 5.3.11_1 64-bit
Python 3.6.9
Used data (size acquired from sys.getsizeof()):
Sparse block diagonal matrix (shape: 158,950 x 158,950; size: 56)
Dense feature matrix (shape: 158,950 x 14,450; size: 9,537,152)
Modules:
networkx 2.3
numpy 1.15.4
pandas 0.25.3
scipy 1.1.0
scikit-learn 0.21.3
stellargraph 0.8.2
tensorflow 1.14.0
Problem description
I aim to create a geometric deep learning to categorize subject based on adjacency matrices acquired from resting state functional MRI. Adjacency matrix assumes 55 region of interest, resulting in 55x55 matrices for all subjects. In constructing the deep learning model, I used spectral graph convolutional network model from StellarGraph, which take a graph object and nodal feature as its input. I created the graph object from sparse block diagonal matrix obtained by combining adjacency matrices from all subjects. While nodal feature is the characteristic of each node (1 node has 5 characteristic values), constructed into dense block diagonal matrix.
Previously, I made the model using a subset of population sample (around 170). It ran perfectly, and I thought I'd be able to do the same using larger data set. Unfortunately, using the same code I got a MemoryError when registering the StellarGraph object. Code and error presented on following section.
Code and error
# Data parsing with scipy.io as sio and pandas as pd
data = sio.mmread('_data/sparse.mtx')
feature = sio.mmread('_data/sparse-feature.mtx')
feature = pd.DataFrame.sparse.from_spmatrix(feature)
# Create graph object using networkx as nx
g = nx.from_scipy_sparse_matrix(data)
# Create StellarGraph object and its generator
gs = StellarGraph(g, node_features=feature) # MemoryError
generator = FullBatchNodeGenerator(gs)
I'm sorry for not providing sparse.mtx and sparse-feature.mtx file due to confidentiality reason, but I hope previous description on data shape and size may help you to understand its construct. Using above code, python gave me following error:
>>> gs = StellarGraph(g, node_features=feature) # MemoryError
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/lam/.local/lib/python3.6/site-packages/stellargraph/core/graph.py", line 786, in __init__
super().__init__(incoming_graph_data, **attr)
File "/home/lam/.local/lib/python3.6/site-packages/stellargraph/core/graph.py", line 381, in __init__
node_features, type_for_node, node_types, dtype
File "/home/lam/.local/lib/python3.6/site-packages/stellargraph/core/graph.py", line 216, in _convert_from_node_data
{node_type: data}, node_type_map, node_types, dtype
File "/home/lam/.local/lib/python3.6/site-packages/stellargraph/core/graph.py", line 182, in _convert_from_node_data
data_arr = arr.values.astype(dtype)
File "/home/lam/.local/lib/python3.6/site-packages/pandas/core/generic.py", line 5443, in values
return self._data.as_array(transpose=self._AXIS_REVERSED)
File "/home/lam/.local/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 822, in as_array
arr = mgr._interleave()
File "/home/lam/.local/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 840, in _interleave
result = np.empty(self.shape, dtype=dtype)
MemoryError
While monitoring memory consumption, I observed that the RAM only used up to 55% of its total capacity, and the swap was not used at all. While running the code, I only used TTY + tmux with only vim, top and python session running. Moreover, I also made sure no other memory-hogging processes running in the background. So I'm certain the memory bottleneck is most likely caused by python.
What I have tried
To leverage the memory consumption, I tried to use dask in managing the dense feature data frame. Unfortunately, StellarGraph function can only have pandas array, pandas data frame, dictionary, tuple, or other iterable as its input.
Other than dask, I also tried using sparse matrix (since almost 80% of my data set is zero-valued anyways). However, it gave me TypeError since StellarGraph could not have sparse matrix as its node_features.
I've also read several solutions in managing large data set, which (mostly) suggest iteratively parsing the data into python session. However, I couldn'tgr find any documentation in StellarGraph on such method.
The other option would be using computer with better hardware, which to my regret, I couldn't do due to limited funding. I'm a student and couldn't afford buying better machines for now.
Potential solution
Upgrading the RAM. I'll try salvaging RAM from other computers, but current max size I have would be 16 GB. I'm not sure it will be enough.
Use smaller chunk of feature data set. I managed to go by this solution, but the model's accuracy was really bad (50-ish %).
Questions
Why python only use 55% of my total RAM without dynamic swap allocation?
How should I effectively manage large data frame?
How do I handle MemoryError when creating a StellarGraph object?
How much RAM do I actually need? Would 32GB suffice?
Python works fine. It's a implementation problem caused by StellarGraph.
I think StellarGraph so far doesn't support huge matrix.
File "/home/lam/.local/lib/python3.6/site-packages/stellargraph/core/graph.py", line 182, in _convert_from_node_data
data_arr = arr.values.astype(dtype)
From the beginning of your code to here, all of your data is stored as sparse array, which doesn't take too much memory. Here, arr is supposed to be a DataFrame with columns as pandas.SparseArray. This line of code converts the data structure to normal numpy array, which crash the memory usage.
import numpy as np
a = np.empty((158950,14450),float)
print(a.nbytes/2**30)
17.112698405981064
An empty numpy array here actually takes 17 G memory. I can initialize 3 array like that one on my 16 G computer. Then I get memory error if I try to get more than 3. And I can't initialize a 158,950 x 158,950 numpy array.

Resources