Pytorch DataLoader multiple data source - python-3.x

I am trying to use Pytorch dataloader to define my own dataset, but I am not sure how to load multiple data source:
My current code:
class MultipleSourceDataSet(Dataset):
def __init__ (self, json_file, root_dir, transform = None):
with open(root_dir + 'block0.json') as f:
self.result = torch.Tensor(json.load(f))
self.root_dir = root_dir
self.transform = transform
def __len__(self):
return len(self.result[0])
def __getitem__ (self):
None
The data source is 50 blocks under root_dir = ~/Documents/blocks/
I split them and avoid to combine them directly before since this is a very big dataset.
How can I load them into a single dataloader?

For DataLoader you need to have a single Dataset, your problem is that you have multiple 'json' files and you only know how to create a Dataset from each 'json' separately.
What you can do in this case is to use ConcatDataset that contains all the single-'json' datasets you create:
import os
import torch.utils.data as data
class SingeJsonDataset(data.Dataset):
# implement a single json dataset here...
list_of_datasets = []
for j in os.path.listdir(root_dir):
if not j.endswith('.json'):
continue # skip non-json files
list_of_datasets.append(SingeJsonDataset(json_file=j, root_dir=root_dir, transform=None))
# once all single json datasets are created you can concat them into a single one:
multiple_json_dataset = data.ConcatDataset(list_of_datasets)
Now you can feed the concatenated dataset into data.DataLoader.

I should revise my question as 2 different sub-questions:
How to deal with large datasets in PyTorch to avoid memory error
If I am separating large a dataset into small chunks, how can I load multiple mini-datasets
For question 1:
PyTorch DataLoader can prevent this issue by creating mini-batches. Here you can find further explanations.
For question 2:
Please refer to Shai's answer above.

Related

How to create a custom parallel corpus for machine translation with recent versions of pytorch and torchtext?

I am trying to train a model for NMT on a custom dataset. I found this great tutorial on youtube along with the accompanying repo, but it uses an old version of PyTorch and torchtext. More recent versions of torchtext have removed the Field and BucketIterator classes.
I looked for more recent tutorials. The closest thing I could find was this medium post (again with the accompanying code) which worked with a custom dataset for text classification. I tried to replicate the code with my problem and got this far:
from os import PathLike
from torch.utils.data import Dataset
from torchtext.vocab import Vocab
import pandas as pd
from .create_vocab import tokenizer
class ParallelCorpus(Dataset):
"""A parallel corpus for training a machine translation model"""
def __init__(self,
corpus_path: str | PathLike,
source_vocab: Vocab,
target_vocab: Vocab
):
super().__init__()
self.corpus = pd.read_csv(corpus_path)
self.source_vocab = source_vocab
self.target_vocab = target_vocab
def __len__(self):
return len(self.corpus)
def __getitem__(self, index: int):
source_sentence = self.corpus.iloc[index, 0]
source = [self.source_vocab["<sos>"]]
source.extend(
self.source_vocab.lookup_indices(tokenizer(source_sentence))
)
source.append(self.source_vocab["<eos>"])
target_sentence = self.corpus.iloc[index, 1]
target = [self.target_vocab["<sos>"]]
target.extend(
self.target_vocab.lookup_indices(tokenizer(target_sentence))
)
target.append(self.target_vocab["<eos>"])
return source, target
My question is: is this the correct way to implement parallel corpora for pytorch? And where can I find more information about this since the documentation wasn't much help.
Thank you in advance and sorry if this is against the rules.

HuggingFace: Streaming dataset from local dir using custom data_loader and data_collator

I have custom data_loader and data_collator that I am using for training in Transformer model using HuggingFace API. It also does the mapping of dataset where tokenization is also done.
My data_loader script is a classes that inherents datasets.GeneratorBasedBuilder so contains _generate_examples function to yield samples.
Upon starting the training, it caches whole dataset (only once on a system), then starts the training. I can reuse that cache on local system but can't use that cached .arrow file on any other system, therefore the caching process restarts. I wanna avoid caching by using streaming feature. My current codes looks like:
from datasets import load_dataset
dataset = load_dataset ("/../my_data_loader.py", streaming =True)
train_dataset = dataset["train"]
train_datatset = train_dataset.map (..... )
data_collator = MyDataCollaor (......)
...
...
trainer = Trainer (model=model, arg= training_arg, train_dataset=train_dataset, data_collaor...)
Note: I don't where I have to code: __len__ and __iter__ functions on my side.
Using datasets of version 1.12 or above, we can stream dataset (without caching) by setting streaming =True as follows.
dataset = load_dataset ("/../my_data_loader.py", streaming =True)
In this case, the dataset would be Iterable dataset, hence mapping would also be little different. Let say following script was using in caching mode:
train_dataset = datasets["train"]
train_dataset = train_dataset.map(
tokenize_and_align_labels,
batched=True,
remove_columns=remove_columns,
num_proc= preprocessing_num_workers,
load_from_cache_file= not overwrite_cache,
)
Then after turning on streaming, you have convert dataset format and change parameters of mapping function as well.
train_dataset = datasets["train"]
train_dataset = train_dataset.with_format("torch")
train_dataset = train_dataset.map(
tokenize_and_align_labels,
batched=True,
)

How to use tf.data.Dataset.interleave to subsample from multi dataset objects in tf2?

I tried to replicate the solution posted here with tf.data.Dataset.interleave, but not quite sure how to apply the interleave method to already created dataset objects.
here is the code:
import tensorflow as tf
import numpy as np
# preparing data
train, test = tf.keras.datasets.fashion_mnist.load_data()
images, labels = train
images = images/255
dataset = tf.data.Dataset.from_tensor_slices((images, labels))
class0=lambda features, label: label==0
class1=lambda features, label: label==1
class2=lambda features, label: label==2
ds_0=dataset.filter(class0)
ds_1=dataset.filter(class1)
ds_2=dataset.filter(class2)
I want to create a dataset by equally sampled from the ds_0, ds_1, and ds_2. what should I pass as map_func?

Pickle fit-object

I wrote a class where some data are fitted. Since the fitting takes very long when lots of data have to be fitted, I want to save the fit-object of this class so I do not have to repeat the fitting when I want to use the fitted data later. Using pickle, I get the following error calling the save method on an object:
AttributeError: Can't pickle local object 'ConstantModel.__init__.<locals>.constant'
I only have this problem when pickle the fitted data, pickle works if I save the object before fitting.
Is there a way to pickle fitted data or is there a nice workaround?
class pattern:
def fitting(self):
mod_total = lmfit.models.ConstantModel()
pars_total = mod_total.guess(self.y, x=self.x)
self.fit = mod_total.fit(self.y, pars_total, x=self.x)
def save(self, path):
with open(path, 'wb') as filehandler:
pickle.dump(self, filehandler)
I found a solution for this problem: Using dill instead of pickle works (as I want it to do).

How to implement a sklearn transformer with sklearn.base.SimpleImputer but returns a pandas DataFrame

I want to implement a customer transformer with sklearn imputer, e.g., sklearn.base.SimpleImputer.
The output should be a dataframe,
I have the following code, but not sure if this is correct
class DFSimpleImputer(TransformerMixin):
def __init__(self, *args, **kwargs):
self.imp = SimpleImputer(*args, **kwargs)
def fit(self, X, y=None, **fit_params):
self.imp.fit(X)
return self
def transform(self, X):
# assumes X is a DataFrame
Ximp = self.imp.transform(X)
Xfilled = pd.DataFrame(Ximp, index=X.index, columns=X.columns)
return Xfilled
Yes, the code above works and returns a dataframe. The question that you need to ask is why you need a DataFrame when build transformers (yes, it adds label for easy reading). Maybe nparray is better since you may encounter sparse matrix and DataFrame will eat up all your RAM.
My code above actually works per test, it fits the datasets, then transform. The end results are converted into a DataFrame. The drawback though is that the default return nparray from SimpleImputer would be better than DataFrame if we end up with a sparse matrix. So transforming nparray into dataframe is good for research purpose, but in production I might switch back to nparray

Resources