PyTorch Dataset / Dataloader from random source - pytorch

I have a source of random (non-deterministic, non-repeatable) data, that I'd like to wrap in Dataset and Dataloader for PyTorch training. How can I do this?
__len__ is not defined, as the source is infinite (with possible repition).
__getitem__ is not defined, as the source is non-deterministic.

When defining a custom dataset class, you'd ordinarily subclass torch.utils.data.Dataset and define __len__() and __getitem__().
However, for cases where you want sequential but not random access, you can use an iterable-style dataset. To do this, you instead subclass torch.utils.data.IterableDataset and define __iter__(). Whatever is returned by __iter__() should be a proper iterator; it should maintain state (if necessary) and define __next__() to obtain the next item in the sequence. __next__() should raise StopIteration when there's nothing left to read. In your case with an infinite dataset, it never needs to do this.
Here's an example:
import torch
class MyInfiniteIterator:
def __next__(self):
return torch.randn(10)
class MyInfiniteDataset(torch.utils.data.IterableDataset):
def __iter__(self):
return MyInfiniteIterator()
dataset = MyInfiniteDataset()
dataloader = torch.utils.data.DataLoader(dataset, batch_size = 32)
for batch in dataloader:
# ... Do some stuff here ...
# ...
# if some_condition:
# break

Related

Sklearn Pipeline: One feature automatically missed out

I created a Custom Classifier(Dummy Classifier). Below is definition. I also added some print statements & global variables to capture values
class FeaturePassThroughClassifier(ClassifierMixin):
def __init__(self):
pass
def fit(self, X, y):
global test_arr1
self.classes_ = np.unique(y)
test_arr1 = X
print("1:", X.shape)
return self
def predict(self, X):
global test_arr2
test_arr2 = X
print("2:", X.shape)
return X
def predict_proba(self, X):
global test_arr3
test_arr3 = X
print("3:", X.shape)
return X
Below is Stacking Classifier definition where the above defined CustomClassifier is one of base classifier. There are 3 more base classifiers (these are fitted estimators). Goal is to get input training set variables as is (which will come out from CustomClassifier) + prediction from base_classifier2, base_classifier3, base_classifier4. These features will act as input to meta classifier.
model = StackingClassifier(estimators=[
('select_features', Pipeline(steps = [("model_feature_selector", ColumnTransformer([('feature_list', 'passthrough', X_train.columns)])),
('base(dummy)_classifier1', FeaturePassThroughClassifier())])),
('base_classifier2', base_classifier2),
('base_classifier3', base_classifier3),
('base_classifier4', base_classifier4)
],
final_estimator = Pipeline(memory=None,
steps=[
('save_base_estimator_output_data', FunctionTransformer(save_base_estimator_output_data, validate=False)), ('final_model', RandomForestClassifier())
], verbose=True), passthrough = False, **stack_method = 'predict_proba'**)
Below is o/p on fitting the model. There are 230 variables:
Here is the problem: There are 230 variables but CustomClassifier o/p is showing only 229 which is strange. We can clearly see from print statements above that 230 variables get passed through CustomClassifier.
I need to use stack_method = "predict_proba". I am not sure what's going wrong here. The code works fine when stack_method = "predict".
Since this is a binary classifier, the classifier class expects you to add two probability columns in the output matrix - one for probability for class label "1" and another for "0".
In the output, it has dropped one of these since both are not required, hence, 230 columns get reduced to 229. Add a dummy column to solve your problem.
In the Notes section of the documentation:
When predict_proba is used by each estimator (i.e. most of the time for stack_method='auto' or specifically for stack_method='predict_proba'), The first column predicted by each estimator will be dropped in the case of a binary classification problem.
Here's the code that eliminates the first column.
You could add a sacrificial first column in your custom estimator's predict_proba, or switch to decision_function (which will cause differences depending on your real base estimators), or use the passthrough option instead of the custom estimator (doing feature selection in the final_estimator object instead).
Both the above solutions are on point. This is how I implemented the workaround with dummy column:
Declare a custom transformer whose output is the column that gets dropped due reasons explained above:
class add_dummy_column(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
return self
def transform(self, X):
print(type(X))
return X[[self.key]]
Do a feature union where above customer transformer + column transformer are called to create final dataframe. This will duplicate the column that gets dropped. Below is altered definition for defining Stacking classifier with FeatureUnion:
model = StackingClassifier(estimators=[
('select_features', Pipeline(steps = [('featureunion', FeatureUnion([('add_dummy_column_to_input_dataframe', add_dummy_column(key='FEATURE_THAT_GETS_DROPPED')),
("model_feature_selector", ColumnTransformer([('feature_list', 'passthrough', X_train.columns)]))])),
('base(dummy)_classifier1', FeaturePassThroughClassifier())])),
('base_classifier2', base_classifier2),
('base_classifier3', base_classifier3),
('base_classifier4', base_classifier4)
],
final_estimator = Pipeline(memory=None,
steps=[
('save_base_estimator_output_data', FunctionTransformer(save_base_estimator_output_data, validate=False)), ('final_model', RandomForestClassifier())
], verbose=True), passthrough = False, **stack_method = 'predict_proba'**)

TypeError: __init__() takes from 1 to 2 positional arguments but 4 were given (Text Analysis with Python)

I was trying to follow https://github.com/foxbook/atap/blob/master/snippets/ch04/loader.py but having the below error:
TypeError: init() takes from 1 to 2 positional arguments but 4 were given
Any thoughts on how to resolve the error
I was able to make changes and run PickledCorpusReader but Corpus Loader is giving some issues as shared below.
from sklearn.model_selection import KFold
class CorpusLoader(object):
"""
The corpus loader knows how to deal with an NLTK corpus at the top of a
pipeline by simply taking as input a corpus to read from. It exposes both
the data and the labels and can be set up to do cross-validation.
If a number of folds is passed in for cross-validation, then the loader
is smart about how to access data for train/test splits. Otherwise it will
simply yield all documents in the corpus.
"""
def __init__(self, corpus, folds=None, shuffle=True):
self.n_docs = len(corpus.fileids())
self.corpus = corpus
self.folds = folds
self.shuffle = True
if folds is not None:
# Generate the KFold cross validation for the loader.
self.folds = KFold(self.n_docs, folds, shuffle)
#property
def n_folds(self):
"""
Returns the number of folds if it exists; 0 otherwise.
"""
if self.folds is None: return 0
return self.folds.n_folds
def fileids(self, fold=None, train=False, test=False):
"""
Returns a listing of the documents filtering to retreive specific
data from the folds/splits. If no fold, train, or test is specified
then the method will return all fileids.
If a fold is specified (should be an integer between 0 and folds),
then the loader will return documents from that fold. Further, train
or test must be specified to split the fold correctly.
"""
if fold is None:
# If no fold is specified, return all the fileids.
return self.corpus.fileids()
# Otherwise, identify the fold specifically and get the train/test idx
for fold_idx, (train_idx, test_idx) in enumerate(self.folds):
if fold_idx == fold: break
else:
# We have discovered no correct fold.
raise ValueError(
"{} is not a fold, specify an integer less than {}".format(
fold, self.folds.n_folds
)
)
# Now determine if we're in train or test mode.
if not (test or train) or (test and train):
raise ValueError(
"Please specify either train or test flag"
)
# Select only the indices to filter upon.
indices = train_idx if train else test_idx
return [
fileid for doc_idx, fileid in enumerate(self.corpus.fileids())
if doc_idx in indices
]
def labels(self, fold=None, train=False, test=False):
"""
Fit will load a list of the labels from the corpus categories.
If a fold is specified (should be an integer between 0 and folds),
then the loader will return documents from that fold. Further, train
or test must be specified to split the fold correctly.
"""
return [
self.corpus.categories(fileids=fileid)[0]
for fileid in self.fileids(fold, train, test)
]
def documents(self, fold=None, train=False, test=False):
"""
A generator of documents being streamed from disk. Each document is
a list of paragraphs, which are a list of sentences, which in turn is
a list of tuples of (token, tag) pairs. All preprocessing is done by
NLTK and the CorpusReader object this object wraps.
If a fold is specified (should be an integer between 0 and folds),
then the loader will return documents from that fold. Further, train
or test must be specified to split the fold correctly. This method
allows us to maintain the generator properties of document reads.
"""
for fileid in self.fileids(fold, train, test):
yield list(self.corpus.tagged(fileids=fileid))
if __name__ == '__main__':
from reader import PickledCorpusReader
corpus4 = PickledCorpusReader(nomi,r'.+\.txt')
loader = CorpusLoader(corpus, folds=12)
for fid in loader.fileids(0):
print(fid)
I had an error with KFold as well. Specifying the keyword-arguments explicitly, like so:
self.folds = KFold(n_splits=folds, shuffle=shuffle)
resolved it for me. I do not know why, though.
Hope this helps
I had this error as well, and I solve it by the following code:
KFold(n_splits=2, random_state=None, shuffle=False)
based on this link:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

Subclass of PyTorch DataLoader for changing batch output

I'm interested in a way of applying a transform to a batch generated by a PyTorch DataLoader class. My minimal example is something like this:
class CustomLoader(torch.utils.data.DataLoader):
def __iter__(self):
result = super().__iter__()
return some_function(result)
But this errors since the DataLoader.__iter()__ returns _MultiProcessingDataLoaderIter or _SingleProcessingDataLoaderIter. Weirdly though, directly returning the output does return a Tensor, so any explanation there would be greatly appreciated!
I understand that in general, transform to data should be done in the subclassed Dataset class. However, in my case the data is tabular and the transform is via numpy, and doing it on a sample-wise basis is much slower (5x) than doing it on an entire batch, since surely these operations are vectorized under the hood.
I know I can do something simple like
for X, y in loader:
X = some_function(X)
But I'd also like to use the DataLoader with pytorch-lightning, so this isn't an option.
What is the proper way to subclass PyTorch Dataloaders?
__iter__() is a generator. You will need to yield the result instead of returning it. You can read more about generators here
Regarding your problem to apply a transform to a batch, you can create a custom Dataset instead of DataLoader and then apply the transforms.
class MyDataset(Dataset):
def __init__(self, transforms=None):
super().__init__()
self.data = ... # define your data here
self.transforms = transforms
def __getitem__(self, idx):
x = self.data[idx]
if self.transforms: x = self.transforms(x)
return x
# use your `MyDataset` class for creating your dataloader
dataloader = DataLoader(MyDataset(transforms = CustomTransforms(), batch_size=4)
You can use this dataloader with PyTorch Lightning Trainer as well.
If you are using PyTorch Lightning, I would suggest you to join our Slack channel and ask questions on Github Discussions as well.
Thanks :)
EDIT: (Add transforms to Batch)
If you are using PyTorch Lightning then I would recommend to use LightningDataModule which provides on_before_batch_transfer hook that can be used to apply transforms on a batch ;)
Here is an example:
def on_before_batch_transfer(self, batch, dataloader_idx):
batch['x'] = transforms(batch['x'])
return batch
Checkout the documentation for more

Does a `DataLoader` created from `ConcatDataset` create a batch from a different files, or a single file?

I am working with multiple files, and multiple training samples in each file. I will use ConcatDataset as described here:
https://discuss.pytorch.org/t/dataloaders-multiple-files-and-multiple-rows-per-column-with-lazy-evaluation/11769/7
I need to have negative samples in addition to my true samples, and I need my negative samples to be randomly selected from all the training data files. So, I am wondering, would the returned batch samples just be a random consecutive chuck from a single file, or would be batch span across multiple random indexes across all the datafiles?
If there are more details needed about what I am trying to do exactly, it's because I am trying to train over a TPU with Pytorch XLA.
Normally for negative samples, I would just use a 2nd DataSet and DataLoader, however, I am trying to train over TPUs with Pytorch XLA (alpha was just released a few days ago https://github.com/pytorch/xla ), and to do that I need to send my DataLoader to a torch_xla.distributed.data_parallel.DataParallel object, like model_parallel(train_loop_fn, train_loader) which can be seen in these example notebooks
https://github.com/pytorch/xla/blob/master/contrib/colab/resnet18-training-xrt-1-15.ipynb
https://github.com/pytorch/xla/blob/master/contrib/colab/mnist-training-xrt-1-15.ipynb
So, I am now limited to a single DataLoader, which will need to handle both the true samples, and negative samples that need to be randomly selected from all my files.
ConcatDataset is a custom class that is subclassed from torch.utils.data.Dataset. Let's take a look at one example.
class ConcatDataset(torch.utils.data.Dataset):
def __init__(self, *datasets):
self.datasets = datasets
def __getitem__(self, i):
return tuple(d[i] for d in self.datasets)
def __len__(self):
return min(len(d) for d in self.datasets)
train_loader = torch.utils.data.DataLoader(
ConcatDataset(
dataset1,
dataset2
),
batch_size=args.batch_size,
shuffle=True,
num_workers=args.workers,
pin_memory=True)
for i, (input, target) in enumerate(train_loader):
...
Here, two datasets namely dataset1 (a list of examples) and dataset2 are combined to form a single training dataset. The __getitem__ function returns one example from the dataset and will be used by the BatchSampler to form the training mini-batches.
Would the returned batch samples just be a random consecutive chuck from a single file, or would be batch span across multiple random indexes across all the datafiles?
Since you have combined all your data files to form one dataset, now it depends on what BatchSampler do you use to sample mini-batches. There are several samplers implemented in PyTorch, for example, RandomSampler, SequentialSampler, SubsetRandomSampler, WeightedRandomSampler. See their usage in the documentation.
You can have your custom BatchSampler too as follows.
class MyBatchSampler(Sampler):
def __init__(self, *params):
# write your code here
def __iter__(self):
# write your code here
# return an iterable
def __len__(self):
# return the size of the dataset
The __iter__ function should return an iterable of mini-batches. You can implement your logic of forming mini-batches in this function.
To randomly sample negative examples for training, one alternative could be to pick negative examples for each positive example in the __init__ function of the ConcatDataset class.

How to use multiprocessing in PyTorch?

I'm trying to use PyTorch with complex loss function. In order to accelerate the code, I hope that I can use the PyTorch multiprocessing package.
The first trial, I put 10x1 features into the NN and get 10x4 output.
After that, I want to pass 10x4 parameters into a function to do some calculation. (The calculation will be complex in the future.)
After calculating, the function will return a 10x1 array in total. This array will be set as NN_energy and calculate loss function.
Besides, I also want to know if there is another method to create a backward-able array to store the NN_energy array, instead of using
NN_energy = net(Data_in)[0:10,0]
Thanks a lot.
Full Code:
import torch
import numpy as np
from torch.autograd import Variable
from torch import multiprocessing
def func(msg,BOP):
ans = (BOP[msg][0]+BOP[msg][1]/BOP[msg][2])*BOP[msg][3]
return ans
class Net(torch.nn.Module):
def __init__(self, n_feature, n_hidden_1, n_hidden_2, n_output):
super(Net, self).__init__()
self.hidden_1 = torch.nn.Linear(n_feature , n_hidden_1) # hidden layer
self.hidden_2 = torch.nn.Linear(n_hidden_1, n_hidden_2) # hidden layer
self.predict = torch.nn.Linear(n_hidden_2, n_output ) # output layer
def forward(self, x):
x = torch.tanh(self.hidden_1(x)) # activation function for hidden layer
x = torch.tanh(self.hidden_2(x)) # activation function for hidden layer
x = self.predict(x) # linear output
return x
if __name__ == '__main__': # apply_async
Data_in = Variable( torch.from_numpy( np.asarray(list(range( 0,10))).reshape(10,1) ).float() )
Ground_truth = Variable( torch.from_numpy( np.asarray(list(range(20,30))).reshape(10,1) ).float() )
net = Net( n_feature=1 , n_hidden_1=15 , n_hidden_2=15 , n_output=4 ) # define the network
optimizer = torch.optim.Rprop( net.parameters() )
loss_func = torch.nn.MSELoss() # this is for regression mean squared loss
NN_output = net(Data_in)
args = range(0,10)
pool = multiprocessing.Pool()
return_data = pool.map( func, zip(args, NN_output) )
pool.close()
pool.join()
NN_energy = net(Data_in)[0:10,0]
for i in range(0,10):
NN_energy[i] = return_data[i]
loss = torch.sqrt( loss_func( NN_energy , Ground_truth ) ) # must be (1. nn output, 2. target)
print(loss)
Error messages:
File
"C:\ProgramData\Anaconda3\lib\site-packages\torch\multiprocessing\reductions.py",
line 126, in reduce_tensor
raise RuntimeError("Cowardly refusing to serialize non-leaf tensor which requires_grad, "
RuntimeError: Cowardly refusing to serialize non-leaf tensor which
requires_grad, since autograd does not support crossing process
boundaries. If you just want to transfer the data, call detach() on
the tensor before serializing (e.g., putting it on the queue).
First of all, Torch Variable API is deprecated since a very long time, just don't use it.
Next, torch.from_numpy( np.asarray(list(range( 0,10))).reshape(10,1) ).float() is wrong at many levels: np.asarray of list is useless since a copy will be performed anyway, and np.array takes list as input by design. Then, np.arange is available to return a range as numpy array, and it is also available on Torch. Next, specifying both dimension for reshape is useless and error prone, you could simply do reshape((-1, 1)), or even better unsqueeze(-1).
Here is the simplified expression torch.arange(10, dtype=torch.float32, requires_grad=True).unsqueeze(-1).
Using multiprocessing pool is a bad practice if using batch processing is possible. It will be both way more efficient and readable. Indeed, performing N small algebraic operations in parallel is always slower and a larger single algebraic operation, and even more on GPU. More importantly, computing the gradient is not supported by multiprocessing, hence the error that you get. Yet, this is partially true, because it is supports for tensors on cpu since 1.6.0. Have a lok, to the official release changelog.
Could you post a more representative example of what func method could be to make sure you really need it ?
NB: Distributed autograd as you are looking is now available in Pytorch as an experimental feature available in beta since 1.6.0. Have a look to the official documentation.

Resources