Why PyTorch creates another repro called TorchData for similar/new Dataset and DataLoader instead of adding them in the existing PyTorch repro? What's the difference of Dataset and Datapipe? Thanks.
TorchData is a library of common modular data loading primitives for easily constructing flexible and performant data pipelines.
It aims to provide composable Iterable-style and Map-style building blocks called DataPipes that work well out of the box with the PyTorch's DataLoader. It contains functionality to reproduce many different datasets in TorchVision and TorchText, namely including loading, parsing, caching, and several other utilities (e.g. hash checking).
DataPipe is simply a renaming and repurposing of the PyTorch Dataset for composed usage. A DataPipe takes in some access function over Python data structures, __iter__ for IterDataPipes and __getitem__ for MapDataPipes, and returns a new access function with a slight transformation applied.
Related
I am working with some classes of the Charades Dataset https://prior.allenai.org/projects/charades to detect indoor actions.
The structure of my dataset is as follows:
Where:
c025, c137 and c142 are actions;
XR436 has frames result of splitting a video where users are performing action c025 and the same for X3803, ... There is a total of 250 folders.
RI495 has frames result of splitting a video where users are performing action c137 and the same for DI402, ... There is a total of 40 folders.
TUCK3 has frames result of splitting a video where users are performing action c142 and the same for the rest. There is a total of 260 folders.
As you can see, the instances of class c137 are quite unbalanced with regard to class c025 and c142. Thus, i would like to increase the number of instances of this class using data augmentation. The idea is creating twin folders with certain transformations. For example, creating A4DID folder as a twin of RI495 with Equalization over each of the frames, A4456 folder as a twin of RI495 in GrayScale, ARTI3 as a twin of DI402 with rotation over the frames, etc. The pattern of transformations can be the same for every folder or not. Just interesting in augmenting the number of instances.
Do you know how to proceed? I am using Pytorch and I tried with torchvision.transforms and DataLoader from torch.utils.data but I have not achieved the result that I am looking for. Any idea on how to proceed?
PS: Undersampling of c025 and c142 is not an option, due to the classifier is not able to learn well with such limited amount of examples.
Thank you in advance
A few thoughts:
Standard practice is to use transforms dynamically; that is, each time a data example is loaded, a compose or sequential set of transform operations are applied with random parameter settings. Thus, each time the datum is loaded, the resulting x (inputs) are different. This can be achieved by defining a stack of transforms to apply to each data example as it is loaded in a pytorch dataset object (see here). This helps provide data augmentation.
Class imbalance is a somewhat different issue, and is generally solved by either a.) oversampling (this is acceptable if using the above transform solution because the oversampled examples will have different transforms applied) or b.) over-weighting of these examples in the loss calculation. Of course, neither approach can account for the risk of receiving an out-of-distribution testing example which is higher the fewer and less diverse examples you have for a given class. The former can be acheived by defining a custom Sampler object that yields examples from your dataset in a class-balanced manner. The latter can be achieved by passing weights to the loss function (many pytorch loss functions such as CrossEntropyLoss already support weights).
I have a model, exported from pytorch, I'll call main_model.onnx. It has an input node I'll call main_input that expects a list of integers. I can load this in onnxruntime and send a list of ints and it works great.
I made another ONNX model I'll call pre_model.onnx with input pre_input and output pre_output. This preprocesses some text so input is the text, and pre_output is a list of ints, exactly as main_model.onnx needs for input.
My goal here is, using the Python onnx.helper tools, create one uber-model that accepts text as input, and runs through my pre-model.onnx, possibly some connector node (Identity maybe?), and then through main_model.onnx all in one big combined.onnx model.
I have tried using pre_model.graph.node+Identity connector+main_model.graph.node as nodes in a new graph, but the parameters exported from pytorch are lost this way. Is there a way to keep all those parameters and everything around, and export this one even larger combined ONNX model?
This is possible to achieve albeit a bit tricky. You can explore the Python APIs offered by ONNX (https://github.com/onnx/onnx/blob/master/docs/PythonAPIOverview.md). This will allow you to load models to memory and you'll have to "compose" your combined model using the APIs exposed (combine both the GraphProto messages into one - this is easier said than done - you' ll have to ensure that you don't violate the onnx spec while doing this) and finally store the new Graphproto in a new ModelProto and you have your combined model. I would also run it through the onnx checker on completion to ensure the model is valid post its creation.
If you have static size inputs, sclblonnx package is an easy solution for merging Onnx models. However, it does not support dynamic size inputs.
For dynamic size inputs, one solution would be writing your own code using ONNX API as stated earlier.
Another solution would be converting the two ONNX models to a framework(Tensorflow or PyTorch) using tools like onnx-tensorflow or onnx2pytorch. Then pass the outputs of one network as inputs of the other network and export the whole network to Onnx format.
I have a bunch of tensor operations (matmul, transpose, etc..) I would like to run on a large dataset.
Since they are still matrix operations, and since I am using Keras generators to load the data batches, It would make sense to use GPUs to compute them.
Now, I've searched a while and I can't seem to find which is the correct way to use Keras to do parallel GPU operations, using generators, outside of the standard Model object interface.
Does anyone know how to do it? Thanks!
I am using the fastai library (fast.ai) to train an image classifier. The model created by fastai is actually a pytorch model.
type(model)
<class 'torch.nn.modules.container.Sequential'>
Now, I want to use this model from pytorch for inference. Here is my code so far:
torch.save(model,"./torch_model_v1")
the_model = torch.load("./torch_model_v1")
the_model.eval() # shows the entire network architecture
Based on the example shown here: http://pytorch.org/tutorials/beginner/data_loading_tutorial.html#sphx-glr-beginner-data-loading-tutorial-py, I understand that I need to write my own data loading class which will override some of the functions in the Dataset class. But what is not clear to me is the transformations that I need to apply at test time? In particular, how do I normalize the images at test time?
Another question: is my approach of saving and loading the model in pytorch fine? I read in the tutorial here: http://pytorch.org/docs/master/notes/serialization.html that the approach that I have used is not recommended. The reason is not clear though.
Just to clarify: the_model.eval() not only prints the architecture, but sets the model to evaluation mode.
In particular, how do I normalize the images at test time?
It depends on the model you have. For instance, for torchvision modules, you have to normalize the inputs this way.
Regarding on how to save / load models, torch.save/torch.load "saves/loads an object to a disk file."
So, if you save the_model, it will save the entire model object, including its architecture definition and some other internal aspects. If you save the_model.state_dict(), it will save a dictionary containing the model state (i.e. parameters and buffers) only. Saving the model can break the code in various ways, so the preferred method is to save and load only the model state. However, I'm not sure if fast.ai "model file" is actually a full model or the state of a model. You have to check this so you can correctly load it.
I am struggling with the following points:
When should bcolz be used instead of keras' data generator? Looks like the keras' model has apis to accept an array with batch or define the data generator as well.
Is there a performance improvement when using bcolz with fit() api over using a data generator with fit_generator()?
Finally, there's a fastai post mentioning dask at this post
Is dask better than bcolz?
Thanks!
Keras data generator's flow_from_directory(directory) takes in ' PNG, JPG, BMP or PPM' images only, ofc you could extend it but bcolz is a quick fix. Which is why bcolz is perfect for pre-computed convolution features. Thus, save those features as bcolz arrays and load them into batches for fit_generator.
fit_generator() with data generator (could be bcolz datagenerator) would be quicker than fit on just bcolz.
Is Dask better than bcolz? Dask isn't strictly an alternative for bcolz, Dask can work with bcolz arrays. And in tasks with huge datasets, it can provde a speed up because it has great support for parallelism. Bcolz is a nice compressed data container and I'd suggest using dask on top of bcolz if you need that speed up.