Increasing instances of a class with Data Augmentation - pytorch

I am working with some classes of the Charades Dataset https://prior.allenai.org/projects/charades to detect indoor actions.
The structure of my dataset is as follows:
Where:
c025, c137 and c142 are actions;
XR436 has frames result of splitting a video where users are performing action c025 and the same for X3803, ... There is a total of 250 folders.
RI495 has frames result of splitting a video where users are performing action c137 and the same for DI402, ... There is a total of 40 folders.
TUCK3 has frames result of splitting a video where users are performing action c142 and the same for the rest. There is a total of 260 folders.
As you can see, the instances of class c137 are quite unbalanced with regard to class c025 and c142. Thus, i would like to increase the number of instances of this class using data augmentation. The idea is creating twin folders with certain transformations. For example, creating A4DID folder as a twin of RI495 with Equalization over each of the frames, A4456 folder as a twin of RI495 in GrayScale, ARTI3 as a twin of DI402 with rotation over the frames, etc. The pattern of transformations can be the same for every folder or not. Just interesting in augmenting the number of instances.
Do you know how to proceed? I am using Pytorch and I tried with torchvision.transforms and DataLoader from torch.utils.data but I have not achieved the result that I am looking for. Any idea on how to proceed?
PS: Undersampling of c025 and c142 is not an option, due to the classifier is not able to learn well with such limited amount of examples.
Thank you in advance

A few thoughts:
Standard practice is to use transforms dynamically; that is, each time a data example is loaded, a compose or sequential set of transform operations are applied with random parameter settings. Thus, each time the datum is loaded, the resulting x (inputs) are different. This can be achieved by defining a stack of transforms to apply to each data example as it is loaded in a pytorch dataset object (see here). This helps provide data augmentation.
Class imbalance is a somewhat different issue, and is generally solved by either a.) oversampling (this is acceptable if using the above transform solution because the oversampled examples will have different transforms applied) or b.) over-weighting of these examples in the loss calculation. Of course, neither approach can account for the risk of receiving an out-of-distribution testing example which is higher the fewer and less diverse examples you have for a given class. The former can be acheived by defining a custom Sampler object that yields examples from your dataset in a class-balanced manner. The latter can be achieved by passing weights to the loss function (many pytorch loss functions such as CrossEntropyLoss already support weights).

Related

Is there a built in sklearn cross validation function for only validating against higher indices?

I'd like to use GridSearchCV, but with the condition that the lowest index of the data in the validation set be greater than the largest in the training set. The reason being that the data is in time, and future data gives unfair insight that would inflate the score. There's some discussion on this:
If the data ordering is not arbitrary (e.g. samples with the same class label are contiguous), shuffling it first may be essential to get a meaningful cross- validation result. However, the opposite may be true if the samples are not independently and identically distributed. For example, if samples correspond to news articles, and are ordered by their time of publication, then shuffling the data will likely lead to a model that is overfit and an inflated validation score: it will be tested on samples that are artificially similar (close in time) to training samples.
but it's not clear to me whether any of the splitting methods listed can accomplish what i'm looking for. It seems to be the case that I can define an itterable of indices and pass that into cv, but in that case it's not clear how many I should define (does it always use all of them? do different tests get different indices?)

How to handle shared data between samples and batches in Keras

I'm using Keras for timeseries prediction and I want to create a model that is based on the self-attention mechanism that will not use any RNNs. For each sample we look at the last x timesteps of samples to predict the next sample.
In other words I want to feed the network (num_batches, num_samples, timesteps, features) and get (num_batches, predictions).
There is 1 problems with this.
There is a lot of unnecessary duplication of data where sample n has basically the same timesteps and features as sample n+1, only shifted 1 to the left.
How would you handle this assuming you dataset is very large?
I am not very familiar with this, but if your issue is "I have too many replicated data" I think you can solve your problem devising a generator for your data, and then pass the generator as input for the Keras/TensorFlow fit function (according to TensorFlow APIs specification, it is stated that it supports generators as input).
If your question is related to the logic behind the model, I do not see the issue. It is like that you have a sliding window, for each window you predict one value, and then you move the window by a certain amount (in your case, one). Could you argue a little more about your concern?

What is the difference between scale and fit in sklearn?

i am new to datascience and when i was going through one of the kaggle blog, i saw that the user is using both scale and fit on the data set. i tried to understand the difference by going through the documentation but was not able to understand
It's hard to understand the source of your confusion without any code. Inside the link you provided, the data is first scaled with sklearn.preprocessing.scale() and then fit to a sklearn.ensemble.GradientBoostingRegressor.
So the scaling operation transforms data such that all the features are represented on the same scale, and the fitting operation trains the model with the said data.
From your question it sounds like you thought these two operations were mutually exclusive, or somehow equivalent, but they are actually logical consecutive steps.
In general, before model is trained, data is somehow preprocessed (with .scale() in this case), then trained. In sklearn the .fit() methods are for training (fitting functions/models to the data).
Hope it makes sense!
Scale is a data normalization technique and it is used when data in different features are of not similar values like in one feature you have values ranging from 1 to 10 and in other features you have values ranging from 1000 to 10000.
Where as fit is the function that actually starts your model training
Scaling is conversion of data, a method used to normalize the range of independent variables or features of data. The fit method is a training step.

Using SVM to perform classification on multi-dimensional time series datasets

I would like to use scikit-learn's svm.SVC() estimator to perform classification tasks on multi-dimensional time series - that is, on time series where the points in the series take values in R^d, where d > 1.
The issue with doing this is that svm.SVC() will only take ndarray objects of dimension at most 2, whereas the dimension of such a dataset would be 3. Specifically, the shape of a given dataset would be (n_samples, n_features, d).
Is there a workaround available? One simple solution would just be to reshape the dataset so that it is 2-dimensional, however I imagine this would lead to the classifier not learning from the dataset properly.
Without any further knowledge about the data reshaping is the best you can do. Feature engineering is a very manual art that depends heavily on domain knowledge.
As a rule of thumb: if you don't really know anything about the data throw in the raw data and see if it works. If you have an idea what properties of the data may be beneficial for classification, try to work it in a feature.
Say we want to classify swiping patterns on a touch screen. This closely resembles your data: We acquired many time series of such patterns by recording the 2D position every few milliseconds.
In the raw data, each time series is characterized by n_timepoints * 2 features. We can use that directly for classification. If we have additional knowledge we can use that to create additional/alternative features.
Let's assume we want to distinguish between zig-zag and wavy patterns. In that case smoothness (however that is defined) may be a very informative feature that we can add as a further column to the raw data.
On the other hand, if we want to distinguish between slow and fast patterns, the instantaneous velocity may be a good feature. However, the velocity can be computed as a simple difference along the time axis. Even linear classifiers can model this easily so it may turn out that such features, although good in principle, do not improve classification of raw data.
If you have lots and lots and lots and lots of data (say an internet full of good examples) Deep Learning neural networks can automatically learn features to some extent, but let's say this is rather advanced. In the end, most practical applications come down to try and error. See what features you can come up with and try them out in practice. And beware the overfitting gremlin.

balancing an imbalanced dataset with keras image generator

The keras
ImageDataGenerator
can be used to "Generate batches of tensor image data with real-time data augmentation"
The tutorial here demonstrates how a small but balanced dataset can be augmented using the ImageDataGenerator. Is there an easy way to use this generator to augment a heavily unbalanced dataset, such that the resulting, generated dataset is balanced?
This would not be a standard approach to deal with unbalanced data. Nor do I think it would be really justified - you would be significantly changing the distributions of your classes, where the smaller class is now much less variable. The larger class would have rich variation, the smaller would be many similar images with small affine transforms. They would live on a much smaller region in image space than the majority class.
The more standard approaches would be:
the class_weights argument in model.fit, which you can use to make the model learn more from the minority class.
reducing the size of the majority class.
accepting the imbalance. Deep learning can cope with this, it just needs lots more data (the solution to everything, really).
The first two options are really kind of hacks, which may harm your ability to cope with real world (imbalanced) data. Neither really solves the problem of low variability, which is inherent in having too little data. If application to a real world dataset after model training isn't a concern and you just want good results on the data you have, then these options are fine (and much easier than making generators for a single class).
The third option is the right way to go if you have enough data (as an example, the recent paper from Google about detecting diabetic retinopathy achieved high accuracy in a dataset where positive cases were between 10% and 30%).
If you truly want to generate a variety of augmented images for one class over another, it would probably be easiest to do it in pre-processing. Take the images of the minority class and generate some augmented versions, and just call it all part of your data. Like I say, this is all pretty hacky.
You can use this strategy to calculate weights based on the imbalance:
from sklearn.utils import class_weight
import numpy as np
class_weights = class_weight.compute_class_weight(
'balanced',
np.unique(train_generator.classes),
train_generator.classes)
train_class_weights = dict(enumerate(class_weights))
model.fit_generator(..., class_weight=train_class_weights)
This answer was inspire by Is it possible to automatically infer the class_weight from flow_from_directory in Keras?

Resources