Does using an image transform significantly slow down training? - pytorch

I see image transforms used quite often by many deep learning researchers. They seem to be treated as if they are free GPU or CPU cycles.
Example:
transformations = transforms.Compose([
transforms.Resize(255),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
train_set = datasets.ImageFolder(data_dir + "/train", transform = transformations)
In this specific case would it not be infinitely better to process the images upfront and save them out for future use in some other format? I see this sometimes but extremely rarely.
Or am I wrong, and transformers on a GPU are just so fast it's not worth the extra code or hassle?

It really depends on how you set up the dataloader. Generally, the transforms are performed on the CPU, and then the transformed data is moved to the GPU. Pytorch dataloaders have a 'prefetch_factor' argument that allows them to pre-compute your data (with transforms) in parallel with the GPU computing the model. That being said, with fixed transforms like you have here, pre-computing the entire dataset and saving it prior to computing could also be a valid strategy.

Related

PyTorch training with Batches of different lenghts?

Is it possible to train model with batches that have unequal lenght during an epoch? I am new to pytorch.
If you take a look at the dataloader documentation, you'll see a drop_last parameter, which explains that sometimes when the dataset size is not divisible by the batch size, then you get a last batch of different size. So basically the answer is yes, it is possible, it happens often and it does not affect (too much) the training of a neural network.
However you must a bit careful, some pytorch layers deal poorly with very small batch sizes. For example if you happen to have Batchnorm layers, and if you get a batch of size 1, you'll get errors due to the fact that batchnorm at some point divides by len(batch)-1. More generally, training a network that has batchnorms generally require batches of significant sizes, say at least 16 (literature generally aims for 32 or 64). So if you happen to have variable size batches, take the time to check whether your layers have requirement in terms of batch size for optimal training and convergence. But except in particular cases, your network will train anyway, no worries.
As for how to make your batches with custom sizes, I suggest you look at and take inspiration from the pytorch implementation of dataloader and sampler. You may want to implement something similar to BatchSampler and use the batch_sampler argument of Dataloader

What are some ways to speed up data loading on large sparse arrays (~1 million x 1 million, density ~0.0001) in Pytorch?

I am working on a binary classification problem. I have ~1.5 million data points, and the dimensionality of the feature space is 1 million. This dataset is stored as a sparse array, with a density of ~0.0001. For this post, I'll limit the scope to assume that the model is a shallow feedforward neural network, and also assume that the dimensionality has already been optimized (so cannot be reduced below 1 million). Naiive approaches to create mini-batches out of this data to feed into the network would take a lot of time (As an example, a basic approach of creating a TensorDataset (map style) from a torch.sparse.FloatTensor representation of the input array, and wrapping a DataLoader around it, means ~20s to get a mini-batch of 32 to the network, as opposed to say ~0.1s to perform the actual training). I am looking for ways to speed this up.
What I've tried
I first figured that reading from such a large sparse array in every iteration of the DataLoader was computationally intensive, so I broke down this sparse array into smaller sparse arrays
For the DataLoader to read from these multiple sparse arrays in an iterative fashion, I replaced the map style dataset that I had inside the DataLoader with an IterableDataset, and streamed these smaller sparse arrays into this IterableDataset like so:
from itertools import chain
from scipy import sparse
class SparseIterDataset(torch.utils.data.IterableDataset):
def __init__(self, fpaths):
super(SparseIter).__init__()
self.fpaths = fpaths
def read_from_file(self, fpath):
data = sparse.load_npz(fpath).toarray()
for d in data:
yield torch.Tensor(d)
def get_stream(self, fpaths):
return chain.from_iterable(map(self.read_from_file, fpaths))
def __iter__(self):
return self.get_stream(self.fpaths)
With this approach, I was able to bring down the time from the naiive base case of ~20s to ~0.2s per minibatch of 32. However, given that my dataset has ~1.5 million samples, this still implies a lot of time spent on even making one pass through the dataset. (As a comparison, even though it's slightly apples to oranges, running a logistic regression on scikit-learn on the original sparse array takes about ~6s per iteration through the whole dataset. With pytorch, with the approach I just outlined, it would take ~3000s just to load all the minibatches in an epoch)
One thing which I am aware of but yet to try is using multiprocess data loading by setting the num_workers argument in the DataLoader. I believe this has its own catches in the case of iterable style datasets though. Plus even a 10x speedup would still mean ~300s per epoch in loading mini batches. I feel I'm being inordinately slow! Are there any other approaches/improvements/best practices that you could suggest?
Your dataset in un-sparsified form would be 1.5M x 1M x 1 byte = 1.5TB as uint8, or 1.5M x 1M x 4 byte = 6TB as float32. Simply reading 6TB from memory to CPU could take 5-10 minutes on a modern CPU (depending on the architecture), and transfer speeds from CPU to GPU would be a bit slower than that (NVIDIA V100 on PCIe has 32GB/s theoretical).
Approaches:
Benchmark everything individually - eg in jupyter
%%timeit data = sparse.load_npz(fpath).toarray()
%%timeit dense = data.toarray() # un-sparsify for comparison
%%timeit t = torch.tensor(data) # probably about the same as the line above
Also print out the shapes and datatypes of everything to make sure they are as expected. I haven't tried running your code but I am pretty sure that (a) sparse.load_npz is extremely fast and unlikely to be a bottleneck, but (b) torch.tensor(data) produces a dense tensor and is also quite slow here
Use torch.sparse. I think torch sparse tensors can be used as regular tensors in most cases. You'd have to do some data prep to convert from scipy.sparse to torch.sparse:
A sparse tensor is represented as a pair of dense tensors: a tensor of
values and a 2D tensor of indices. A sparse tensor can be constructed by
providing these two tensors, as well as the size of the sparse tensor
You mention torch.sparse.FloatTensor but I'm pretty sure you're not making sparse tensors in your code - there is no reason to expect those would be constructed simply from passing a scipy.sparse array to a regular tensor constructor, since that's not how they're usually made.
If you figure out a good way to do this, I recommend you post it as a project or git on github, it would be quite useful.
If torch.sparse doesn't work out, think of other ways to either convert the data to dense only on the GPU, or avoid converting it entirely.
See also:
https://towardsdatascience.com/sparse-matrices-in-pytorch-be8ecaccae6
https://github.com/rusty1s/pytorch_sparse

Is there any support for BiPlots when using PCA in spark.ml?

I have used kmeans and PCA to attempt to visualise high dimensional k-means clusters in two dimensions but have lost the meaning of the clusters in 2D.
Is there anyway to project the features onto to 2D plot to return some interpretability?
Any non-linear dimensionality reduction method might work better (also called "manifold learning", e.g. see sklearn's suite). The t-sne method is generally quite popular for this.
However, these do not take your cluster labels into account. If you wanted to do that (although generally you do not), you could add a penalty to the manifold learning technique that forces same-cluster points to be close together, for example.

balancing an imbalanced dataset with keras image generator

The keras
ImageDataGenerator
can be used to "Generate batches of tensor image data with real-time data augmentation"
The tutorial here demonstrates how a small but balanced dataset can be augmented using the ImageDataGenerator. Is there an easy way to use this generator to augment a heavily unbalanced dataset, such that the resulting, generated dataset is balanced?
This would not be a standard approach to deal with unbalanced data. Nor do I think it would be really justified - you would be significantly changing the distributions of your classes, where the smaller class is now much less variable. The larger class would have rich variation, the smaller would be many similar images with small affine transforms. They would live on a much smaller region in image space than the majority class.
The more standard approaches would be:
the class_weights argument in model.fit, which you can use to make the model learn more from the minority class.
reducing the size of the majority class.
accepting the imbalance. Deep learning can cope with this, it just needs lots more data (the solution to everything, really).
The first two options are really kind of hacks, which may harm your ability to cope with real world (imbalanced) data. Neither really solves the problem of low variability, which is inherent in having too little data. If application to a real world dataset after model training isn't a concern and you just want good results on the data you have, then these options are fine (and much easier than making generators for a single class).
The third option is the right way to go if you have enough data (as an example, the recent paper from Google about detecting diabetic retinopathy achieved high accuracy in a dataset where positive cases were between 10% and 30%).
If you truly want to generate a variety of augmented images for one class over another, it would probably be easiest to do it in pre-processing. Take the images of the minority class and generate some augmented versions, and just call it all part of your data. Like I say, this is all pretty hacky.
You can use this strategy to calculate weights based on the imbalance:
from sklearn.utils import class_weight
import numpy as np
class_weights = class_weight.compute_class_weight(
'balanced',
np.unique(train_generator.classes),
train_generator.classes)
train_class_weights = dict(enumerate(class_weights))
model.fit_generator(..., class_weight=train_class_weights)
This answer was inspire by Is it possible to automatically infer the class_weight from flow_from_directory in Keras?

Scikit SVM gives very poor accuracy for STL-10 dataset

I am using Scikit-learn SVM for training my model for STL-10 dataset which contains 5000 training images (10 pre-defined folds). So I have 5000*96*96*3 size dataset for training and test purposes. I used following code to train it and measure the accuracy for the test set. (80% 20%). Final result was 0.323 accuracy. How can I increase the accuracy for SVM.
This is STL10 dataset
def train_and_evaluate(clf, train_x, train_y):
clf.fit(train_x, train_y)
#make 2D array as we can apply only 2d to fit() function
nsamples, nx, ny, nz = images.shape
reshaped_train_dataset = images.reshape((nsamples, nx * ny * nz))
X_train, X_test, Y_train, Y_test = train_test_split(reshaped_train_dataset, read_labels(LABEL_PATH), test_size=0.20, random_state=33)
train_and_evaluate(my_svc, X_train, Y_train)
print(metrics.accuracy_score(Y_test, clf2.predict(X_test)))
So it seems you are using raw SVM directly on the images. That is usually not a good idea (it is rather bad actually).
I will describe the classic image-classification pipeline popular in the last decades! Keep in mind, that the highest performing approaches right now might use Deep Neural Networks to combine some of these steps (a very different approach; a lot of research in the last years!)
First step:
Preprocessing is needed!
Normalize mean and variance (i would not expect your dataset to be already normalized)
Optional: histogram-equalization
Second step:
Feature-extraction -> you should learn some features from these images. There are a lot of approaches including
(Kernel-)PCA
(Kernel-)LDA
Dictionary-learning
Matrix-factorization
Local binary patterns
... (just test with LDA initially)
Third:
SVM for classification
again there might be a Normalization-step needed before this and as mentioned in the comments by #David Batista: there might be some parameter-tuning needed (especially for Kernel-SVM)
It is also not clear, if using color-information is wise here. For more simple approaches i expect black-and-white images to be superior (you are losing information but tuning your pipeline is more robust; high-performance approaches will of course use color-information).
See here for some random tutorial describing a similar problem. While i don't know if it's good work, you could immediatly recognize the processing-pipeline mentioned above (preprocessing, feature-extraction, classifier-learning)!
Edit:
Why preprocessing?: some algorithms assume centered samples with unit-variance, therefore normalization is needed. This is (at least) very important for PCA, LDA and SVM's.

Resources