Imbalanced Image Dataset using Pytorch - pytorch

I am trying to balance my image dataset using WeightedRandomSampler but after loading data using Dataloader, I am unable to split the dataset to train and test. Could anyone please guide me in this regard?

You should split your Dataset (e.g., using data.random_split) not your DataLoader. The split should be agnostic to the way you sample/process your training data. Only after you have a training split of the data you can apply WeightedRandomSampler to it.

Related

How to build a customized dataset from MNIST in pytorch

I am importing MNIST dataset as train_data_MNIST = torchvision.datasets.MNIST(root=path+"MNIST", train=True,transform=transforms, download=True)and I am trying to make a smaller dataset from MNIST, let's say the first 10,000 images and corresponding labels. I know this can be handled with torch.utils.data.Subset. But what I want is a torchvision.datasets object (if I directly apply torch.utils.data.Subset to the train_data_MNIST that I list above, the result is an object from torch.utils.data.Subset class).
Is there any possible way such that I can use a fraction of the original MNIST dataset to create a new dataset (not subset)?
Thanks in advance.
What about modifying data and targets directly? For example:
dataset = torchvision.datasets.MNIST(root=path+"MNIST", train=True,transform=transforms, download=True)
dataset.data = dataset.data[:10000]
dataset.targets = dataset.targets[:10000]

Train Spacy model with larger-than-RAM dataset

I asked this question to better understand some of the nuances between training Spacy models with DocBins serialized to disk, versus loading Example instances via custom data loading function. The goal was to train a Spacy NER model with more data that can fit into RAM (or at least some way to avoid loading the entire file into RAM). Though the custom data loader seemed like one specific way to accomplish this, I am writing this question to ask more generally:
How can one train a Spacy model without loading the entire training data set file during training?
Your only options are using a custom data loader or setting max_epochs = -1. See the docs.

which is the most suitable method for training among model.fit(), model.train_on_batch(), model.fit_generator()

I have a training dataset of 600 images with (512*512*1) resolution categorized into 2 classes(300 images per class). Using some augmentation techniques I have increased the dataset to 10000 images. After having following preprocessing steps
all_images=np.array(all_images)/255.0
all_images=all_images.astype('float16')
all_images=all_images.reshape(-1,512,512,1)
saved these images to H5 file.
I am using an AlexNet architecture for classification purpose with 3 convolutional, 3 overlap max-pool layers.
I want to know which of the following cases will be best for training using Google Colab where memory size is limited to 12GB.
1. model.fit(x,y,validation_split=0.2)
# For this I have to load all data into memory and then applying an AlexNet to data will simply cause Resource-Exhaust error.
2. model.train_on_batch(x,y)
# For this I have written a script which randomly loads the data batch-wise from H5 file into the memory and train on that data. I am confused by the property of train_on_batch() i.e single gradient update. Do this will affect my training procedure or will it be same as model.fit().
3. model.fit_generator()
# giving the original directory of images to its data_generator function which automatically augments the data and then train using model.fit_generator(). I haven't tried this yet.
Please guide me which will be the best among these methods in my case. I have read many answers Here, Here, and Here about model.fit(), model.train_on_batch() and model.fit_generator() but I am still confused.
model.fit - suitable if you load the data as numpy-array and train without augmentation.
model.fit_generator - if your dataset is too big to fit in the memory or\and you want to apply augmentation on the fly.
model.train_on_batch - less common, usually used when training more than one model at a time (GAN for example)

Train, Save and Load a Tensorflow Model

Refer to this to train a GAN model for MNIST dataset, I want to save a model and restore it for further prediction. After having some understanding of Saving and Importing a Tensorflow Model I am able to save and restore some variables of inputs and outputs but for this network I am able to save the model only after some specific iterations and not able to predict some output.
Did you refer to this guide? It explains very clearly how to load and save tensorflow models in all possible formats.
If you are new to ML, I'd recommend you give Keras a try first, which is much easier to use. See https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model, pretty much you can use:
model.save('my_model.h5')
to save your model to disk.
model = load_model('my_model.h5')
to load your model and make prediction

Train and predict using SVM theory

I have implemented character recognition using a library
but I still don't get how SVM theory works in training and prediction process, I just understand SVM is only finding the hyperplane
E.g., suppose I have a training image as follows
image from google, number zero
How do we find hyperplane for each training data like above?
How is the prediction process is done?
How can the SVM classify the data based on those hyperplane?
Thank you very much if you can help me
You can use opencv and python.Opencv has implemented svm and you can use it by function call.
SVM is machine leraning model for data classification.We can use SVM to classify images.the steps are
you must have a training dataset(a dataset of images whose labels are known)
Extract features [features are color,shape,hog,surf,sift etc..] from that images and store that,also store the assosiated labels
then train svm using these datas
Now you can use svm to predict labels of unkonwn images
this link will help you
First, It is a non linear separable problem you have to implement kernel SVM which projects them into higher dimensional space where it becomes linearly separable. You can use sklearn library to achieve the above.

Resources