How to build a customized dataset from MNIST in pytorch - pytorch

I am importing MNIST dataset as train_data_MNIST = torchvision.datasets.MNIST(root=path+"MNIST", train=True,transform=transforms, download=True)and I am trying to make a smaller dataset from MNIST, let's say the first 10,000 images and corresponding labels. I know this can be handled with torch.utils.data.Subset. But what I want is a torchvision.datasets object (if I directly apply torch.utils.data.Subset to the train_data_MNIST that I list above, the result is an object from torch.utils.data.Subset class).
Is there any possible way such that I can use a fraction of the original MNIST dataset to create a new dataset (not subset)?
Thanks in advance.

What about modifying data and targets directly? For example:
dataset = torchvision.datasets.MNIST(root=path+"MNIST", train=True,transform=transforms, download=True)
dataset.data = dataset.data[:10000]
dataset.targets = dataset.targets[:10000]

Related

Imbalanced Image Dataset using Pytorch

I am trying to balance my image dataset using WeightedRandomSampler but after loading data using Dataloader, I am unable to split the dataset to train and test. Could anyone please guide me in this regard?
You should split your Dataset (e.g., using data.random_split) not your DataLoader. The split should be agnostic to the way you sample/process your training data. Only after you have a training split of the data you can apply WeightedRandomSampler to it.

Reshape Images from ImageDataGenerator

from tensorflow.keras.preprocessing.image import ImageDataGenerator
With image data generator's flow_from_directory method can we reshape images also.
e.g. we have color images in 10 classes in 10 folders and we are providing path of that directory let's say train:
gen = ImageDataGenerator(rescale=1./255, width_shift_range=0.05, height_shift_range=0.05)
train_imgs= gen .flow_from_directory(
'/content/data/train',
target_size=(10,10),
batch_size=1,
class_mode='categorical')
Now my model is taking input shape 300. And I want to define training data from this train_imgs that is images of 10X10X3.
Is there any library, method or option available to convert this data generator to matrix in which columns are each image vector?
Generally the best option in these cases is to add a Reshape layer to the start of your model: layers.Reshape((300), input_shape=(10,10,3)). You can also do layers.Reshape((-1), input_shape=(10,10,3)), and it will automatically figure out the correct output length.

Loading Training Images using Keras

To train a model using Keras, should I load all the images I have to an array to create something like
x_train, y_train
Or is there a better way to read the images on the fly while training. I am not looking for ImageDataGenerator class since my output is an array of points not classes based on directory names..
I managed to get my data csv file to contain the array of points and image file name in 9 columns as follows:
x1 x2 ..... x8 Image_file_name
You can use this data with ImageDataGenerator. You incorrectly assume that it needs folders for classes, but that only applies to flow_from_directory. The method flow_from_dataframe allows you to load data from a Pandas dataframe, from where you can load your data, for example:
idg = ImageDataGenerator(...)
df = pd.load_csv('your_data.csv')
generator = idf.flow_from_dataframe(directory='image folder', x_col = 'filename_column',
y_col = ['col1', 'col2', ..., 'coln'],
class_mode='other')
This generator will data from the dataframe, load the image filename in directory as specified by the value of x_col, and use the corresponding row to build the targets, which in this case will be a numpy array of the values of columns in y_col. More information about this method can be found in the keras documentation.
Loading the entire data set in memory in an array is not a great idea because the memory consumption could go out of control, so you should use a generator. ImageDataGenerator and flow_from_dataframe are a great way of loading images in Keras. Since you don't want to use ImageDataGenerator(can you mention why?) you can create your own generator function that loads chunks of images in memory. If you load your data in a generator make sure you use fit_generator and predict_generator functions.
To load unlabeled data you can do the following hack:
datagen = ImageDataGenerator()
test_data = datagen.flow_from_directory('.', classes=['directory_where_images_are_stored'])
For more information check out link [1].
[1] https://kylewbanks.com/blog/loading-unlabeled-images-with-imagedatagenerator-flowfromdirectory-keras

How do you treat a new sample after training a model using sklearn preprocessing scale?

Assume I have a dataset X and labels Y for a supervised machine learning task.
Assume X has 10 features and 1,000 samples and I believe that it is appropriate to scale my data using sklearn.preprocessing.scale. This operation is taken and I train my model.
I now wish to use it for model for new data, so I collect a new sample of the 10 features of X and wish to use my trained model to classify this sample.
Is there an easy way to apply the same scaling that was performed on X before training my model to this single new sample, before attempting classification?
If not, is the only solution to have retained a copy of X before scaling and to add my new sample to this data and then scale this dataset and attempt classification on the new sample after it has been scaled via this process?
use class api instead of function api. like preprocessing.MinMaxScaler, preprocessing.StandardScaler
http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
The function scale provides a quick and easy way to perform this operation on a
single array-like dataset
The preprocessing module further provides a utility class StandardScaler that
implements the Transformer API to compute the mean and standard deviation on a
training set so as to be able to later reapply the same transformation on the
testing set.
lets say you you have the training dataset "training_dataset" and you did the following to scale it,
x__feature_scaler = MinMaxScaler(feature_range = (0, 1))
training_scaled_dataset = x__feature_scaler.fit_transform(training_dataset)
Use the same instance of MinMaxScaler to scale the new dataset. If your new dataset is "new_dataset", do the following,
new_scaled_dataset = x__feature_scaler.transform(new_dataset)
That way you will scale your new dataset to the same scale as your training dataset.

Training Keras model with Dask Array is very slow

I want to use Dask to read a large dataset and feed with it a Keras model. The data consists of audio files and I am using a custom function to read them. I have tried to apply delayed to this function and I collect all of the files in a dask array, as:
x = da.stack([da.from_delayed(delayed(get_item_data)(fp, sr, mono, post_processing, data_shape), shape=data_shape, dtype=np.float32) for fp in df['path']])
(See the source)
To train the Keras model, I compute X and Y as above and I input them to the function fit.
However, the training is very slow. I have tried to change the chunksizeand it is still very slow.
Could you tell me if I am doing something wrong when creating the array? Or any good practices for it?
Thanks
As far as I know Keras doesn't have any built-in support for Dask.arrays. So I'm not sure what will happen when you provide a dask.array directly to Keras functions. My guess is that it will automatically convert the dask.array into a (possibly very large) numpy array.

Resources