With pytorch DataLoader how to take in two ndarray (data & label)? - pytorch

I have a training data features in ndarray of shape (100, 400, 3) as it's 100 images of 20x20 with RGB channel and label in shape (100, ). Do I need to combine them into one dataset or how can I pass it to Pytorch dataLoader in order to iterate over image and label later?
What I've tried so far
#turn ndarray of features and labels into tensors
transform = transforms.Compose([transforms.ToPILImage(),
transforms.ToTensor()])

As #Shai mentioned, DataLoader requires the input to be the Dataset class or its subclass. One of the simplest subclasses is TensorDataset and you can convert it from ndarray.
import torch
import numpy as np
import torch.utils as utils
train_x = torch.Tensor(np.random.randn(100,400,3))
train_y = torch.Tensor(np.random.randint(0,2,100))
dataset = utils.data.TensorDataset(train_x, train_y)
dataloader = utils.data.DataLoader(dataset)

You can convert your data/label ndarrays to torch.tensor and use torch.utils.data.TensorDataset to create a dataset that iterates over your examples.
Once you have a dataset, you can wrap a DataLoader around it to be used for training.

Related

How to Preprocess 'Cats vs Dogs' Tensorflow Datasets in order to deal with it in CNN?

I have a problem about dealing with data preprocession of tensorflow 'cats vs dogs' datasets
I loaded data like this:
dataset, info = tfds.load(name='cats_vs_dogs, split=tfds.Split.TRAIN, with_info=True)
Then, I'd like to define preprocess function like this:
def preprocess(features):
Then, I'd like to use this preprocess function like this:
train_dataset = dataset.map(preprocess).batch(32)
where train_dataset is the train set that I would use in fitting my model.
However, I have no idea how to preprocess my loaded data. Specifically, I don't even know what sort of data type dataset is.
Please help me to solve this problem. Thank You
Here you can refer to this link to learn more about tensorflow datasets and input pipelines, to prepare the data you can use this function
def preprocess(features):
print(features['image'], features['label'])
image = tf.image.resize(features['image'], [224,224])
image = tf.divide(image, 255)
print(image)
label = features['label']
print(label)
return image, tf.cast(label, tf.float32)
hope this helps, By the way don't use softmax in your model use sigmoid instead
First, you need to study the dataset.
Then you can preproccess the dataset in order to make the dataset in same size of images. here 200 * 200 images have been taken.
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
X = []
img= load_img(file_path, target_size=(200, 200))
img= img_to_array(img)
X.append(img)
For every image, You need to do this before training and testing.
Then you can split the dataset into two different parts as training_data and testing_data.
Typically taking 70% and 30% respectively is better. for this you may use
from sklearn.model_selection import train_test_split
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.30)
Thank you!

How to use Tensorflow 2 Dataset API with Keras?

This question has been answered for Tensorflow 1, eg: How to Properly Combine TensorFlow's Dataset API and Keras?, but this answer hasn't helped for my use case.
Below is an example of a model with three float32 inputs and one float32 output. I have a large amount of data that doesn't all fit into memory at once, so it's split into separate files. I'm trying to use the Dataset API to train a model by bringing in a portion of the training data at once.
import tensorflow as tf
import tensorflow.keras.layers as layers
import numpy as np
# Create TF model of a given architecture (number of hidden layers, layersize, #outputs, activation function)
def create_model(h=2, l=64, activation='relu'):
model = tf.keras.Sequential([
layers.Dense(l, activation=activation, input_shape=(3,), name='input_layer'),
*[layers.Dense(l, activation=activation) for _ in range(h)],
layers.Dense(1, activation='linear', name='output_layer')])
return model
# Load data (3 X variables, 1 Y variable) split into 5 files
# (for this example, just create a list 5 numpy arrays)
list_of_training_datasets = [np.random.rand(10,4).astype(np.float32) for _ in range(5)]
validation_dataset = np.random.rand(30,4).astype(np.float32)
def data_generator():
for data in list_of_training_datasets:
x_data = data[:, 0:3]
y_data = data[:, 3:4]
yield((x_data,y_data))
# prepare model
model = create_model(h=2,l=64,activation='relu')
model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam())
# load dataset
dataset = tf.data.Dataset.from_generator(data_generator,(np.float32,np.float32))
# fit model
model.fit(dataset, epochs=100, validation_data=(validation_dataset[:,0:3],validation_dataset[:,3:4]))
Running this, I get the error:
ValueError: Cannot take the length of shape with unknown rank.
Does anyone know how to get this working? I would also like to be able to use the batch dimension, to load two data files at a time, for example.
You need to need to specify the shapes of the your dataset along with the return data types like this.
dataset = tf.data.Dataset.from_generator(data_generator,
(np.float32,np.float32),
((None, 3), (None, 1)))
The following works, but I don't know if this is the most efficient.
As far as I understand, if your training dataset is split into 10 pieces, then you should set steps_per_epoch=10. This ensures that each epoch will step through all data once. As far as I understand, dataset.repeat() is needed because the dataset iterator is "used up" after the first epoch. .repeat() ensures that the iterator gets created again after being used up.
import numpy as np
import tensorflow.keras.layers as layers
import tensorflow as tf
# Create TF model of a given architecture (number of hidden layers, layersize, #outputs, activation function)
def create_model(h=2, l=64, activation='relu'):
model = tf.keras.Sequential([
layers.Dense(l, activation=activation, input_shape=(3,), name='input_layer'),
*[layers.Dense(l, activation=activation) for _ in range(h)],
layers.Dense(1, activation='linear', name='output_layer')])
return model
# Load data (3 X variables, 1 Y variable) split into 5 files
# (for this example, just create a list 5 numpy arrays)
list_of_training_datasets = [np.random.rand(10,4).astype(np.float32) for _ in range(5)]
steps_per_epoch = len(list_of_training_datasets)
validation_dataset = np.random.rand(30,4).astype(np.float32)
def data_generator():
for data in list_of_training_datasets:
x_data = data[:, 0:3]
y_data = data[:, 3:4]
yield((x_data,y_data))
# prepare model
model = create_model(h=2,l=64,activation='relu')
model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam())
# load dataset
dataset = tf.data.Dataset.from_generator(data_generator,output_types=(np.float32,np.float32),
output_shapes=(tf.TensorShape([None,3]), tf.TensorShape([None,1]))).repeat()
# fit model
model.fit(dataset.as_numpy_iterator(), epochs=10,steps_per_epoch=steps_per_epoch,
validation_data=(validation_dataset[:,0:3],validation_dataset[:,3:4]))

Image_classification using resnet50 model with imagenet db with my custom labels

I am working on image_classification problem(multi-class).
i am using resnet50 model( https://keras.io/applications/#classify-imagenet-classes-with-resnet50 ) along with pretrained db "imagenet" using keras
I am getting the the output labels for which the images i passed to the model.
But now,
i have image data and label data with me of my own dataset.
When i pass the images to the resnet50 model it gives back the imagenet labels that are already trained. Now, here, i want the output as my own labels which is already in dataset instead of getting imagenet labels.
How to to fine tune labels in resnet50 model with imagenet db in keras
I have tried the resnet50 model alone and it works fine. but, how to change the output to my own labels instead of imagenet pre-trained labels.
from keras.applications.resnet50 import ResNet50
from keras.preprocessing import image
from keras.applications.resnet50 import preprocess_input, decode_predictions
import numpy as np
import os
model = ResNet50(weights='imagenet')
path='/Users/resnet-sample/'
img_path=os.listdir(path)
count=0
for i in img_path:
img = image.load_img(path+i, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
preds = model.predict(x)
print('Predicted:', decode_predictions(preds, top=1)[0], i)
count=count+1
print(preds)
example:
i have an elephant image in jpg format and label its as an 'elephant' in my dataset.
when i pass this image to resnet50 model which uses imagenet pre-trained db the output i received is 'African-Elephant'(imagenet-label).
So instead of getting imagenet label as output, i want to tune this as 'elephant' as label which is in my dataset.
So, not sure how to fine tune the last layers that gives output as my labels instead of imagenet labels.
Pelase help me on this.
Thanks,
Srknt73
The weights argument should be either None (random initialization), imagenet (pre-training on ImageNet), or the path to the weights file to be loaded. So you give the path to the file containing the labels of your dataset

data augmentation in Keras for large datasets

I'm using Keras to train a model for image classification and 'am working with ~50k images. Each image has three channels and size of each image is 150x150. I have to use floats to store the images because of the minute differences in image intensities between the three channels. I'm using a GPU for training but I do not have a lot of memory on my graphics card and neither do I have the monies to upgrade my GPU. I also have to augment my dataset because my training images do not cover all the possible rotations and translations in my testing dataset.
I have written my own generator that splits the input images and labels into chunks before feeding it to Keras' data augmentation routine and model.fit(). Below is my code:
from __future__ import print_function
from keras.preprocessing.image import ImageDataGenerator
from keras.utils import np_utils
from keras.callbacks import Callback
from keras.callbacks import ModelCheckpoint
from keras.callbacks import ReduceLROnPlateau
from keras.callbacks import CSVLogger
from keras.callbacks import EarlyStopping, TensorBoard, LearningRateScheduler
from keras.optimizers import SGD, Adam, RMSprop
from keras import backend as K
import tensorflow as tf
from sklearn.model_selection import train_test_split
import numpy as np
import math
import myCNN # my own convolutional neural network
def myBatchGenerator(X_train_large, y_train_large, chunk_size):
number_of_images = len(y_train_large)
while True:
batch_start = 0
batch_end = chunk_size
while batch_start < number_of_images:
limit = min(batch_end, number_of_images)
X = X_train_large[batch_start:limit,:,:,:]
y = y_train_large[batch_start:limit,:]
yield(X,y)
batch_start += chunk_size
batch_end += chunk_size
if __name__ == '__main__':
input_image_shape = (150,150,3)
# read input images and labels
# X_train_large is an array of type float16
# y_train_large is an array of size number of images x number of classes
X_train_large, y_train_large = myFunctionToReadTrainingImagesAndLabels()
# validation images: about 5000 images
X_validation_large, y_validation_large =
myFunctionToReadValidationImagesAndLabels()
# create a stratified sample from the large training set. use 100 samples from each class
y_train_large_vectors = [np.where(r == 1)[0][0] for r in y_train_large]
unique, counts = np.unique(y_train_large_vectors, return_counts=True)
X_train_sample = np.empty((12000, 150, 150, 3))
y_train_sample = np.empty((12000, 12))
for idx in range(num_classes):
start_idx_for_sample = 100*idx
end_idx_for_sample = start_idx_for_sample+99
start_idx_for_large = np.max(counts)*idx
end_idx_for_large = start_idx_for_large+99
X_train_sample[start_idx_for_sample:end_idx_for_sample,:,:,:] = X_train_large[start_idx_for_large:end_idx_for_large,:,:,:]
y_train_sample[start_idx_for_sample:end_idx_for_sample,:] = y_train_large[start_idx_for_large:end_idx_for_large,:]
# define augmentation needed for image data generator
train_datagen = ImageDataGenerator(featurewise_center=False,
samplewise_center=False,
featurewise_std_normalization=False,
samplewise_std_normalization=False,
zca_whitening=False,
rotation_range=90,
width_shift_range=0.1,
height_shift_range=0.1,
horizontal_flip=True,
vertical_flip=True)
train_datagen.fit(X_train_sample)
# load my model
model = myCNN.build_model(input_image_shape)
sgd = SGD(lr=0.05,decay=10e-4,momentum=0.9)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'
for e in range(number_of_epochs):
print('*********************epoch',e)
# get 1000 images at a time from the input image set
for X_train, y_train in myBatchGenerator(X_train_large, y_train_large,chunk_size=1000):
# split it into batches of 32 images/labels and augment on the fly
for X_batch, y_batch in train_datagen.flow(X_train_large,y_train_large,batch_size=32):
# train
model.fit(X_batch,y_batch,validation_data=(X_validation_large,y_validation_large))
model.save('myCNN_trained_on_largedataset.h5')
In short,
1. I create a stratified sample of my input images to use for the image data generator.
2. I split my input images into chunks of 1000 images and feed those 1000 images to the model in batches of 32.
So, I'm training my model on 32 images at a time, augmenting it on the fly and 'am validating the model on ~5000 images.
I'm still running my model but each batch of 32 images is currently taking 30 seconds to solve. This translates to a lot of hours to solve just one epoch. I'm missing something here.
I've tested my CNN code on a smaller dataset and it works. So I know the problem is not my function to read input images nor my CNN. I think it is how am splitting my data into chunks and batching it. But I cannot figure out where I went wrong. Can you please guide me?
Thanks in advance for your time
Why don't you use flow_from_directory() from ImageDataGenerator class? It is a built-in in keras and is very good to handle problem like yours easily!
Flow_from_directory, specificly, draws your batches directly from your directory and you can perform a data augmentation on the fly.
There are also a couple of example I can suggest you:
Building powerful image classification models using very little data. It is a Keras blog post about a problem like yours, very easy to read.
cifar10_cnn_tfaugment2d.py. A more advanced ad-hoc solution on Tensorflow, defining a specific augmenting layer. Very Interesting though!
I think it's enough to make your network run ;).
I hope it can be helpful, good luck!

sample_weight parameter shape error in scikit-learn GridSearchCV

Passing the sample_weight parameter to GridSearchCV raises an error due to incorrect shape. My suspicion is that cross validation is not capable of handling the split of sample_weights accordingly with the dataset.
First part: Using sample_weight as a model parameter works beautifully
Let's consider a simple example, first without GridSearch:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
dataURL = 'https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sinusoidal_data.csv'
x = pd.read_csv(dataURL, usecols=["x"]).x
y = pd.read_csv(dataURL, usecols=["y"]).y
occurrences = pd.read_csv(dataURL, usecols=["Occurrences"]).Occurrences
my_sample_weights = (1 - occurrences/10000)**3
my_sample_weights contains the importance that I assign to each observation in x, y, as the following picture shows. The points of the sinusoidal curve get higher weights than those forming the background noise.
plt.scatter(x, y, c=my_sample_weights>0.9, cmap="cool")
Let's train a neural network, first without using the information contained in my_sample_weights:
def make_model(number_of_hidden_neurons=1):
model = Sequential()
model.add(Dense(number_of_hidden_neurons, input_shape=(1,), activation='tanh'))
model.add(Dense(1, activation='linear'))
model.compile(optimizer='sgd', loss='mse')
return model
net_Not_using_sample_weight = make_model(number_of_hidden_neurons=6)
net_Not_using_sample_weight.fit(x,y, epochs=1000)
plt.scatter(x, y, )
plt.scatter(x, net_Not_using_sample_weight.predict(x), c="green")
As the following picture shows, the neural network tries to fit the shape of the sinusoidal but the background noise prevents it from a good fit.
Now, using the information of my_sample_weights , the quality of the prediction is a much better one.
Second part: Using sample_weight as a GridSearchCV parameter raises an error
my_Regressor = KerasRegressor(make_model)
validator = GridSearchCV(my_Regressor,
param_grid={'number_of_hidden_neurons': range(4, 5),
'epochs': [500],
},
fit_params={'sample_weight': [ my_sample_weights ]},
n_jobs=1,
)
validator.fit(x, y)
Trying to pass the sample_weights as a parameter gives the following error:
...
ValueError: Found a sample_weight array with shape (1000,) for an input with shape (666, 1). sample_weight cannot be broadcast.
It seems that the sample_weight vector has not been split in a similar manner to the input array.
For what is worth:
import sklearn
print(sklearn.__version__)
0.18.1
import keras
print(keras.__version__)
2.0.5
The problem is that as a standard, the GridSearch uses 3-fold cross-validation, unless explicity stated otherwise. This means that 2/3 data points of the data are used as training data and 1/3 for cross-validation, which does fit the error message. The input shape of 1000 of the fit_params doesn't match the number of training examples used for training (666). Adjust the size and the code will run.
my_sample_weights = np.random.uniform(size=666)
We developed PipeGraph, an extension to Scikit-Learn Pipeline that allows you to get intermediate data, build graph like workflows, and in particular, solve this problem (see the examples in the gallery at http://mcasl.github.io/PipeGraph )

Resources