Balancing dataset - conv-neural-network

I have a dataset like this . A main folder which contains 5 subfolders . Each subfolders represents each class . i have 5 classes . so 5 subfolders . Those subfolders contains images in it .
This is how the images are distributed in subclasses .
**
0 - 1805
2 - 999
1 - 370
4 - 295
3 - 193**
You can see that it is totally imbalanced . I want to balance this .like 295 images in each class .
train_data_gen = ImageDataGenerator(rescale = 1./255,
validation_split=train_val_split)
train_generator = train_data_gen.flow_from_directory( directory='/kaggle/input/traindata/train', target_size = (224,224), batch_size = 32, class_mode = 'categorical', subset='training')
validation_generator = train_data_gen.flow_from_directory( directory='/kaggle/input/traindata/train', target_size = (224,224), batch_size = 32, class_mode = 'categorical', subset='validation')
**
Found 2931 images belonging to 5 classes.
Found 731 images belonging to 5 classes.
**
Can anyone help me out.
I want a solution for balancing the dataset using the train and validation generator

Related

ram crashed while using imagedatagenerator in google colab

I have no coding experience, new to python.
task: use cnn to do image binary classification
problem: memory error
# data is confidential, image example is pasted. [enter image description here][1]
# two classes of images: 294 images for class 'e'; 5057 images for class 'l'. Given datasets were imbalanced, the original plan was set batch_size=500 in datagen.flow_from_directory for each class. So, in every batch, the whole dataset of class 'e' and 500 images of class 'l' were fed to the model. However, google colab keeps crashing out of ram. Batch_size was downgraded to 50, still failed.
# x=image data; y=label; bs=batch_size
bs = 50
def generate_batch_data_random(x, y, bs):
ylen = len(y)
loopcount = ylen // bs
while (True):
i = random.randint(0,loopcount)
yield x[i * bs:(i + 1) * bs], y[i * bs:(i + 1) * bs]
def train_and_validate_model(model, x, y):
(trainX, testX, trainY, testY) = train_test_split(x, y, test_size=0.25, random_state=6)
trainY = to_categorical(trainY, num_classes=2)
testY = to_categorical(testY, num_classes=2)
logger = CSVLogger(kfold_train_and_validate, append=True)
H = model.fit_generator(generator=generate_batch_data_random(trainX, trainY, bs),
steps_per_epoch= len(trainX) / bs,
epochs=10,
validation_data=generate_batch_data_random(testX, testY, bs), validation_steps= len(testX) /bs,
callbacks=[checkpoint])
return H,testX,testY
# use imagedatagenerator to save memory. Split groups seem more appropriate than fixed training and validate groups. So dataset structure was built based on the classes of images (one folder of images for each class), not the training and validation groups. The plan was to use imagedatagenerator to send images in batches, then use kfold to split each batch into training and validate groups.
path = '/content/drive/MyDrive/er_lr/erlr_vs_er'
datagen = ImageDataGenerator(rescale=1./255)
data_e = datagen.flow_from_directory(directory=path,
target_size=(128,128),
classes='e',
batch_size=50,
class_mode='categorical')
x_e, y_e = next(data_e)
data_l = datagen.flow_from_directory(directory=path,
classes='l',
target_size=(128,128),
batch_size=50,
class_mode='categorical')
x_l, y_l = next(data_l)
for i in range(0,len(y_e)):
y_e[i] = 0
for j in range(0,len(y_l)):
y_l[j] = 1
x = []
y = []
x.extend(np.array(data_e)[0][0])
x.extend(np.array(data_l)[0][0])
y.extend(np.array(y_e))
y.extend(np.array(y_l))
seed = 10
np.random.seed(seed)
filepath = '/content/drive/MyDrive/er_lr/hdf5/my_best_model.epoch{epoch:02d}-loss{val_loss:.2f}.hdf5'
fold = 1
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)
for train, test in kfold.split(x, y):
model = create_model()
checkpoint = keras.callbacks.ModelCheckpoint(filepath,
monitor='val_loss', save_weights_only=True,verbose=1,save_best_only=True, save_freq='epoch', period=1)
H,validationX,validationY=train_and_validate_model(model, x[train], y[train])
training_ACCs.append(H.history['accuracy'])
training_loses.append(H.history['loss'])
val_ACCs.append(H.history['val_accuracy'])
val_loses.append(H.history['val_loss'])
labels_test_cat = to_categorical(y[test], num_classes=2)
scores = model.evaluate(x[test], labels_test_cat, verbose=0)
fold = fold + 1
crashed in google colab repeatedly for out of ram. batch_size of 50 and np.shape of (128, 128, 3) of each image do not seem large-scaled.
Any thoughts?
[1]: https://i.stack.imgur.com/Lp1H9.png

Normalize MNIST in PyTorch

I am trying to normalize MNIST dataset in PyTorch 1.9 and Python 3.8 to be between the range [0, 1] with the code (batch_size = 32).
# Specify path to MNIST dataset-
path_to_data = "path_to_dataset"
# Define transformation(s) to be applied to dataset-
transforms_MNIST = transforms.Compose(
[
transforms.ToTensor(),
transforms.Normalize(mean = (0.1307,), std = (0.3081,))
]
)
# Load MNIST dataset-
train_dataset = torchvision.datasets.MNIST(
# root = './data', train = True,
root = path_to_data + "data", train = True,
transform = transforms_MNIST, download = True
)
test_dataset = torchvision.datasets.MNIST(
# root = './data', train = False,
root = path_to_data + "data", train = False,
transform = transforms_MNIST
)
# Create training and testing dataloaders-
train_loader = torch.utils.data.DataLoader(
dataset = train_dataset, batch_size = batch_size,
shuffle = True
)
test_loader = torch.utils.data.DataLoader(
dataset = test_dataset, batch_size = batch_size,
shuffle = False
)
print(f"Sizes of train_dataset: {len(train_dataset)} and test_dataet: {len(test_dataset)}")
print(f"Sizes of train_loader: {len(train_loader)} and test_loader: {len(test_loader)}")
# Sizes of train_dataset: 60000 and test_dataet: 10000
# Sizes of train_loader: 1875 and test_loader: 313
# Sanity check-
print(f"train_dataset: min pixel value = {train_dataset.data.min().numpy():.3f} &"
f" max pixel value = {train_dataset.data.max().numpy():.3f}")
# train_dataset: min pixel value = 0.000 & max pixel value = 255.000
print(f"test_dataset: min pixel value = {test_dataset.data.min().numpy():.3f} &"
f" max pixel value = {test_dataset.data.max().numpy():.3f}")
# test_dataset: min pixel value = 0.000 & max pixel value = 255.000
print(f"len(train_loader) = {len(train_loader)} & len(test_loader) = {len(test_loader)}")
# len(train_loader) = 1875 & len(test_loader) = 313
# Sanity check-
len(train_dataset) / batch_size, len(test_dataset) / batch_size
# (1875.0, 312.5)
# Get some random batch of training images & labels-
images, labels = next(iter(train_loader))
# You get x images due to the specified batch size-
print(f"images.shape: {images.shape} & labels.shape: {labels.shape}")
# images.shape: torch.Size([32, 1, 28, 28]) & labels.shape: torch.Size([32])
# Get min and max values for normalized pixels in mini-batch-
images.min(), images.max()
# (tensor(-0.4242), tensor(2.8215))
The min and max for 'images' should be between 0 and 1, instead, it is 0.4242 and 2.8215. What is going wrong?
This happens because Normalize applies what is actually known (also) as a standardization: output = (input - mean) / std.
The normalization you want to achieve is automatically performed when loading the image so you can comment Normalize.

How to determine the optimal number of "Steps" and "Batch Size" for test dataset in Keras ImageDataGenerator?

I have trained an image similarity network. The network is designed to distinguish between similar/dissimilar pairs of images.
Whare a pair contains a camera image and its corresponding sketch image.
The test dataset contains 4 image directories (camera_positive, sketch_positive, camera_negative, sketch_negative).
I am facing problem while evaluating the performance of the network on the test dataset.
As the test dataset is huge to fit into the memory, I decided to use Keras ImageDataGenerator.
I implemented the following code. Each directory contains 20 images (for small demonstration).
Therefore, in total 80 images and 40 predictions.
As the ImageDataGenerator gives us the option to save the image I used "save_to_dir" parameter as can be seen in the following code to verify the correct working.
Each directory contains 20 images therefore, I am expecting after running the predictions it will save the same images to the specified directories.
After running the code, it generates 31 images in each folder instead of 20!
I played around the different step sizes but no one gives accurate results.
What is wrong with this code. Please suggest!
import os
import numpy as np
from keras.models import load_model
from keras.preprocessing.image import ImageDataGenerator
batch_size = 1
image_size = 224
class_mode = None
"""
c_pos/neg: camera positive/neg image
s_pos/neg: sketch positive/neg image
"""
c_pos = r"testing\c_pos"
c_neg = r"testing\c_neg"
s_pos = r"testing\s_pos"
s_neg = r"testing\s_neg"
datagen_constructor = ImageDataGenerator()
def initialize_generator(generator, c_pos, s_pos, c_neg, s_neg):
camera_pos=generator.flow_from_directory(
c_pos,
target_size=(image_size, image_size),
color_mode="rgb",
batch_size=batch_size,
class_mode=class_mode,
shuffle = False,
seed=7,
save_to_dir='results/c_pos',
save_format='jpeg',
save_prefix='CPOS'
)
sketch_pos=generator.flow_from_directory(
s_pos,
target_size=(image_size, image_size),
color_mode="rgb",
batch_size=batch_size,
class_mode=class_mode,
shuffle = False,
seed=7,
save_to_dir='results/s_pos',
save_format='jpeg',
save_prefix='SPOS'
)
camera_neg=generator.flow_from_directory(
c_neg,
target_size=(image_size, image_size),
color_mode="rgb",
batch_size=batch_size,
class_mode=class_mode,
shuffle = False,
seed=7,
save_to_dir='results/c_neg',
save_format='jpeg',
save_prefix='CNEG'
)
sketch_neg=generator.flow_from_directory(
s_neg,
target_size=(image_size, image_size),
color_mode="rgb",
batch_size=batch_size,
class_mode=class_mode,
shuffle = False,
seed=7,
save_to_dir='results/s_neg',
save_format='jpeg',
save_prefix='SNEG'
)
while True:
camerapos = np.expand_dims(camera_pos.next(), axis=0)
sketchpos = np.expand_dims(sketch_pos.next(), axis=0)
cameraneg = np.expand_dims(camera_neg.next(), axis=0)
sketchneg = np.expand_dims(sketch_neg.next(), axis=0)
camera = np.concatenate((camerapos[0], cameraneg[0]))
sketch = np.concatenate((sketchpos[0], sketchneg[0]))
camera = np.asarray(list(camera), dtype=np.float32)
sketch = np.asarray(list(sketch), dtype=np.float32)
yield [camera, sketch]
test_datagen = initialize_generator(datagen_constructor, c_pos, s_pos, c_neg, s_neg)
# Load pre-trained model
model = load_model("model.h")
# Evaluating network performance on test dataset
predict = model.predict_generator(test_datagen, steps = 20)
You could manually iterate through each folder and make a prediction like this:
model = load_model("model.h")
image_paths = [image.path for image in os.scandir(path_to_my_folder)]
for image_path in image_paths:
image = cv2.imread(image_path)
image_to_predict = np.expand_dims(image,axis=0) # this is important to add the batch index, keras only predicts on batches and here we have batch of size 1
prediction = model.predict(image_to_predict)
Then, you could compare each prediction with the ground truth label you know it belongs to.

Unable to load the saved model in the browser using Tensorflowjs

I am trying to build a rice classifier using transfer learning on edge device, I took the help of tutorial at https://github.com/ADLsourceCode/TensorflowJS
My sample data is at https://www.dropbox.com/s/esirpr6q1lsdsms/ricetransfer1.zip?dl=0
I saved the model locally using the code mentioned below for rice classification and kept in folder TensorflowJS/Mobilenet_VGG16_Keras_To_TensorflowJS/static/
alongwith vgg and mobilenet but the, I am not able to load the rice model on tensorflowjs in the browser.
If I trying the save the vgg model in my local system and load the model in the tensoflowjs(in browser) it's working well.
# Base variables
import os
base_dir = 'ricetransfer1/'
train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')
test_dir = os.path.join(base_dir, 'test')
train_cats_dir = os.path.join(train_dir, 'KN')
train_dogs_dir = os.path.join(train_dir, 'DM')
train_size, validation_size, test_size = 90, 28, 26
#train_size, validation_size, test_size = 20, 23, 14
img_width, img_height = 224, 224 # Default input size for VGG16
# Instantiate convolutional base
from keras.applications import VGG16
import tensorflowjs as tfjs
import tensorflow as tf
tf.compat.v1.disable_eager_execution()
img_width, img_height = 224, 224 # Default input size for VGG16
conv_base = VGG16(weights='imagenet',
include_top=False,
input_shape=(img_width, img_height, 3))
# 3 = number of channels in RGB pictures
#saving the vgg model to run it locally
tfjs.converters.save_keras_model(conv_base, '/TensorflowJS/Mobilenet_VGG16_Keras_To_TensorflowJS/static/vgg')
# Check architecture
conv_base.summary()
# Extract features
import os, shutil
from keras.preprocessing.image import ImageDataGenerator
import numpy as np
train_size, validation_size, test_size = 90, 28, 25
datagen = ImageDataGenerator(rescale=1./255)
batch_size = 1
#train_dir = "ricetransfer1/train"
#validation_dir = "ricetransfer1/validation"
#test_dir="ricetransfer1/test"
#indices = np.random.choice(range(len(X_train)))
def extract_features(directory, sample_count):
#sample_count= X_train.ravel()
features = np.zeros(shape=(sample_count, 7, 7, 512)) # Must be equal to the output of the convolutional base
labels = np.zeros(shape=(sample_count))
# Preprocess data
generator = datagen.flow_from_directory(directory,
target_size=(img_width,img_height),
batch_size = batch_size,
class_mode='binary')
# Pass data through convolutional base
i = 0
for inputs_batch, labels_batch in generator:
features_batch = conv_base.predict(inputs_batch)
features[i * batch_size: (i + 1) * batch_size] = features_batch
labels[i * batch_size: (i + 1) * batch_size] = labels_batch
i += 1
if i * batch_size >= sample_count:
break
return features, labels
train_features, train_labels = extract_features(train_dir, train_size) # Agree with our small dataset size
validation_features, validation_labels = extract_features(validation_dir, validation_size)
test_features, test_labels = extract_features(test_dir, test_size)
# Define model
from keras import models
from keras import layers
from keras import optimizers
epochs = 2
ricemodel = models.Sequential()
ricemodel.add(layers.Flatten(input_shape=(7,7,512)))
ricemodel.add(layers.Dense(256, activation='relu', input_dim=(7*7*512)))
ricemodel.add(layers.Dropout(0.5))
ricemodel.add(layers.Dense(1, activation='sigmoid'))
ricemodel.summary()
# Compile model
ricemodel.compile(optimizer=optimizers.Adam(),
loss='binary_crossentropy',
metrics=['acc'])
# Train model
import os
history = ricemodel.fit(train_features, train_labels,
epochs=epochs,
batch_size=batch_size,
validation_data=(validation_features, validation_labels))
##saving the rice classification model to run it locally
tfjs.converters.save_keras_model(ricemodel, '/TensorflowJS/Mobilenet_VGG16_Keras_To_TensorflowJS/static/rice/')
I think, there is some mistake in the rice model, how can I solve the issue?
The expected output is to run the rice classification on the browser using tensorflowjs
I think here it might be you are getting an error due to the older version of the tfjs file.
update the latest version to
<script src="https://cdn.jsdelivr.net/npm/#tensorflow/tfjs#0.13.5"></script>
in your html page but it might arise a new error to due different image size.
I will suggest the open the develope mode in the browser to see the exact error, in this it worked.

CNN-Divide images into training/validation/testing

I'm trying to divide my images (dataset of bunnies and dogs) into x_train, x_val, y_train, y_val, and testing.
The following is what I did:
I placed the photos of each class (dogs/bunnies) in separate folders inside two folders: training and testing.
Training directory-> Bunny directory -> bunny images
Training directory-> Puppy directory -> puppy images
Testing directory-> Bunny directory -> bunny images
Testing directory-> Puppy directory -> puppy images
I used the following code to get the images from the folders:
training_data = train_datagen.flow_from_directory('./images/train',
target_size = (28, 28),
batch_size = 86,
class_mode = 'binary',
color_mode='rgb',
classes=None)
test_data = test_datagen.flow_from_directory('./images/test',
target_size = (28, 28),
batch_size = 86,
class_mode = 'binary',
color_mode='rgb',
classes=None)
Which gives me the following output:
Found 152 images belonging to 2 classes.
Found 23 images belonging to 2 classes.
Question 1: I wasn't sure how to define my labels here (y_val/ y_train) or if I need to (but it appears that most models have y_val/y_train).
Question 2: I tried to run
x_train, x_val = train_test_split(training_data, test_size=0.1)
In order to at least split my training data into validation/training, but when I tried to run my model it gave me the following error:
history=classifier.fit_generator(x_train,
steps_per_epoch = (8000 / 86),
epochs = 2,
validation_data = x_val,
validation_steps = 8000/86,
callbacks=[learning_rate_reduction])
ValueError: validation_data should be a tuple (val_x, val_y, val_sample_weight) or (val_x, val_y).
Found: [(array([[[[0.5058095 , 0.46913707, 0.42369673],...
Question 1:
From my experience there's no discernable confinements in naming y,x variables. For example in this kernel a person uses y_train, y_test names for labels and here a person uses train_Y. There's a rule that you should give names that shows what the variable is about.
Question 2:
I would recommend using validation_split parameter in ImageDataGenerator (doc) to set up fraction of images reserved for validation. After that I would recommend using subset parameter in flow_from_directory (doc) to define training_generator and validation generator variables. (I want to point out that the flow_from_directory returns generator, not data).
So your code would look like:
data_generator = ImageDataGenerator(
validation_split=0.2,
rescale=1./255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
)
train_generator = data_generator.flow_from_directory(
'./images/train',
target_size = (28, 28),
batch_size = 86,
class_mode = 'binary',
color_mode='rgb',
classes=None, subset="training"
)
validation_generator = data_generator.flow_from_directory(
'./images/train',
target_size = (28, 28),
batch_size = 86,
class_mode = 'binary',
color_mode='rgb',
classes=None, subset="validation"
)
history=classifier.fit_generator(
train_generator,
steps_per_epoch = (8000 / 86),
epochs = 2,
validation_data = validation_generator,
validation_steps = 8000/86,
callbacks=[learning_rate_reduction]
)

Resources