What is the point of data augmentation? - keras

The code below is based on François Collet. He uses it to show that when the training set is small (2000 images), data augmentation improves the classification power in the validation set (which is true!).
My questions are:
If the model.fit_generator method uses steps_per_epoch = 2000 // batch_size. Are we using 2000 images per epoch?
If yes. What is the point of data augmentation if I use an augmented sample size equal in size to the original one?
batch_size = 32
# Train data augmentation
train_datagen = ImageDataGenerator(
rescale = 1./255,
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True)
train_generator = train_datagen.flow_from_directory(
train_dir,
target_size = (150, 150),
batch_size = batch_size,
class_mode = 'binary')
# Train data generation
validation_datagen = ImageDataGenerator(rescale=1./255)
validation_generator = validation_datagen.flow_from_directory(
validation_dir,
target_size = (150, 150),
batch_size = 20,
class_mode = 'binary')
# training and validation
history = model.fit_generator(train_generator,
steps_per_epoch = 2000 // batch_size,
epochs = 100,
validation_data = validation_generator,
validation_steps = 500 // batch_size)

The code that you posted is quite dated, but serves the purpose for the explanation.
Admittedly we have 2000 images. We use them all, but the number of steps that are performed in that epoch is 2000//batch_size, since you update the weights of the network after a batch of size batch_size. Therefore you perform 2000//batch_size steps.
At the same time, think about augmentation as enrichment at run-time. When we use augmentation you do not create new examples which are physically stored on your drive, but when you load the batch into the memory. This means that, out of the batch that contains batch_size elements, some of them are modified(augmented), and fed into your network. Each augmentation has a probability associated, that is there is N % probability (you can set it even manually if you want) that your image is subjected to that specific augmentation.
But this means that as the training progresses, as the number of epochs increase, your network gets to see many more images than the initial size of 2000.

In the snippet that you have provided, steps_per_epoch = 2000 // batch_size essentially means that the model will see the whole 2000 images during an epoch but out of those 2000 images many of them will be replaced with their augmented counterparts based on either a specific probability that you can provide or by randomly choosing the images.
For eg. Consider you are building a Dog-vs-Cats classifier and the dataset is made up of images of right-facing dogs and left-facing cats only. In this case, if you don't apply augmentation (horizontal_flipping) then the model might learn that all the left facing animals are cats which will lead to incorrect results when given an image of a left-facing dog.
Augmentation here (specifically horizontal_flipping) will randomly flip the images of cats and dogs enabling the model to reach a better solution and hence make it more robust!
Augmentation Happens in-place no new images are generated.

Related

Predicting single image using Tensorflow not being accurate

I'm trying to build a CNN model in order to classify an image, but whenever the training is done and I try to feed it a single image (from the training dataset) it misclassifies this image always.
Please take a look at the code I wrote below.
Thank you in advance.
First, I declared an Image Data Generator for both my training and testing sets:
train_datagen = ImageDataGenerator(rescale = 1./255, rotation_range=20, horizontal_flip = True,
validation_split=0.3)
test_datagen = ImageDataGenerator(rescale = 1./255,validation_split=0.3)
Then, I used the flow_from_directory() function to load the images:
train_generator = train_datagen.flow_from_directory(
data_dir,
shuffle=False,
subset='training',
target_size = (224, 224),
class_mode = 'categorical'
)
test_generator = test_datagen.flow_from_directory(
data_dir,
shuffle=False,
subset='validation',
target_size = (224, 224),
class_mode = 'categorical'
)
I then loaded a pretrained model and added a few layers to build my model:
pretrained_model = VGG16(weights="imagenet", include_top=False,
input_tensor=input_shape)
pretrained_model.trainable = False
model = tf.keras.Sequential([
pretrained_model,
Flatten(name="flatten"),
Dense(3, activation="softmax")
])
I then trained the model :
INIT_LR = 3e-4
EPOCHS = 15
opt = Adam(lr=INIT_LR)
model.compile(loss="categorical_crossentropy", optimizer='Adam', metrics=["accuracy"])
H = model.fit(
train_generator,
validation_data=test_generator,
epochs=EPOCHS,
verbose= 1)
Then came the part to predict a single image:
I chose an image that was part of the training set, I even overfitted the model to make sure the predictions should be correct, but it was giving me wrong results for every image I input to the model.
I tried the following ways:
image = image.load_img(url,target_size = (224, 224))
img = tf.keras.preprocessing.image.img_to_array(image)
img = np.array([img])
img = img.astype('float32') / 255.
img = tf.keras.applications.vgg16.preprocess_input(img)
This didn't work
image = cv2.imread(url)
image = cv2.normalize(image, None,beta=255, dtype=cv2.CV_32F)
image = cv2.resize(image, (224, 224))
image = np.expand_dims(image, axis=0)
This also didn't work, I also tried many other ways to predict a single image, but none worked.
Finally, the only way was that I had to create an Image Data Generator and Flow From Directory for this single image, and it worked, but I believe that's not how it should be done.
The code img = tf.keras.applications.vgg16.preprocess_input(img) scales the pixel
values in the image to values between -1 to +1 assuming the original pixel values are in the range 0 to 255. In the previous line of code
img = img.astype('float32') / 255.
You rescaled the pixels. So remove that line of code. Now to predict a single image you need to expand the dimensions with
img = np.expand_dims(img, axis=0)
In your second code effort be aware the CV2 reads in images as BGR. If your model was trained on RGB images then your predictions will be wrong. Use the code below to convert the image to RGB.
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
As a side note you can replace tf.keras.applications.vgg16.preprocess_input(img) with the function below which will scale the images between -1 to +1
def scalar(img):
return img/127.5 - 1
This answer could be one starting point:
Resnet50 produces different prediction when image loading and resizing is done with OpenCV
These are possible differences (short gist):
RGB vs BGR (OpenCV loads BGR)
The interpolation method used (INTER_LINEAR vs INTER_NEAREST).
img_to_array() transforms the data type into float32 rather than uint8 which is obtained by default when loading with OpenCV.
tf.keras.applications.vgg16.preprocess_input(img). This preprocessing function can actually differ from what you have written above as image preprocessing; it is also notable that, if you do not preprocess it while training in this particular way (preprocess_input()) then it also makes sense to have bad results on the test set, since the preprocessings are different.
Hope these observations shed some light.

keras training on big datasets seperately keras

I am working on a keras denoising neural network that denoise high Dimension x-ray images. The idea is to train on some datasets eg.1,2,3 and after having the weights, another datasets eg.4,5,6 will start with a new training with weights initialized from the previous training. Implementation-wise it works, however the weights resulted from the last rotation perform better only on the datasets that were used to train on in this rotation. Same goes for other rotation.
In other words, weights resutlted from training on dataset: 4,5,6 doesn't give the good results on an image of dataset 1 as intended as the weights that were trained on datasets: 1,2,3. which shouldn't be what I intend to do
The idea is that weights should be tweaked to work with all datasets effectively, as training on the whole dataset doesn't fit into memory.
I tried other solutions such as creating custom generator that takes images from disk and do the training as batches which is very slow as it depends on factors like I/O operations happening on disk or the time complexity of processing functions happening inside the custom keras generator!
Below is a code that shows what I am doing. I have 12 datasets, seperated into 4 checkpoints. data is loaded and training goes and saves final model to an array and next training takes the weights from the previous rotation and continues.
EPOCHES = 150
NUM_CHKPTS = 4
weights = []
for chk in range(1,NUM_CHKPTS+1):
log_dir = os.path.join(os.getcwd(), 'resnet_checkpts_' + str(EPOCHES) + "_tl2_chkpt" + str(chk))
if not os.path.isdir(log_dir):
os.makedirs(log_dir)
else:
print('Training log directory already exists # {}.'.format(log_dir))
tb_output = TensorBoard(log_dir=log_dir, histogram_freq=1)
print("Loading Data From CHKPT #" + str(chk))
h5f = h5py.File('C:\\autoencoder\\datasets\\mix\\chk' + str(chk) + '.h5','r')
org_patch = h5f['train_data'][:]
noisy_patch = h5f['train_noisy'][:]
h5f.close()
input_patch, test_patch, noisy_patch, test_noisy_patch = train_test_split(org_patch, noisy_patch, train_size=0.8, shuffle=True)
print("Reshaping")
train_data = np.array([np.reshape(input_patch[i], (52, 52, 1)) for i in range(input_patch.shape[0])], dtype = np.float32)
train_noisy_data = np.array([np.reshape(noisy_patch[i], (52, 52, 1)) for i in range(noisy_patch.shape[0])], dtype = np.float32)
test_data = np.array([np.reshape(test_patch[i], (52, 52, 1)) for i in range(test_patch.shape[0])], dtype = np.float32)
test_noisy_data = np.array([np.reshape(test_noisy_patch[i], (52, 52, 1)) for i in range(test_noisy_patch.shape[0])], dtype = np.float32)
print('Number of training samples are:', train_data.shape[0])
print('Number of test samples are:', test_data.shape[0])
# IN = np.ones((len(XTRAINFILES), 52, 52, 1 ))
if chk == 1:
print("Generating the Model For The First Time..")
autoencoder_model = model_autoencoder(train_noisy_data)
print("Done!")
else:
autoencoder_model=load_model(weights[chk-2])
checkpt_path = log_dir + r"\\cp-{epoch:04d}.ckpt"
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpt_path, verbose=0, save_weights_only=True, save_freq='epoch')
optimizer = tf.keras.optimizers.Adam(lr=0.0001)
autoencoder_model.compile(loss='mse',optimizer=optimizer)
autoencoder_model.fit(train_noisy_data, train_data,
batch_size=128,
epochs=EPOCHES, shuffle=True, verbose=1,
validation_data=(test_noisy_data, test_data),
callbacks=[tb_output, checkpoint_callback])
weight_dir = log_dir+'\\model_resnet_new_OL' + str(EPOCHES) + 'epochs.h5'
weights.append(weight_dir)
autoencoder_model.save(weight_dir) # Defined saved model name by number of epochs.
Tensorboard Graphs, Rotations are 1,2,3,4 from up down :
Your model will forget previous dataset as you train on new dataset.
I read in reinforcement learning, when game are used to train Deep Reinforcement Learning (DRL), then you have to create memory replay, which collect data from different rounds of game, because each round of game has different data, then randomly some of that data is chosen to train model. that way DRL model can learn to play different rounds of game without forgetting previous rounds.
You can try to create a single dataset by taking some random samples from each dataset.
When you train model on new dataset that make sure data from all previous rotation are in current rotation.
Also in transfer learning, when you train model on new dataset, you have to freeze previous layers so that model don`t forget previous training. you are not using transfer learning but still when you start training on 2nd dataset your 1st dataset will slowly be removed from memory of weights.
you can try freezing initial layers of decoder so that they are not updated when extracting feature, assuming all of the dataset contain similar images, that way your model will not forget previous training as in transfer learning. but still when you train on new dataset previous will be forgotten.

Anomalies have similar error values to normal data

I have inertial measurement unit (IMU) data for which I am building an anomaly detection autoencoder neural net. I have about 5k training samples of which I am using 10% for validation. I also have about 50 (though I can make more) samples to test anomaly detection. My dataset has 12 IMU features. I train for about 10,000 epochs and I attain mean squared errors for reconstruction (MSE) of about 0.004 during training. After training, I perform an MSE calculation on the test data and I get values very similar to those in the train data (0.003) and I do not know why!
I am making my test set by slicing 50 samples from the overall data (not part of X_train) and changing one of the features to all zeros. I have also tried adding noise to one of the features as well as making multiple features zero.
np.random.seed(404)
np.random.shuffle(all_imu_data)
norm_imu_data = all_imu_data[:len_slice]
anom_imu_data = all_imu_data[len_slice:]
anom_imu_data[:,6] = 0
scaler = MinMaxScaler()
norm_data = scaler.fit_transform(norm_imu_data)
anom_data = scaler.transform(anom_imu_data)
X_train = pd.DataFrame(norm_data)
X_test = pd.DataFrame(anom_data)
I have tried many different network sizes by ranging number of hidden layers and number of hidden nodes/layer. As an example, I show a topology like [12-7-4-7-12]:
input_dim = num_features
input_layer = Input(shape=(input_dim, ))
encoder = Dense(int(7), activation="tanh", activity_regularizer=regularizers.l1(10e-5))(input_layer)
encoder = Dense(int(4), activation="tanh")(encoder)
decoder = Dense(int(7), activation="tanh")(encoder)
decoder = Dense(int(input_dim), activation="tanh")(decoder)
autoencoder = Model(inputs=input_layer, outputs=decoder)
autoencoder.compile(optimizer='adam', loss='mse', metrics=['mse'])
history = autoencoder.fit(X_train, X_train,
epochs=nb_epoch,
batch_size=batch_size,
shuffle=True,
validation_split=0.1,
verbose=1,
callbacks=[checkpointer, tensorboard]).history
pred_train = autoencoder.predict(X_train)
pred_test = autoencoder.predict(X_test)
mse_train = np.mean(np.power(X_train - pred_train, 2), axis=1)
mse_test = np.mean(np.power(X_test - pred_test, 2), axis=1)
print('MSE mean() - X_train:', np.mean(mse_train))
print('MSE mean() - X_test:', np.mean(mse_test))
After doing this, I get MSE mean numbers of 0.004 for Train and 0.003 for Test. Therefore, I cannot select a good threshold for anomalous data, as there are a lot of normal points that have larger MSE scores than the 'anomalous' data.
Any thoughts as to why this network is unable to detect these anomalies?
It is completely normal. You train your autoencoder on a sub sample of your whole data. Therefore, there are also anomalies contaminating your training data. The purpose of the autoencoder is to find a perfect reconstruction of your original data which it does including the anomalies. It is a very powerful tool, so if you show it anomalies in the training data, it will reconstruct them easily.
You need to remove 5% of your anomalous data with another anomaly detection algorithm (for example isolation forest) and do the subsampling on that part of the data (without outliers).
After that, you can find your outliers easily.

Nan loss in keras with triplet loss

I'm trying to learn an embedding for Paris6k images combining VGG and Adrian Ung triplet loss. The problem is that after a small amount of iterations, in the first epoch, the loss becomes nan, and then the accuracy and validation accuracy grow to 1.
I've already tried lowering the learning rate, increasing the batch size (only to 16 beacuse of memory), changing optimizer (Adam and RMSprop), checking if there are None values on my dataset, changing data format from 'float32' to 'float64', adding a little bias to them and simplify the model.
Here is my code:
base_model = VGG16(include_top = False, input_shape = (512, 384, 3))
input_images = base_model.input
input_labels = Input(shape=(1,), name='input_label')
embeddings = Flatten()(base_model.output)
labels_plus_embeddings = concatenate([input_labels, embeddings])
model = Model(inputs=[input_images, input_labels], outputs=labels_plus_embeddings)
batch_size = 16
epochs = 2
embedding_size = 64
opt = Adam(lr=0.0001)
model.compile(loss=tl.triplet_loss_adapted_from_tf, optimizer=opt, metrics=['accuracy'])
label_list = np.vstack(label_list)
x_train = image_list[:2500]
x_val = image_list[2500:]
y_train = label_list[:2500]
y_val = label_list[2500:]
dummy_gt_train = np.zeros((len(x_train), embedding_size + 1))
dummy_gt_val = np.zeros((len(x_val), embedding_size + 1))
H = model.fit(
x=[x_train,y_train],
y=dummy_gt_train,
batch_size=batch_size,
epochs=epochs,
validation_data=([x_val, y_val], dummy_gt_val),callbacks=callbacks_list)
The images are 3366 with values scaled in range [0, 1].
The network takes dummy values because it tries to learn embeddings from images in a way that images of the same class should have small distance, while images of different classes should have high distances and than the real class is part of the training.
I've noticed that I was previously making an incorrect class division (and keeping images that should be discarded), and I didn't have the nan loss problem.
What should I try to do?
Thanks in advance and sorry for my english.
In some case, the random NaN loss can be caused by your data, because if there are no positive pairs in your batch, you will get a NaN loss.
As you can see in Adrian Ung's notebook (or in tensorflow addons triplet loss; it's the same code) :
semi_hard_triplet_loss_distance = math_ops.truediv(
math_ops.reduce_sum(
math_ops.maximum(
math_ops.multiply(loss_mat, mask_positives), 0.0)),
num_positives,
name='triplet_semihard_loss')
There is a division by the number of positives pairs (num_positives), which can lead to NaN.
I suggest you try to inspect your data pipeline in order to ensure there is at least one positive pair in each of your batches. (You can for example adapt some of the code in the triplet_loss_adapted_from_tf to get the num_positives of your batch, and check if it is greater than 0).
Try increasing your batch size. It happened to me also. As mentioned in the previous answer, network is unable to find any num_positives. I had 250 classes and was getting nan loss initially. I increased it to 128/256 and then there was no issue.
I saw that Paris6k has 15 classes or 12 classes. Increase your batch size 32 and if the GPU memory occurs you can try with model with less parameters. You can work on Efficient B0 model for starting. It has 5.3M compared to VGG16 which has 138M parameters.
I have implemented a package for triplet generation so that every batch is guaranteed to include postive pairs. It is compatible with TF/Keras only.
https://github.com/ma7555/kerasgen (Disclaimer: I am the owner)

ImageDataGenerator performs worse

I build a neural network with and without ImageDataGenerator. When I use it without than it works fine. When I use it with IDG than both accuracy and valid_accuracy-scores are really bad. So I think I am doing something wrong.
I wanted to use the IDG to see what augmentation could do for my neural network. But even when I get rid of all the augmentation it still performs bad.
Here is my code for the IDG:
image_size=224
train_datagen = ImageDataGenerator(rescale=1./255, validation_split = 0.2)
train_generator = train_datagen.flow_from_directory('images',
target_size = (image_size, image_size),
batch_size = 10
class_mode = 'categorical'
subset='training')
validation_generator = train_datagen.flow_from_directory('images',
target_size = (image_size, image_size),
batch_size = 10
class_mode = 'categorical'
subset='training')
When I fit it I use this code:
chat = model.fit_generator(train_generator, steps_per_epoch = train_generator.samples // 10,
validation_data = validation_generator,
validation_steps = validation_generator.samples // 10,
epochs = 10)
Am I doing something wrong? Does the IDG perform an operation on the images that I don't see but changes something that influences the images in some way?
When I plot my images, I don't see anything strange.
Hope someone can give me some tips!
When you say that the performance is worse with data augmentation, are you comparing both on the same dataset?
Generally there is a chance of mistake of comparing the accuracy of model trained with data augmentation on the augmented dataset with the model trained without data augmentation on the regular dataset.
It is important to keep in mind that augmented datasets can be harder to deal with for the model. Therefore, even if the accuracy isn't as high as before, it might be actually higher when evaluated on the regular dataset.

Resources