I have read this tutorial for using albumentations with keras sequence. The code is as follows :
`
from tensorflow.python.keras.utils.data_utils import Sequence
class CIFAR10Sequence(Sequence):
def __init__(self, x_set, y_set, batch_size, augmentations):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
self.augment = augmentations
def __len__(self):
return int(np.ceil(len(self.x) / float(self.batch_size)))
def __getitem__(self, idx):
batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
return np.stack([
self.augment(image=x)["image"] for x in batch_x
], axis=0), np.array(batch_y)
`
The thing is I don't understand how it is augmenting ( i.e. providing more samples ) the data. The way I see it, it is just transforming the samples in the dataset, and not generating newer ones.
Following the tutorial you provided you may see that the author defines AUGMENTATIONS_TRAIN and AUGMENTATIONS_TEST objects which perform the actual augmentation.
Then these objects are passed to the sequence generator above:
train_gen = CIFAR10Sequence(x_train, y_train, hparams.train_batch_size, augmentations=AUGMENTATIONS_TRAIN)
so that calling self.augment actually augments every image in the batch:
self.augment(image=x)["image"] for x in batch_x
And yes, augmentation doesn't mean creating new objects but applying random transformation to existing ones to create 'artifical' objects which are somewhat different from the originals.
Related
I study using tf.keras.utils.Sequence on Tensorflow 2.4.1. I used the example code in Sequence in API document (https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence) and finetuned by adding on_epoch_end function to adaptively change the batch_size value on every epoch.
from skimage.io import imread
from skimage.transform import resize
import numpy as np
import random
import math
# Here, `x_set` is list of path to the images
# and `y_set` are the associated classes.
class CIFAR10Sequence(tensorflow.keras.utils.Sequence):
def __init__(self, x_set, y_set, batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
def __len__(self):
return math.ceil(len(self.x) / self.batch_size)
def on_epoch_end(self):
print(self.batch_size)
self.batch_size = int(random.randint(10, 100))
def __getitem__(self, idx):
batch_x = self.x[idx * self.batch_size:(idx + 1) *
self.batch_size]
batch_y = self.y[idx * self.batch_size:(idx + 1) *
self.batch_size]
return np.array([
resize(imread(file_name), (200, 200))
for file_name in batch_x]), np.array(batch_y)
However, in practice, the number of steps per epoch, which expected to change depending on the number of batches, remains unchanged. In fact, Tensorflow returns a WARNING, informing that they run out of data, and stop the training immediately. This problem happens when the initialize batch_size is smaller than the current self.batch_size.
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches
Here is my guess, Tensorflow did adapt the batch size after every epoch, but somehow the model was still keeping the initial value. This problem never happened in Keras version 1. So far, I have no clue on solving this problem.
Edit 1: The number of training data is much larger than the number of batches.
I faced exactly this problem. In my case, I update the data after every epoch (the number increases). I notice that the number of batches in each epoch stays the same (although it should depend on the number of samples). My guess is that it is called once during initialization and not updated during each epoch.
I know that ImageDataGenerator generates for each input image one image randomly augmented . Now, I would like to generate for each input image two augmented images :
datagen = tf.keras.preprocessing.image.ImageDataGenerator(
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest')
train_ds = datagen.flow_from_directory('/home/train/')
To explain more, I would like to apply 2 distinct augmentation functions on the same image, i.e, if we sample 5 images, we end up with 2 × 5 = 10 augmented observations in the batch
So how I can proceed please ?
I would recommend creating a custom data generator that inherits from tf.keras.utils.Sequence. There are a number of ways to go about this, but this should be along the lines of what you are looking for:
class double_aug_generator(tf.keras.utils.Sequence):
def __init__(self, x, y, batch_size, aug_params1, aug_params2):
self.x, self.y = x, y
self.batch_size = batch_size
self.datagen = tf.keras.preprocessing.image.ImageDataGenerator(**aug_params1)
// dictionary of parameters for the second augmentation
self.aug_params2 = aug_params2
def __len__(self):
return math.ceil(len(self.x) / self.batch_size)
def load(self, file_names):
// load and return raw images however you like
def __getitem__(self, idx):
batch_x = self.x[idx * self.batch_size:(idx + 1) *
self.batch_size]
batch_y = self.y[idx * self.batch_size:(idx + 1) *
self.batch_size]
// load images
batch_x = self.load(batch_x)
// apply first augmentation
batch_x = self.datagen.flow(batch_x)
// apply second
batch_x = self.datagen.apply_transform(batch_x, self.aug_params2)
return batch_x, np.array(batch_y)
I'm experiencing many errors/problems with Keras generators and multi-processing.
I have used:
history = model.fit(training_generator,
steps_per_epoch=trainingSetSize // batch_size,
epochs=epochs,
verbose=1,
validation_data=validationGenerator,
validation_steps= validationSetSize // batch_size,
callbacks=callbacks,
use_multiprocessing=True,
workers=nb_workers,
max_queue_size=2*nb_workers)
to launch the training.
My generator spits batches of (batch_size,64,64,2) tensors. One problem is that I notice the following warnings/error messages:
Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
even though I added the steps_per_epoch = X_size // batch_size,
Also by adding a print() inside the generator I notice that at the end of an epoch, it generates an "empty" tensor of shape: (0,64,64,2)...
Any ideas, any proposals/comments/answers?
this is the generator's code:
class custom_gen(Sequence):
def __init__(self, fn, datasetSize, batch_size, trainingsetName,trainingsetTargetsName):
self.fn=fn
self.datasetSize= datasetSize
self.batch_size = batch_size
self.trainingsetTargetsName = trainingsetTargetsName
self.trainingsetName = trainingsetName
self.lock = threading.Lock()
#compulsory method: total number of batches that the generator must produce
def __len__(self):
return self.datasetSize // self.batch_size
#idx arg is managed by Python/Keras by themselves!?
def __getitem__(self, idx):
with self.lock:
f=h5py.File(self.fn,'r')
X=f[self.trainingsetName][idx * self.batch_size:(idx + 1) * self.batch_size]
Y=f[self.trainingsetTargetsName][idx * self.batch_size:(idx + 1) * self.batch_size]
print('Generated X.shape='+str(X.shape))
print('Generated Y.shape='+str(Y.shape))
return X,Y
When implementing a custom layer in Keras, I need to know the real size of batch_size. my shape is (?,20).
questions:
1. What is the best way to change (?,20) to (batch_size,20).
I have looked into this but it can not adjust to my problem.
I can pass the batch_size to this layer. In that case, I need to reshape (?,20) to (batch_size,20), how can I do that?
2. Is it the best way to that, or is there any builtin function that can get the real batch_size while building and running the model?
This is my layer:
from scipy.stats import entropy
from keras.engine import Layer
import keras.backend as K
import numpy as np
class measure(Layer):
def __init__(self, beta, **kwargs):
self.beta = beta
self.uses_learning_phase = True
self.supports_masking = True
super(measure, self).__init__(**kwargs)
def call(self, x):
return K.in_train_phase(self.rev_entropy(x, self.beta), x)
def get_config(self):
config = {'beta': self.beta}
base_config = super(measure, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
def rev_entropy(self, x, beta):
entropy_p_t_w = np.apply_along_axis(entropy, 1, x)
con = (beta / (1 + entropy_p_t_w)) ** 1.5
new_f_w_t = x * (con.reshape(con.shape[0], 1))
norm_const = 1e-30 + np.sum(new_f_w_t, axis=0)
for t in range(norm_const.shape[0]):
new_f_w_t[:, t] /= norm_const[t]
return new_f_w_t
And here is where I call this layer:
encoded = measure(beta=0.08)(encoded)
I am also using fit_generator if it can help at all:
autoencoder.fit_generator(train_gen, steps_per_epoch=num_train_steps, epochs=NUM_EPOCHS,
validation_data=test_gen, validation_steps=num_test_steps, callbacks=[checkpoint])
The dimension of the x passed to the layer is (?,20) and that's why I can not do my calculation.
Thanks:)
I am trying to fit a model using a large image datasets. I have a memory RAM of 14 GB, and the dataset have the size of 40 GB. I tried to use fit_generator, but I end up with a method that does not delete the loaded batchs after using theme.
If there is anyway to sole the problem or resources, thanks to point me to it.
Thanks.
The generator code is :
class Data_Generator(Sequence):
def __init__(self, image_filenames, labels, batch_size):
self.image_filenames, self.labels = image_filenames, labels
self.batch_size = batch_size
def __len__(self):
return int(np.ceil(len(self.image_filenames) / float(self.batch_size)))
def __format_labels__(self, gd_truth):
cols=gd_truth.columns
y=[]
for col in cols:
y.append(gd_truth[col].values)
return y
def __getitem__(self, idx):
batch_x = self.image_filenames[idx * self.batch_size:(idx + 1) * self.batch_size]
batch_y = self.labels[idx * self.batch_size:(idx + 1) * self.batch_size]
gd_truth=pd.DataFrame(data=batch_y,columns=self.labels.columns)
#gd_truth=batch_y
return np.array([read_image(file_name) for file_name in batch_x]),self.__format_labels__(gd_truth) #np.array(batch_y)
Then I have created two generators for train and validation images:
training_batch_generator = Data_Generator(training_filenames, trainTargets, batch_size)
mvalidation_batch_generator = Data_Generator(validation_filenames, valTargets, batch_size)
The fit_generator call is as follow :
num_epochs=10
model.fit_generator(generator=my_training_batch_generator,
steps_per_epoch=(num_training_samples // batch_size),
epochs=num_epochs,
verbose=1,
validation_data=my_validation_batch_generator,
validation_steps=(num_validation_samples // batch_size),
max_queue_size=16)