Keras: load images batch wise for large dataset - keras

Its is possible in keras to load only one batch in memory at a time as I have 40GB dataset of images.
If dataset is small I can used ImageDataGenerator to generator batches but due large dataset I can't load all images in memory.
Is there any method in keras to do something similar to following tensorflow code:
path_queue = tf.train.string_input_producer(input_paths, shuffle= False)
paths, contents = reader.read(path_queue)
inputs = decode(contents)
input_batch = tf.train.batch([inputs], batch_size=2)
I am using this method to serialize inputs in tensorflow but I don't know how to achieve this task in Keras.

Keras has the method fit_generator() in its models. It accepts a python generator or a keras Sequence as input.
You can create a simple generator like this:
fileList = listOfFiles
def imageLoader(files, batch_size):
L = len(files)
#this line is just to make the generator infinite, keras needs that
while True:
batch_start = 0
batch_end = batch_size
while batch_start < L:
limit = min(batch_end, L)
X = someMethodToLoadImages(files[batch_start:limit])
Y = someMethodToLoadTargets(files[batch_start:limit])
yield (X,Y) #a tuple with two numpy arrays with batch_size samples
batch_start += batch_size
batch_end += batch_size
And fit like this:
model.fit_generator(imageLoader(fileList,batch_size),steps_per_epoch=..., epochs=..., ...)
Normally, you pass to steps_per_epoch the number of batches you will take from the generator.
You can also implement your own Keras Sequence. It's a little more work, but they recommend using this if you're going to make multi-thread processing.

Related

How to use Tensorflow 2 Dataset API with Keras?

This question has been answered for Tensorflow 1, eg: How to Properly Combine TensorFlow's Dataset API and Keras?, but this answer hasn't helped for my use case.
Below is an example of a model with three float32 inputs and one float32 output. I have a large amount of data that doesn't all fit into memory at once, so it's split into separate files. I'm trying to use the Dataset API to train a model by bringing in a portion of the training data at once.
import tensorflow as tf
import tensorflow.keras.layers as layers
import numpy as np
# Create TF model of a given architecture (number of hidden layers, layersize, #outputs, activation function)
def create_model(h=2, l=64, activation='relu'):
model = tf.keras.Sequential([
layers.Dense(l, activation=activation, input_shape=(3,), name='input_layer'),
*[layers.Dense(l, activation=activation) for _ in range(h)],
layers.Dense(1, activation='linear', name='output_layer')])
return model
# Load data (3 X variables, 1 Y variable) split into 5 files
# (for this example, just create a list 5 numpy arrays)
list_of_training_datasets = [np.random.rand(10,4).astype(np.float32) for _ in range(5)]
validation_dataset = np.random.rand(30,4).astype(np.float32)
def data_generator():
for data in list_of_training_datasets:
x_data = data[:, 0:3]
y_data = data[:, 3:4]
yield((x_data,y_data))
# prepare model
model = create_model(h=2,l=64,activation='relu')
model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam())
# load dataset
dataset = tf.data.Dataset.from_generator(data_generator,(np.float32,np.float32))
# fit model
model.fit(dataset, epochs=100, validation_data=(validation_dataset[:,0:3],validation_dataset[:,3:4]))
Running this, I get the error:
ValueError: Cannot take the length of shape with unknown rank.
Does anyone know how to get this working? I would also like to be able to use the batch dimension, to load two data files at a time, for example.
You need to need to specify the shapes of the your dataset along with the return data types like this.
dataset = tf.data.Dataset.from_generator(data_generator,
(np.float32,np.float32),
((None, 3), (None, 1)))
The following works, but I don't know if this is the most efficient.
As far as I understand, if your training dataset is split into 10 pieces, then you should set steps_per_epoch=10. This ensures that each epoch will step through all data once. As far as I understand, dataset.repeat() is needed because the dataset iterator is "used up" after the first epoch. .repeat() ensures that the iterator gets created again after being used up.
import numpy as np
import tensorflow.keras.layers as layers
import tensorflow as tf
# Create TF model of a given architecture (number of hidden layers, layersize, #outputs, activation function)
def create_model(h=2, l=64, activation='relu'):
model = tf.keras.Sequential([
layers.Dense(l, activation=activation, input_shape=(3,), name='input_layer'),
*[layers.Dense(l, activation=activation) for _ in range(h)],
layers.Dense(1, activation='linear', name='output_layer')])
return model
# Load data (3 X variables, 1 Y variable) split into 5 files
# (for this example, just create a list 5 numpy arrays)
list_of_training_datasets = [np.random.rand(10,4).astype(np.float32) for _ in range(5)]
steps_per_epoch = len(list_of_training_datasets)
validation_dataset = np.random.rand(30,4).astype(np.float32)
def data_generator():
for data in list_of_training_datasets:
x_data = data[:, 0:3]
y_data = data[:, 3:4]
yield((x_data,y_data))
# prepare model
model = create_model(h=2,l=64,activation='relu')
model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam())
# load dataset
dataset = tf.data.Dataset.from_generator(data_generator,output_types=(np.float32,np.float32),
output_shapes=(tf.TensorShape([None,3]), tf.TensorShape([None,1]))).repeat()
# fit model
model.fit(dataset.as_numpy_iterator(), epochs=10,steps_per_epoch=steps_per_epoch,
validation_data=(validation_dataset[:,0:3],validation_dataset[:,3:4]))

Keras fit vs. fit_generator extra smaples

I have training data and validation data stacked up in two tensors. At first, I ran a NN using keras.model.fit() function. for my purposes, I wish to move to keras.model.fit_generator(). I build a generator and I have noticed the number of samples is not a multiplication of the batch size.
My implementation to overcome this:
indices = np.arange(len(dataset))# generate indices of len(dataset)
num_of_steps = int(np.ceil(len(dataset)/batch_size)) #number of steps per epoch
extra = num_of_steps *batch_size-len(dataset)#find the size of extra samples needed to complete the next multiplication of batch_size
additional = np.random.randint(len(dataset),size = extra )#complete with random samples
indices = np.append(indices ,additional )
After randomizing the indices at each epoch I simply iterate this in batches skips and pool the correct data and labels.
I am observing a degradation in the performance of the model. When training with fit() I get 0.99 training accuracy and 0.93 validation accuracy while with fit_generator() I am getting 0.95 and 0.9 respectively. note, this is consistent and not a single experiment. I thought it might be due to fit() handling the extra samples required differently. Is my implementation reasonable? how does fit() handles datasets of a size different from a batch_size multiplication?
Sharing the full generator code:
def generator(self,batch_size,train):
"""
Generates batches of samples
:return:
"""
while 1:
nb_of_steps=0
if(train):
nb_of_steps = self._num_of_steps_train
indices = np.arange(len(self._x_train))
additional = np.random.randint(len(self._x_train), size=self._num_of_steps_train*batch_size-len(self._x_train))
else:
nb_of_steps = self._num_of_steps_test
indices = np.arange(len(self._x_test))
additional = np.random.randint(len(self._x_test), size=self._num_of_steps_test*batch_size-len(self._x_test))
indices = np.append(indices,additional)
np.random.shuffle(indices)
# print(indices.shape)
# print(nb_of_steps)
for i in range(nb_of_steps):
batch_indices=indices[i:i+batch_size]
if(train):
feat = self._x_train[batch_indices]
label = self._y_train[batch_indices]
else:
feat = self._x_test[batch_indices]
label = self._y_test[batch_indices]
feat = np.expand_dims(feat,axis=1)
# print(feat.shape)
# print(label.shape)
yield feat, label
It looks like you can simplify the generator significantly!
The number of steps etc can be set outside the loop as they do not really change. Moreover, it looks like the batch_indices is not going through the entire dataset. Finally, if your data fits in memory you might not need a generator at all, but will leave this to your judgement.
def generator(self, batch_size, train):
nb_of_steps = 0
if (train):
nb_of_steps = self._num_of_steps_train
indices = np.arange(len(self._x_train)) #len of entire dataset
else:
nb_of_steps = self._num_of_steps_test
indices = np.arange(len(self._x_test))
while 1:
np.random.shuffle(indices)
for i in range(nb_of_steps):
start_idx = i*batch_size
end_idx = min(i*batch_size+batch_size, len(indices))
batch_indices=indices[start_idx : end_idx]
if(train):
feat = self._x_train[batch_indices]
label = self._y_train[batch_indices]
else:
feat = self._x_test[batch_indices]
label = self._y_test[batch_indices]
feat = np.expand_dims(feat,axis=1)
yield feat, label
For a more robust generator consider creating a class for your set using the keras.utils.Sequence class. It will add a few extra lines of code, but it is certainly working with keras.

How to shuffle data at each epoch using tf.data API in TensorFlow 2.0?

I am getting my hands dirty using TensorFlow 2.0 to train my model. The new iteration feature in tf.data API is pretty awesome. However, when I was executing the following codes, I found that, unlike the iteration features in torch.utils.data.DataLoader, it did not shuffle data automatically at each epoch. How do I achieve that using TF2.0?
import numpy as np
import tensorflow as tf
def sample_data():
...
data = sample_data()
NUM_EPOCHS = 10
BATCH_SIZE = 128
# Subsample the data
mask = range(int(data.shape[0]*0.8), data.shape[0])
data_val = data[mask]
mask = range(int(data.shape[0]*0.8))
data_train = data[mask]
train_dset = tf.data.Dataset.from_tensor_slices(data_train).\
shuffle(buffer_size=10000).\
repeat(1).batch(BATCH_SIZE)
val_dset = tf.data.Dataset.from_tensor_slices(data_val).\
batch(BATCH_SIZE)
loss_metric = tf.keras.metrics.Mean(name='train_loss')
optimizer = tf.keras.optimizers.Adam(0.001)
#tf.function
def train_step(inputs):
...
for epoch in range(NUM_EPOCHS):
# Reset the metrics
loss_metric.reset_states()
for inputs in train_dset:
train_step(inputs)
...
The batch needs to be reshuffled:
train_dset = tf.data.Dataset.from_tensor_slices(data_train).\
repeat(1).batch(BATCH_SIZE)
train_dset = train_dset.shuffle(buffer_size=buffer_size)

Avoid feed_dict mechanism in static graph in tensorflow

I am trying to implement a model for generating/reconstructing samples (Variational autoencoder). During test time, I would like to be able to make the model generate new samples by feeding it a latent variable, but that requires changing the inputs to a part of the computational graph.
I could use a feed_dict to "dynamically" do that, since I cannot directly change a static graph, but I want to avoid the overhead of exchanging data between the GPU and the system RAM.
As it stands I feed the data using Iterators.
def make_mnist_dataset(batch_size, shuffle=True, include_labels=True):
"""Loads the MNIST data set and returns the relevant
iterator along with its initialization operations.
"""
# load the data
train, test = tf.keras.datasets.mnist.load_data()
# binarize and reshape the data sets
temp_train = train[0]
temp_train = (temp_train > 0.5).astype(np.float32).reshape(temp_train.shape[0], 784)
train = (temp_train, train[1])
temp_test = test[0]
temp_test = (temp_test > 0.5).astype(np.float32).reshape(temp_test.shape[0], 784)
test = (temp_test, test[1])
# prepare Dataset objects
if include_labels:
train_set = tf.data.Dataset.from_tensor_slices(train).repeat().batch(batch_size)
test_set = tf.data.Dataset.from_tensor_slices(test).repeat(1).batch(batch_size)
else:
train_set = tf.data.Dataset.from_tensor_slices(train[0]).repeat().batch(batch_size)
test_set = tf.data.Dataset.from_tensor_slices(test[0]).repeat(1).batch(batch_size)
if shuffle:
train_set = train_set.shuffle(buffer_size=int(0.5*train[0].shape[0]),
seed=123)
# make the iterator
iter = tf.data.Iterator.from_structure(train_set.output_types,
train_set.output_shapes)
data = iter.get_next()
# create initialization ops
train_init = iter.make_initializer(train_set)
test_init = iter.make_initializer(test_set)
return train_init, test_init, data
And here's the code snippet where the data being iterated over is being fed to the graph:
train_init, test_init, next_batch = make_mnist_dataset(batch_size, include_labels=True)
ops = build_graph(next_batch[0], next_batch[1], learning_rate, is_training,
latent_dim, tau, batch_size, inf_layers, gen_layers)
Is there any way to "switch" from an Iterator object to a different input source during test time, without resorting to feed_dict?

how to implement Loss function of paper ''Semantic Image Inpainting with Deep Generative Models' in keras

I have trained GAN on celebA dataset. After that i separate G and D. Then i pick one image from celebA training dataset say yTrue and now i want to find the closest image to yTrue that G can generate say yPred. So the loss at output of G is ||yTrue - yPred||_2^{2} and i minimized it w.r.t generator input(latent variable from normal distribution). Below is code that is giving good results. Now the problem is i want to also add prior loss (log(1-D(G(z))) 1 in first line but i am not getting how to do it as D is not connected to G now and if i directly add k.mean(k.log(1-D.predict(G.output))) in first line it returns numpy array not tensor that is not allowed.
`loss = K.mean(K.square(yTrue - gf.output))
grad = K.gradients(loss,[gf.input])[0]
fn = K.function([gf.input], [grad])
generator_input = np.random.normal(0,1,[1,100])
for i in range(5000):
grad1 = fn([generator_input])
generator_input -= grads[0]*.01
recovered = gf.predict(generator_input)`
In keras, you get the final output to create loss functions. Then, you will have to train the full network to achieve that loss. (Train G+D joined as a single model).
In the loss function, you will have y_true and y_pred, and you use them to compare:
PS: if MSE is not taking the output of the discriminator, please detail your questoin better.
import keras.backend as K
def customLoss(yTrue,yPred):
mse = K.mean(K.square(yTrue-yPred)
prior = K.mean(K.log(1-yPred))
return mse + prior
Pass this function when compiling the model
discriminator.compile(loss=customLoss,optimizer=.....)

Resources