Keras fit vs. fit_generator extra smaples - keras

I have training data and validation data stacked up in two tensors. At first, I ran a NN using keras.model.fit() function. for my purposes, I wish to move to keras.model.fit_generator(). I build a generator and I have noticed the number of samples is not a multiplication of the batch size.
My implementation to overcome this:
indices = np.arange(len(dataset))# generate indices of len(dataset)
num_of_steps = int(np.ceil(len(dataset)/batch_size)) #number of steps per epoch
extra = num_of_steps *batch_size-len(dataset)#find the size of extra samples needed to complete the next multiplication of batch_size
additional = np.random.randint(len(dataset),size = extra )#complete with random samples
indices = np.append(indices ,additional )
After randomizing the indices at each epoch I simply iterate this in batches skips and pool the correct data and labels.
I am observing a degradation in the performance of the model. When training with fit() I get 0.99 training accuracy and 0.93 validation accuracy while with fit_generator() I am getting 0.95 and 0.9 respectively. note, this is consistent and not a single experiment. I thought it might be due to fit() handling the extra samples required differently. Is my implementation reasonable? how does fit() handles datasets of a size different from a batch_size multiplication?
Sharing the full generator code:
def generator(self,batch_size,train):
"""
Generates batches of samples
:return:
"""
while 1:
nb_of_steps=0
if(train):
nb_of_steps = self._num_of_steps_train
indices = np.arange(len(self._x_train))
additional = np.random.randint(len(self._x_train), size=self._num_of_steps_train*batch_size-len(self._x_train))
else:
nb_of_steps = self._num_of_steps_test
indices = np.arange(len(self._x_test))
additional = np.random.randint(len(self._x_test), size=self._num_of_steps_test*batch_size-len(self._x_test))
indices = np.append(indices,additional)
np.random.shuffle(indices)
# print(indices.shape)
# print(nb_of_steps)
for i in range(nb_of_steps):
batch_indices=indices[i:i+batch_size]
if(train):
feat = self._x_train[batch_indices]
label = self._y_train[batch_indices]
else:
feat = self._x_test[batch_indices]
label = self._y_test[batch_indices]
feat = np.expand_dims(feat,axis=1)
# print(feat.shape)
# print(label.shape)
yield feat, label

It looks like you can simplify the generator significantly!
The number of steps etc can be set outside the loop as they do not really change. Moreover, it looks like the batch_indices is not going through the entire dataset. Finally, if your data fits in memory you might not need a generator at all, but will leave this to your judgement.
def generator(self, batch_size, train):
nb_of_steps = 0
if (train):
nb_of_steps = self._num_of_steps_train
indices = np.arange(len(self._x_train)) #len of entire dataset
else:
nb_of_steps = self._num_of_steps_test
indices = np.arange(len(self._x_test))
while 1:
np.random.shuffle(indices)
for i in range(nb_of_steps):
start_idx = i*batch_size
end_idx = min(i*batch_size+batch_size, len(indices))
batch_indices=indices[start_idx : end_idx]
if(train):
feat = self._x_train[batch_indices]
label = self._y_train[batch_indices]
else:
feat = self._x_test[batch_indices]
label = self._y_test[batch_indices]
feat = np.expand_dims(feat,axis=1)
yield feat, label
For a more robust generator consider creating a class for your set using the keras.utils.Sequence class. It will add a few extra lines of code, but it is certainly working with keras.

Related

How to use PyTorch WeightedRandomSampler on Subsets of Subsets?

I want to split my dataset into three subsets for training, validation and testing. My original approach was to start with a first split and then do a second split on one of the two resulting Subsets. In addition, I wanted to apply the WeightedRandomSampler in the DataLoader to the training dataset created in the second step for balancing, because the two classes are represented differently in the dataset. If I try this on a normal subset (e.g. generated by the random_split function), it also works as expected, and the distribution of the two classes in the batches is approximately at 50% each. However, if I now use my training dataset as described above, which is a Subset of a Subset, the sampler does not seem to be applied: The distribution of the classes is the same as the original distribution in the overall dataset. Because this is probably difficult to capture purely textually, a minimal reproducible example follows. In the meantime I have a workaround, but I would still like to know if and how I could apply my original approach, because I would like to focus on PyTorch on-board resources in order not to depend on too many other packages.
from random import shuffle
import numpy as np
import torch
from torch.utils.data import DataLoader, TensorDataset, WeightedRandomSampler, random_split
from tqdm import tqdm
# create an imbalanced dataset with 2 classes
x = [0] * 8000
x = x + [1] * 2000
shuffle(x)
dataset = TensorDataset(torch.Tensor(x))
# split it into a train, validation and test set
trainval_ds, test_ds = random_split(dataset, [0.8, 0.2])
# training and validation sets are subsets of subsets (the test set is just a normal subset)
train_ds, val_ds = random_split(trainval_ds, [0.8, 0.2])
# calculate the weights, use them to create a sampler and a dataloader for the training set
y = [x[i] for i in train_ds.indices]
class_counts = np.bincount(y)
labels_weights = 1. / class_counts
print("Number of data points in the two classes and their weights: ")
print(class_counts)
print(labels_weights)
weights = labels_weights[y]
sampler = WeightedRandomSampler(weights, len(weights))
train_dl = DataLoader(train_ds, batch_size=100, num_workers=10, pin_memory=True, sampler=sampler)
# calculate the ratio between the two classes presented during training
arr_batch = []
l_idx = 0
for batch in tqdm(train_dl, desc='Train batches'):
label_ids = [t.to('cpu').item() for t in batch[l_idx]]
arr_batch = arr_batch + label_ids
print("Ratio of the two classes during training: " + str(sum(arr_batch) / len(arr_batch)))
# the output is in the range of the original distribution
# this happens with all subsets of subsets
# the sampler is not working?
# calculate the weights, use them to create a sampler and a dataloader for the test set
# (just for presentation purposes, normally you would not do that)
y = [x[i] for i in test_ds.indices]
class_counts = np.bincount(y)
labels_weights = 1. / class_counts
print("Number of data points in the two classes and their weights: ")
print(class_counts)
print(labels_weights)
weights = labels_weights[y]
sampler = WeightedRandomSampler(weights, len(weights))
test_dl = DataLoader(test_ds, batch_size=100, num_workers=10, pin_memory=True, sampler=sampler)
# calculate the ratio between the two classes presented during testing
arr_batch = []
l_idx = 0
for batch in tqdm(test_dl, desc='Test batches'):
label_ids = [t.to('cpu').item() for t in batch[l_idx]]
arr_batch = arr_batch + label_ids
print("Ratio of the two classes during testing: " + str(sum(arr_batch) / len(arr_batch)))
# the output is close to the preferred 0.5

calculate Entropy for each class of the test set to measure uncertainty on pytorch

I am trying to calculate Entropy of each class of the dataset for an image classification task to measure uncertainty on pytorch,using the MC Dropout method and the solution proposed in this link
Measuring uncertainty using MC Dropout on pytorch
First,I have calculated the mean of each class per batch across different forward passes (class_mean_batch) and then for all the testloader (classes_mean) and then did some transformations to get (total_mean) to use it for calculating Entropy as shown in the code below
def mcdropout_test(batch_size,n_classes,model,T):
#set non-dropout layers to eval mode
model.eval()
#set dropout layers to train mode
enable_dropout(model)
softmax = nn.Softmax(dim=1)
classes_mean = []
for images,labels in testloader:
images = images.to(device)
labels = labels.to(device)
classes_mean_batch = []
with torch.no_grad():
output_list = []
#getting outputs for T forward passes
for i in range(T):
output = model(images)
output = softmax(output)
output_list.append(torch.unsqueeze(output, 0))
concat_output = torch.cat(output_list,0)
# getting mean of each class per batch across multiple MCD forward passes
for i in range (n_classes):
mean = torch.mean(concat_output[:, : , i])
classes_mean_batch.append(mean)
# getting mean of each class for the testloader
classes_mean.append(torch.stack(classes_mean_batch))
total_mean = []
concat_classes_mean = torch.stack(classes_mean)
for i in range (n_classes):
concat_classes = concat_classes_mean[: , i]
total_mean.append(concat_classes)
total_mean = torch.stack(total_mean)
total_mean = np.asarray(total_mean.cpu())
epsilon = sys.float_info.min
# Calculating entropy across multiple MCD forward passes
entropy = (- np.sum(total_mean*np.log(total_mean + epsilon), axis=-1)).tolist()
for i in range(n_classes):
print(f'The uncertainty of class {i+1} is {entropy[i]:.4f}')
Can anyone please correct or confirm the implementation i have used to calculate Entropy of each class.

Truncated backpropagation in PyTorch (code check)

I am trying to implement truncated backpropagation through time in PyTorch, for the simple case where K1=K2. I have an implementation below that produces reasonable output, but I just want to make sure it is correct. When I look online for PyTorch examples of TBTT, they do inconsistent things around detaching the hidden state and zeroing out the gradient, and the ordering of these operations. Please let me know if I have made a mistake.
In the code below, H maintains the current hidden state, and model(weights, H, x) outputs the prediction and the new hidden state.
while i < NUM_STEPS:
# Grab x, y for ith datapoint
x = data[i]
target = true_output[i]
# Run model
output, new_hidden = model(weights, H, x)
H = new_hidden
# Update running error
error += (output - target)**2
if (i+1) % K == 0:
# Backpropagate
error.backward()
opt.step()
opt.zero_grad()
error = 0
H = H.detach()
i += 1
So the idea of your code is to isolate the last variables after each Kth step. Yes, your implementation is absolutely correct and this answer confirms that.
# truncated to the last K timesteps
while i < NUM_STEPS:
out = model(out)
if (i+1) % K == 0:
out.backward()
out.detach()
out.backward()
You can also follow this example for your reference.
import torch
from ignite.engine import Engine, EventEnum, _prepare_batch
from ignite.utils import apply_to_tensor
class Tbptt_Events(EventEnum):
"""Aditional tbptt events.
Additional events for truncated backpropagation throught time dedicated
trainer.
"""
TIME_ITERATION_STARTED = "time_iteration_started"
TIME_ITERATION_COMPLETED = "time_iteration_completed"
def _detach_hidden(hidden):
"""Cut backpropagation graph.
Auxillary function to cut the backpropagation graph by detaching the hidden
vector.
"""
return apply_to_tensor(hidden, torch.Tensor.detach)
def create_supervised_tbptt_trainer(
model, optimizer, loss_fn, tbtt_step, dim=0, device=None, non_blocking=False, prepare_batch=_prepare_batch
):
"""Create a trainer for truncated backprop through time supervised models.
Training recurrent model on long sequences is computationally intensive as
it requires to process the whole sequence before getting a gradient.
However, when the training loss is computed over many outputs
(`X to many <https://karpathy.github.io/2015/05/21/rnn-effectiveness/>`_),
there is an opportunity to compute a gradient over a subsequence. This is
known as
`truncated backpropagation through time <https://machinelearningmastery.com/
gentle-introduction-backpropagation-time/>`_.
This supervised trainer apply gradient optimization step every `tbtt_step`
time steps of the sequence, while backpropagating through the same
`tbtt_step` time steps.
Args:
model (`torch.nn.Module`): the model to train.
optimizer (`torch.optim.Optimizer`): the optimizer to use.
loss_fn (torch.nn loss function): the loss function to use.
tbtt_step (int): the length of time chunks (last one may be smaller).
dim (int): axis representing the time dimension.
device (str, optional): device type specification (default: None).
Applies to batches.
non_blocking (bool, optional): if True and this copy is between CPU and GPU,
the copy may occur asynchronously with respect to the host. For other cases,
this argument has no effect.
prepare_batch (callable, optional): function that receives `batch`, `device`,
`non_blocking` and outputs tuple of tensors `(batch_x, batch_y)`.
.. warning::
The internal use of `device` has changed.
`device` will now *only* be used to move the input data to the correct device.
The `model` should be moved by the user before creating an optimizer.
For more information see:
* `PyTorch Documentation <https://pytorch.org/docs/stable/optim.html#constructing-it>`_
* `PyTorch's Explanation <https://github.com/pytorch/pytorch/issues/7844#issuecomment-503713840>`_
Returns:
Engine: a trainer engine with supervised update function.
"""
def _update(engine, batch):
loss_list = []
hidden = None
x, y = batch
for batch_t in zip(x.split(tbtt_step, dim=dim), y.split(tbtt_step, dim=dim)):
x_t, y_t = prepare_batch(batch_t, device=device, non_blocking=non_blocking)
# Fire event for start of iteration
engine.fire_event(Tbptt_Events.TIME_ITERATION_STARTED)
# Forward, backward and
model.train()
optimizer.zero_grad()
if hidden is None:
y_pred_t, hidden = model(x_t)
else:
hidden = _detach_hidden(hidden)
y_pred_t, hidden = model(x_t, hidden)
loss_t = loss_fn(y_pred_t, y_t)
loss_t.backward()
optimizer.step()
# Setting state of engine for consistent behaviour
engine.state.output = loss_t.item()
loss_list.append(loss_t.item())
# Fire event for end of iteration
engine.fire_event(Tbptt_Events.TIME_ITERATION_COMPLETED)
# return average loss over the time splits
return sum(loss_list) / len(loss_list)
engine = Engine(_update)
engine.register_events(*Tbptt_Events)
return engine

How to fix a 'OutOfRangeError: End of sequence' error when training a CNN with tensorflow?

I am trying to train a CNN using my own dataset. I've been using tfrecord files and the tf.data.TFRecordDataset API to handle my dataset. It works fine for my training dataset. But when I tried to batch my validation dataset, the error of 'OutOfRangeError: End of sequence' raised. After browsing through the Internet, I thought the problem was caused by the batch size of the validation set, which I set to 32 in the first place. But after I changed it to 2, the code ran for like 9 epochs and the error raised again.
I used an input function to handle the dataset, the code goes below:
def input_fn(is_training, filenames, batch_size, num_epochs=1, num_parallel_reads=1):
dataset = tf.data.TFRecordDataset(filenames,num_parallel_reads=num_parallel_reads)
if is_training:
dataset = dataset.shuffle(buffer_size=1500)
dataset = dataset.map(parse_record)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(batch_size)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
features, labels = iterator.get_next()
return features, labels
and for the training set, "batch_size" is set to 128 and "num_epochs" set to None which means keep repeating for infinite time. For the validation set, "batch_size" is set to 32(later set to 2, still didn't work) and the "num_epochs" set to 1 since I only want to go through the validation set one time.
I can assure that the validation set contains enough data for the epochs. Because I've tried the codes below and it didn't raise any errors:
with tf.Session() as sess:
features, labels = input_fn(False, valid_list, 32, 1, 1)
for i in range(450):
sess.run([features, labels])
print(labels.shape)
In the code above, when I changed the number 450 to 500 or anything larger, it would raise the 'OutOfRangeError'. That can confirm that my validation dataset contains enough data for 450 iterations with a batch size of 32.
I've tried to use a smaller batch size(i.e., 2) for the validation set, but still having the same error.
I can get the code running with the "num_epochs" set to "None" in the input_fn for validation set, but that does not seem to be how the validation works. Any help, please?
This behaviour is normal. From the Tensorflow documentation:
If the iterator reaches the end of the dataset, executing the Iterator.get_next() operation will raise a tf.errors.OutOfRangeError. After this point the iterator will be in an unusable state, and you must initialize it again if you want to use it further.
The reason why the error is not raised when you set dataset.repeat(None) is because the dataset is never exhausted since it is repeated indefinitely.
To solve your issue, you should change your code to this:
n_steps = 450
...
with tf.Session() as sess:
# Training
features, labels = input_fn(True, training_list, 32, 1, 1)
for step in range(n_steps):
sess.run([features, labels])
...
...
# Validation
features, labels = input_fn(False, valid_list, 32, 1, 1)
try:
sess.run([features, labels])
...
except tf.errors.OutOfRangeError:
print("End of dataset") # ==> "End of dataset"
You can also make a few changes to your input_fn to run the evaluation at every epoch:
def input_fn(is_training, filenames, batch_size, num_epochs=1, num_parallel_reads=1):
dataset = tf.data.TFRecordDataset(filenames,num_parallel_reads=num_parallel_reads)
if is_training:
dataset = dataset.shuffle(buffer_size=1500)
dataset = dataset.map(parse_record)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(batch_size)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_initializable_iterator()
return iterator
n_epochs = 10
freq_eval = 1
training_iterator = input_fn(True, training_list, 32, 1, 1)
training_features, training_labels = training_iterator.get_next()
val_iterator = input_fn(False, valid_list, 32, 1, 1)
val_features, val_labels = val_iterator.get_next()
with tf.Session() as sess:
# Training
sess.run(training_iterator.initializer)
for epoch in range(n_epochs):
try:
sess.run([training_features, training_labels])
except tf.errors.OutOfRangeError:
pass
# Validation
if (epoch+1) % freq_eval == 0:
sess.run(val_iterator.initializer)
try:
sess.run([val_features, val_labels])
except tf.errors.OutOfRangeError:
pass
I advise you to have a close look to this official guide if you want to have a better understanding of what is happening under the hood.

Avoid feed_dict mechanism in static graph in tensorflow

I am trying to implement a model for generating/reconstructing samples (Variational autoencoder). During test time, I would like to be able to make the model generate new samples by feeding it a latent variable, but that requires changing the inputs to a part of the computational graph.
I could use a feed_dict to "dynamically" do that, since I cannot directly change a static graph, but I want to avoid the overhead of exchanging data between the GPU and the system RAM.
As it stands I feed the data using Iterators.
def make_mnist_dataset(batch_size, shuffle=True, include_labels=True):
"""Loads the MNIST data set and returns the relevant
iterator along with its initialization operations.
"""
# load the data
train, test = tf.keras.datasets.mnist.load_data()
# binarize and reshape the data sets
temp_train = train[0]
temp_train = (temp_train > 0.5).astype(np.float32).reshape(temp_train.shape[0], 784)
train = (temp_train, train[1])
temp_test = test[0]
temp_test = (temp_test > 0.5).astype(np.float32).reshape(temp_test.shape[0], 784)
test = (temp_test, test[1])
# prepare Dataset objects
if include_labels:
train_set = tf.data.Dataset.from_tensor_slices(train).repeat().batch(batch_size)
test_set = tf.data.Dataset.from_tensor_slices(test).repeat(1).batch(batch_size)
else:
train_set = tf.data.Dataset.from_tensor_slices(train[0]).repeat().batch(batch_size)
test_set = tf.data.Dataset.from_tensor_slices(test[0]).repeat(1).batch(batch_size)
if shuffle:
train_set = train_set.shuffle(buffer_size=int(0.5*train[0].shape[0]),
seed=123)
# make the iterator
iter = tf.data.Iterator.from_structure(train_set.output_types,
train_set.output_shapes)
data = iter.get_next()
# create initialization ops
train_init = iter.make_initializer(train_set)
test_init = iter.make_initializer(test_set)
return train_init, test_init, data
And here's the code snippet where the data being iterated over is being fed to the graph:
train_init, test_init, next_batch = make_mnist_dataset(batch_size, include_labels=True)
ops = build_graph(next_batch[0], next_batch[1], learning_rate, is_training,
latent_dim, tau, batch_size, inf_layers, gen_layers)
Is there any way to "switch" from an Iterator object to a different input source during test time, without resorting to feed_dict?

Resources