How to use PyTorch WeightedRandomSampler on Subsets of Subsets? - pytorch

I want to split my dataset into three subsets for training, validation and testing. My original approach was to start with a first split and then do a second split on one of the two resulting Subsets. In addition, I wanted to apply the WeightedRandomSampler in the DataLoader to the training dataset created in the second step for balancing, because the two classes are represented differently in the dataset. If I try this on a normal subset (e.g. generated by the random_split function), it also works as expected, and the distribution of the two classes in the batches is approximately at 50% each. However, if I now use my training dataset as described above, which is a Subset of a Subset, the sampler does not seem to be applied: The distribution of the classes is the same as the original distribution in the overall dataset. Because this is probably difficult to capture purely textually, a minimal reproducible example follows. In the meantime I have a workaround, but I would still like to know if and how I could apply my original approach, because I would like to focus on PyTorch on-board resources in order not to depend on too many other packages.
from random import shuffle
import numpy as np
import torch
from torch.utils.data import DataLoader, TensorDataset, WeightedRandomSampler, random_split
from tqdm import tqdm
# create an imbalanced dataset with 2 classes
x = [0] * 8000
x = x + [1] * 2000
shuffle(x)
dataset = TensorDataset(torch.Tensor(x))
# split it into a train, validation and test set
trainval_ds, test_ds = random_split(dataset, [0.8, 0.2])
# training and validation sets are subsets of subsets (the test set is just a normal subset)
train_ds, val_ds = random_split(trainval_ds, [0.8, 0.2])
# calculate the weights, use them to create a sampler and a dataloader for the training set
y = [x[i] for i in train_ds.indices]
class_counts = np.bincount(y)
labels_weights = 1. / class_counts
print("Number of data points in the two classes and their weights: ")
print(class_counts)
print(labels_weights)
weights = labels_weights[y]
sampler = WeightedRandomSampler(weights, len(weights))
train_dl = DataLoader(train_ds, batch_size=100, num_workers=10, pin_memory=True, sampler=sampler)
# calculate the ratio between the two classes presented during training
arr_batch = []
l_idx = 0
for batch in tqdm(train_dl, desc='Train batches'):
label_ids = [t.to('cpu').item() for t in batch[l_idx]]
arr_batch = arr_batch + label_ids
print("Ratio of the two classes during training: " + str(sum(arr_batch) / len(arr_batch)))
# the output is in the range of the original distribution
# this happens with all subsets of subsets
# the sampler is not working?
# calculate the weights, use them to create a sampler and a dataloader for the test set
# (just for presentation purposes, normally you would not do that)
y = [x[i] for i in test_ds.indices]
class_counts = np.bincount(y)
labels_weights = 1. / class_counts
print("Number of data points in the two classes and their weights: ")
print(class_counts)
print(labels_weights)
weights = labels_weights[y]
sampler = WeightedRandomSampler(weights, len(weights))
test_dl = DataLoader(test_ds, batch_size=100, num_workers=10, pin_memory=True, sampler=sampler)
# calculate the ratio between the two classes presented during testing
arr_batch = []
l_idx = 0
for batch in tqdm(test_dl, desc='Test batches'):
label_ids = [t.to('cpu').item() for t in batch[l_idx]]
arr_batch = arr_batch + label_ids
print("Ratio of the two classes during testing: " + str(sum(arr_batch) / len(arr_batch)))
# the output is close to the preferred 0.5

Related

Results from Pytorch tutorial using Google collab not matching results in PyCharm

I'm following a tutorial on Youtube for Pytorch which uses torch.manual_seed to ensure results are the same or least in the same ballpark.
Now admittedly I'm no expert but on running the code in chapter 2 of the tutorial the resulting graph from my model seems way off from what it should be.
I've tried going through the code line by line but for the last 3 days I can't seem to find any differences between my code and the code used in the tutorial (other than variable names which I changed for clarity on my part and so I'm not just mindlessly copying).
I work a pretty busy menial job with variable work days so I don't get a lot of time off but I've spent 10 'off days' across a month trying to solve this and I just can't see it. Genuinely any help would be appreciated even if it's an eror on my part I would be alright with that being stated without saying what it is; I just want to know if I've done anything wrong at all.
Here's a link to the doc file for the tutorial if that helps:
https://www.learnpytorch.io/02_pytorch_classification/#31-going-from-raw-model-outputs-to-predicted-labels-logits-prediction-probabilities-prediction-labels
Here's my code:
`from sklearn.datasets import make_circles
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split as tts
import torch
from torch import nn
from helper_functions import plot_predictions, plot_decision_boundary
import os
from pathlib import Path
# Generate 1000 samples
sample_number = 1000
# Create circles
Features, labels = make_circles(sample_number,
noise=0.03, # <- adds a little noise to the dots
random_state=42) # <- sets random state seed for consistency
# View the first 5 values for both parameters
print(f'First 5 Features (X):\n{Features[:5]}\n'
f'First 5 labels (y):\n{labels[:5]}')
# Make a DataFrame of circle data
circles = pd.DataFrame({"inputType1": Features[:, 0], # <- everything in the 0th index is type 1
"inputType2": Features[:, 1], # <- everything in the 1st index is type 2
"output": labels
})
print(f'Created dataframe:\n'
f'{circles.head(10)}')
# Check the different labels
print(f'Number of value per class:\n'
f'{circles.output.value_counts()}')
# Visualise the dataframe
plt.scatter(x=Features[:, 0],
y=Features[:, 1],
c=labels,
cmap=plt.cm.RdYlBu)
# Display plot
# plt.show()
# Check the shapes of the features and labels
# ML deals with numerical representation
# Ensuring the input and output shapes are compatible is crucial
print(f'Circle shapes: {Features.shape, labels.shape}')
# View the first example of features and labels
Features_samples = Features[0]
labels_samples = labels[0]
print(f'Values for one sample of X: {Features_samples} and the same for y: {labels_samples}')
print(f'Shapes for one sample of X: {Features_samples.shape}'
f'\nand the same for y: {labels_samples.shape}')
# ^ Features dataset has 1000 samples with two feature classes
# ^ labels dataset has 1000 samples with no feature classes since it's a scalar
# Turning datasets into tensors
Features = torch.from_numpy(Features).type(torch.float)
labels = torch.from_numpy(labels).type(torch.float)
# View the first five samples
print(f'First 5 Features:\n'
f'{Features[:5]}\n'
f'First 5 labels:\n'
f'{labels[:5]}\n')
# Split data into train and test sets
input_data_train, input_data_test, model_output_train, model_output_test = tts(Features,
labels,
test_size=0.2,
random_state=42)
# Check that splits follow this pattern:
# 80% train, 20% test
print(f'Number of samples for input train:\n'
f'{len(input_data_train)}\n'
f'Number of samples for input test:\n'
f'{len(input_data_test)}\n'
f'Number of samples for output train:\n'
f'{len(model_output_train)}\n'
f'Number of samples for output test:\n'
f'{len(model_output_test)}\n')
# Begin building learning model
# Make device-agnostic code
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f'Learning model processing on: {device}\n')
"""# Assign parameters to the device
input_data_train = input_data_train.to(device)
input_data_test = input_data_test.to(device)
model_output_train = model_output_train.to(device)
model_output_test = model_output_test.to(device)"""
# 1. Construct a model class that subclasses nn.Module
class CircleClassificationModel(nn.Module):
def __init__(self):
super().__init__()
# 2. Create 2 nn.Linear layers for handling Feature and labels, input and output shapes
self.layer_1 = nn.Linear(in_features=2, out_features=5) # <- inputs 2 Features, outputs 5
self.layer_2 = nn.Linear(in_features=5, out_features=1) # <- inputs 5 Features, outputs 1 label
# 3. Define a forward method containing the forward pass computation
def forward(self, x):
# Return the output of layer_2, a single feature, which is the same shape as the label
return self.layer_2(self.layer_1(x))
# Computation goes through layer_1 first
# The output of layer_1 goes through layer_2
# 4. Create an instance of the model and send it to target device
classification_model_0 = CircleClassificationModel().to(device)
# Display model parameters
print(f'Model parameters for self defined model:\n'
f'{classification_model_0}\n')
# The above code can be written more succinctly using nn.Sequential
# Implements two layers of nn.Linear()
# Which calls the following equation
# y = ( x * (Weights).transposed ) + bias
classification_model_0 = nn.Sequential(
nn.Linear(in_features=2, out_features=5),
nn.Linear(in_features=5, out_features=1)
).to(device)
# Display model parameters
print(f'Model (nn.Sequential) parameters:\n'
f'{classification_model_0}\n\n')
# Make predictions with the model
untrained_predictions = classification_model_0(input_data_test.to(device))
print(f'Length of predictions: {len(untrained_predictions)}, Shape: {untrained_predictions.shape}')
print(f'Length of test samples: {len(model_output_test)}, Shape: {model_output_test.shape}')
print(f'\nFirst 10 predictions:\n'
f'{untrained_predictions[:10]}')
print(f'\nFirst 10 test labels:\n'
f'{model_output_test[:10]}')
# Create a loss function
# Unlike the regression model, the classification model uses a different loss type
# Binary Cross Entropy will be used for this task
# torch.nn.BCELoss() - measures the BCE between the target(labels) and the input(features)
# Another version may be used:
# torch.nn.BCEWithLogitsLoss() - same, except it has a built-in Sigmoid function
# loss_fn = nn.BCELoss() # <- BCELoss = no sigmoid built-in
loss_function = nn.BCEWithLogitsLoss() # <- BCEWithLogitsLoss = sigmoid built-in
# Create an optimiser
optimiser = torch.optim.SGD(params=classification_model_0.parameters(),
lr=0.1)
# Calculate accuracy (a classification metric)
# This acts as an evaluation metric
# Offers perspective into how the model is going
# The loss function measures how wrong the model but
# Evaluation metrics measure how right it is
# Accuracy will be the first metric to be utilised
# Accuracy can be measured by dividing the total number of correct predictions
# By the total number of overall predictions
def accuracy_function(label_actual, label_predicted):
# Calculates where 2 tensors are equal
correct = torch.eq(label_actual, label_predicted).sum().item()
accuracy = (correct / len(label_predicted)) * 100
return accuracy
# View the first 5 results of the forward pass on test data
# labels_logits represents the output of the forward pass method above
# Which utilises two nn.Linear() layers
labels_logits = classification_model_0(input_data_test.to(device))[:5]
print(f'First 5 outputs of the forward pass:\n'
f'{labels_logits}')
# Use the sigmoid function on the model labels_logits
# Turns the output of the forward pass into prediction probabilities
# Measures the odds the model classifies a data point into one class or the other
# In the case of this problem the classes are either 0 or 1
# It uses the logic:
# If labels_prediction_probabilities >= 0.5 then assign the label class (1)
# If labels_prediction_probabilities < 0.5 then assign the label class (0)
labels_prediction_probabilities = torch.sigmoid(labels_logits)
print(f'Output of the sigmoid-ed forward pass:\n'
f'{labels_prediction_probabilities}')
# Find the predicted labels (round the prediction probabilities as well)
label_predictions = torch.round(labels_prediction_probabilities)
# In full
labels_predictions_classes = \
torch.round(torch.sigmoid(classification_model_0(input_data_test.to(device))[:5]))
# Check for equality
print(torch.eq(label_predictions.squeeze(), labels_predictions_classes.squeeze()))
# Get rid of the extra dimensions
label_predictions.squeeze()
# Display model predictions
print(f'Model Predictions:\n'
f'{label_predictions}')
# Display test labels for comparison with model predictions
print(f'\nFirst five test data:\n'
f'{model_output_test[:5]}')
# Building the training loop
torch.manual_seed(42)
# Set the number of epochs
epochs = 100
# Process data on the target devices
input_data_train, model_output_train = input_data_train.to(device),\
model_output_train.to(device)
input_data_test, model_output_test = input_data_test.to(device),\
model_output_test.to(device)
# Build the training and evaluation loop
for epoch in range(epochs):
# Training
classification_model_0.train()
# todo: Do the Forward pass
# Model outputs raw labels_logits
train_labels_logits = classification_model_0(input_data_train).squeeze()
# ^ .squeeze() removes the extra dimensions, won't work if model and data on diff devices
train_label_prediction = torch.round(torch.sigmoid(train_labels_logits))
# ^ turns logits -> prediction probabilities -> prediction label classes
# todo: Calculate the loss
# 2. Calculates loss/accuracy
""" train_loss = loss_function(torch.sigmoid(train_labels_logits),
model_output_train) # <- nn.BCELoss needs torch.sigmoid() """
train_loss = loss_function(train_labels_logits,
model_output_train)
train_accuracy = accuracy_function(label_actual=model_output_train,
label_predicted=train_label_prediction)
# todo: Optimiser zero grad
optimiser.zero_grad()
# todo: Loss backwards
train_loss.backward()
# todo: optimiser step step step
optimiser.step()
# Testing
# todo: evaluate the model
classification_model_0.eval()
with torch.inference_mode():
# todo: Do the forward pass
test_logits = classification_model_0(input_data_test).squeeze()
test_predictions = torch.round(torch.sigmoid(test_logits))
# todo: calculate the loss
test_loss = loss_function(test_logits,
model_output_test)
test_accuracy = accuracy_function(label_actual=model_output_test,
label_predicted=test_predictions)
# todo: print model statistics every 10 epochs
if epoch % 10 == 0:
print(f'Epoch: {epoch} | Loss: {train_loss:.5f}, | Train Accuracy: {train_accuracy:.2f}%'
f'Test Loss: {test_loss:.5f}, | Test accuracy: {test_accuracy:.2f}%')
# Plot decision boundary for training and test sets
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("My Train")
plot_decision_boundary(classification_model_0, input_data_train, model_output_train)
plt.subplot(1, 2, 2)
plt.title("My Test")
plot_decision_boundary(classification_model_0, input_data_test, model_output_test)
plt.show()`
AND HERE'S THE TUTORIAL CODE:
from sklearn.datasets import make_circles
# Make 1000 samples
n_samples = 1000
# Create circles
X, y = make_circles(n_samples,
noise=0.03, # a little bit of noise to the dots
random_state=42) # keep random state so we get the same values
print(f"First 5 X features:\n{X[:5]}")
print(f"\nFirst 5 y labels:\n{y[:5]}")
# Make DataFrame of circle data
import pandas as pd
circles = pd.DataFrame({"X1": X[:, 0],
"X2": X[:, 1],
"label": y
})
circles.head(10)
# Check different labels
circles.label.value_counts()
# Visualize with a plot
import matplotlib.pyplot as plt
plt.scatter(x=X[:, 0],
y=X[:, 1],
c=y,
cmap=plt.cm.RdYlBu);
# Check the shapes of our features and labels
X.shape, y.shape
# View the first example of features and labels
X_sample = X[0]
y_sample = y[0]
print(f"Values for one sample of X: {X_sample} and the same for y: {y_sample}")
print(f"Shapes for one sample of X: {X_sample.shape} and the same for y: {y_sample.shape}")
# Turn data into tensors
# Otherwise this causes issues with computations later on
import torch
X = torch.from_numpy(X).type(torch.float)
y = torch.from_numpy(y).type(torch.float)
# View the first five samples
X[:5], y[:5]
# Split data into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2, # 20% test, 80% train
random_state=42) # make the random split reproducible
len(X_train), len(X_test), len(y_train), len(y_test)
# Standard PyTorch imports
import torch
from torch import nn
# Make device agnostic code
device = "cuda" if torch.cuda.is_available() else "cpu"
device
# 1. Construct a model class that subclasses nn.Module
class CircleModelV0(nn.Module):
def __init__(self):
super().__init__()
# 2. Create 2 nn.Linear layers capable of handling X and y input and output shapes
self.layer_1 = nn.Linear(in_features=2, out_features=5) # takes in 2 features (X), produces 5 features
self.layer_2 = nn.Linear(in_features=5, out_features=1) # takes in 5 features, produces 1 feature (y)
# 3. Define a forward method containing the forward pass computation
def forward(self, x):
# Return the output of layer_2, a single feature, the same shape as y
return self.layer_2(
self.layer_1(x)) # computation goes through layer_1 first then the output of layer_1 goes through layer_2
# 4. Create an instance of the model and send it to target device
model_0 = CircleModelV0().to(device)
model_0
# Replicate CircleModelV0 with nn.Sequential
model_0 = nn.Sequential(
nn.Linear(in_features=2, out_features=5),
nn.Linear(in_features=5, out_features=1)
).to(device)
model_0
# Make predictions with the model
untrained_preds = model_0(X_test.to(device))
print(f"Length of predictions: {len(untrained_preds)}, Shape: {untrained_preds.shape}")
print(f"Length of test samples: {len(y_test)}, Shape: {y_test.shape}")
print(f"\nFirst 10 predictions:\n{untrained_preds[:10]}")
print(f"\nFirst 10 test labels:\n{y_test[:10]}")
# Create a loss function
# loss_fn = nn.BCELoss() # BCELoss = no sigmoid built-in
loss_fn = nn.BCEWithLogitsLoss() # BCEWithLogitsLoss = sigmoid built-in
# Create an optimizer
optimizer = torch.optim.SGD(params=model_0.parameters(),
lr=0.1)
# Calculate accuracy (a classification metric)
def accuracy_fn(y_true, y_pred):
correct = torch.eq(y_true, y_pred).sum().item() # torch.eq() calculates where two tensors are equal
acc = (correct / len(y_pred)) * 100
return acc
# View the frist 5 outputs of the forward pass on the test data
y_logits = model_0(X_test.to(device))[:5]
y_logits
# Use sigmoid on model logits
y_pred_probs = torch.sigmoid(y_logits)
y_pred_probs
# Find the predicted labels (round the prediction probabilities)
y_preds = torch.round(y_pred_probs)
# In full
y_pred_labels = torch.round(torch.sigmoid(model_0(X_test.to(device))[:5]))
# Check for equality
print(torch.eq(y_preds.squeeze(), y_pred_labels.squeeze()))
# Get rid of extra dimension
y_preds.squeeze()
y_test[:5]
torch.manual_seed(42)
# Set the number of epochs
epochs = 100
# Put data to target device
X_train, y_train = X_train.to(device), y_train.to(device)
X_test, y_test = X_test.to(device), y_test.to(device)
# Build training and evaluation loop
for epoch in range(epochs):
### Training
model_0.train()
# 1. Forward pass (model outputs raw logits)
y_logits = model_0(
X_train).squeeze() # squeeze to remove extra `1` dimensions, this won't work unless model and data are on same device
y_pred = torch.round(torch.sigmoid(y_logits)) # turn logits -> pred probs -> pred labls
# 2. Calculate loss/accuracy
# loss = loss_fn(torch.sigmoid(y_logits), # Using nn.BCELoss you need torch.sigmoid()
# y_train)
loss = loss_fn(y_logits, # Using nn.BCEWithLogitsLoss works with raw logits
y_train)
acc = accuracy_fn(y_true=y_train,
y_pred=y_pred)
# 3. Optimizer zero grad
optimizer.zero_grad()
# 4. Loss backwards
loss.backward()
# 5. Optimizer step
optimizer.step()
### Testing
model_0.eval()
with torch.inference_mode():
# 1. Forward pass
test_logits = model_0(X_test).squeeze()
test_pred = torch.round(torch.sigmoid(test_logits))
# 2. Caculate loss/accuracy
test_loss = loss_fn(test_logits,
y_test)
test_acc = accuracy_fn(y_true=y_test,
y_pred=test_pred)
# Print out what's happening every 10 epochs
if epoch % 10 == 0:
print(
f"Epoch: {epoch} | Loss: {loss:.5f}, Accuracy: {acc:.2f}% | Test loss: {test_loss:.5f}, Test acc: {test_acc:.2f}%")
from helper_functions import plot_predictions, plot_decision_boundary
# Plot decision boundaries for training and test sets
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("Tut Train")
plot_decision_boundary(model_0, X_train, y_train)
plt.subplot(1, 2, 2)
plt.title("Tut Test")
plot_decision_boundary(model_0, X_test, y_test)

calculate Entropy for each class of the test set to measure uncertainty on pytorch

I am trying to calculate Entropy of each class of the dataset for an image classification task to measure uncertainty on pytorch,using the MC Dropout method and the solution proposed in this link
Measuring uncertainty using MC Dropout on pytorch
First,I have calculated the mean of each class per batch across different forward passes (class_mean_batch) and then for all the testloader (classes_mean) and then did some transformations to get (total_mean) to use it for calculating Entropy as shown in the code below
def mcdropout_test(batch_size,n_classes,model,T):
#set non-dropout layers to eval mode
model.eval()
#set dropout layers to train mode
enable_dropout(model)
softmax = nn.Softmax(dim=1)
classes_mean = []
for images,labels in testloader:
images = images.to(device)
labels = labels.to(device)
classes_mean_batch = []
with torch.no_grad():
output_list = []
#getting outputs for T forward passes
for i in range(T):
output = model(images)
output = softmax(output)
output_list.append(torch.unsqueeze(output, 0))
concat_output = torch.cat(output_list,0)
# getting mean of each class per batch across multiple MCD forward passes
for i in range (n_classes):
mean = torch.mean(concat_output[:, : , i])
classes_mean_batch.append(mean)
# getting mean of each class for the testloader
classes_mean.append(torch.stack(classes_mean_batch))
total_mean = []
concat_classes_mean = torch.stack(classes_mean)
for i in range (n_classes):
concat_classes = concat_classes_mean[: , i]
total_mean.append(concat_classes)
total_mean = torch.stack(total_mean)
total_mean = np.asarray(total_mean.cpu())
epsilon = sys.float_info.min
# Calculating entropy across multiple MCD forward passes
entropy = (- np.sum(total_mean*np.log(total_mean + epsilon), axis=-1)).tolist()
for i in range(n_classes):
print(f'The uncertainty of class {i+1} is {entropy[i]:.4f}')
Can anyone please correct or confirm the implementation i have used to calculate Entropy of each class.

How to use Tensorflow 2 Dataset API with Keras?

This question has been answered for Tensorflow 1, eg: How to Properly Combine TensorFlow's Dataset API and Keras?, but this answer hasn't helped for my use case.
Below is an example of a model with three float32 inputs and one float32 output. I have a large amount of data that doesn't all fit into memory at once, so it's split into separate files. I'm trying to use the Dataset API to train a model by bringing in a portion of the training data at once.
import tensorflow as tf
import tensorflow.keras.layers as layers
import numpy as np
# Create TF model of a given architecture (number of hidden layers, layersize, #outputs, activation function)
def create_model(h=2, l=64, activation='relu'):
model = tf.keras.Sequential([
layers.Dense(l, activation=activation, input_shape=(3,), name='input_layer'),
*[layers.Dense(l, activation=activation) for _ in range(h)],
layers.Dense(1, activation='linear', name='output_layer')])
return model
# Load data (3 X variables, 1 Y variable) split into 5 files
# (for this example, just create a list 5 numpy arrays)
list_of_training_datasets = [np.random.rand(10,4).astype(np.float32) for _ in range(5)]
validation_dataset = np.random.rand(30,4).astype(np.float32)
def data_generator():
for data in list_of_training_datasets:
x_data = data[:, 0:3]
y_data = data[:, 3:4]
yield((x_data,y_data))
# prepare model
model = create_model(h=2,l=64,activation='relu')
model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam())
# load dataset
dataset = tf.data.Dataset.from_generator(data_generator,(np.float32,np.float32))
# fit model
model.fit(dataset, epochs=100, validation_data=(validation_dataset[:,0:3],validation_dataset[:,3:4]))
Running this, I get the error:
ValueError: Cannot take the length of shape with unknown rank.
Does anyone know how to get this working? I would also like to be able to use the batch dimension, to load two data files at a time, for example.
You need to need to specify the shapes of the your dataset along with the return data types like this.
dataset = tf.data.Dataset.from_generator(data_generator,
(np.float32,np.float32),
((None, 3), (None, 1)))
The following works, but I don't know if this is the most efficient.
As far as I understand, if your training dataset is split into 10 pieces, then you should set steps_per_epoch=10. This ensures that each epoch will step through all data once. As far as I understand, dataset.repeat() is needed because the dataset iterator is "used up" after the first epoch. .repeat() ensures that the iterator gets created again after being used up.
import numpy as np
import tensorflow.keras.layers as layers
import tensorflow as tf
# Create TF model of a given architecture (number of hidden layers, layersize, #outputs, activation function)
def create_model(h=2, l=64, activation='relu'):
model = tf.keras.Sequential([
layers.Dense(l, activation=activation, input_shape=(3,), name='input_layer'),
*[layers.Dense(l, activation=activation) for _ in range(h)],
layers.Dense(1, activation='linear', name='output_layer')])
return model
# Load data (3 X variables, 1 Y variable) split into 5 files
# (for this example, just create a list 5 numpy arrays)
list_of_training_datasets = [np.random.rand(10,4).astype(np.float32) for _ in range(5)]
steps_per_epoch = len(list_of_training_datasets)
validation_dataset = np.random.rand(30,4).astype(np.float32)
def data_generator():
for data in list_of_training_datasets:
x_data = data[:, 0:3]
y_data = data[:, 3:4]
yield((x_data,y_data))
# prepare model
model = create_model(h=2,l=64,activation='relu')
model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam())
# load dataset
dataset = tf.data.Dataset.from_generator(data_generator,output_types=(np.float32,np.float32),
output_shapes=(tf.TensorShape([None,3]), tf.TensorShape([None,1]))).repeat()
# fit model
model.fit(dataset.as_numpy_iterator(), epochs=10,steps_per_epoch=steps_per_epoch,
validation_data=(validation_dataset[:,0:3],validation_dataset[:,3:4]))

Keras fit vs. fit_generator extra smaples

I have training data and validation data stacked up in two tensors. At first, I ran a NN using keras.model.fit() function. for my purposes, I wish to move to keras.model.fit_generator(). I build a generator and I have noticed the number of samples is not a multiplication of the batch size.
My implementation to overcome this:
indices = np.arange(len(dataset))# generate indices of len(dataset)
num_of_steps = int(np.ceil(len(dataset)/batch_size)) #number of steps per epoch
extra = num_of_steps *batch_size-len(dataset)#find the size of extra samples needed to complete the next multiplication of batch_size
additional = np.random.randint(len(dataset),size = extra )#complete with random samples
indices = np.append(indices ,additional )
After randomizing the indices at each epoch I simply iterate this in batches skips and pool the correct data and labels.
I am observing a degradation in the performance of the model. When training with fit() I get 0.99 training accuracy and 0.93 validation accuracy while with fit_generator() I am getting 0.95 and 0.9 respectively. note, this is consistent and not a single experiment. I thought it might be due to fit() handling the extra samples required differently. Is my implementation reasonable? how does fit() handles datasets of a size different from a batch_size multiplication?
Sharing the full generator code:
def generator(self,batch_size,train):
"""
Generates batches of samples
:return:
"""
while 1:
nb_of_steps=0
if(train):
nb_of_steps = self._num_of_steps_train
indices = np.arange(len(self._x_train))
additional = np.random.randint(len(self._x_train), size=self._num_of_steps_train*batch_size-len(self._x_train))
else:
nb_of_steps = self._num_of_steps_test
indices = np.arange(len(self._x_test))
additional = np.random.randint(len(self._x_test), size=self._num_of_steps_test*batch_size-len(self._x_test))
indices = np.append(indices,additional)
np.random.shuffle(indices)
# print(indices.shape)
# print(nb_of_steps)
for i in range(nb_of_steps):
batch_indices=indices[i:i+batch_size]
if(train):
feat = self._x_train[batch_indices]
label = self._y_train[batch_indices]
else:
feat = self._x_test[batch_indices]
label = self._y_test[batch_indices]
feat = np.expand_dims(feat,axis=1)
# print(feat.shape)
# print(label.shape)
yield feat, label
It looks like you can simplify the generator significantly!
The number of steps etc can be set outside the loop as they do not really change. Moreover, it looks like the batch_indices is not going through the entire dataset. Finally, if your data fits in memory you might not need a generator at all, but will leave this to your judgement.
def generator(self, batch_size, train):
nb_of_steps = 0
if (train):
nb_of_steps = self._num_of_steps_train
indices = np.arange(len(self._x_train)) #len of entire dataset
else:
nb_of_steps = self._num_of_steps_test
indices = np.arange(len(self._x_test))
while 1:
np.random.shuffle(indices)
for i in range(nb_of_steps):
start_idx = i*batch_size
end_idx = min(i*batch_size+batch_size, len(indices))
batch_indices=indices[start_idx : end_idx]
if(train):
feat = self._x_train[batch_indices]
label = self._y_train[batch_indices]
else:
feat = self._x_test[batch_indices]
label = self._y_test[batch_indices]
feat = np.expand_dims(feat,axis=1)
yield feat, label
For a more robust generator consider creating a class for your set using the keras.utils.Sequence class. It will add a few extra lines of code, but it is certainly working with keras.

signal to signal pediction using RNN and Keras

I am trying to reproduce the nice work here and adapte it so that it reads real data from a file.
I started by generating random signals (instead of the generating methods provided in the above link). Unfortoutanyl, I could not generate the proper signals that the model can accept.
here is the code:
import numpy as np
import keras
from keras.utils import plot_model
input_sequence_length = 15 # Length of the sequence used by the encoder
target_sequence_length = 15 # Length of the sequence predicted by the decoder
import random
def getModel():# Define an input sequence.
learning_rate = 0.01
num_input_features = 1
lambda_regulariser = 0.000001 # Will not be used if regulariser is None
regulariser = None # Possible regulariser: keras.regularizers.l2(lambda_regulariser)
layers = [35, 35]
num_output_features=1
decay = 0 # Learning rate decay
loss = "mse" # Other loss functions are possible, see Keras documentation.
optimiser = keras.optimizers.Adam(lr=learning_rate, decay=decay) # Other possible optimiser "sgd" (Stochastic Gradient Descent)
encoder_inputs = keras.layers.Input(shape=(None, num_input_features))
# Create a list of RNN Cells, these are then concatenated into a single layer
# with the RNN layer.
encoder_cells = []
for hidden_neurons in layers:
encoder_cells.append(keras.layers.GRUCell(hidden_neurons, kernel_regularizer=regulariser,recurrent_regularizer=regulariser,bias_regularizer=regulariser))
encoder = keras.layers.RNN(encoder_cells, return_state=True)
encoder_outputs_and_states = encoder(encoder_inputs)
# Discard encoder outputs and only keep the states.
# The outputs are of no interest to us, the encoder's
# job is to create a state describing the input sequence.
encoder_states = encoder_outputs_and_states[1:]
# The decoder input will be set to zero (see random_sine function of the utils module).
# Do not worry about the input size being 1, I will explain that in the next cell.
decoder_inputs = keras.layers.Input(shape=(None, 1))
decoder_cells = []
for hidden_neurons in layers:
decoder_cells.append(keras.layers.GRUCell(hidden_neurons,
kernel_regularizer=regulariser,
recurrent_regularizer=regulariser,
bias_regularizer=regulariser))
decoder = keras.layers.RNN(decoder_cells, return_sequences=True, return_state=True)
# Set the initial state of the decoder to be the ouput state of the encoder.
# This is the fundamental part of the encoder-decoder.
decoder_outputs_and_states = decoder(decoder_inputs, initial_state=encoder_states)
# Only select the output of the decoder (not the states)
decoder_outputs = decoder_outputs_and_states[0]
# Apply a dense layer with linear activation to set output to correct dimension
# and scale (tanh is default activation for GRU in Keras, our output sine function can be larger then 1)
decoder_dense = keras.layers.Dense(num_output_features,
activation='linear',
kernel_regularizer=regulariser,
bias_regularizer=regulariser)
decoder_outputs = decoder_dense(decoder_outputs)
# Create a model using the functional API provided by Keras.
# The functional API is great, it gives an amazing amount of freedom in architecture of your NN.
# A read worth your time: https://keras.io/getting-started/functional-api-guide/
model = keras.models.Model(inputs=[encoder_inputs, decoder_inputs], outputs=decoder_outputs)
model.compile(optimizer=optimiser, loss=loss)
print(model.summary())
return model
def getXY():
X, y = list(), list()
for _ in range(100):
x = [random.random() for _ in range(input_sequence_length)]
y = [random.random() for _ in range(target_sequence_length)]
X.append([x,[0 for _ in range(input_sequence_length)]])
y.append(y)
return np.array(X), np.array(y)
X,y = getXY()
print(X,y)
model = getModel()
model.fit(X,y)
The error message i got is:
ValueError: Error when checking model input: the list of Numpy arrays
that you are passing to your model is not the size the model expected.
Expected to see 2 array(s), but instead got the following list of 1
arrays:
what is the correct shape of the input data for the model?
If you read carefully the source of your inspiration, you will find that he talks about the "decoder_input" data.
He talks about the "teacher forcing" technique that consists of feeding the decoder with some delayed data. But also says that it didn't really work well in his case so he puts that initial state of the decoder to a bunch of 0 as this line shows:
decoder_input = np.zeros((decoder_output.shape[0], decoder_output.shape[1], 1))
in his design of the auto-encoder, they are two separate models that have different inputs, then he ties them with RNN stats from each other.
I can see that you have tried doing the same thing but you have appended np.array([x_encoder, x_decoder]) where you should have done [np.array(x_encoder), np.array(x_decoder)]. Each input to the network should be a numpy array that you put in a list of inputs, not one big numpy array.
I also found some typos in your code, you are appending y to itself, where you should instead create a Y variable
def getXY():
X_encoder, X_decoder, Y = list(), list(), list()
for _ in range(100):
x_encoder = [random.random() for _ in range(input_sequence_length)]
# the decoder input is a sequence of 0's same length as target seq
x_decoder = [0]*len(target_sequence_length)
y = [random.random() for _ in range(target_sequence_length)]
X_encoder.append(x_encoder)
# Not really optimal but will work
X_decoder.append(x_decoder)
Y.append(y)
return [np.array(X_encoder), np.array(X_decoder], np.array(Y)
now when you do :
X, Y = getXY()
you receive X which is a list of 2 numpy arrays (as your model requests) and Y which is a single numpy array.
I hope this helps
EDIT
Indeed, in the code that generates the dataset, you can see that they build 3 dimensions np arrays for the input. RNN needs 3 dimensional inputs :-)
The following code should address the shape issue:
def getXY():
X_encoder, X_decoder, Y = list(), list(), list()
for _ in range(100):
x_encoder = [random.random() for _ in range(input_sequence_length)]
# the decoder input is a sequence of 0's same length as target seq
x_decoder = [0]*len(target_sequence_length)
y = [random.random() for _ in range(target_sequence_length)]
X_encoder.append(x_encoder)
# Not really optimal but will work
X_decoder.append(x_decoder)
Y.append(y)
# Make them as numpy arrays
X_encoder = np.array(X_encoder)
X_decoder = np.array(X_decoder)
Y = np.array(Y)
# Make them 3 dimensional arrays (with third dimension being of size 1) like the 1d vector: [1,2] can become 2 de vector [[1,2]]
X_encoder = np.expand_dims(X_encoder, axis=2)
X_decoder = np.expand_dims(X_decoder, axis=2)
Y = np.expand_dims(Y, axis=2)
return [X_encoder, X_decoder], Y

Resources