I am new to pytorch and are trying to implement a feed forward neural network to classify the mnist data set. I have some problems when trying to use cross-validation. My data has the following shapes:
x_train:
torch.Size([45000, 784]) and
y_train: torch.Size([45000])
I tried to use KFold from sklearn.
kfold =KFold(n_splits=10)
Here is the first part of my train method where I'm dividing the data into folds:
for train_index, test_index in kfold.split(x_train, y_train):
x_train_fold = x_train[train_index]
x_test_fold = x_test[test_index]
y_train_fold = y_train[train_index]
y_test_fold = y_test[test_index]
print(x_train_fold.shape)
for epoch in range(epochs):
...
The indices for the y_train_fold variable is right, it's simply:
[ 0 1 2 ... 4497 4498 4499], but it's not for x_train_fold, which is [ 4500 4501 4502 ... 44997 44998 44999]. And the same goes for the test folds.
For the first iteration I want the varibale x_train_fold to be the first 4500 pictures, in other words to have the shape torch.Size([4500, 784]), but it has the shape torch.Size([40500, 784])
Any tips on how to get this right?
I think you're confused!
Ignore the second dimension for a while, When you've 45000 points, and you use 10 fold cross-validation, what's the size of each fold? 45000/10 i.e. 4500.
It means that each of your fold will contain 4500 data points, and one of those fold will be used for testing, and the remaining for training i.e.
For testing: one fold => 4500 data points => size: 4500
For training: remaining folds => 45000-4500 data points => size: 45000-4500=40500
Thus, for first iteration, the first 4500 data points (corresponding to indices) will be used for testing and the rest for training. (Check below image)
Given your data is x_train: torch.Size([45000, 784]) and y_train: torch.Size([45000]), this is how your code should look like:
for train_index, test_index in kfold.split(x_train, y_train):
print(train_index, test_index)
x_train_fold = x_train[train_index]
y_train_fold = y_train[train_index]
x_test_fold = x_train[test_index]
y_test_fold = y_train[test_index]
print(x_train_fold.shape, y_train_fold.shape)
print(x_test_fold.shape, y_test_fold.shape)
break
[ 4500 4501 4502 ... 44997 44998 44999] [ 0 1 2 ... 4497 4498 4499]
torch.Size([40500, 784]) torch.Size([40500])
torch.Size([4500, 784]) torch.Size([4500])
So, when you say
I want the variable x_train_fold to be the first 4500 picture... shape torch.Size([4500, 784]).
you're wrong. this size corresonds to x_test_fold. In the first iteration, based on 10 folds, x_train_fold will have 40500 points, thus its size is supposed to be torch.Size([40500, 784]).
Think I have it right now, but I feel the code is a bit messy, with 3 nested loops. Is there any simpler way to it or is this approach okay?
Here's my code for the training with cross validation:
def train(network, epochs, save_Model = False):
total_acc = 0
for fold, (train_index, test_index) in enumerate(kfold.split(x_train, y_train)):
### Dividing data into folds
x_train_fold = x_train[train_index]
x_test_fold = x_train[test_index]
y_train_fold = y_train[train_index]
y_test_fold = y_train[test_index]
train = torch.utils.data.TensorDataset(x_train_fold, y_train_fold)
test = torch.utils.data.TensorDataset(x_test_fold, y_test_fold)
train_loader = torch.utils.data.DataLoader(train, batch_size = batch_size, shuffle = False)
test_loader = torch.utils.data.DataLoader(test, batch_size = batch_size, shuffle = False)
for epoch in range(epochs):
print('\nEpoch {} / {} \nFold number {} / {}'.format(epoch + 1, epochs, fold + 1 , kfold.get_n_splits()))
correct = 0
network.train()
for batch_index, (x_batch, y_batch) in enumerate(train_loader):
optimizer.zero_grad()
out = network(x_batch)
loss = loss_f(out, y_batch)
loss.backward()
optimizer.step()
pred = torch.max(out.data, dim=1)[1]
correct += (pred == y_batch).sum()
if (batch_index + 1) % 32 == 0:
print('[{}/{} ({:.0f}%)]\tLoss: {:.6f}\t Accuracy:{:.3f}%'.format(
(batch_index + 1)*len(x_batch), len(train_loader.dataset),
100.*batch_index / len(train_loader), loss.data, float(correct*100) / float(batch_size*(batch_index+1))))
total_acc += float(correct*100) / float(batch_size*(batch_index+1))
total_acc = (total_acc / kfold.get_n_splits())
print('\n\nTotal accuracy cross validation: {:.3f}%'.format(total_acc))
You messed with indices.
x_train = x[train_index]
x_test = x[test_index]
y_train = y[train_index]
y_test = y[test_index]
x_fold = x_train[train_index]
y_fold = y_train[test_index]
It should be:
x_fold = x_train[train_index]
y_fold = y_train[train_index]
Though all the above answers provide a good example of how to split the dataset, I am curious about the way to implement the K-fold cross-validation. K-fold aims to estimate the skill of a machine learning model on unseen data. To use a limited sample to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model. (See the concept and explanation in Wikipedia https://en.wikipedia.org/wiki/Cross-validation_(statistics)) Therefore, it is necessary to initialize the parameters of your to-be-trained model at the beginning of each fold. Otherwise, your model will see every sample in the dataset after K-fold and there is no such thing as validation (all are training samples).
Related
I am performing word sense disambiguation and have created my own vocabulary of the top 300k most common English words. My model is very simple where each word in the sentences (their respective index value) is passed through an embedding layer which embeds the word and average the resulting embedding. The averaged embedding is then sent through a linear layer, as shown in the model below.
class TestingClassifier(nn.Module):
def __init__(self, vocabSize, features, embeddingDim):
super(TestingClassifier, self).__init__()
self.embeddings = nn.Embedding(vocabSize, embeddingDim)
self.linear = nn.Linear(features, 2)
self.sigmoid = nn.Sigmoid()
def forward(self, inputs):
embeds = self.embeddings(inputs)
avged = torch.mean(embeds, dim=-1)
output = self.linear(avged)
output = self.sigmoid(output)
return output
I am running BCELoss as loss function and SGD as optimizer. My problem is that my loss barely decreases as training goes on, almost as if it converges with a very high loss. I have tried different learning rates (0.0001, 0.001, 0.01 and 0.1) but I get the same issue.
My training function is as follows:
def train_model(model,
optimizer,
lossFunction,
batchSize,
epochs,
isRnnModel,
trainDataLoader,
validDataLoader,
earlyStop = False,
maxPatience = 1
):
validationAcc = []
patienceCounter = 0
stopTraining = False
model.train()
# Train network
for epoch in range(epochs):
losses = []
if(stopTraining):
break
for inputs, labels in tqdm(trainDataLoader, position=0, leave=True):
optimizer.zero_grad()
# Predict and calculate loss
prediction = model(inputs)
loss = lossFunction(prediction, labels)
losses.append(loss)
# Backward propagation
loss.backward()
# Readjust weights
optimizer.step()
print(sum(losses) / len(losses))
curValidAcc = check_accuracy(validDataLoader, model, isRnnModel) # Check accuracy on validation set
curTrainAcc = check_accuracy(trainDataLoader, model, isRnnModel)
print("Epoch", epoch + 1, "Training accuracy", curTrainAcc, "Validation accuracy:", curValidAcc)
# Control early stopping
if(earlyStop):
if(patienceCounter == 0):
if(len(validationAcc) > 0 and curValidAcc < validationAcc[-1]):
benchmark = validationAcc[-1]
patienceCounter += 1
print("Patience counter", patienceCounter)
elif(patienceCounter == maxPatience):
print("EARLY STOP. Patience level:", patienceCounter)
stopTraining = True
else:
if(curValidAcc < benchmark):
patienceCounter += 1
print("Patience counter", patienceCounter)
else:
benchmark = curValidAcc
patienceCounter = 0
validationAcc.append(curValidAcc)
Batch size is 32 (training set contains 8000 rows), vocabulary size is 300k, embedding dimension is 24. I have tried adding more linear layers to the network, but it makes no difference. The prediction accuracy on the training and validation sets stays at around 50% (which is horrible) even after many epochs of training. Any help is much appreciated!
I am getting acquainted with Tensorflow-Probability and here I am running into a problem. During training, the model returns nan as the loss (possibly meaning a huge loss that causes overflowing). Since the functional form of the synthetic data is not overly complicated and the ratio of data points to parameters is not frightening at first glance at least I wonder what is the problem and how it could be corrected.
The code is the following --accompanied by some possibly helpful images:
# Create and plot 5000 data points
x_train = np.linspace(-1, 2, 5000)[:, np.newaxis]
y_train = np.power(x_train, 3) + 0.1*(2+x_train)*np.random.randn(5000)[:, np.newaxis]
plt.scatter(x_train, y_train, alpha=0.1)
plt.show()
# Define the prior weight distribution -- all N(0, 1) -- and not trainable
def prior(kernel_size, bias_size, dtype = None):
n = kernel_size + bias_size
prior_model = Sequential([
tfpl.DistributionLambda(
lambda t: tfd.MultivariateNormalDiag(loc = tf.zeros(n) , scale_diag = tf.ones(n)
))
])
return(prior_model)
# Define variational posterior weight distribution -- multivariate Gaussian
def posterior(kernel_size, bias_size, dtype = None):
n = kernel_size + bias_size
posterior_model = Sequential([
tfpl.VariableLayer(tfpl.MultivariateNormalTriL.params_size(n) , dtype = dtype), # The parameters of the model are declared Variables that are trainable
tfpl.MultivariateNormalTriL(n) # The posterior function will return to the Variational layer that will call it a MultivariateNormalTril object that will have as many dimensions
# as the parameters of the Variational Dense Layer. That means that each parameter will be generated by a distinct Normal Gaussian shifted and scaled
# by a mu and sigma learned from the data, independently of all the other weights. The output of this Variablelayer will become the input to the
# MultivariateNormalTriL object.
# The shape of the VariableLayer object will be defined by the number of paramaters needed to create the MultivariateNormalTriL object given
# that it will live in a Space of n dimensions (event_size = n). This number is returned by the tfpl.MultivariateNormalTriL.params_size(n)
])
return(posterior_model)
x_in = Input(shape = (1,))
x = tfpl.DenseVariational(units= 2**4,
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1/x_train.shape[0],
activation='relu')(x_in)
x = tfpl.DenseVariational(units= 2**4,
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1/x_train.shape[0],
activation='relu')(x)
x = tfpl.DenseVariational(units=tfpl.IndependentNormal.params_size(1),
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1/x_train.shape[0])(x)
y_out = tfpl.IndependentNormal(1)(x)
model = Model(inputs = x_in, outputs = y_out)
def nll(y_true, y_pred):
return -y_pred.log_prob(y_true)
model.compile(loss=nll, optimizer= 'Adam')
model.summary()
Train the model
history = model.fit(x_train1, y_train1, epochs=500)
The problem seems to be in the loss function: negative log-likelihood of the independent normal distribution without any specified location and scale leads to the untamed variance which leads to the blowing up the final loss value. Since you're experimenting with the variational layers, you must be interested in the estimation of the epistemic uncertainty, to that end, I'd recommend to apply the constant variance.
I tried to make a couple of slight changes to your code within the following lines:
first of all, the final output y_out comes directly from the final variational layer without any IndpendnetNormal distribution layer:
y_out = tfpl.DenseVariational(units=1,
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1/x_train.shape[0])(x)
second, the loss function now contains the necessary calculations with the normal distribution you need but with the static variance in order to avoid the blowing up of the loss during training:
def nll(y_true, y_pred):
dist = tfp.distributions.Normal(loc=y_pred, scale=1.0)
return tf.reduce_sum(-dist.log_prob(y_true))
then the model is compiled and trained in the same way as before:
model.compile(loss=nll, optimizer= 'Adam')
history = model.fit(x_train, y_train, epochs=3000)
and finally let's sample 100 different predictions from the trained model and plot these values to visualize the epistemic uncertainty of the model:
predicted = [model(x_train) for _ in range(100)]
for i, res in enumerate(predicted):
plt.plot(x_train, res , alpha=0.1)
plt.scatter(x_train, y_train, alpha=0.1)
plt.show()
After 3000 epochs the result looks like this (with the reduced number of training points to 3000 instead of 5000 to speed-up the training):
The model has 38,589 trainable parameters but you have only 5,000 points as data; so, effective training is impossible with so many parameters.
I am trying to train a recurrent model in Keras containing an LSTM for regression purposes.
I would like to use the model online and, as far as I understood, I need to train a stateful LSTM.
Since the model has to output a sequence of values, I hope it computes the loss on each of the expected output vector.
However, I fear my code is not working this way and I would be grateful if anyone would help me to understand if I am doing right or if there is some better approach.
The input to the model is a sequence of 128-dimensional vectors. Each sequence in the training set has a different lenght.
At each time, the model should output a vector of 3 elements.
I am trying to train and compare two models:
A) a simple LSTM with 128 inputs and 3 outputs;
B) a simple LSTM with 128 inputs and 100 outputs + a dense layer with 3 outputs;
For model A) I wrote the following code:
# Model
model = Sequential()
model.add(LSTM(3, batch_input_shape=(1, None, 128), return_sequences=True, activation = "linear", stateful = True))`
model.compile(loss='mean_squared_error', optimizer=Adam())
# Training
for i in range(n_epoch):
for j in np.random.permutation(n_sequences):
X = data[j] # j-th sequences
X = X[np.newaxis, ...] # X has size 1 x NTimes x 128
Y = dataY[j] # Y has size NTimes x 3
history = model.fit(X, Y, epochs=1, batch_size=1, verbose=0, shuffle=False)
model.reset_states()
With this code, model A) seems to train fine because the output sequence approaches the ground-truth sequence on the training set.
However, I wonder if the loss is really computed by considering all NTimes output vectors.
For model B), I could not find any way to get the entire output sequence due to the dense layer. Hence, I wrote:
# Model
model = Sequential()
model.add(LSTM(100, batch_input_shape=(1, None, 128), , stateful = True))
model.add(Dense(3, activation="linear"))
model.compile(loss='mean_squared_error', optimizer=Adam())
# Training
for i in range(n_epoch):
for j in np.random.permutation(n_sequences):
X = data[j] #j-th sequence
X = X[np.newaxis, ...] # X has size 1 x NTimes x 128
Y = dataY[j] # Y has size NTimes x 3
for h in range(X.shape[1]):
x = X[0,h,:]
x = x[np.newaxis, np.newaxis, ...] # h-th vector in j-th sequence
y = Y[h,:]
y = y[np.newaxis, ...]
loss += model.train_on_batch(x,y)
model.reset_states() #After the end of the sequence
With this code, model B) does not train fine. It seems to me the training does not converge and loss values increase and decrease cyclically
I have also tried to use as Y only the last vector and them calling the fit function on the Whole training sequence X, but no improvements.
Any idea? Thank you!
If you want to still have three outputs per step of your sequence, you need to TimeDistribute your Dense layer like so:
model.add(TimeDistributed(Dense(3, activation="linear")))
This applies the dense layer to each timestep independently.
See https://keras.io/layers/wrappers/#timedistributed
I am doing a time-series forecast with a LSTM NN and Keras. As input features there are two variables (precipitation and temperature) and the one target to be predicted is groundwater-level.
It seems to be working quite all right, though there is a serious offset between the actual data and the output (see image).
Now I've read that this is can be a classic sign of the network not working, as it seems to be mimicing the output and
what the model is actually doing is that when predicting the value at
time ât+1â, it simply uses the value at time âtâ as its prediction https://towardsdatascience.com/how-not-to-use-machine-learning-for-time-series-forecasting-avoiding-the-pitfalls-19f9d7adf424
However, this is not actually possible in my case, as the target-values are not used as input variable. I am using a multi variate time-series with two features, independent of the output feature.
Also, the predicted values are not offset in future (t+1) but rather seem to lag behind (t-1).
Does anyone know what could cause this problem?
This is the complete code of my network:
# Split in Input and Output Data
x_1 = data[['MeanT']].values
x_2 = data[['Precip']].values
y = data[['Z_424A_6857']].values
# Scale Data
x = np.hstack([x_1, x_2])
scaler = MinMaxScaler(feature_range=(0, 1))
x = scaler.fit_transform(x)
scaler_out = MinMaxScaler(feature_range=(0, 1))
y = scaler_out.fit_transform(y)
# Reshape Data
x_1, x_2, y = H.create2feature_data(x_1, x_2, y, window)
train_size = int(len(x_1) * .8)
test_size = int(len(x_1)) # * .5
x_1 = np.expand_dims(x_1, 2) # 3D tensor with shape (batch_size, timesteps, input_dim) // (nr. of samples, nr. of timesteps, nr. of features)
x_2 = np.expand_dims(x_2, 2)
y = np.expand_dims(y, 1)
# Split Training Data
x_1_train = x_1[:train_size]
x_2_train = x_2[:train_size]
y_train = y[:train_size]
# Split Test Data
x_1_test = x_1[train_size:test_size]
x_2_test = x_2[train_size:test_size]
y_test = y[train_size:test_size]
# Define Model Input Sets
inputA = Input(shape=(window, 1))
inputB = Input(shape=(window, 1))
# Build Model Branch 1
branch_1 = layers.GRU(16, activation=act, dropout=0, return_sequences=False, stateful=False, batch_input_shape=(batch, 30, 1))(inputA)
branch_1 = layers.Dense(8, activation=act)(branch_1)
#branch_1 = layers.Dropout(0.2)(branch_1)
branch_1 = Model(inputs=inputA, outputs=branch_1)
# Build Model Branch 2
branch_2 = layers.GRU(16, activation=act, dropout=0, return_sequences=False, stateful=False, batch_input_shape=(batch, 30, 1))(inputB)
branch_2 = layers.Dense(8, activation=act)(branch_2)
#branch_2 = layers.Dropout(0.2)(branch_2)
branch_2 = Model(inputs=inputB, outputs=branch_2)
# Combine Model Branches
combined = layers.concatenate([branch_1.output, branch_2.output])
# apply a FC layer and then a regression prediction on the combined outputs
comb = layers.Dense(6, activation=act)(combined)
comb = layers.Dense(1, activation="linear")(comb)
# Accept the inputs of the two branches and then output a single value
model = Model(inputs=[branch_1.input, branch_2.input], outputs=comb)
model.compile(loss='mse', optimizer='adam', metrics=['mse', H.r2_score])
model.summary()
# Training
model.fit([x_1_train, x_2_train], y_train, epochs=epoch, batch_size=batch, validation_split=0.2, callbacks=[tensorboard])
model.reset_states()
# Evaluation
print('Train evaluation')
print(model.evaluate([x_1_train, x_2_train], y_train))
print('Test evaluation')
print(model.evaluate([x_1_test, x_2_test], y_test))
# Predictions
predictions_train = model.predict([x_1_train, x_2_train])
predictions_test = model.predict([x_1_test, x_2_test])
predictions_train = np.reshape(predictions_train, (-1,1))
predictions_test = np.reshape(predictions_test, (-1,1))
# Reverse Scaling
predictions_train = scaler_out.inverse_transform(predictions_train)
predictions_test = scaler_out.inverse_transform(predictions_test)
# Plot results
plt.figure(figsize=(15, 6))
plt.plot(orig_data, color='blue', label='True GWL')
plt.plot(range(train_size), predictions_train, color='red', label='Predicted GWL (Training)')
plt.plot(range(train_size, test_size), predictions_test, color='green', label='Predicted GWL (Test)')
plt.title('GWL Prediction')
plt.xlabel('Day')
plt.ylabel('GWL')
plt.legend()
plt.show()
I am using a batch size of 30 timesteps, a lookback of 90 timesteps, with a total data size of around 7500 time steps.
Any help would be greatly appreciated :-) Thank you!
Probably my answer is not relevant two years later, but I had a similar issue when experimenting with LSTM encoder-decoder model. I solved my problem by scaling the input data in the range -1 .. 1 instead of 0 .. 1 as in your example.
Trying to work with the framework provided in the course Stanford cs231n, given the code below.
I can see the accuracy getting better and the net is trained however after the training process and checking the results on the validation set, how would I go to input one image into the model and see its prediction?
I have searched around and couldn't find some built in predict function in tensorflow as there is in keras.
Initializing the net and its parameters
# clear old variables
tf.reset_default_graph()
# setup input (e.g. the data that changes every batch)
# The first dim is None, and gets sets automatically based on batch size fed in
X = tf.placeholder(tf.float32, [None, 30, 30, 1])
y = tf.placeholder(tf.int64, [None])
is_training = tf.placeholder(tf.bool)
def simple_model(X,y):
# define our weights (e.g. init_two_layer_convnet)
# setup variables
Wconv1 = tf.get_variable("Wconv1", shape=[7, 7, 1, 32]) # Filter of size 7x7 with depth of 3. No. of filters is 32
bconv1 = tf.get_variable("bconv1", shape=[32])
W1 = tf.get_variable("W1", shape=[4608, 360]) # 5408 is 13x13x32 where 13x13 is the output of 7x7 filter on 32x32 image with padding of 2.
b1 = tf.get_variable("b1", shape=[360])
# define our graph (e.g. two_layer_convnet)
a1 = tf.nn.conv2d(X, Wconv1, strides=[1,2,2,1], padding='VALID') + bconv1
h1 = tf.nn.relu(a1)
h1_flat = tf.reshape(h1,[-1,4608])
y_out = tf.matmul(h1_flat,W1) + b1
return y_out
y_out = simple_model(X,y)
# define our loss
total_loss = tf.losses.hinge_loss(tf.one_hot(y,360),logits=y_out)
mean_loss = tf.reduce_mean(total_loss)
# define our optimizer
optimizer = tf.train.AdamOptimizer(5e-4) # select optimizer and set learning rate
train_step = optimizer.minimize(mean_loss)
Function for evaluating the model whether for training or validation and plots the results:
def run_model(session, predict, loss_val, Xd, yd,
epochs=1, batch_size=64, print_every=100,
training=None, plot_losses=False):
# Have tensorflow compute accuracy
correct_prediction = tf.equal(tf.argmax(predict,1), y)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# shuffle indicies
train_indicies = np.arange(Xd.shape[0])
np.random.shuffle(train_indicies)
training_now = training is not None
# setting up variables we want to compute and optimize
# if we have a training function, add that to things we compute
variables = [mean_loss,correct_prediction,accuracy]
if training_now:
variables[-1] = training
# counter
iter_cnt = 0
for e in range(epochs):
# keep track of losses and accuracy
correct = 0
losses = []
# make sure we iterate over the dataset once
for i in range(int(math.ceil(Xd.shape[0]/batch_size))):
# generate indicies for the batch
start_idx = (i*batch_size)%Xd.shape[0]
idx = train_indicies[start_idx:start_idx+batch_size]
# create a feed dictionary for this batch
feed_dict = {X: Xd[idx,:],
y: yd[idx],
is_training: training_now }
# get batch size
actual_batch_size = yd[idx].shape[0]
# have tensorflow compute loss and correct predictions
# and (if given) perform a training step
loss, corr, _ = session.run(variables,feed_dict=feed_dict)
# aggregate performance stats
losses.append(loss*actual_batch_size)
correct += np.sum(corr)
# print every now and then
if training_now and (iter_cnt % print_every) == 0:
print("Iteration {0}: with minibatch training loss = {1:.3g} and accuracy of {2:.2g}"\
.format(iter_cnt,loss,np.sum(corr)/actual_batch_size))
iter_cnt += 1
total_correct = correct/Xd.shape[0]
total_loss = np.sum(losses)/Xd.shape[0]
print("Epoch {2}, Overall loss = {0:.3g} and accuracy of {1:.3g}"\
.format(total_loss,total_correct,e+1))
if plot_losses:
plt.plot(losses)
plt.grid(True)
plt.title('Epoch {} Loss'.format(e+1))
plt.xlabel('minibatch number')
plt.ylabel('minibatch loss')
plt.show()
return total_loss,total_correct
The functions calls that trains the model
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
print('Training')
run_model(sess,y_out,mean_loss,x_train,y_train,1,64,100,train_step,True)
print('Validation')
run_model(sess,y_out,mean_loss,x_val,y_val,1,64)
You do not need to go far, you simply pass your new (test) feature matrix X_test into your network and perform a forward pass - the output layer is the prediction. So the code is something like this
session.run(y_out, feed_dict={X: X_test})