PyTorch "modified by inplace operation" error - pytorch

The error message indicates that [torch.cuda.FloatTensor [256, 1, 4, 4]] is at version 2; expected version 1 instead, and execution breaks on d_loss.backward() — i.e., the backward call on my Discriminator.
UPDATE: Okay, I tracked it down to an optimizer.step() for my Generator that was happening before running .backward() on my Discriminator.
UPDATE 2: So once I got the model running on PyTorch 1.5 (by moving G's optimizer to after the d_loss.backward() call, as above), I noticed that losses were suddenly much higher during training. I let the model run for a few epochs and the images were basically noise. So, out of curiosity I switched back to my PyTorch 1.4 environment and ran the original for a few epochs, and the images were good again. It's a ClusterGAN that I'm training — so not the standard routine — and I'm wondering why this change is so detrimental to the output. Also, how can I get the model to run in PyTorch 1.5 without the degradation in performance? Presumably I have to keep the optimizer update where it was originally (right after ge_loss.backward(retain_graph=True)), but somehow avoid the error PyTorch 1.5 reports when we hit d_loss.backward() later in the code. I suppose I have to clone() something, but I'm not clear what... ?
[...]
# main training block
for epoch in range(n_epochs):
for i, (imgs, itruth_label) in enumerate(dataloader):
iter_count += 1
# Ensure generator/encoder are trainable
generator.train()
encoder.train()
# Zero gradients for models
generator.zero_grad()
encoder.zero_grad()
discriminator.zero_grad()
# Configure input
real_imgs = Variable(imgs.type(Tensor))
# ---------------------------
# Train Generator + Encoder
# ---------------------------
optimizer_GE.zero_grad()
# Sample random latent variables
zn, zc, zc_idx = sample_z(shape=imgs.shape[0],
latent_dim=latent_dim,
n_c=n_c)
# Generate a batch of images
gen_imgs = generator(zn, zc)
# Discriminator output from real and generated samples
D_gen = discriminator(gen_imgs)
D_real = discriminator(real_imgs)
# Step for Generator & Encoder, n_skip_iter times less than for discriminator
did_update = False
if (i % n_skip_iter == 0):
# Encode the generated images
enc_gen_zn, enc_gen_zc, enc_gen_zc_logits = encoder(gen_imgs)
# Calculate losses for z_n, z_c
zn_loss = mse_loss(enc_gen_zn, zn)
zc_loss = xe_loss(enc_gen_zc_logits, zc_idx)
# additional top-k step (from Sinha et al, 2020)
if top_k <= D_gen.size()[0]:
top_k_gen = torch.topk(D_gen, top_k, 0)
else:
top_k_gen = torch.topk(D_gen, D_gen.size()[0], 0)
# Check requested metric
if wass_metric:
# Wasserstein GAN loss
ge_loss = torch.mean(top_k_gen[0]) + betan * zn_loss + betac * zc_loss
else:
# Vanilla GAN loss
valid = Variable(Tensor(gen_imgs.size(0), 1).fill_(1.0), requires_grad=False)
v_loss = bce_loss(D_gen, valid)
ge_loss = v_loss + betan * zn_loss + betac * zc_loss
ge_loss.backward(retain_graph=True)
# ---- ORIGINAL OPTIMIZER UPDATE ---- #
optimizer_GE.step()
scheduler.step(epoch + i / iters)
did_update = True
# ---------------------
# Train Discriminator
# ---------------------
optimizer_D.zero_grad()
# Measure discriminator's ability to classify real from generated samples
if wass_metric:
# Gradient penalty term
grad_penalty = calc_gradient_penalty(discriminator, real_imgs, gen_imgs)
# Wasserstein GAN loss w/gradient penalty
d_loss = torch.mean(D_real) - torch.mean(D_gen) + grad_penalty
else:
# Vanilla GAN loss
fake = Variable(Tensor(gen_imgs.size(0), 1).fill_(0.0), requires_grad=False)
real_loss = bce_loss(D_real, valid)
fake_loss = bce_loss(D_gen, fake)
d_loss = (real_loss + fake_loss) / 2
d_loss.backward()
# --- REVISED OPTIMIZER UPDATE FOR PyTorch 1.5 ------ #
# if did_update:
# optimizer_GE.step()
optimizer_D.step()
# scheduler.step(epoch + i / iters)
[...]

If I understand correctly, the error occurs at the second time you call .backward().
The problem is caused by calling .backward() on D_gen and D_real twice.
I don't know exactly what you're doing with this model, but I guess you don't have to backward and update the parameters of Discriminators while training Generators, right?
So, try this:
1.set the requires_grad of D.parameters() to False in the Train Generator + Encoder stage
2.set the requires_grad of D.parameters() to True in the Train Discriminator stage

Related

Implementation of Gradient Accumulation with SAM Optimizer together with Mixed-Precision

I'm trying to use Sharpness-Aware Minimization (SAM) optimizer in my code, using the already built Pytorch code from here. Then, I would also like to use gradient accumulation, but I have no idea how to make this works properly. Using the proposed idea in one of the closed issue for mixed-precision:
def train(
args, model, device, train_loader, optimizer, first_step_scaler, second_step_scaler, epoch
):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
enable_running_stats(model)
# First forward step
with autocast():
output = model(data)
loss = F.nll_loss(output, target)
first_step_scaler.scale(loss).backward()
# We unscale manually for two reasons: (1) SAM's first-step adds the gradient
# to weights directly. So gradient must be unscaled; (2) unscale_ checks if any
# gradient is inf and updates optimizer_state["found_inf_per_device"] accordingly.
# We use optimizer_state["found_inf_per_device"] to decide whether to apply
# SAM's first-step or not.
first_step_scaler.unscale_(optimizer)
optimizer_state = first_step_scaler._per_optimizer_states[id(optimizer)]
# Check if any gradients are inf/nan
inf_grad_cnt = sum(v.item() for v in optimizer_state["found_inf_per_device"].values())
if inf_grad_cnt == 0:
# if valid graident, apply sam_first_step
optimizer.first_step(zero_grad=True, mixed_precision=True)
sam_first_step_applied = True
else:
# if invalid graident, skip sam and revert to single optimization step
optimizer.zero_grad()
sam_first_step_applied = False
# Update the scaler with no impact on the model (weights or gradient). This update step
# resets the optimizer_state["found_inf_per_device"]. So, it is applied after computing
# inf_grad_cnt. Note that zero_grad() has no impact on the update() operation,
# because update() leverage optimizer_state["found_inf_per_device"]
first_step_scaler.update()
disable_running_stats(model)
# Second forward step
with autocast():
output = model(data)
loss = F.nll_loss(output, target)
second_step_scaler.scale(loss).backward()
if sam_first_step_applied:
# If sam_first_step was applied, apply the 2nd step
optimizer.second_step(mixed_precision=True)
second_step_scaler.step(optimizer)
I tried something like this:
def train(
args, model, device, train_loader, optimizer, first_step_scaler, second_step_scaler, epoch, gradient_acc=2
):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
enable_running_stats(model)
# First forward step
with autocast():
output = model(data)
loss = F.nll_loss(output, target)
loss = loss / gradient_acc
first_step_scaler.scale(loss).backward()
# We unscale manually for two reasons: (1) SAM's first-step adds the gradient
# to weights directly. So gradient must be unscaled; (2) unscale_ checks if any
# gradient is inf and updates optimizer_state["found_inf_per_device"] accordingly.
# We use optimizer_state["found_inf_per_device"] to decide whether to apply
# SAM's first-step or not.
first_step_scaler.unscale_(optimizer)
optimizer_state = first_step_scaler._per_optimizer_states[id(optimizer)]
# Check if any gradients are inf/nan
inf_grad_cnt = sum(v.item() for v in optimizer_state["found_inf_per_device"].values())
if inf_grad_cnt == 0:
# if valid graident, apply sam_first_step
optimizer.first_step(zero_grad=True, mixed_precision=True)
sam_first_step_applied = True
else:
# if invalid graident, skip sam and revert to single optimization step
optimizer.zero_grad()
sam_first_step_applied = False
# Update the scaler with no impact on the model (weights or gradient). This update step
# resets the optimizer_state["found_inf_per_device"]. So, it is applied after computing
# inf_grad_cnt. Note that zero_grad() has no impact on the update() operation,
# because update() leverage optimizer_state["found_inf_per_device"]
first_step_scaler.update()
disable_running_stats(model)
# Second forward step
with autocast():
output = model(data)
loss = F.nll_loss(output, target)
loss = loss / gradient_acc
second_step_scaler.scale(loss).backward()
if sam_first_step_applied:
# If sam_first_step was applied, apply the 2nd step
optimizer.second_step(mixed_precision=True)
if not (batch_idx + 1) % gradient_acc != 0:
second_step_scaler.step(optimizer)
second_step_scaler.update()
optimizer.zero_grad()
But I noticed this makes my loss increasing rather than decreasing, anyone have any idea how to improvise this?

Unable to use existing code working with base transformers on 'large' models

My Python code works OK for base transformer models, but when I attempt to use 'large' models, or roberta models I receive error mesages. The most common message I print below.
Epoch 1 / 40
RuntimeError Traceback (most recent call last)
in ()
12
13 #train model
---> 14 train_loss, _ = fine_tune()
15 # WE DON'T CARE ABOUT THE SECOND ITEM THE MODEL OUTPUTS (total_preds)
16 # We onlt want the average loss values here 'avg_loss'
5 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in linear(input, weight, bias)
1688 if input.dim() == 2 and bias is not None:
1689 # fused op is marginally faster
-> 1690 ret = torch.addmm(bias, input, weight.t())
1691 else:
1692 output = input.matmul(weight.t())
RuntimeError: mat1 dim 1 must match mat2 dim 0
I am guessing there is some kind of a mismatch between matrices(Tensors) such that an operation cannot occur. If I can better understand the issue, I can better address the necessary changes to my code. Her is the fine tuning function I am using...
def fine_tune():
model.train()
total_loss, total_accuracy = 0, 0
empty list to save model predictions
total_preds=[]
iterate over batches
for step,batch in enumerate(train_dataloader):
# progress update after every 50 batches.
if step % 50 == 0 and not step == 0:
print(' Batch {:>5,} of {:>5,}.'.format(step, len(train_dataloader)))
# push the batch to gpu
batch = [r.to(device) for r in batch]
sent_id, mask, labels = batch
# clear previously calculated gradients
model.zero_grad()
# get model predictions for the current batch
preds = model(sent_id, mask)
# compute the loss between actual and predicted values
loss = cross_entropy(preds, labels)
# add on to the total loss
total_loss = total_loss + loss.item()
# backward pass to calculate the gradients
loss.backward()
# clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# update parameters
optimizer.step()
# model predictions are stored on GPU. So, push it to CPU
preds=preds.detach().cpu().numpy()
# Length of preds is the same as the batch size
# append the model predictions
total_preds.append(preds)
compute the training loss of the epoch
avg_loss = total_loss / len(train_dataloader)
reshape the predictions in form of (number of samples, no. of classes)
total_preds = np.concatenate(total_preds, axis=0)
return avg_loss, total_preds
regards, Mark
I wrote a print statement to reveal the size of the input from the pre-trained model. This revealed that true size, namely 1024, rather than the default hard-code value of 768 in the program I have modified. An easy fix once I understood the problem. The moral of the story for me is, when a YouTuber ( a good one actually!) says " all transformers have an output dimension of 768" don't take that necessarily as gospel!

Implementing Batch for Image Segmentation

I wrote a Python 3.5 script for doing street segmentation. Since I'm new in Image Segementation, I did not use predefined dataloaders from pytorch, instead I wrote them by my self (for better understanding). Until now I only use a batch size of 1. Now I want to generalize this for arbitrary batch sizes.
This is a snippet of my Dataloader:
def augment_data(batch_size):
# [...] defining some paths and data transformation (including ToTensor() function)
# The images are named by numbers (Frame numbers), this allows me to find the correct label image for a given input image.
all_input_image_paths = {int(elem.split('\\')[-1].split('.')[0]) : elem for idx, elem in enumerate(glob.glob(input_dir + "*"))}
all_label_image_paths = {int(elem.split('\\')[-1].split('.')[0]) : elem for idx, elem in enumerate(glob.glob(label_dir + "*"))}
dataloader = {"train":[], "val":[]}
all_samples = []
img_counter = 0
for key, value in all_input_image_paths.items():
input_img = Image.open(all_input_image_paths[key])
label_img = Image.open(all_label_image_paths[key])
# Here I use my own augmentation function which crops the input and label on the same position and do other things.
# We get a list of new augmented data
augmented_images = generate_augmented_images(input_img, label_img)
for elem in augmented_images:
input_as_tensor = data_transforms['norm'](elem[0])
label_as_tensor = data_transforms['val'](elem[1])
input_as_tensor.unsqueeze_(0)
label_as_tensor.unsqueeze_(0)
is_training_data = random.uniform(0.0, 1.0)
if is_training_data <= 0.7:
dataloader["train"].append([input_as_tensor, label_as_tensor])
else:
dataloader["val"].append([input_as_tensor, label_as_tensor])
img_counter += 1
shuffle(dataloader["train"])
shuffle(dataloader["val"])
dataloader_batched = {"train":[], "val":[]}
# Here I group my data to a given batch size
for elem in dataloader["train"]:
batch = []
for i in range(batch_size):
batch.append(elem)
dataloader_batched["train"].append(batch)
for elem in dataloader["val"]:
batch = []
for i in range(batch_size):
batch.append(elem)
dataloader_batched["val"].append(batch)
return dataloader_batched
This is a snippet of my training method with batch size 1:
while epoch <= num_epochs:
# Each epoch has a training and validation phase
for phase in ['train', 'val']:
if phase == 'train':
scheduler.step(3)
model.train() # Set model to training mode
else:
model.eval() # Set model to evaluate mode
running_loss = 0.0
counter = 0
# Iterate over data.
for inputs, labels in dataloaders[phase]:
counter += 1
max_num = len(dataloaders[phase])
inputs = inputs.to(device)
labels = labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward
# track history if only in train
with torch.set_grad_enabled(phase == 'train'):
outputs = model(inputs)
loss = criterion(outputs, labels)
# backward + optimize only if in training phase
if phase == 'train':
loss.backward()
optimizer.step()
# statistics
running_loss += loss.item() * inputs.size(0)
epoch_loss = running_loss / dataset_sizes[phase]
If I execute this, I get of course the error:
for inputs, labels in dataloaders[phase]:
ValueError: not enough values to unpack (expected 2, got 1)
I understand why, because now I have a list of images and not only an input and label image as before. So guessed I need a second for loop which iterates over these batches. So I tried this:
# Iterate over data.
for elem in dataloaders[phase]:
for inputs, labels in elem:
counter += 1
max_num = len(dataloaders[phase])
inputs = inputs.to(device)
labels = labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward
# track history if only in train
with torch.set_grad_enabled(phase == 'train'):
outputs = model(inputs)
# _, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
# backward + optimize only if in training phase
if phase == 'train':
loss.backward()
optimizer.step()
But for me it looks like the optimization step (back-prop) is only applied on the last image of the batch. Is that true? And if so, how can I fix this? I guess if I indent the with-Block, then I get again a batch size 1 optimization.
Thanks in advance
But for me it looks like the optimization step (back-prop) is only applied on the last image of the batch.
It should not apply only based on the last image. It should apply based on the batch size.
If you set bs=2 and it should apply to the batch of two images.
Optimization step actually will update the params of your network. Backprop is a fancy name for PyTorch autograd system that computes the first order gradients.

TF | How to predict from CNN after training is done

Trying to work with the framework provided in the course Stanford cs231n, given the code below.
I can see the accuracy getting better and the net is trained however after the training process and checking the results on the validation set, how would I go to input one image into the model and see its prediction?
I have searched around and couldn't find some built in predict function in tensorflow as there is in keras.
Initializing the net and its parameters
# clear old variables
tf.reset_default_graph()
# setup input (e.g. the data that changes every batch)
# The first dim is None, and gets sets automatically based on batch size fed in
X = tf.placeholder(tf.float32, [None, 30, 30, 1])
y = tf.placeholder(tf.int64, [None])
is_training = tf.placeholder(tf.bool)
def simple_model(X,y):
# define our weights (e.g. init_two_layer_convnet)
# setup variables
Wconv1 = tf.get_variable("Wconv1", shape=[7, 7, 1, 32]) # Filter of size 7x7 with depth of 3. No. of filters is 32
bconv1 = tf.get_variable("bconv1", shape=[32])
W1 = tf.get_variable("W1", shape=[4608, 360]) # 5408 is 13x13x32 where 13x13 is the output of 7x7 filter on 32x32 image with padding of 2.
b1 = tf.get_variable("b1", shape=[360])
# define our graph (e.g. two_layer_convnet)
a1 = tf.nn.conv2d(X, Wconv1, strides=[1,2,2,1], padding='VALID') + bconv1
h1 = tf.nn.relu(a1)
h1_flat = tf.reshape(h1,[-1,4608])
y_out = tf.matmul(h1_flat,W1) + b1
return y_out
y_out = simple_model(X,y)
# define our loss
total_loss = tf.losses.hinge_loss(tf.one_hot(y,360),logits=y_out)
mean_loss = tf.reduce_mean(total_loss)
# define our optimizer
optimizer = tf.train.AdamOptimizer(5e-4) # select optimizer and set learning rate
train_step = optimizer.minimize(mean_loss)
Function for evaluating the model whether for training or validation and plots the results:
def run_model(session, predict, loss_val, Xd, yd,
epochs=1, batch_size=64, print_every=100,
training=None, plot_losses=False):
# Have tensorflow compute accuracy
correct_prediction = tf.equal(tf.argmax(predict,1), y)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# shuffle indicies
train_indicies = np.arange(Xd.shape[0])
np.random.shuffle(train_indicies)
training_now = training is not None
# setting up variables we want to compute and optimize
# if we have a training function, add that to things we compute
variables = [mean_loss,correct_prediction,accuracy]
if training_now:
variables[-1] = training
# counter
iter_cnt = 0
for e in range(epochs):
# keep track of losses and accuracy
correct = 0
losses = []
# make sure we iterate over the dataset once
for i in range(int(math.ceil(Xd.shape[0]/batch_size))):
# generate indicies for the batch
start_idx = (i*batch_size)%Xd.shape[0]
idx = train_indicies[start_idx:start_idx+batch_size]
# create a feed dictionary for this batch
feed_dict = {X: Xd[idx,:],
y: yd[idx],
is_training: training_now }
# get batch size
actual_batch_size = yd[idx].shape[0]
# have tensorflow compute loss and correct predictions
# and (if given) perform a training step
loss, corr, _ = session.run(variables,feed_dict=feed_dict)
# aggregate performance stats
losses.append(loss*actual_batch_size)
correct += np.sum(corr)
# print every now and then
if training_now and (iter_cnt % print_every) == 0:
print("Iteration {0}: with minibatch training loss = {1:.3g} and accuracy of {2:.2g}"\
.format(iter_cnt,loss,np.sum(corr)/actual_batch_size))
iter_cnt += 1
total_correct = correct/Xd.shape[0]
total_loss = np.sum(losses)/Xd.shape[0]
print("Epoch {2}, Overall loss = {0:.3g} and accuracy of {1:.3g}"\
.format(total_loss,total_correct,e+1))
if plot_losses:
plt.plot(losses)
plt.grid(True)
plt.title('Epoch {} Loss'.format(e+1))
plt.xlabel('minibatch number')
plt.ylabel('minibatch loss')
plt.show()
return total_loss,total_correct
The functions calls that trains the model
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
print('Training')
run_model(sess,y_out,mean_loss,x_train,y_train,1,64,100,train_step,True)
print('Validation')
run_model(sess,y_out,mean_loss,x_val,y_val,1,64)
You do not need to go far, you simply pass your new (test) feature matrix X_test into your network and perform a forward pass - the output layer is the prediction. So the code is something like this
session.run(y_out, feed_dict={X: X_test})

Udacity Deep Learning, Assignment 3, Part 3: Tensorflow dropout function

I am now on assignment 3 of the Udacity Deep Learning class. I have most of it completed and it's working but I noticed that problem 3, which is about using 'dropout' with tensorflow, seems to degrade my performance rather than improve it.
So I think I'm doing something wrong. I'll put my full code here. If someone can explain to me how to properly use dropout, I'd appreciate it. (Or confirm I'm using it correctly and it's just not helping in this case.). It drops accuracy from over 94% (without dropout) down to 91.5%. If you aren't using L2 regularization, the degradation is even larger.
def create_nn(dataset, weights_hidden, biases_hidden, weights_out, biases_out):
# Original layer
logits = tf.add(tf.matmul(tf_train_dataset, weights_hidden), biases_hidden)
# Drop Out layer 1
logits = tf.nn.dropout(logits, 0.5)
# Hidden Relu layer
logits = tf.nn.relu(logits)
# Drop Out layer 2
logits = tf.nn.dropout(logits, 0.5)
# Output: Connect hidden layer to a node for each class
logits = tf.add(tf.matmul(logits, weights_out), biases_out)
return logits
# Create model
batch_size = 128
hidden_layer_size = 1024
beta = 1e-3
graph = tf.Graph()
with graph.as_default():
# Input data. For the training data, we use a placeholder that will be fed
# at run time with a training minibatch.
tf_train_dataset = tf.placeholder(tf.float32,
shape=(batch_size, image_size * image_size))
tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
# Variables.
weights_hidden = tf.Variable(
#tf.truncated_normal([image_size * image_size, num_labels]))
tf.truncated_normal([image_size * image_size, hidden_layer_size]))
#biases = tf.Variable(tf.zeros([num_labels]))
biases_hidden = tf.Variable(tf.zeros([hidden_layer_size]))
weights_out = tf.Variable(tf.truncated_normal([hidden_layer_size, num_labels]))
biases_out = tf.Variable(tf.zeros([num_labels]))
# Training computation.
#logits = tf.matmul(tf_train_dataset, weights_out) + biases_out
logits = create_nn(tf_train_dataset, weights_hidden, biases_hidden, weights_out, biases_out)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
loss += beta * (tf.nn.l2_loss(weights_hidden) + tf.nn.l2_loss(weights_out))
# Optimizer.
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
# Predictions for the training, validation, and test data.
train_prediction = tf.nn.softmax(logits)
#valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights_out) + biases_out)
#test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights_out) + biases_out)
valid_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset, weights_hidden) + biases_hidden), weights_out) + biases_out)
test_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, weights_hidden) + biases_hidden), weights_out) + biases_out)
num_steps = 10000
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print("Initialized")
for step in range(num_steps):
# Pick an offset within the training data, which has been randomized.
# Note: we could use better randomization across epochs.
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
#offset = (step * batch_size) % (3*128 - batch_size)
#print(offset)
# Generate a minibatch.
batch_data = train_dataset[offset:(offset + batch_size), :]
batch_labels = train_labels[offset:(offset + batch_size), :]
# Prepare a dictionary telling the session where to feed the minibatch.
# The key of the dictionary is the placeholder node of the graph to be fed,
# and the value is the numpy array to feed to it.
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
_, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
if (step % 500 == 0):
print("Minibatch loss at step %d: %f" % (step, l))
print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels))
print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
You would need to turn off dropout during inference. It may not be obvious at first, but the fact that dropout is hardcoded in the NN architecture means it will affect the test data during inference. You can avoid this by creating a placeholder keep_prob, rather than providing the value 0.5 directly. For example:
keep_prob = tf.placeholder(tf.float32)
logits = tf.nn.dropout(logits, keep_prob)
To turn on dropout during training, set the keep_prob value to 0.5:
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob: 0.5}
During inference/evaluation, you should be able to do something like this to set keep_prob to 1.0 in eval:
accuracy.eval(feed_dict={x: test_prediction, y_: test_labels, keep_prob: 1.0}
EDIT:
Since the issue does not seem to be that dropout is used at inference, the next culprit would be that the dropout is too high for this network size. You can potentially try decreasing the dropout to 20% (i.e. keep_prob=0.8), or increasing the size of the network to give the model an opportunity to learn the representations.
I actually gave it a try with your code, and I'm getting around ~93.5% with 20% dropout with this network size. I have added some additional resources below, including the original Dropout paper to help clarify the intuition behind it, and expands on more tips when using dropout such as increasing the learning rate.
References:
Deep MNIST for Experts: has an example on the above (dropout on/off) using MNIST
Dropout Regularization in Deep Learning Models With Keras
Dropout: A Simple Way to Prevent Neural Networks from Overfitting
2 things I think can cause the problem.
First of all I would not recommend using dropout in first layer (that too 50%, use lower, in range 10-25% if you have to)) as when you use such a high dropout even higher level features are not learnt and propagated to deeper layers. Also try a range of dropouts from 10% to 50% and see how accuracy changes. There is no way to know beforehand what value will work
Secondly, you do not usually use dropout at inference. To fix that pass in keep_prob parameter of dropout as a placeholder and set it to 1 when inferencing.
Also, if the accuracy values you state are training accuracy then there may not even be much of a problem in first place as dropout will usually decrease training accuracy by small amounts as you are not overfitting, its the test/validation accuracy that needs to be closely monitored

Resources