How to understand a periodicity in the training loss using a pre-trained model of PyTorch? - pytorch

I'm using a pre-trained model from Pytorch ( Resnet 18,34,50) in order to classify images. During the training, a weird periodicity appears in the training as you can see in the image below. Did somebody already have a similar issue?In order to deal with the overfitting, I'm using Data augmentation in the preprocessing.
When using SGD as an optimizer with the following parameters, we obtain this sort of graph:
criterion: NLLLoss()
learning rate: 0.0001
epoch: 40
print every 40 iteration
We also try adam and Adam bound as optimizers but the same periodicity was observed.
Thank's in advance for your answer!
Here is the code :
def train_classifier():
start=0
stop=0
start = timeit.default_timer()
epochs = 40
steps = 0
print_every = 40
model.to('cuda')
epo=[]
train=[]
valid=[]
acc_valid=[]
for e in range(epochs):
print('Currently running epoch',e,':')
model.train()
running_loss = 0
for images, labels in iter(train_loader):
steps += 1
images, labels = images.to('cuda'), labels.to('cuda')
optimizer.zero_grad()
output = model.forward(images)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
if steps % print_every == 0:
model.eval()
# Turn off gradients for validation, saves memory and computations
with torch.no_grad():
validation_loss, accuracy = validation(model, val_loader, criterion)
print("Epoch: {}/{}.. ".format(e+1, epochs),
"Training Loss: {:.3f}.. ".format(running_loss/print_every),
"Validation Loss: {:.3f}.. ".format(validation_loss/len(val_loader)),
"Validation Accuracy: {:.3f}".format(accuracy/len(val_loader)))
stop = timeit.default_timer()
print('Time: ', stop - start)
acc_valid.append(accuracy/len(val_loader))
train.append(running_loss/print_every)
valid.append(validation_loss/len(val_loader))
epo.append(e+1)
running_loss = 0
model.train()
return train,epo,valid,acc_valid

Related

Pytorch1.6 What is the actual learning rate during training?

I'd like to know the actual learning rate during training, here is my code.
learning_rate = 0.001
optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[1, 2], gamma=0.1)
def train(epoch):
train_loss = 0
for batch_idx, (input, target) in enumerate(train_loader):
predict_label = net(input)
loss = criterion(predict_label, target)
train_loss += loss.item()
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(optimizer.param_groups[0]['lr'])
scheduler.step()
print(scheduler.state_dict()['_last_lr'])
print(optimizer.param_groups[0]['lr'])
the output is 0.001, 0.0001, 0.0001. So what is the actual lr during optimizer.step()? 0.001 or 0.0001? Thanks.
The important part is here:
for batch_idx, (input, target) in enumerate(train_loader):
...
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(optimizer.param_groups[0]['lr']) #### CURRENT LEARNING RATE
scheduler.step() #step through learning rate
print(scheduler.state_dict()['_last_lr']) #### NEW LEARNING RATE
print(optimizer.param_groups[0]['lr']) #### NEW LEARNING RATE
Because you step your scheduler after your epoch, then the first epoch will have your initial value which is set to 0.001. If you run for multiple epochs then it will continue to be annealed.

Fine-Tuning DistilBertForSequenceClassification: Is not learning, why is loss not changing? Weights not updated?

I am relatively new to PyTorch and Huggingface-transformers and experimented with DistillBertForSequenceClassification on this Kaggle-Dataset.
from transformers import DistilBertForSequenceClassification
import torch.optim as optim
import torch.nn as nn
from transformers import get_linear_schedule_with_warmup
n_epochs = 5 # or whatever
batch_size = 32 # or whatever
bert_distil = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
#bert_distil.classifier = nn.Sequential(nn.Linear(in_features=768, out_features=1), nn.Sigmoid())
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(bert_distil.parameters(), lr=0.1)
X_train = []
Y_train = []
for row in train_df.iterrows():
seq = tokenizer.encode(preprocess_text(row[1]['text']), add_special_tokens=True, pad_to_max_length=True)
X_train.append(torch.tensor(seq).unsqueeze(0))
Y_train.append(torch.tensor([row[1]['target']]).unsqueeze(0))
X_train = torch.cat(X_train)
Y_train = torch.cat(Y_train)
running_loss = 0.0
bert_distil.cuda()
bert_distil.train(True)
for epoch in range(n_epochs):
permutation = torch.randperm(len(X_train))
j = 0
for i in range(0,len(X_train), batch_size):
optimizer.zero_grad()
indices = permutation[i:i+batch_size]
batch_x, batch_y = X_train[indices], Y_train[indices]
batch_x.cuda()
batch_y.cuda()
outputs = bert_distil.forward(batch_x.cuda())
loss = criterion(outputs[0],batch_y.squeeze().cuda())
loss.requires_grad = True
loss.backward()
optimizer.step()
running_loss += loss.item()
j+=1
if j == 20:
#print(outputs[0])
print('[%d, %5d] running loss: %.3f loss: %.3f ' %
(epoch + 1, i*1, running_loss / 20, loss.item()))
running_loss = 0.0
j = 0
[1, 608] running loss: 0.689 loss: 0.687
[1, 1248] running loss: 0.693 loss: 0.694
[1, 1888] running loss: 0.693 loss: 0.683
[1, 2528] running loss: 0.689 loss: 0.701
[1, 3168] running loss: 0.690 loss: 0.684
[1, 3808] running loss: 0.689 loss: 0.688
[1, 4448] running loss: 0.689 loss: 0.692 etc...
Regardless on what I tried, loss did never decrease, or even increase, nor did the prediction get better. It seems to me that I forgot something so that weights are actually not updated. Someone has an idea?
O
what I tried
Different loss functions
BCE
CrossEntropy
even MSE-loss
One-Hot Encoding vs A single neuron output
Different learning rates, and optimizers
I even changed all the targets to only one single label, but even then, the network did'nt converge.
Looking at running loss and minibatch loss is easily misleading. You should look at epoch loss, because the inputs are the same for every loss.
Besides, there are some problems in your code, fixing all of them and the behavior is as expected: the loss slowly decreases after each epoch, and it can also overfit to a small minibatch. Please look at the code, changes include: using model(x) instead of model.forward(x), cuda() only called once, smaller learning rate, etc.
Tuning and fine-tuning ML models are difficult work.
n_epochs = 5
batch_size = 1
bert_distil = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(bert_distil.parameters(), lr=1e-3)
X_train = []
Y_train = []
for row in train_df.iterrows():
seq = tokenizer.encode(row[1]['text'], add_special_tokens=True, pad_to_max_length=True)[:100]
X_train.append(torch.tensor(seq).unsqueeze(0))
Y_train.append(torch.tensor([row[1]['target']]))
X_train = torch.cat(X_train)
Y_train = torch.cat(Y_train)
running_loss = 0.0
bert_distil.cuda()
bert_distil.train(True)
for epoch in range(n_epochs):
permutation = torch.randperm(len(X_train))
for i in range(0,len(X_train), batch_size):
optimizer.zero_grad()
indices = permutation[i:i+batch_size]
batch_x, batch_y = X_train[indices].cuda(), Y_train[indices].cuda()
outputs = bert_distil(batch_x)
loss = criterion(outputs[0], batch_y)
loss.backward()
optimizer.step()
running_loss += loss.item()
print('[%d] epoch loss: %.3f' %
(epoch + 1, running_loss / len(X_train) * batch_size))
running_loss = 0.0
Output:
[1] epoch loss: 0.695
[2] epoch loss: 0.690
[3] epoch loss: 0.687
[4] epoch loss: 0.685
[5] epoch loss: 0.684
I would highlight two possible reasons for your "stable" results:
I agree that the learning rate is surely too high that prevents model from any significant updates.
But what is important to know is that based on the state-of-the-art papers finetuning has very marginal effect on the core NLP abilities of Transformers. For example, the paper says that finetuning only applies really small weight changes. Citing it: "Finetuning barely affects accuracy on NEL, COREF and REL indicating that those tasks are already sufficiently covered by pre-training". Several papers suggest that finetuning for classification tasks is basically waste of time. Thus, considering that DistilBert is actually a student model of BERT, maybe you won't get better results. Try pre-training with your data first. Generally, pre-training has a more significant impact.
I have got similar problem when I tried to use xxxForSequenceClassification to fine-tune my down-stream task.
At last, I changed xxxForSequenceClassification to xxxModel and added Dropout - FC - Softmax. Magically it's solved, loss decreased as expected.
I'm still trying to find out why.
Hope it may help you.
FYI, transformers verion: 3.5.0
Maybe the poor performance is due to gradients being applied to the BERT backbone. Validate it like so:
print([p.requires_grad for p in bert_distil.distilbert.parameters()])
As an alternative solution, try freezing the weights of your trained model:
for param in bert_distil.distilbert.parameters():
param.requires_grad = False
As you are trying to optimize the weights of a trained model during fine-tuning on your data, you face issues described, among other sources, in the ULMIfit (https://arxiv.org/abs/1801.06146) paper

CNN Training Runs without error, but it does not display the results

I am new to pytorch, and I am trying to train my model (CNN), using the following code:
The program runs fine, but it does not display this Epoch/Step/Loss/Accuracy part:
print(‘Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Accuracy: {:.2f}%’
as if (i+1) % 100 == 0: never turns to 0
Training part:
iter = 0
for epoch in range(num_epochs):
for i, (images, labels) in enumerate(dataloaders['train']):
images = Variable(images)
labels = Variable(labels)
# Clear the gradients
optimizer.zero_grad()
# Forward propagation
outputs = model(images)
# Calculating loss with softmax to obtain cross entropy loss
loss = criterion(outputs, labels)
# Backward prop
loss.backward()
# Updating gradients
optimizer.step()
iter += 1
# Total number of labels
total = labels.size(0)
# Obtaining predictions from max value
_, predicted = torch.max(outputs.data, 1)
# Calculate the number of correct answers
correct = (predicted == labels).sum().item()
# Print loss and accuracy
if (i+1) % 100 == 0:
print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Accuracy: {:.2f}%'
.format(epoch + 1, num_epochs, i + 1, len(dataloaders['train']), loss.item(),
(correct / total) * 100))
Full Code:
https://pastebin.com/dshNmhRL

how to add BatchNormalization with SWA:stochastic weights average?

I am a beginner in Deepleaning and Pytorch.
I don't understand how to use BatchNormalization in using SWA.
pytorch.org says in https://pytorch.org/blog/stochastic-weight-averaging-in-pytorch/:
Note that the SWA averages of the weights are never used to make
predictions during training, and so the batch normalization layers do
not have the activation statistics computed after you reset the
weights of your model with opt.swap_swa_sgd()
This means it's suitable for adding BatchNormalization layer after using SWA?
# it means, in my idea
#for example
opt = torchcontrib.optim.SWA(base_opt)
for i in range(100):
opt.zero_grad()
loss_fn(model(input), target).backward()
opt.step()
if i > 10 and i % 5 == 0:
opt.update_swa()
opt.swap_swa_sgd()
#save model once
torch.save(model,"swa_model.pt")
#model_load
saved_model=torch.load("swa_model.pt")
#it means adding BatchNormalization layer??
model2=saved_model
model2.add_module("Batch1",nn.BatchNorm1d(10))
# decay learning_rate more
learning_rate=0.005
optimizer = torch.optim.SGD(model2.parameters(), lr=learning_rate)
# train model again
for epoch in range(num_epochs):
loss = train(train_loader)
val_loss, val_acc = valid(test_loader)
I appreciate your replying.
following your advise,
I try to make example model adding optimizer.bn_update()
# add optimizer.bn_update() to model
criterion = nn.CrossEntropyLoss()
learning_rate=0.01
base_opt = torch.optim.SGD(model.parameters(), lr=0.1)
optimizer = SWA(base_opt, swa_start=10, swa_freq=5, swa_lr=0.05)
def train(train_loader):
#mode:train
model.train()
running_loss = 0
for batch_idx, (images, labels) in enumerate(train_loader):
optimizer.zero_grad()
outputs = model(images)
#loss
loss = criterion(outputs, labels)
running_loss += loss.item()
loss.backward()
optimizer.step()
optimizer.swap_swa_sgd()
train_loss = running_loss / len(train_loader)
return train_loss
def valid(test_loader):
model.eval()
running_loss = 0
correct = 0
total = 0
#torch.no_grad
with torch.no_grad():
for batch_idx, (images, labels) in enumerate(test_loader):
outputs = model(images)
loss = criterion(outputs, labels)
running_loss += loss.item()
_, predicted = torch.max(outputs, 1)
correct += (predicted == labels).sum().item()
total += labels.size(0)
val_loss = running_loss / len(test_loader)
val_acc = float(correct) / total
return val_loss, val_acc
num_epochs=30
loss_list = []
val_loss_list = []
val_acc_list = []
for epoch in range(num_epochs):
loss = train(train_loader)
val_loss, val_acc = valid(test_loader)
optimizer.bn_update(train_loader, model)
print('epoch %d, loss: %.4f val_loss: %.4f val_acc: %.4f'
% (epoch, loss, val_loss, val_acc))
# logging
loss_list.append(loss)
val_loss_list.append(val_loss)
val_acc_list.append(val_acc)
# optimizer.bn_updata()
optimizer.bn_update(train_loader, model)
# go on evaluating model,,,
What the documentation is telling you is that since SWA computes averages of weights but those weights aren't used for prediction during training the batch normalization layers won't see those weights. This means they haven't computed the respective statistics for them (as they were never able to) which is important because the weights are used during actual prediction (i.e. not during training).
This means, they assume you have batch normalization layers in your model and want to train it using SWA. This is (more or less) not straight-forward due to the reasons above.
One approach is given as follows:
To compute the activation statistics you can just make a forward pass on your training data using the SWA model once the training is finished.
Or you can use their helper class:
In the SWA class we provide a helper function opt.bn_update(train_loader, model). It updates the activation statistics for every batch normalization layer in the model by making a forward pass on the train_loader data loader. You only need to call this function once in the end of training.
In case you are using Pytorch's DataLoader class you can simply supply the model (after training) and the training loader to the bn_update function which updates all batch normalization statistics for you. You only need to call this function once in the end of training.
Steps to proceed:
Train your model that includes batch normalization layers using SWA
After your model has finished training, call opt.bn_update(train_loader, model) using your training data and providing your trained model
I tried to compare before and after using optimizer.bn_update() in Mnist Data.
as follows:
# using Mnist Data
X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
# to compare Test Data Accuracy
X_train_=X_train[0:50000]
y_train_=y_train[0:50000]
# to validate for test data
X_train_ToCompare=X_train[50000:60000]
y_train_ToCompare=y_train[50000:60000]
print(X_train_.shape)
print(y_train_.shape)
print(X_train_ToCompare.shape)
print(y_train_ToCompare.shape)
#(50000, 784)
#(50000,)
#(10000, 784)
#(10000,)
# like keras,simple MLP model
from torch import nn
model = nn.Sequential()
model.add_module('fc1', nn.Linear(784, 1000))
model.add_module('relu1', nn.ReLU())
model.add_module('fc2', nn.Linear(1000, 1000))
model.add_module('relu2', nn.ReLU())
model.add_module('fc3', nn.Linear(1000, 10))
print(model)
# using GPU
model.cuda()
criterion = nn.CrossEntropyLoss()
learning_rate=0.01
base_opt = torch.optim.SGD(model.parameters(), lr=0.1)
optimizer = SWA(base_opt, swa_start=10, swa_freq=5, swa_lr=0.05)
def train(train_loader):
model.train()
running_loss = 0
for batch_idx, (images, labels) in enumerate(train_loader):
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
running_loss += loss.item()
loss.backward()
optimizer.step()
optimizer.swap_swa_sgd()
train_loss = running_loss / len(train_loader)
return train_loss
def valid(test_loader):
model.eval()
running_loss = 0
correct = 0
total = 0
with torch.no_grad():
for batch_idx, (images, labels) in enumerate(test_loader):
outputs = model(images)
loss = criterion(outputs, labels)
running_loss += loss.item()
_, predicted = torch.max(outputs, 1)
correct += (predicted == labels).sum().item()
total += labels.size(0)
val_loss = running_loss / len(test_loader)
val_acc = float(correct) / total
return val_loss, val_acc
num_epochs=30
loss_list = []
val_loss_list = []
val_acc_list = []
for epoch in range(num_epochs):
loss = train(train_loader)
val_loss, val_acc = valid(test_loader)
optimizer.bn_update(train_loader, model)
print('epoch %d, loss: %.4f val_loss: %.4f val_acc: %.4f'
% (epoch, loss, val_loss, val_acc))
# logging
loss_list.append(loss)
val_loss_list.append(val_loss)
val_acc_list.append(val_acc)
# output:
# epoch 0, loss: 0.7832 val_loss: 0.5381 val_acc: 0.8866
# ...
# epoch 29, loss: 0.0502 val_loss: 0.0758 val_acc: 0.9772
#evaluate model
# attempt to evaluate model before optimizer.bn_update()
# using X_train_toCompare for test data
model.eval()
predicted_list=[]
for i in range(len(X_train_ToCompare)):
temp_predicted=model(torch.cuda.FloatTensor(X_train_ToCompare[i]))
_,y_predicted=torch.max(temp_predicte,0)
predicted_list.append(int(y_predicted))
sum(predicted_list==y_train_ToCompare)
# test accuracy 9757/10000
#after using:optimizer.bn_update
model.train()
optimizer.bn_update(train_loader, model)
# evaluate model
model.eval()
predicted_list_afterBatchNorm=[]
for i in range(len(X_train_ToCompare)):
temp_predicted=model(torch.cuda.FloatTensor(X_train_ToCompare[i]))
_,y_predicted=torch.max(temp_predicted,0)
predicted_list_afterBatchNorm.append(int(y_predicted))
sum(predicted_list_withNorm==y_train_ToCompare)
# test accuracy 9778/10000
I don't know if this way is correct to validate...
Using optimizer.bn_update() method, I confirm test accuracy is improved ofen.
but some test accuracy is descended:I think this is because of
simple MLP model and learning epochs are not enough.
there is need to try test more.
thank you for reply.

Adding second hidden layer in Tensorflow breaks loss calculation

I'm am working on assignment three of the Udacity Deep Learning course. I have a working neural network with one hidden layer. However, when I add a second one, the loss results in nan.
This is the graph code:
num_nodes_layer_1 = 1024
num_nodes_layer_2 = 128
num_inputs = 28 * 28
num_labels = 10
batch_size = 128
graph = tf.Graph()
with graph.as_default():
# Input data. For the training data, we use a placeholder that will be fed
# at run time with a training minibatch.
tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, num_inputs))
tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
# variables
# hidden layer 1
hidden_weights_1 = tf.Variable(tf.truncated_normal([num_inputs, num_nodes_layer_1]))
hidden_biases_1 = tf.Variable(tf.zeros([num_nodes_layer_1]))
# hidden layer 2
hidden_weights_2 = tf.Variable(tf.truncated_normal([num_nodes_layer_1, num_nodes_layer_2]))
hidden_biases_2 = tf.Variable(tf.zeros([num_nodes_layer_2]))
# linear layer
weights = tf.Variable(tf.truncated_normal([num_nodes_layer_2, num_labels]))
biases = tf.Variable(tf.zeros([num_labels]))
# Training computation.
y1 = tf.nn.relu(tf.matmul(tf_train_dataset, hidden_weights_1) + hidden_biases_1)
y2 = tf.nn.relu(tf.matmul(y1, hidden_weights_2) + hidden_biases_2)
logits = tf.matmul(y2, weights) + biases
# Calc loss
loss = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits_v2(labels=tf_train_labels, logits=logits))
# Optimizer.
# We are going to find the minimum of this loss using gradient descent.
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
# Predictions for the training, validation, and test data.
# These are not part of training, but merely here so that we can report
# accuracy figures as we train.
train_prediction = tf.nn.softmax(logits)
y1_valid = tf.nn.relu(tf.matmul(tf_valid_dataset, hidden_weights_1) + hidden_biases_1)
y2_valid = tf.nn.relu(tf.matmul(y1_valid, hidden_weights_2) + hidden_biases_2)
valid_prediction = tf.nn.softmax(tf.matmul(y2_valid, weights) + biases)
y1_test = tf.nn.relu(tf.matmul(tf_test_dataset, hidden_weights_1) + hidden_biases_1)
y2_test = tf.nn.relu(tf.matmul(y1_test, hidden_weights_2) + hidden_biases_2)
test_prediction = tf.nn.softmax(tf.matmul(y2_test, weights) + biases)
It does not give an error. But after the first time, the loss is unable to print and it doesn't learn.
Initialized
Minibatch loss at step 0: 2133.468750
Minibatch accuracy: 8.6%
Validation accuracy: 10.0%
Minibatch loss at step 400: nan
Minibatch accuracy: 9.4%
Validation accuracy: 10.0%
Minibatch loss at step 800: nan
Minibatch accuracy: 11.7%
Validation accuracy: 10.0%
Minibatch loss at step 1200: nan
Minibatch accuracy: 4.7%
Validation accuracy: 10.0%
Minibatch loss at step 1600: nan
Minibatch accuracy: 7.8%
Validation accuracy: 10.0%
Minibatch loss at step 2000: nan
Minibatch accuracy: 6.2%
Validation accuracy: 10.0%
Test accuracy: 10.0%
When I remove the second layer it trains and I get an accuracy of about 85%. With a second layer I would suspect the score to be between 80% and 90%.
Am I using the wrong optimizer? Is it just something stupid I missed?
This is the session code:
num_steps = 2001
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print("Initialized")
for step in range(num_steps):
# Pick an offset within the training data, which has been randomized.
# Note: we could use better randomization across epochs.
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
# Generate a minibatch.
batch_data = train_dataset[offset:(offset + batch_size), :]
batch_labels = train_labels[offset:(offset + batch_size), :]
# Prepare a dictionary telling the session where to feed the minibatch.
# The key of the dictionary is the placeholder node of the graph to be fed,
# and the value is the numpy array to feed to it.
feed_dict = {
tf_train_dataset : batch_data,
tf_train_labels : batch_labels,
}
_, l, predictions = session.run(
[optimizer, loss, train_prediction], feed_dict=feed_dict)
if (step % 400 == 0):
print("Minibatch loss at step %d: %f" % (step, l))
print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels))
acc = accuracy(test_prediction.eval(), test_labels)
print("Test accuracy: %.1f%%" % acc)
Your learning rate of 0.5 is too high, set it to 0.05 and it'll converge.
Minibatch loss at step 0: 1506.469238
Minibatch loss at step 400: 7796.088867
Minibatch loss at step 800: 9893.363281
Minibatch loss at step 1200: 5089.553711
Minibatch loss at step 1600: 6148.481445
Minibatch loss at step 2000: 5257.598145
Minibatch loss at step 2400: 1716.116455
Minibatch loss at step 2800: 1600.826538
Minibatch loss at step 3200: 941.884766
Minibatch loss at step 3600: 1033.936768
Minibatch loss at step 4000: 1808.775757
Minibatch loss at step 4400: 113.909866
Minibatch loss at step 4800: 49.800560
Minibatch loss at step 5200: 20.392700
Minibatch loss at step 5600: 6.253595
Minibatch loss at step 6000: 4.372780
Minibatch loss at step 6400: 6.862935
Minibatch loss at step 6800: 6.951239
Minibatch loss at step 7200: 3.528607
Minibatch loss at step 7600: 2.968611
Minibatch loss at step 8000: 3.164592
...
Minibatch loss at step 19200: 2.141401
Also a couple of pointers:
tf_train_dataset and tf_train_labels should be tf.placeholders of shape [None, 784]. The None dimension allows you to vary the batch size during training, instead of being limited to a size number such as 128.
Instead of using tf_valid_dataset and tf_test_dataset as tf.constant, just pass your validation and test datasets in the respective feed_dicts, this will allow you to get rid of the extra ops at the end of your graph for validation and test accuracy.
I'd recommended sampling from a separate batch of validation and test data rather than using the same batch of data for each iteration of checking the val/test accuracy.

Resources