PyTorch: Speed up data loading

PyTorch: Speed up data loading - python-3.x

I am using densenet121 to do cat/dog detection from Kaggle dataset. I enabled cuda and it appears that training is very fast. However, the data loading (or perhaps processing) appears to be very slow. Are there some ways to speed it up? I tried to play witch batch size, that didn't provide much help. I also changed num_workers from 0 to some positive numbers. Going from 0 to 2 reduces loading time by perhaps 1/3, increasing by more doesn't have additional effect. Are there some other ways I can speed loading things up?
This is my rough code (I am focused on learning, so it's not very organized):
import matplotlib.pyplot as plt
import torch
from torch import nn
from torch import optim
import torch.nn.functional as F
from torchvision import datasets, transforms, models
data_dir = 'Cat_Dog_data'
train_transforms = transforms.Compose([transforms.RandomRotation(30),
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.5, 0.5, 0.5],
[0.5, 0.5, 0.5])])
test_transforms = transforms.Compose([transforms.Resize(255),
transforms.CenterCrop(224),
transforms.ToTensor()])
# Pass transforms in here, then run the next cell to see how the transforms look
train_data = datasets.ImageFolder(data_dir + '/train',
transform=train_transforms)
test_data = datasets.ImageFolder(data_dir + '/test', transform=test_transforms)
trainloader = torch.utils.data.DataLoader(train_data, batch_size=64,
num_workers=16, shuffle=True,
pin_memory=True)
testloader = torch.utils.data.DataLoader(test_data, batch_size=64,
num_workers=16)
model = models.densenet121(pretrained=True)
# Freeze parameters so we don't backprop through them
for param in model.parameters():
param.requires_grad = False
from collections import OrderedDict
classifier = nn.Sequential(OrderedDict([
('fc1', nn.Linear(1024, 500)),
('relu', nn.ReLU()),
('fc2', nn.Linear(500, 2)),
('output', nn.LogSoftmax(dim=1))
]))
model.classifier = classifier
model.cuda()
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)
epochs = 30
steps = 0
import time
device = torch.device('cuda:0')
train_losses, test_losses = [], []
for e in range(epochs):
running_loss = 0
count = 0
total_start = time.time()
for images, labels in trainloader:
start = time.time()
images = images.cuda()
labels = labels.cuda()
optimizer.zero_grad()
log_ps = model(images)
loss = criterion(log_ps, labels)
loss.backward()
optimizer.step()
elapsed = time.time() - start
if count % 20 == 0:
print("Optimized elapsed: ", elapsed, "count:", count)
print("Total elapsed ", time.time() - total_start)
total_start = time.time()
count += 1
running_loss += loss.item()
else:
test_loss = 0
accuracy = 0
for images, labels in testloader:
images = images.cuda()
labels = labels.cuda()
with torch.no_grad():
model.eval()
log_ps = model(images)
test_loss += criterion(log_ps, labels)
ps = torch.exp(log_ps)
top_p, top_class = ps.topk(1, dim=1)
compare = top_class == labels.view(*top_class.shape)
accuracy += compare.type(torch.FloatTensor).mean()
model.train()
train_losses.append(running_loss / len(trainloader))
test_losses.append(test_loss / len(testloader))
print("Epoch: {}/{}.. ".format(e + 1, epochs),
"Training Loss: {:.3f}.. ".format(
running_loss / len(trainloader)),
"Test Loss: {:.3f}.. ".format(test_loss / len(testloader)),
"Test Accuracy: {:.3f}".format(accuracy / len(testloader)))

torchvision 0.8.0 version or greater
Actually torchvision now supports batches and GPU when it comes to transformations (this is done on torch.Tensors instead of PIL images), so one should use it as an initial improvement.
See here for more info about this release. Also those act as torch.nn.Module, hence can be used inside a model, for example:
transforms = torch.nn.Sequential(
T.RandomCrop(224),
T.RandomHorizontalFlip(p=0.3),
T.ConvertImageDtype(torch.float),
T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
)
Furthermore, those operations could be JITed possibly improving the performance even further.
torchvision < 0.8.0 (original answer)
Increasing batch_size won't help as torchvision performs transform on single image while it's loaded from your disk.
There are a couple of ways one could speed up data loading with increasing level of difficulty:
Improve image loading times
Load & normalize images and cache in RAM (or on disk)
Produce transformations and save them to disk
Apply non-cache'able transforms (rotations, flips, crops) in batched manner
Prefetching
1. Improve image loading
Easy improvements can be gained by installing Pillow-SIMD instead of original pillow. It is a drop-in replacement and could be faster (or so is claimed at least for Resize which you are using).
Alternatively, you could create your own data loading and processing with OpenCV as some say it's faster or check albumentations (though can't tell you whether those will improve the performance and might be a lot of time wasted for no gain except learning experience).
2. Load & normalize images & cache
You can use Python's LRU Cache functionality to cache some outputs.
You can also use torchdata which acts almost exactly like PyTorch's torch.utils.data.Dataset but allows caching to disk or in RAM (or mixed modes) with simple cache() on torchdata.Dataset (see github repository, disclaimer: i'm the author).
Remember: you have to load and normalize images, cache and after that use RandomRotation, RandomResizedCrop and RandomHorizontalFlip (as those change each time they are run).
3. Produce transformations and save them to disk
You would have to perform a lot of transformations on images, save them to disk and use this enhanced dataset afterwards. Once again that could be done with torchdata but it's really wasteful when it comes to I/O and hard drive and very inelegant solution. Furthermore it's "static" so the data would only last your for X epochs, it wouldn't be "infinite" generator with augmentations.
4. Batched transformations
torchvision does not support it so you would have to write those functions on your own. See this issue for justification. AFAIK no other 3rd party provides it either. For large batches it should speed up things but implementation is open question I think (correct me if I'm wrong).
5. Prefetch
IMO would be hardest to implement (though a really good idea for the project come to think about it). Basically you load data for the next iteration when your model trains. torch.utils.data.DataLoader does provide it, though there are some concerns (like workers pausing after their data got loaded). You can read PyTorch thread about it (not sure about it as I didn't verify on my own). Also, a lot of valuable insight provided by this comment and this blog post (though not sure how up to date those are).
All in all to substantially improve data loading you would need to get your hands quite dirty (or maybe there are libraries doing this some of those for PyTorch, if so,I would love to know about them).
Also remember to profile your changes, see torch.nn.bottleneck
EDIT: DALI project might be worth checking out, though AFAIK it has some problems with RAM memory growing linearly with number of epochs.

Related

How to handle objective convergence, early stopping and learning rate adjustments when using partial_fit method of Scikit SGDClassifier?

I am using partial_fit function from SGDClassifier with log loss to do online learning as I have a large dataset that cannot fit inside the memory as following:
cls = SGDClassifier(loss='log', learning_rate='adaptive', eta0=0.1, penalty='l2', alpha=0.0001)
for batch in training_generator:
cls.partial_fit(batch)
predictions = []
for batch in test_data:
probs = cls.predict_proba(batch)
predictions += list(probs)
In the documentation of partial_fit function it is stated
Internally, this method uses max_iter = 1. Therefore, it is not guaranteed that a minimum of the cost function is reached after calling it once. Matters such as objective convergence, early stopping, and learning rate adjustments should be handled by the user.
Questions:
max_iter = 1 means I would need to loop through the partial_fit as many times as needed by myself for each batch of data as following?
for batch in training_generator:
for _ in range(num_of_iteration):
cls.partial_fit(batch)
Does that statement in the documentation mean I would need to compute the log_loss (learning curve) myself for the validation data in each training iteration and decide when to stop the training? For example, the code as below.
for batch in training_generator:
cls.partial_fit(batch)
predictions = []
for batch in training_generator:
probs = cls.predict_proba(batch)
predictions += list(probs)
training_loss = log_loss(y_true, predictions)
predictions = []
for batch in validation_generator:
probs = cls.predict_proba(batch)
predictions += list(probs)
val_loss = log_loss(y_true, predictions)
#Pseudocode
If val_loss does not decrease after n iteration by some value, then stop training
If I have a large validation and training dataset, can I use a representative subset of the validation and training dataset, i.e: having the same distribution of classes as the full dataset, to compute the loss?
If assuming the validation loss keep decreasing and all the data in the training_generator has came to the end. Should I shuffle the training_generator data and run again the training loop?
#Psedocode
while True:
Run training loop
If val_loss does not decrease after n iteration by some value, then stop training (break while loop)
Finish training loop
The documentation says that the learning_rate adjustment should also be done by the user. Does that mean the learning_rate='adaptive' argument to the SGDClassifier has no effect when using partial_fit? If yes, how can the learning_rate be adjusted?

Weighted random sampler - oversample or undersample?

Problem
I am training a deep learning model in PyTorch for binary classification, and I have a dataset containing unbalanced class proportions. My minority class makes up about 10% of the given observations. To avoid the model learning to just predict the majority class, I want to use the WeightedRandomSampler from torch.utils.data in my DataLoader.
Let's say I have 1000 observations (900 in class 0, 100 in class 1), and a batch size of 100 for my dataloader.
Without weighted random sampling, I would expect each training epoch to consist of 10 batches.
Questions
Will only 10 batches be sampled per epoch when using this sampler - and consequently, would the model 'miss' a large portion of the majority class during each epoch, since the minority class is now overrepresented in the training batches?
Will using the sampler result in more than 10 batches being sampled per epoch (meaning the same minority class observations may appear many times, and also that training would slow down)?

A small snippet of code to use WeightedRandomSampler
First, define the function:
def make_weights_for_balanced_classes(images, nclasses):
n_images = len(images)
count_per_class = [0] * nclasses
for _, image_class in images:
count_per_class[image_class] += 1
weight_per_class = [0.] * nclasses
for i in range(nclasses):
weight_per_class[i] = float(n_images) / float(count_per_class[i])
weights = [0] * n_images
for idx, (image, image_class) in enumerate(images):
weights[idx] = weight_per_class[image_class]
return weights
And after this, use it in the following way:
import torch
dataset_train = datasets.ImageFolder(traindir)
# For unbalanced dataset we create a weighted sampler
weights = make_weights_for_balanced_classes(dataset_train.imgs, len(dataset_train.classes))
weights = torch.DoubleTensor(weights)
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(weights))
train_loader = torch.utils.data.DataLoader(dataset_train, batch_size=args.batch_size, shuffle = True,
sampler = sampler, num_workers=args.workers, pin_memory=True)

It depends on what you're after, check torch.utils.data.WeightedRandomSampler documentation for details.
There is an argument num_samples which allows you to specify how many samples will actually be created when Dataset is combined with torch.utils.data.DataLoader (assuming you weighted them correctly):
If you set it to len(dataset) you will get the first case
If you set it to 1800 (in your case) you will get the second case
Will only 10 batches be sampled per epoch when using this sampler - and consequently, would the model 'miss' a large portion of the majority class during each epoch [...]
Yes, but new samples will be returned after this epoch passes
Will using the sampler result in more than 10 batches being sampled per epoch (meaning the same minority class observations may appear many times, and also that training would slow down)?
Training would not slow down, each epoch would take longer, but convergence should be approximately the same (as less epochs will be necessary due to more data in each).

Trying to accumulate gradients in Pytorch, but getting RuntimeError when calling loss.backward

I'm trying to train a model in Pytorch, and I'd like to have a batch size of 8, but due to memory limitations, I can only have a batch size of at most 4. I've looked all around and read a lot about accumulating gradients, and it seems like the solution to my problem.
However, I seem to have trouble implementing it. Every time I run the code I get RuntimeError: Trying to backward through the graph a second time. I don't understand why since my code looks like all these other examples I've seen (unless I'm just missing something major):
https://stackoverflow.com/a/62076913/1227353
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/20
One caveat is that the labels for my images are all different size, so I can't send the output batch and the label batch into the loss function; I have to iterate over them together. This is what an epoch looks like (it's been pared down for the sake of brevity):
# labels_batch contains labels of different sizes
for batch_idx, (inputs_batch, labels_batch) in enumerate(dataloader):
outputs_batch = model(inputs_batch)
# have to do this because labels can't be stacked into a tensor
for output, label in zip(outputs_batch, labels_batch):
output_scaled = interpolate(...) # make output match label size
loss = train_criterion(output_scaled, label) / (BATCH_SIZE * 2)
loss.backward()
if batch_idx % 2 == 1:
optimizer.step()
optimizer.zero_grad()
Is there something I'm missing? If I do the following I also get an error:
# labels_batch contains labels of different sizes
for batch_idx, (inputs_batch, labels_batch) in enumerate(dataloader):
outputs_batch = model(inputs_batch)
# CHANGE: we're gonna accumulate losses manually
batch_loss = 0
# have to do this because labels can't be stacked into a tensor
for output, label in zip(outputs_batch, labels_batch):
output_scaled = interpolate(...) # make output match label size
loss = train_criterion(output_scaled, label) / (BATCH_SIZE * 2)
batch_loss += loss # CHANGE: accumulate!
# CHANGE: do backprop outside for loop
batch_loss.backward()
if batch_idx % 2 == 1:
optimizer.step()
optimizer.zero_grad()
The error I get in this case is RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn. This happens when the next epoch starts though... (INCORRECT, SEE EDIT BELOW)
How can I train my model with gradient accumulation? Or am I doomed to train with a batch size of 4 or less?
Oh and as a side question, does the location of where I put loss.backward() affect what I need to normalize the loss by? Or is it always normalized by BATCH_SIZE * 2?
EDIT:
The second code segment was getting an error due to the fact that I was doing torch.set_grad_enabled(phase == 'train') but I had forgotten to wrap the call to batch_loss.backward() with an if phase == 'train'... my bad
So now the second segment of code seems to work and do gradient accumulation, but why doesn't the first bit of code work? It feel equivalent to setting BATCH_SIZE as 1. Furthermore, I'm creating a new loss object each time, so shouldn't the calls to backward() operate on different graphs entirely?

It seems you have two issues here, you said you couldn't have batch_size=8 because of memory limitations but later state that your labels are not of the same size. The latter seems much more important than the former. Anyway, I will try to answer your questions best I can.
How can I train my model with gradient accumulation? Or am I doomed to train with a batch size of 4 or less?
You want to call .backward() on every loop cycle otherwise the batch will have no effect on the training. You can then call step() and zero_grad() only when batch_idx % 2 is True (i.e. for every other batch).
Here's an example which accumulates the gradient, not the loss:
model = nn.Linear(10, 3)
optim = torch.optim.SGD(model.parameters(), lr=0.1)
ds = TensorDataset(torch.rand(100, 10), torch.rand(100, 3))
dl = DataLoader(ds, batch_size=4)
for i, (x, y) in enumerate(dl):
y_hat = model(x)
loss = F.l1_loss(y_hat, y) / 2
loss.backward()
if i % 2:
optim.step()
optim.zero_grad()
Note this approach is different to accumulating the loss, and back-propagating only all batches (or part of the batches) have gone through the network. In the example above we backpropagate every 4 datapoints and updating the model every 8 datapoints.
Oh and as a side question, does the location of where I put loss.backward() affect what I need to normalize the loss by? Or is it always normalized by BATCH_SIZE * 2?
Usually torch's built-in losses have reduction='mean' set as default. This means the loss gets averaged over all batch elements that contributed to calculating the loss. So this will depend on your loss implementation.
However if you are using gradient accumalation, then yes you will need to average your loss by the number of accumulation steps (here loss = F.l1_loss(y_hat, y) / 2). Since your gradients will be accumulated twice.
To read more about this, I recommend taking a look at this other SO post.

Nan loss in keras with triplet loss

I'm trying to learn an embedding for Paris6k images combining VGG and Adrian Ung triplet loss. The problem is that after a small amount of iterations, in the first epoch, the loss becomes nan, and then the accuracy and validation accuracy grow to 1.
I've already tried lowering the learning rate, increasing the batch size (only to 16 beacuse of memory), changing optimizer (Adam and RMSprop), checking if there are None values on my dataset, changing data format from 'float32' to 'float64', adding a little bias to them and simplify the model.
Here is my code:
base_model = VGG16(include_top = False, input_shape = (512, 384, 3))
input_images = base_model.input
input_labels = Input(shape=(1,), name='input_label')
embeddings = Flatten()(base_model.output)
labels_plus_embeddings = concatenate([input_labels, embeddings])
model = Model(inputs=[input_images, input_labels], outputs=labels_plus_embeddings)
batch_size = 16
epochs = 2
embedding_size = 64
opt = Adam(lr=0.0001)
model.compile(loss=tl.triplet_loss_adapted_from_tf, optimizer=opt, metrics=['accuracy'])
label_list = np.vstack(label_list)
x_train = image_list[:2500]
x_val = image_list[2500:]
y_train = label_list[:2500]
y_val = label_list[2500:]
dummy_gt_train = np.zeros((len(x_train), embedding_size + 1))
dummy_gt_val = np.zeros((len(x_val), embedding_size + 1))
H = model.fit(
x=[x_train,y_train],
y=dummy_gt_train,
batch_size=batch_size,
epochs=epochs,
validation_data=([x_val, y_val], dummy_gt_val),callbacks=callbacks_list)
The images are 3366 with values scaled in range [0, 1].
The network takes dummy values because it tries to learn embeddings from images in a way that images of the same class should have small distance, while images of different classes should have high distances and than the real class is part of the training.
I've noticed that I was previously making an incorrect class division (and keeping images that should be discarded), and I didn't have the nan loss problem.
What should I try to do?
Thanks in advance and sorry for my english.

In some case, the random NaN loss can be caused by your data, because if there are no positive pairs in your batch, you will get a NaN loss.
As you can see in Adrian Ung's notebook (or in tensorflow addons triplet loss; it's the same code) :
semi_hard_triplet_loss_distance = math_ops.truediv(
math_ops.reduce_sum(
math_ops.maximum(
math_ops.multiply(loss_mat, mask_positives), 0.0)),
num_positives,
name='triplet_semihard_loss')
There is a division by the number of positives pairs (num_positives), which can lead to NaN.
I suggest you try to inspect your data pipeline in order to ensure there is at least one positive pair in each of your batches. (You can for example adapt some of the code in the triplet_loss_adapted_from_tf to get the num_positives of your batch, and check if it is greater than 0).

Try increasing your batch size. It happened to me also. As mentioned in the previous answer, network is unable to find any num_positives. I had 250 classes and was getting nan loss initially. I increased it to 128/256 and then there was no issue.
I saw that Paris6k has 15 classes or 12 classes. Increase your batch size 32 and if the GPU memory occurs you can try with model with less parameters. You can work on Efficient B0 model for starting. It has 5.3M compared to VGG16 which has 138M parameters.

I have implemented a package for triplet generation so that every batch is guaranteed to include postive pairs. It is compatible with TF/Keras only.
https://github.com/ma7555/kerasgen (Disclaimer: I am the owner)

Reporting accuracy and loss issues with MonitoredTrainingSession

I am performing transfer learning on InceptionV3 for a dataset of 5 types of flowers. All layers are frozen except the output layer. My implementation is heavily based off of the Cifar10 tutorial from Tensorflow and the input dataset is formated in the same way as Cifar10.
I have added a MonitoredTrainingSession (like in the tutorial) to report the accuracy and loss after a certain number of steps. Below is the section of the code for the MonitoredTrainingSession (almost identical to the tutorial):
class _LoggerHook(tf.train.SessionRunHook):
def begin(self):
self._step = -1
self._start_time = time.time()
def before_run(self,run_context):
self._step+=1
return tf.train.SessionRunArgs([loss,accuracy])
def after_run(self,run_context,run_values):
if self._step % LOG_FREQUENCY ==0:
current_time = time.time()
duration = current_time - self._start_time
self._start_time = current_time
loss_value = run_values.results[0]
acc = run_values.results[1]
examples_per_sec = LOG_FREQUENCY/duration
sec_per_batch = duration / LOG_FREQUENCY
format_str = ('%s: step %d, loss = %.2f, acc = %.2f (%.1f examples/sec; %.3f sec/batch)')
print(format_str %(datetime.now(),self._step,loss_value,acc,
examples_per_sec,sec_per_batch))
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
if MODE == 'train':
file_writer = tf.summary.FileWriter(LOGDIR,tf.get_default_graph())
with tf.train.MonitoredTrainingSession(
save_checkpoint_secs=70,
checkpoint_dir=LOGDIR,
hooks=[tf.train.StopAtStepHook(last_step=NUM_EPOCHS*NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN),
tf.train.NanTensorHook(loss),
_LoggerHook()],
config=config) as mon_sess:
original_saver.restore(mon_sess,INCEPTION_V3_CHECKPOINT)
print("Proceeding to training stage")
while not mon_sess.should_stop():
mon_sess.run(train_op,feed_dict={training:True})
print('acc: %f' %mon_sess.run(accuracy,feed_dict={training:False}))
print('loss: %f' %mon_sess.run(loss,feed_dict={training:False}))
When the two lines printing the accuracy and loss under mon_sess.run(train_op... are removed, the loss and accuracy printed from after_run, after it trains for surprisingly only 20 min, report that the model is performing very well on the training set and the loss is decreasing. Even the moving average loss was reporting great results. It eventually approaches greater than 90% accuracy for multiple random batches.
After, the training session was reporting high accuracy for a while,I stopped the training session, restored the model, and ran it on random batches from the same training set. It performed poorly, only achieving between 50% and 85% accuracy. I confirmed it was restored properly because it did perform better than a model with an untrained output layer.
I then went back to training again from the last checkpoint. The accuracy was initially low but after about 10 mini batch runs the accuracy went back above 90%. I then repeated the process but this time added the two lines for evaluating the loss and accuracy after the training operation. Those two evaluations reported that the model was having issues converging and performing poorly. While the evaluations via before_run and after_run, now only occasionally showed high accuracy and low loss (the results jumped around). But still after_run sometimes reported 100% accuracy (the fact that it is no longer consistent I think is because after_run is getting called also for mon_sess.run(accuracy...) and mon_sess.run(loss...)).
Why would the results reported from MonitoredTrainingSession be indicating the model is performing well when it really isn't? Aren't the two operations in SessionRunArgs being fed with the same mini batch as train_op, indicating model performance on the batch before gradient update?
Here is the code I used for restoring and testing the model(based of the cifar10 tutorial):
elif MODE == 'test':
init = tf.global_variables_initializer()
ckpt = tf.train.get_checkpoint_state(LOGDIR)
if ckpt and ckpt.model_checkpoint_path:
with tf.Session(config=config) as sess:
init.run()
saver = tf.train.Saver()
print(ckpt.model_checkpoint_path)
saver.restore(sess,ckpt.model_checkpoint_path)
global_step = tf.contrib.framework.get_or_create_global_step()
coord = tf.train.Coordinator()
threads =[]
try:
for qr in tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS):
threads.extend(qr.create_threads(sess, coord=coord, daemon=True,start=True))
print('model restored')
i =0
num_iter = 4*NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN/BATCH_SIZE
print(num_iter)
while not coord.should_stop() and i < num_iter:
print("loss: %.2f," %loss.eval(feed_dict={training:False}),end="")
print("acc: %.2f" %accuracy.eval(feed_dict={training:False}))
i+=1
except Exception as e:
print(e)
coord.request_stop(e)
coord.request_stop()
coord.join(threads,stop_grace_period_secs=10)
Update :
So I was able to fix the issue. However, i am not sure why it worked. In the arg_scope for the inception model i was passing in an is_training Boolean placeholder for Batch Norm and dropout used by inception. However, when I removed the placeholder and just set the is_training keyword to true, the accuracy on the training set when the model was restored was extremely high. This was the same model checkpoint that previously performed poorly. When i trained it i always had the is_training placeholder set to true. Having the is_training set to true while testing would mean batch Norm is now using th sample mean and variance.
Why would telling Batch Norm to now use the sample average and sample standard deviation like it does during training increase the accuracy?
This would also mean that the dropout layer is dropping units and that the model's accuracy during testing on both the training set and test set is higher with the dropout layer enabled.
Update 2
I went through the tensorflow slim inceptionv3 model code that the arg_scope in the code above is referencing. I removed the final dropout layer after the Avg pool 8x8 and the accuracy remained at around 99%. However, when I set is_training to False only for the batch norm layers, the accuracy dropped back to around 70%. Here is the arg_scope from slim\nets\inception_v3.py and my modification.
with variable_scope.variable_scope(
scope, 'InceptionV3', [inputs, num_classes], reuse=reuse) as scope:
with arg_scope(
[layers_lib.batch_norm],is_training=False): #layers_lib.dropout], is_training=is_training):
net, end_points = inception_v3_base(
inputs,
scope=scope,
min_depth=min_depth,
depth_multiplier=depth_multiplier)
I tried this with both the dropout layer removed and the dropout layer kept with passing in is_training=True to the dropout layer.

(Summarizing from dylan7's debugging in the question's comments)
Batch norm relies on variables to save the summary statistics it normalizes with. These are only updated when is_training is True through an UPDATE_OPS collection (see the batch_norm documentation). If these update ops don't get run (or the variables are overwritten), there may be transient "reasonable" statistics based on each batch which get lost when is_training is False (testing data is not, and should not be, used to inform batch_norm summary statistics).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string