Pytorch inference OOM after some batches

Pytorch inference OOM after some batches - pytorch

I am trying to do inference with a GPT2-like model on a large dataset (26k samples). To speed it up I would like to do it in batches, but trying this it goes in Cuda OOM after some batches. The fact that it goes out only after some batches sounds strange to me, because I suppose the memory use should be more or less constant in different batches.
This is my code:
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
sentences = ["<START_TOK>" + s + "<END_TOK>" + tokenizer.eos_token for s in sentences]
inputs = tokenizer(sentences, return_tensors="pt", padding=True, max_length=1024, truncation=True)
device = torch.device("cuda:0")
inputs = inputs.to(device)
model = model.to(device)
model.eval()
res = []
with torch.no_grad():
output_sequences = model.generate(
input_ids=inputs['input_ids'],
attention_mask=inputs['attention_mask'],
max_length=1024,
pad_token_id=tokenizer.eos_token_id,
no_repeat_ngram_size=2,
do_sample=True,
top_k=100,
top_p=0.9,
temperature=0.85
)
output_sequences = output_sequences.cpu() #not really sure this is useful, just tried, but the problem remained
for i in range(len(sentences)):
res.append(tokenizer.decode(output_sequences[i]))
model.train()
return res
What could be the problem?

Related

Pytorch low gpu util after first epoch

Hi I'm training my pytorch model on remote server.
All the job is managed by slurm.
My problem is 'training is extremely slower after training first epoch.'
I checked gpu utilization.
On my first epoch, utilization was like below image.
I can see gpu was utilized.
But from second epoch utilized percentage is almos zero
My dataloader code like this
class img2selfie_dataset(Dataset):
def __init__(self, path, transform, csv_file, cap_vec):
self.path = path
self.transformer = transform
self.images = [path + item for item in list(csv_file['file_name'])]
self.smiles_list = cap_vec
def __getitem__(self, idx):
img = Image.open(self.images[idx])
img = self.transformer(img)
label = self.smiles_list[idx]
label = torch.Tensor(label)
return img, label.type(torch.LongTensor)
def __len__(self):
return len(self.images)
My dataloader is defined like this
train_data_set = img2selfie_dataset(train_path, preprocess, train_dataset, train_cap_vec)
train_loader = DataLoader(train_data_set, batch_size = 256, num_workers = 2, pin_memory = True)
val_data_set = img2selfie_dataset(train_path, preprocess, val_dataset, val_cap_vec)
val_loader = DataLoader(val_data_set, batch_size = 256, num_workers = 2, pin_memory = True)
My training step defined like this
train_loss = []
valid_loss = []
epochs = 20
best_loss = 1e5
for epoch in range(1, epochs + 1):
print('Epoch {}/{}'.format(epoch, epochs))
print('-' * 10)
epoch_train_loss, epoch_valid_loss = train(encoder_model, transformer_decoder, train_loader, val_loader, criterion, optimizer)
train_loss.append(epoch_train_loss)
valid_loss.append(epoch_valid_loss)
if len(valid_loss) > 1:
if valid_loss[-1] < best_loss:
print(f"valid loss on this {epoch} is better than previous one, saving model.....")
torch.save(encoder_model.state_dict(), 'model/encoder_model.pickle')
torch.save(transformer_decoder.state_dict(), 'model/decoder_model.pickle')
best_loss = valid_loss[-1]
print(best_loss)
print(f'Epoch : [{epoch}] Train Loss : [{train_loss[-1]:.5f}], Valid Loss : [{valid_loss[-1]:.5f}]')
In my opinion, if this problem comes from my code. It wouldn't have hitted 100% utilization in first epoch.

I fixed this issue with moving my training data into local drive.
My remote server(school server) policy was storing personel data into NAS.
And file i/o from NAS proveked heavy load on network.
It was also affected by other user's file i/o from NAS.
After I moved training data into NAS, everything is fine.

problem overfitting model VGG16 small dataset

Screenshot of the problem
i have two classes, each of them contains an equal number of pictures
Train
360 train pictures for classe one
360 train pictures for classe two
Test
90 test pictures classe one
90 test pictures classe two
my code
def load_split(basePath, csvPath):
data = []
labels = []
rows = open(csvPath).read().strip().split("\n")[1:]
random.shuffle(rows)
for (i, row) in enumerate(rows):
if i > 0:
print("[INFO] processed {} total images".format(i))
(label, imagePath) = row.strip().split(",")[-2:]
imagePath = os.path.sep.join([basePath, imagePath])
image = io.imread(imagePath)
image = transform.resize(image, (224, 224))
image = exposure.equalize_adapthist(image, clip_limit=0.1)
data.append(image)
labels.append(int(label))
data = np.array(data)
labels = np.array(labels)
return (data, labels)
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required=True,
help="path to input GTSRB")
ap.add_argument("-m", "--model", required=True,
help="path to output model")
ap.add_argument("-p", "--plot", type=str, default="plot.png",
help="path to training history plot")
args = vars(ap.parse_args())
NUM_EPOCHS = 30
INIT_LR = 1e-4
BS = 64
labelNames = open("signnames.csv").read().strip().split("\n")[1:]
labelNames = [l.split(",")[1] for l in labelNames]
trainPath = os.path.sep.join([args["dataset"], "Train.csv"])
testPath = os.path.sep.join([args["dataset"], "Test.csv"])
print("[INFO] loading training and testing data...")
(trainX, trainY) = load_split(args["dataset"], trainPath)
(testX, testY) = load_split(args["dataset"], testPath)
trainX = (trainX-np.mean(trainX))/np.std(trainX)
testX = (testX-np.mean(testX))/np.std(testX)
numLabels = len(np.unique(trainY))
trainY = to_categorical(trainY, numLabels)
testY = to_categorical(testY, numLabels)
classTotals = trainY.sum(axis=0)
classWeight = classTotals.max() / classTotals
aug = ImageDataGenerator(
rotation_range=10,
zoom_range=0.15,
width_shift_range=0.1,
height_shift_range=0.1,
shear_range=0.15,
horizontal_flip=False,
vertical_flip=False,
fill_mode="nearest")
print("[INFO] compiling model...")
opt = Adam(lr=INIT_LR)
model = TrafficSignNet.build(width=224, height=224, depth=3,
classes=numLabels)
model.compile(loss="categorical_crossentropy", optimizer=opt,
metrics=["accuracy"])
print("[INFO] training network...")
H = model.fit_generator(
aug.flow(trainX, trainY, batch_size=BS),
validation_data=(testX, testY),
steps_per_epoch=trainX.shape[0] // BS,
epochs=NUM_EPOCHS,
class_weight=classWeight,
verbose=1)
print("[INFO] evaluating network...")
predictions = model.predict(testX, batch_size=BS)
print(classification_report(testY.argmax(axis=1),
predictions.argmax(axis=1), target_names=labelNames))
print("[INFO] serializing network to '{}'...".format(args["model"]))
model_json = model.to_json()
with open("model.json", "w") as json_file:
json_file.write(model_json)
model.save_weights("model.h5")
print("Saved model to disk")
Enddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd

I could not get access to your training script for some reason - can check your screenshot only.
I think your problem is not overfitting - your model is actually not learning, since the loss is not reduced while the accuracy keeps at the same level. Please double check you network, learning rate (most important) and preprocessing. Also, you may consider loading the VGG16 pretrained weight or not to load if you have done.
Or just post your code, I can take a look when I get a chance.
Update:
Based on your codes, I found you did not make any changes internally on the VGG16 - which is easy to debug.
Check you training and test set and make sure the classes are evenly distributed
Print out the label (Y test and train), carefully check if they are correct.
Try to standardize the X train and test instead of dividing by 255. x=(x-mean)/std
Try use learning rate as 0.0001 (I found it's generally good for VGG16 based network)
Stay simple at the first time. Don't use the decayed optimization, just try the standard ADAM firstly
Bests,

Tensorflow -- Iterating over training and validation sequencially

I have been going throught the Dataset API of tensorflow to feed different dataset with ease to an RNN model.
I got everything working following the not so many blogs together with the docs in the tensorflow website. My working example did the following:
--- Train on X epochs in a training dataset -> validate after all the training has concluded in a validation dataset.
However, I'm unable to develop the following example:
--- Train on X epochs in a training dataset -> validate in each epoch the training model with a validation dataset (a bit like what Keras does)
The problematic issue comes because of the following piece of code:
train_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE, drop_remainder=True).repeat()
val_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE_VAL, drop_remainder=True).repeat()
itr = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
train_init_op = itr.make_initializer(train_dataset)
validation_init_op = itr.make_initializer(val_dataset)
When I create the iterator from_structure, I need to specify an output_shape. Obviously, the output shape of the train dataset and the validation dataset is not the same as they have a different batch_size. However, the validation_init_op is throwing the following error, which it seems counterintuitive because validation sets have always different batch_size:
TypeError: Expected output shapes compatible with (TensorShape([Dimension(256), Dimension(11), Dimension(74)]), TensorShape([Dimension(256), Dimension(3)])) but got dataset with output shapes (TensorShape([Dimension(28), Dimension(11), Dimension(74)]), TensorShape([Dimension(28), Dimension(3)])).
I want to do this second approach to evaluate my model and see the common train and validation plots developed at the same time, to see how can I improve it (stopping the learning early and etc). However, with the first simple approach I don't get all this.
So, the question is: ¿Am I doing something wrong? ¿Does my second approach has to be tackled differently? I can think of creating two iterators, but I don't know if that is the right approach. Also, this answer by #MatthewScarpino points out to a feedable iterator because switching between reinitializable ones makes them to start all over again; however, the above error is not related with that part of the code -- ¿Maybe the reinitializable iterator is not intended to set a different batch size for the validation set and to only iterate it once after training whatever the size it is and without setting it in the .batch() method?
Any help is very much appreciated.
Full code for reference:
N_TIMESTEPS_X = xt.shape[0] ## The stack number
BATCH_SIZE = 256
#N_OBSERVATIONS = xt.shape[1]
N_FEATURES = xt.shape[2]
N_OUTPUTS = yt.shape[1]
N_NEURONS_LSTM = 128 ## Number of units in the LSTMCell
N_EPOCHS = 350
LEARNING_RATE = 0.001
### Define the placeholders anda gather the data.
xt = xt.transpose([1,0,2])
xval = xval.transpose([1,0,2])
train_data = (xt, yt)
validation_data = (xval, yval)
N_BATCHES = train_data[0].shape[0] // BATCH_SIZE
print('The number of batches is: {}'.format(N_BATCHES))
BATCH_SIZE_VAL = validation_data[0].shape[0] // N_BATCHES
print('The validation batch size is: {}'.format(BATCH_SIZE_VAL))
## We define the placeholders as a trick so that we do not break into memory problems, associated with feeding the data directly.
'''As an alternative, you can define the Dataset in terms of tf.placeholder() tensors, and feed the NumPy arrays when you initialize an Iterator over the dataset.'''
batch_size = tf.placeholder(tf.int64)
x = tf.placeholder(tf.float32, shape=[None, N_TIMESTEPS_X, N_FEATURES], name='XPlaceholder')
y = tf.placeholder(tf.float32, shape=[None, N_OUTPUTS], name='YPlaceholder')
# Creating the two different dataset objects.
train_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE, drop_remainder=True).repeat()
val_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE_VAL, drop_remainder=True).repeat()
# Creating the Iterator type that permits to switch between datasets.
itr = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
train_init_op = itr.make_initializer(train_dataset)
validation_init_op = itr.make_initializer(val_dataset)
next_features, next_labels = itr.get_next()

After investigating the best way to do this, I came across with this final implementation that works well on my end. Surely not be the best. So as to maintain the state, I used a feedable iterator.
AIM: This code is intented to be used when you want to train and validate at the same time, preserving the state of each iterator (i.e. validate with the newest model parameters). Together with that, the code saves the model and other stuff, like some information about the hyperparameters and summaries to visualize the training and validation in Tensorboard.
Also, don't get confused: you don't need to have a different batch size for the training set and for the validation set. This is a misconception that I have. The batch sizes must be the same AND you have to deal with the different number of batches, just passing when no more batches are left. This is a requirement so that you can create the iterator, regarding having both datasets the same data type and shape.
Hope it helps others. Just ignore the code that does not relate to your objectives. Many thanks for #kvish for all the help and time.
Code:
def RNNmodelTF(xt, yt, xval, yval, xtest, ytest):
N_TIMESTEPS_X = xt.shape[0] ## The stack number
BATCH_SIZE = 256
#N_OBSERVATIONS = xt.shape[1]
N_FEATURES = xt.shape[2]
N_OUTPUTS = yt.shape[1]
N_NEURONS_LSTM = 128 ## Number of units in the LSTMCell
N_EPOCHS = 350
LEARNING_RATE = 0.001
### Define the placeholders anda gather the data.
xt = xt.transpose([1,0,2])
xval = xval.transpose([1,0,2])
train_data = (xt, yt)
validation_data = (xval, yval)
N_BATCHES = train_data[0].shape[0] // BATCH_SIZE
## We define the placeholders as a trick so that we do not break into memory problems, associated with feeding the data directly.
'''As an alternative, you can define the Dataset in terms of tf.placeholder() tensors, and feed the NumPy arrays when you initialize an Iterator over the dataset.'''
batch_size = tf.placeholder(tf.int64)
x = tf.placeholder(tf.float32, shape=[None, N_TIMESTEPS_X, N_FEATURES], name='XPlaceholder')
y = tf.placeholder(tf.float32, shape=[None, N_OUTPUTS], name='YPlaceholder')
# Creating the two different dataset objects.
train_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE, drop_remainder=True).repeat()
val_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE, drop_remainder=True).repeat()
#################### Creating the Iterator type that permits to switch between datasets.
handle = tf.placeholder(tf.string, shape = [])
iterator = tf.data.Iterator.from_string_handle(handle, train_dataset.output_types, train_dataset.output_shapes)
next_features, next_labels = iterator.get_next()
train_val_iterator = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
train_iterator = train_val_iterator.make_initializer(train_dataset)
val_iterator = train_val_iterator.make_initializer(val_dataset)
###########################
### Create the graph
cellType = tf.nn.rnn_cell.LSTMCell(num_units=N_NEURONS_LSTM, name='LSTMCell')
inputs = tf.unstack(next_features, axis=1)
'''inputs: A length T list of inputs, each a Tensor of shape [batch_size, input_size]'''
RNNOutputs, _ = tf.nn.static_rnn(cell=cellType, inputs=inputs, dtype=tf.float32)
out_weights = tf.get_variable("out_weights", shape=[N_NEURONS_LSTM, N_OUTPUTS], dtype=tf.float32, initializer=tf.contrib.layers.xavier_initializer())
out_bias = tf.get_variable("out_bias", shape=[N_OUTPUTS], dtype=tf.float32, initializer=tf.zeros_initializer())
predictionsLayer = tf.matmul(RNNOutputs[-1], out_weights) + out_bias
### Define the cost function, that will be optimized by the optimizer.
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=predictionsLayer, labels=next_labels, name='Softmax_plus_Cross_Entropy'))
optimizer_type = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE, name='AdamOptimizer')
optimizer = optimizer_type.minimize(cost)
### Model evaluation
correctPrediction = tf.equal(tf.argmax(predictionsLayer,1), tf.argmax(next_labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPrediction,tf.float32))
confusionMatrix1 = tf.confusion_matrix(tf.argmax(next_labels,1), tf.argmax(predictionsLayer,1), num_classes=3, name='ConfMatrix')
## Saving variables so that we can restore them afterwards.
saver = tf.train.Saver()
save_dir = '/media/SecondDiskHDD/8classModels/DLmodels/tfModels/{}_{}'.format(cellType.__class__.__name__, datetime.now().strftime("%Y%m%d%H%M%S"))
#save_dir = '/home/Desktop/tfModels/{}_{}'.format(cellType.__class__.__name__, datetime.now().strftime("%Y%m%d%H%M%S"))
os.mkdir(save_dir)
varDict = {'nTimeSteps': N_TIMESTEPS_X, 'BatchSize': BATCH_SIZE, 'nFeatures': N_FEATURES,
'nNeuronsLSTM': N_NEURONS_LSTM, 'nEpochs': N_EPOCHS,
'learningRate': LEARNING_RATE, 'optimizerType': optimizer_type.__class__.__name__}
varDicSavingTxt = save_dir + '/varDict.txt'
modelFilesDir = save_dir + '/modelFiles'
os.mkdir(modelFilesDir)
logDir = save_dir + '/TBoardLogs'
os.mkdir(logDir)
acc_summary = tf.summary.scalar('Accuracy', accuracy)
loss_summary = tf.summary.scalar('Cost_CrossEntropy', cost)
summary_merged = tf.summary.merge_all()
with open(varDicSavingTxt, 'w') as outfile:
outfile.write(repr(varDict))
with tf.Session() as sess:
tf.set_random_seed(2)
sess.run(tf.global_variables_initializer())
train_writer = tf.summary.FileWriter(logDir + '/train', sess.graph)
validation_writer = tf.summary.FileWriter(logDir + '/validation')
# initialise iterator with data
train_val_string = sess.run(train_val_iterator.string_handle())
cm1Total = None
cm2Total = None
print('¡Training starts!')
for epoch in range(N_EPOCHS):
batchAccList = []
batchAccListVal = []
tot_loss_train = 0
tot_loss_validation = 0
for batch in range(N_BATCHES):
sess.run(train_iterator, feed_dict = {x : train_data[0], y: train_data[1], batch_size: BATCH_SIZE})
optimizer_output, loss_value, summary, accBatch, cm1 = sess.run([optimizer, cost, summary_merged, accuracy, confusionMatrix1], feed_dict = {handle: train_val_string})
npArrayPred = predictionsLayer.eval(feed_dict= {handle: train_val_string})
predLabEnc = np.apply_along_axis(thresholdSet, 1, npArrayPred, value=0.5)
npArrayLab = next_labels.eval(feed_dict= {handle: train_val_string})
labLabEnc = np.argmax(npArrayLab, 1)
cm2 = confusion_matrix(labLabEnc, predLabEnc)
tot_loss_train += loss_value
batchAccList.append(accBatch)
try:
sess.run(val_iterator, feed_dict = {x: validation_data[0], y: validation_data[1], batch_size: BATCH_SIZE})
valLoss, valAcc, summary_val = sess.run([cost, accuracy, summary_merged], feed_dict = {handle: train_val_string})
tot_loss_validation += valLoss
batchAccListVal.append(valAcc)
except tf.errors.OutOfRangeError:
pass
if cm1Total is None and cm2Total is None:
cm1Total = cm1
cm2Total = cm2
else:
cm1Total += cm1
cm2Total += cm2
if batch % 10 == 0:
train_writer.add_summary(summary, batch)
validation_writer.add_summary(summary_val, batch)
epochAcc = tf.reduce_mean(batchAccList)
sess.run(train_iterator, feed_dict = {x : train_data[0], y: train_data[1], batch_size: BATCH_SIZE})
epochAcc_num = sess.run(epochAcc, feed_dict = {handle: train_val_string})
epochAccVal = tf.reduce_mean(batchAccListVal)
sess.run(val_iterator, feed_dict = {x: validation_data[0], y: validation_data[1], batch_size: BATCH_SIZE})
epochAcc_num_Val = sess.run(epochAccVal, feed_dict = {handle: train_val_string})
if epoch%10 == 0:
print("Epoch: {}, Loss: {:.4f}, Accuracy: {:.3f}".format(epoch, tot_loss_train / N_BATCHES, epochAcc_num))
print('Validation Loss: {:.4f}, Validation Accuracy: {:.3f}'.format(tot_loss_validation / N_BATCHES, epochAcc_num_Val))
cmLogFile1 = save_dir + '/cm1File.txt'
with open(cmLogFile1, 'w') as outfile:
outfile.write(repr(cm1Total))
cmLogFile2 = save_dir + '/cm2File.txt'
with open(cmLogFile2, 'w') as outfile:
outfile.write(repr(cm2Total))
saver.save(sess, modelFilesDir + '/model.ckpt')

batching huge data in tensorflow

I am trying to perform binary classification using the code/tutorial from
https://github.com/eisenjulian/nlp_estimator_tutorial/blob/master/nlp_estimators.py
print("Loading data...")
(x_train_variable, y_train), (x_test_variable, y_test) = imdb.load_data(num_words=vocab_size)
print(len(y_train), "train sequences")
print(len(y_test), "test sequences")
print("Pad sequences (samples x time)")
x_train = sequence.pad_sequences(x_train_variable,
maxlen=sentence_size,
padding='post',
value=0)
x_test = sequence.pad_sequences(x_test_variable,
maxlen=sentence_size,
padding='post',
value=0)
print("x_train shape:", x_train.shape)
print("x_test shape:", x_test.shape)
def train_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((x_train, x_len_train, y_train))
dataset = dataset.shuffle(buffer_size=len(x_train_variable))
dataset = dataset.batch(100)
dataset = dataset.map(parser)
dataset = dataset.repeat()
iterator = dataset.make_one_shot_iterator()
return iterator.get_next()
def eval_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((x_test, x_len_test, y_test))
dataset = dataset.batch(100)
dataset = dataset.map(parser)
iterator = dataset.make_one_shot_iterator()
return iterator.get_next()
def cnn_model_fn(features, labels, mode, params):
input_layer = tf.contrib.layers.embed_sequence(
features['x'], vocab_size, embedding_size,
initializer=params['embedding_initializer'])
training = mode == tf.estimator.ModeKeys.TRAIN
dropout_emb = tf.layers.dropout(inputs=input_layer,
rate=0.2,
training=training)
conv = tf.layers.conv1d(
inputs=dropout_emb,
filters=32,
kernel_size=3,
padding="same",
activation=tf.nn.relu)
# Global Max Pooling
pool = tf.reduce_max(input_tensor=conv, axis=1)
hidden = tf.layers.dense(inputs=pool, units=250, activation=tf.nn.relu)
dropout_hidden = tf.layers.dropout(inputs=hidden,
rate=0.2,
training=training)
logits = tf.layers.dense(inputs=dropout_hidden, units=1)
# This will be None when predicting
if labels is not None:
labels = tf.reshape(labels, [-1, 1])
optimizer = tf.train.AdamOptimizer()
def _train_op_fn(loss):
return optimizer.minimize(
loss=loss,
global_step=tf.train.get_global_step())
return head.create_estimator_spec(
features=features,
labels=labels,
mode=mode,
logits=logits,
train_op_fn=_train_op_fn)
cnn_classifier = tf.estimator.Estimator(model_fn=cnn_model_fn,
model_dir=os.path.join(model_dir, 'cnn'),
params=params)
train_and_evaluate(cnn_classifier)
The example here loads data from IMDB movie reviews. I have my own dataset in the form of text which is approx 2GB huge. Now in this example the line
(x_train_variable, y_train), (x_test_variable, y_test) = imdb.load_data(num_words=vocab_size) tries to load whole dataset in memory. If I try to do the same I run out of memory. How can I restructure this logic to read data in batches from my disk?

You want to change the dataset = tf.data.Dataset.from_tensor_slices((x_train, x_len_train, y_train)) line. There are lots of ways of creating a dataset - from_tensor_slices is the easiest, but won't work on its own if you can't load the entire dataset to memory.
The best way depends on how you have the data stored, or how you want to store it/manipulate it. The simplest in my opinion with very little down-side (unless running on multiple GPUs) is to have the original dataset just give indices to data, and write a normal numpy function for loading the ith example.
dataset = tf.data.Dataset.from_tensor_slices(tf.range(epoch_size))
def tf_map_fn(i):
def np_map_fn(i):
return load_ith_example(i)
inp1, inp2 = tf.py_func(np_map_fn, (i,), Tout=(tf.float32, tf.float32), stateful=False)
# other preprocessing/data augmentation goes here.
# unbatched sizes
inp1.set_shape(shape1)
inp2.set_shape(shape2)
return inp1, inp2
dataset = dataset.repeat().shuffle(epoch_size).map(tf_map_fn, 8)
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(1) # start loading data as GPU trains on previous batch
inp1, inp2 = dataset.make_one_shot_iterator().get_next()
Here I assume your outputs are float32 tensors (Tout=...). set_shape calls aren't strictly necessary, but if you know the shape it'll do better error checks.
So long as your preprocessing doesn't take longer than your network to run, this should run just as fast as any other method on a single GPU machine.
The other obvious way is to convert your data to tfrecords, but that'll take up more space on disk and is more of a pain to manage if you ask me.

how to calculate the accuracy on the whole train dataset and val dataset respectively

Hello I'm a newbie about TensorBoard and tf.metrics.accuracy()
(I am a Chinese so maybe my English is not very well, I will try to describe my question)
for convenience, I just write the key codes
Now I have a problem about save train and val accuracy to TensorBoard every epoch, and my data amount is big, so I use batch of data.
What I have finished is:
1) Get Dataset
Now, I use
train_iterator = train_dataset.make_initializable_iterator()
val_iterator = val_dataset.make_initializable_iterator()
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle, train_iterator.output_types,
train_iterator.output_shapes)
image_batch, label_batch = iterator.get_next()
to get the dataset and I can switch between two datasets within tf.Session() using
sess.run([train_iterator.initializer, accuracy_vars_initializer])
and
sess.run([val_iterator.initializer, accuracy_vars_initializer])
2) calculate accuracy
with tf.name_scope("meters"):
accuracy, accuracy_op = tf.metrics.accuracy(labels=label_batch,
predictions=tf.argmax(tf.nn.softmax(logits_batch), -1),
name="accuracy")
accuracy_value_ = tf.placeholder(tf.float32, shape=())
accuracy_summary = tf.summary.scalar('accuracy', accuracy_value_)
accuracy_vars = tf.get_collection(tf.GraphKeys.LOCAL_VARIABLES, scope="meters/accuracy")
accuracy_vars_initializer = tf.variables_initializer(var_list=accuracy_vars)
I use
sess.run(accuracy_vars_initializer)
before the train/val process on the whole train/val dataset to set the internal counting value within tf.metrics.accuracy()
then I want use
for epoch_i in range(FLAGS.epoch):
sess.run([train_iterator.initializer, accuracy_vars_initializer])
train_loss_avg, train_acc_avg = [], []
while True:
try:
_, loss_value, step, acc_value, acc_op_value, summary = sess.run(
[train_op, loss, global_step, accuracy, accuracy_op, merged],
feed_dict={handle: train_iterator_handle,
accuracy_value_: np.average(train_acc_avg),
loss_value_: np.average(train_loss_avg)})
train_acc_avg.append(acc_value)
train_loss_avg.append(loss_value)
except tf.errors.OutOfRangeError:
train_writer.add_summary(summary, global_step=step)
saver.save(sess, os.path.join(FLAGS.model_dir, "fcn8.ckpt"), global_step)
print("train dataset finished")
break
sess.run([val_iterator.initializer, accuracy_vars_initializer])
val_loss_avg = []
while True:
try:
loss_value, acc_value, acc_op_value, summary = sess.run(
[loss, accuracy, accuracy_op, merged], feed_dict={handle: val_iterator_handle,
accuracy_value_: acc_op_value,
loss_value_: np.average(val_loss_avg)})
print("Epoch[%d],val batch loss = %g,acc = %g." % (epoch_i, loss_value, acc_value))
val_loss_avg.append(loss_value)
except tf.errors.OutOfRangeError:
val_writer.add_summary(summary, global_step=step)
print("val dataset finished")
break
train_writer.close()
val_writer.close()
to achieve my goal.
The accuracy calculating method I used before is simply
feed_dict={handle: train_iterator_handle,
accuracy_value_: accuracy_op,
loss_value_: np.average(train_loss_avg)})
But both the old and new method will result in a horizontal accuracy line in TensorBoard. And I improved my code many times but the problem still exists
Can anyone help me to find the reason? And is there a better and standardized way to structure my code? Because it's too complicated right now.
Many thanks for any help.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pytorch inference OOM after some batches - pytorch

Related

Pytorch low gpu util after first epoch

problem overfitting model VGG16 small dataset

Tensorflow -- Iterating over training and validation sequencially

batching huge data in tensorflow

how to calculate the accuracy on the whole train dataset and val dataset respectively

Categories

Resources