Something wrong when TensorFlow running in thread - multithreading

I'm writing a multi-thread face recognition program, using Keras as high level model with tensorflow as backend. The code is as blow:
class FaceRecognizerTrainThread(QThread):
def run(self):
print("[INFO] Loading images...")
images, org_labels, face_classes = FaceRecognizer.load_train_file(self.train_file)
print("[INFO] Compiling Model...")
opt = SGD(lr=0.01)
face_recognizer = LeNet.build(width=Const.FACE_SIZE[0], height=Const.FACE_SIZE[1], depth=Const.FACE_IMAGE_DEPTH,
classes=face_classes, weightsPath=None)
face_recognizer.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])
images = np.array(images)[:, np.newaxis, :, :] / 255.0
labels = np_utils.to_categorical(org_labels, face_classes)
print("[INFO] Training model...")
try:
face_recognizer.fit(images, labels, epochs=50, verbose=2, batch_size=10)
except Exception as e:
print(e)
print("[INFO] Training model done...")
save_name = "data/CNN_" + time.strftime("%Y%m%d%H%M%S", time.localtime()) + ".hdf5"
if save_name:
face_recognizer.save(save_name)
self.signal_train_end.emit(save_name)
every thing is ok when running it in a normal mode, but when I run it in a QThread and when it goes to
face_recognizer.fit(images, labels, epochs=50, verbose=2, batch_size=10)
it gives me the error:
Cannot interpret feed_dict key as Tensor: Tensor Tensor("conv2d_1_input:0", shape=(?, 1, 30, 30), dtype=float32) is not an element of this graph.
How can I fix it? Any suggestion is welcome, thank you very much~~~~

TensorFlow allows you to define a tf.Graph() that you can then create a tf.Session() with the graph and then run the operations defined in the graph. When you're doing it in this way, each QThread is trying to create it's own TF Graph. Which is why you get that error of not an element of this graph. I don't see your feed_dict code so I would assume that it's probably run in a main thread that your other threads do not see. Including your feed_dict in each thread might make it work but it's hard to conclude without looking at your full code.
Replicating models in Keras and Tensorflow for a multi-threaded setting might help you.
To solve your problem, you should be using something similar to this post. Code reproduced from that post:
# Thread body: loop until the coordinator indicates a stop was requested.
# If some condition becomes true, ask the coordinator to stop.
def MyLoop(coord):
while not coord.should_stop():
...do something...
if ...some condition...:
coord.request_stop()
# Main thread: create a coordinator.
coord = tf.train.Coordinator()
# Create 10 threads that run 'MyLoop()'
threads = [threading.Thread(target=MyLoop, args=(coord,)) for i in xrange(10)]
# Start the threads and wait for all of them to stop.
for t in threads:
t.start()
coord.join(threads)
It is also worth reading about inter_op and intra_op parallelism here.

Related

Train two model iteratively with PyTorch

I hope to train two cascaded networks, e.g. X->Z->Y, Z=net1(X), Y=net2(Z).
I hope to optimize the parameters of these two networks iteratively, i.e., for a fixed parameter of net1, firstly train parameters of net2 using MSE(predY,Y) loss util convergence; then, use the converged MSE loss to train a iteration of net1, etc.
So, I define two optimizers for each networks respectively. My training code is below:
net1 = SimpleLinearF()
opt1 = torch.optim.Adam(net1.parameters(), lr=0.01)
loss_func = nn.MSELoss()
for itera1 in range(num_iters1 + 1):
predZ = net1(X)
net2 = SimpleLinearF()
opt2 = torch.optim.Adam(net2.parameters(), lr=0.01)
for itera2 in range(num_iters2 + 1):
predY = net2(predZ)
loss = loss_func(predY,Y)
if itera2 % (num_iters2 // 2) == 0:
print('iteration: {:d}, loss: {:.7f}'.format(int(itera2), float(loss)))
loss.backward(retain_graph=True)
opt2.step()
opt2.zero_grad()
loss.backward()
opt1.step()
opt1.zero_grad()
However, I encounter the following mistake:
RuntimeError: one of the variables needed for gradient computation has been modified by an
inplace operation: [torch.FloatTensor [1, 1]], which is output 0 of AsStridedBackward0, is at
version 502; expected version 501 instead. Hint: enable anomaly detection to find the
operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Does anyone know why this error occurs? How should I solve this problem. Many Thanks.
I found the answer to my question after some searching on PyTorch computation graph.
Just remove the retain_graph=True and add a .detach() in net2(predZ) will solve this error.
This detach operation can cut net1 away from the computation graph of net2/optimizor2.

Running Tensorflow model test data after closing session

I have a Convnet I am trying to replicate (not my original code) that was able to run test dataset into the trained model only when I trained and tested in the same sitting. I tweaked only a few lines of the code to make it run test data after said sitting so I am not sure what might be going on. I noticed that "logits_out" was a dataflow edge rather than node in tensorboard, so is it that because edges aren't saved in checkpoints automatically, in conjunction with the fact that it is not saved as a node or in any other form intentionally in original code, that it can't be called after the first sitting closes?
This is the general structure of the training phase:
tf.reset_default_graph()
graph = tf.Graph()
with graph.as_default():
with tf.name_scope('1st_pool'):
#first layer
#subsequent layers
with graph.as_default():
#flattening, dropout, optimization, etc...
#some summary.scalar for loss analyses
logits_out = tf.layers.dense(flat, 1) #flat is the flattened array
saved_1 = tf.train.Saver()
trained_event = tf.summary.FileWriter('./CNN/train', graph=graph)
test_event = tf.summary.FileWriter('./CNN/test', graph=graph)
merged = tf.summary.merge_all()
with tf.Session(graph=graph) as sess:
#training and "validating"
sess.run(tf.global_variables_initializer())
#running train summaries
if step = test_round:
#running test summaries
saved_1.save(sess, './CNN/model_1.ckpt')
(EDITED:code pasted incorrectly)
This code ran successfully during the continuous sitting with graph still open:
with tf.Session(graph=graph) as sess:
saved_1.restore(sess, tf.train.latest_checkpoint('./CNN'))
#
pred = sess.run(logits_out, feed_dict={some inputs for placeholders})
#
Only tweaked 2 lines pretty much (shown below) to load meta files in a new graph on the next day but gave the error "name 'logits_out' is not defined" when I try to run in a separate sitting (in fact, other variables I tried to sess.run gave the same error):
with tf.Session(graph=tf.get_default_graph()) as sess:
saved_1 = tf.train.import_meta_graph('./CNN/model_1.ckpt.meta')
saved_1.restore(sess, tf.train.latest_checkpoint('./CNN'))
pred = sess.run(logits_out, feed_dict={some inputs for placeholders})
#
EDITED:I'm thinking it might be because I am missing a scope - or misunderstanding how tensorflow names stuff - after restoring the session/graph the next day, but I can't see how - the only thing that had been named were the pool.
I was able to run data through the model by just creating the graph by running this section of code today:
tf.reset_default_graph()
graph = tf.Graph()
with graph.as_default():
with tf.name_scope('1st_pool'):
#first layer
#subsequent layers
with graph.as_default():
#flattening, dropout, optimization, etc...
#some summary.scalar for loss analyses
logits_out = tf.layers.dense(flat, 1) #flat is the flattened array
saved_1 = tf.train.Saver()
trained_event = tf.summary.FileWriter('./CNN/train', graph=graph)
test_event = tf.summary.FileWriter('./CNN/test', graph=graph)
merged = tf.summary.merge_all()
with tf.Session(graph=graph) as sess:
#training and "validating"
sess.run(tf.global_variables_initializer())
#running train summaries
if step = test_round:
#running test summaries
saved_1.save(sess, './CNN/model_1.ckpt')
and then running
the code without the edited 2 lines:
with tf.Session(graph=graph) as sess:
saved_1.restore(sess, tf.train.latest_checkpoint('./CNN'))
#
pred = sess.run(logits_out, feed_dict={some inputs for placeholders})
#
So the gist of all this entire post on SO was that I did not have to use tf.train.import_meta_graph, but what I don't understand is what is the use of tf.train.import_meta_graph? I thought it imports the graph and it's metadata saved in ".meta" file so I could avoid having to rebuild the graph from the source code?
(note: I will remove this postscript question once I figure it out)

tf.keras.Model.load_weights() caught ResourceExhaustedError

I have two ipynb files: train.ipynb and predict.ipynb. I have trained a model with fit generator(with batch size 64) in train.ipynb and caught ResourceExhaustedError when I tried to load weights in predict.ipynb
I'm using keras inside tensorflow v1.9 and tensorflow docker image.
# train.ipynb
def network():
#[ A normal model]
return model
model = network()
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit_generator(seq,shuffle=True,
epochs = 10, verbose=1
)
# save the model and weight after training
with open('model.json','w') as json_file:
json_file.write(model.to_json())
model.save_weights('model.h5')
clear_session() # tried to clear the session here
# saved both successfully
# model.h5(131MB)
After successfully saved, I can load it back inside train.ipynb However, when I do the same thing in predict.ipynb, an error is caught.
# train.ipynb
with open('model.json','r') as json_file:
test_model = model_from_json(json_file.read())
test_model.load_weights('model.h5')
# No error here
# predict.ipynb
with open('model.json','r') as json_file:
test_model = model_from_json(json_file.read())
test_model.load_weights('model.h5')
# Got the following error
ResourceExhaustedError: OOM when allocating tensor with shape[28224,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Any help is appreciated!
Are you running both notebooks simultaneously? Your GPU is out of memory. Try nvidia-smi in the command line to check on your GPU's resource usage, although be aware that TensorFlow by default occupies all available GPU memory. keras.backend.clear_session() could be of help as well.

What's the best way to evaluate test error during training?

I have a neural network I'm training with TensorFlow. Actually, at each iteration, I can compute the training cost to pass to the optimizer. A pseudo code of my implementation is:
def defineNetworkStructure(): # layers
...
def feedForward():
...
def defineCost():
...
def defineOptimizer():
opt = ...
def train(train_X, train_Z, ...):
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(N):
_, ith_cost = sess.run([opt, cost], feed_dict={X:train_X, Y:train_Y})
print("Cost at {} is {}".format(i, ith_cost))
Now, inside the loop, I'd like to insert something like:
ith_cost = sess.run([opt, cost], feed_dict={X:test_X, Y:test_Y})
Note: test_X and test_Y instead of train_X and train_Y.
However, if I do so, I'll modify the value of the tensorflow variable cost and consequentely (but I'm not sure), I'll influence the optimization process.
What is the best way to achieve this task in tensorflow?
The thing you've missed here is, you shouldn't run opt on test_X and test_Y.
Just doing sess.run(cost, feed_dict={X:test_X, Y:test_Y}) will output your testing-loss and in no-way affects the training or optimization process.

retraining last layer of inception-v3 significantly slowers the classification

In an attempt for transfer learning over inception-v3 with TF and PY3.5, I've tested two approaches:
1- retraining the last layer, as shown here: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/image_retraining
2- Apply linear SVM on top of inception-V3 bottlenecks as demonstrated here: https://www.kernix.com/blog/image-classification-with-a-pre-trained-deep-neural-network_p11
Expectedly, they should've had a similar runtime for classification phase, since the critical part - the bottlenecks extraction - is identical. In practice though, the retrained network is about 8X slower when running classification.
My questions is whether anyone has an idea for the reason of this.
Some code snippets:
SVM on top (the faster):
def getTensors():
graph_def = tf.GraphDef()
f = open('classify_image_graph_def.pb', 'rb')
graph_def.ParseFromString(f.read())
tensorBottleneck, tensorsResizedImage = tf.import_graph_def(graph_def, name='', return_elements=['pool_3/_reshape:0', 'Mul:0'])
return tensorBottleneck, tensorsResizedImage
def calc_bottlenecks(imgFile, tensorBottleneck, tensorsResizedImage):
""" - read, decode and resize to get <resizedImage> - """
bottleneckValues = sess.run(tensorBottleneck, {tensorsResizedImage : resizedImage})
return np.squeeze(bottleneckValues)
This takes about 0.5 sec on my (Windows) laptop while the SVM part takes no time.
Retraining last layer - (this is harder to summarize since longer code)
def loadGraph(pbFile):
with tf.gfile.FastGFile(pbFile, 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
tf.import_graph_def(graph_def, name='')
with tf.Session() as sess:
softmaxTensor = sess.graph.get_tensor_by_name('final_result:0')
def labelImage(imageFile, softmaxTensor):
with tf.Session() as sess:
input_layer_name = 'DecodeJpeg/contents:0'
predictions, = sess.run(softmax_tensor, {input_layer_name: image_data})
'pbFile' is the file saved be the retrainer, which supposed to have identical topology and weights excluding the classification layer, as 'classify_image_graph_def.pb'. This takes about 4sec to run (on my same laptop, without the loading).
Any idea for the performance gap?
Thanks!
Solved. The problem was in creating a new tf.Session() for every image. Storing the session when reading graph and using it made runtime back to expected.
def loadGraph(pbFile):
...
with tf.Session() as sess:
softmaxTensor = sess.graph.get_tensor_by_name('final_result:0')
sessToStore = sess
return softmaxTensor, sessToStore
def labelImage(imageFile, softmaxTensor, sessToStore):
input_layer_name = 'DecodeJpeg/contents:0'
predictions, = sessToStore.run(softmax_tensor, {input_layer_name: image_data})

Resources