tf.keras.Model.load_weights() caught ResourceExhaustedError - python-3.x

I have two ipynb files: train.ipynb and predict.ipynb. I have trained a model with fit generator(with batch size 64) in train.ipynb and caught ResourceExhaustedError when I tried to load weights in predict.ipynb
I'm using keras inside tensorflow v1.9 and tensorflow docker image.
# train.ipynb
def network():
#[ A normal model]
return model
model = network()
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit_generator(seq,shuffle=True,
epochs = 10, verbose=1
)
# save the model and weight after training
with open('model.json','w') as json_file:
json_file.write(model.to_json())
model.save_weights('model.h5')
clear_session() # tried to clear the session here
# saved both successfully
# model.h5(131MB)
After successfully saved, I can load it back inside train.ipynb However, when I do the same thing in predict.ipynb, an error is caught.
# train.ipynb
with open('model.json','r') as json_file:
test_model = model_from_json(json_file.read())
test_model.load_weights('model.h5')
# No error here
# predict.ipynb
with open('model.json','r') as json_file:
test_model = model_from_json(json_file.read())
test_model.load_weights('model.h5')
# Got the following error
ResourceExhaustedError: OOM when allocating tensor with shape[28224,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Any help is appreciated!

Are you running both notebooks simultaneously? Your GPU is out of memory. Try nvidia-smi in the command line to check on your GPU's resource usage, although be aware that TensorFlow by default occupies all available GPU memory. keras.backend.clear_session() could be of help as well.

Related

input problem of tensorflow model while doing audio classification

Env:
tensorflow version: 2.4rc
numpy version: 1.19.5
os: mac os M1 arch / ubuntu 16.4
description:
I'm new to tensorflow and i am doing audio classification using LSTM and CNN with it, but these two models seem to meet the same problem while training. The training process stucks because of incorrect inputs which i can't figure out how to fix it. Take LSTM as example:
model information:
input_shape = (SEGMENT_DUR,N_MEL_BANDS)
model=Sequential()
model.add(Input(input_shape))
model.add(LSTM(96, return_sequences=True,name = 'lstm-96',input_shape = input_shape))
model.add(LSTM(218, name = 'lstm-218'))
model.add(Dense(n_classes, activation='softmax',name = 'prediction'))
model.summary()
image link: model summary
training set
early_stopping = EarlyStopping(monitor='val_loss', patience=EARLY_STOPPING_EPOCH)
save_clb = ModelCheckpoint(
"{weights_basepath}/{model_path}/".format(
weights_basepath=MODEL_WEIGHT_BASEPATH,
model_path=self.model_module.BASE_NAME) +
"epoch.{epoch:02d}-val_loss.{val_loss:.3f}-acc.{val_accuracy:.3f}"+"-{key}.hdf5".format(
key=self.model_module.MODEL_KEY),
monitor='val_loss',
save_best_only=True)
lrs = LearningRateScheduler(lambda epoch_n: self.init_lr / (2**(epoch_n//SGD_LR_REDUCE)))
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
# train_input, train_val = self.load_data(self.X_train), self.y_train
history = model.fit(
self._batch_generator(self.X_train, self.y_train,batch_size=BATCH_SIZE),
validation_data=self._batch_generator(self.X_val, self.y_val,batch_size=BATCH_SIZE),
batch_size=BATCH_SIZE,
epochs=MAX_EPOCH_NUM,
steps_per_epoch=2,
verbose=1,
callbacks=[save_clb, early_stopping, lrs]
# workers=1,
)
self-defined batch generator provide a batch_size of (input,label) data,
result
image link: run result
seems that it's doing nothing after the first batch, and i trace into the training detail found the input shapes are unclear:
image link: shape problem
these warnings are no big deal, they appear in simple success demos as well. and for convenience here is the link of relavant docs: model.fit really need HELP! Thanks in advance!

Tensorflow model prediction failes when ran right after model training

I'm having troubles with my model prediction. The training works fine but afterwards my program fails while predicting the trained model. When I rerun my code the training is now skipped because its already done, the prediction works now fine as its supposed to. In google I find this error only with regard to model training so i guess the solutions don't work for me. I think the reason for my error is, that my video ram is not entirely freed after model training. That's why I tried the following without success.
tf.keras.backend.clear_session()
tf.compat.v1.reset_default_graph()
K.clear_session()
Error code:
prediction = model.predict(x)[:, 0]#.flatten() # flatten was needed now
File "/home/max/PycharmProjects/Masterthesis/venv3-8-12/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/max/PycharmProjects/Masterthesis/venv3-8-12/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 106, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.
Do you have any ideas on how to solve the problem?
My Setup:
Python: 3.8.12
Tensorflow-gpu: 2.7.0
System: Manjaro Linux
Cuda: 11.5
GPU: NVIDIA GeForce GTX 980 Ti
My Code:
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.layers import Input, LSTM, Dense, Dropout
import tensorflow as tf
import h5py
import keras.backend as K
def loss_function(y_true, y_pred):
alpha = K.std(y_pred) / K.std(y_true)
beta = K.sum(y_pred) / K.sum(y_true)
error = K.sqrt( + K.square(1 - alpha) + K.square(1 - beta))
return error
i = Input(shape=(171, 11))
x = LSTM(100, return_sequences=True)(i)
x = LSTM(50)(x)
x = Dropout(0.1)(x)
out = Dense(1)(x)
model = Model(i, out)
model.compile(
loss=loss_function,
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
with h5py.File("db.hdf5", 'r') as db_:
r = model.fit(
db_["X_train"][...],
db_["Y_train"][...],
epochs=1,
batch_size=64,
verbose=1,
shuffle=True)
model.save("model.h5")
model = load_model("model.h5", compile=False)
with h5py.File("db.hdf5", 'r') as db:
x = db["X_val"][...]
y = db["Y_val"][...].flatten()
prediction = model.predict(x)[:, 0].flatten()
I found the solution to my problem. Since I'm using a custom loss function, I somehow needed to specify the custom loss function when loading the model again.
I accomplished this by modifying this line
model = load_model("model.h5", compile=False)
to this one
model = load_model("model.h5", custom_objects={"loss_function": loss_function})

Output predicted image Tensorflow Lite

I am trying to figure out how i can save a predicted mask (output) from a tensorflow model which have been converted to a tf.lite model on my PC. Any tips or ideas of how i can vizualise it or save the predicted mask as a .png image. I have tried using the tensorflow Lite interference from https://www.tensorflow.org/lite/guide/inference#load_and_run_a_model_in_python without success.
Output now is as following:
[ 1 512 512 3]
[[[[9.7955531e-01 2.0444747e-02]
[9.9987805e-01 1.2197520e-04]
[9.9978799e-01 2.1196880e-04]
.......
.......
[9.9997246e-01 2.7536058e-05]
[9.9997437e-01 2.5645388e-05]
[1.9125430e-03 9.9808747e-01]]]]
Any help is greatly appriceated.
Many thanks
## Load the TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path="tflite_model.tflite")
print(interpreter.get_input_details())
print(interpreter.get_output_details())
print(interpreter.get_tensor_details())
interpreter.allocate_tensors()
## Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
## Test the model on input data.
input_shape = input_details[0]['shape']
print(input_shape)
## Use same image as Keras model
input_data = np.array(Xall, dtype=np.float32)
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
## The function `get_tensor()` returns a copy of the tensor data.
## Use `tensor()` in order to get a pointer to the tensor.
output_data = interpreter.get_tensor(output_details[0]['index'])
print(output_data)
output_data.shape
It depends on the meaning of your model output. Then, use image libraries like cv2 or PIL to draw the mask.
For example, the first row:
9.7955531e-01 2.0444747e-02
You need to figure out what they correspond to. With the limited information, it is hard to guess from the context.

Getting Errors while running elmo embeddings in google colab

I am extracting features through elmo. Train and Test are text data.I am getting errors while executing in google colab. I have checked previous Stackoverflow questions but could not resolve. Exact codes with pointers will be helpful.
elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)
def elmo_vectors(x):
embeddings = elmo(x.tolist(), signature="default", as_dict=True)["elmo"]
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())
# return average of ELMo features
return sess.run(tf.reduce_mean(embeddings,1))
import tensorflow as tf
import tensorflow_hub as hub
list_train = [train[i:i+100] for i in range(0,train.shape[0],100)]
list_test = [test[i:i+100] for i in range(0,test.shape[0],100)]
# Extract ELMo embeddings
elmo_train = [elmo_vectors(x['clean_tweet']) for x in list_train]
elmo_test = [elmo_vectors(x['clean_tweet']) for x in list_test]
I am getting following errors:
UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node module_apply_default_1/bilm/CNN_2/Conv2D_6 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_hub/native_module.py:517) ]]
[[node Mean (defined at :8) ]]
I tried right now on colab.research.google.com in Python 3 runtimes with and without GPU, and the following adaptation of your code runs:
import tensorflow as tf
import tensorflow_hub as hub
elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)
def elmo_vectors(x):
embeddings = elmo(x, # Note plain x here.
signature="default", as_dict=True)["elmo"]
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())
# return average of ELMo features
return sess.run(tf.reduce_mean(embeddings, 1))
elmo_vectors(["Hello world"])
I get the output:
array([[ 0.45319763, -0.99154925, -0.26539633, ..., -0.13455263,
0.48878008, 0.31264588]], dtype=float32)
I believe this is a not a TF Hub problem.
This happened with me as well. I think it is about the free RAM usage we get in the Google Colab Free tier version. I had to reduce a few Convolution layers and reduce the batch size in order to run it. Also, I was not able to run it on GPU.
So, I guess you can consider using Google Colab Pro

Something wrong when TensorFlow running in thread

I'm writing a multi-thread face recognition program, using Keras as high level model with tensorflow as backend. The code is as blow:
class FaceRecognizerTrainThread(QThread):
def run(self):
print("[INFO] Loading images...")
images, org_labels, face_classes = FaceRecognizer.load_train_file(self.train_file)
print("[INFO] Compiling Model...")
opt = SGD(lr=0.01)
face_recognizer = LeNet.build(width=Const.FACE_SIZE[0], height=Const.FACE_SIZE[1], depth=Const.FACE_IMAGE_DEPTH,
classes=face_classes, weightsPath=None)
face_recognizer.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])
images = np.array(images)[:, np.newaxis, :, :] / 255.0
labels = np_utils.to_categorical(org_labels, face_classes)
print("[INFO] Training model...")
try:
face_recognizer.fit(images, labels, epochs=50, verbose=2, batch_size=10)
except Exception as e:
print(e)
print("[INFO] Training model done...")
save_name = "data/CNN_" + time.strftime("%Y%m%d%H%M%S", time.localtime()) + ".hdf5"
if save_name:
face_recognizer.save(save_name)
self.signal_train_end.emit(save_name)
every thing is ok when running it in a normal mode, but when I run it in a QThread and when it goes to
face_recognizer.fit(images, labels, epochs=50, verbose=2, batch_size=10)
it gives me the error:
Cannot interpret feed_dict key as Tensor: Tensor Tensor("conv2d_1_input:0", shape=(?, 1, 30, 30), dtype=float32) is not an element of this graph.
How can I fix it? Any suggestion is welcome, thank you very much~~~~
TensorFlow allows you to define a tf.Graph() that you can then create a tf.Session() with the graph and then run the operations defined in the graph. When you're doing it in this way, each QThread is trying to create it's own TF Graph. Which is why you get that error of not an element of this graph. I don't see your feed_dict code so I would assume that it's probably run in a main thread that your other threads do not see. Including your feed_dict in each thread might make it work but it's hard to conclude without looking at your full code.
Replicating models in Keras and Tensorflow for a multi-threaded setting might help you.
To solve your problem, you should be using something similar to this post. Code reproduced from that post:
# Thread body: loop until the coordinator indicates a stop was requested.
# If some condition becomes true, ask the coordinator to stop.
def MyLoop(coord):
while not coord.should_stop():
...do something...
if ...some condition...:
coord.request_stop()
# Main thread: create a coordinator.
coord = tf.train.Coordinator()
# Create 10 threads that run 'MyLoop()'
threads = [threading.Thread(target=MyLoop, args=(coord,)) for i in xrange(10)]
# Start the threads and wait for all of them to stop.
for t in threads:
t.start()
coord.join(threads)
It is also worth reading about inter_op and intra_op parallelism here.

Resources