tensorflow and keras - fast inference in production from SQS

tensorflow and keras - fast inference in production from SQS - python-3.x

I have my model written and trained in Keras. I'm trying to use it for inference in production. I receive SQS "task" messages containing a tuple of (path_in, path_out).
I can obviously use:
BATCH_SIZE = 10
batch_messages = []
while True:
while len(batch_messages) < BATCH_SIZE:
msg = sqs.read_messsage()
batch_messages.apend(msg)
assert len(batch_messages) == BATCH_SIZE
batch = np.array([read_image(msg.path_in) for msg in batch_messages])
output_batch = model.predict(batch)
for i in range(BATCH_SIZE):
write_output(output_batch[i], path=batch_messages[i].path_out)
batch_messages = []
The problem with that is that the code wastes most of the time reading from SQS, reading the image from disk and writing it back at the end. This means the GPU is idle during all this time.
I'm aware of Keras' Sequence, but not sure if it is intended for that case as well, and for inference (and not training)

I would suggest you to use Tensorflow Serving solution as it implements a server side batching strategy which optimizes the inference speed and GOU utilization. Also if you'd like to speed up your pipeline, you should convert the model into a TensorRT model which optimizes the models operations to a specific GPU (and it does a lot more).

Related

How to retrain pytorch-lightning based model on new data using previous checkpoint

I'm using the pytorch-forecasting library (which is based on pytorch-lightning) for running a TFT model on time series forecasting. My training routine is segregated into three different tasks. At first I perform HPO using optuna, then I do a training+validation, and in the end, a retraining with full data (no validation).
Currently, both training+validation and retraining are happening using fresh models from scratch, so the runtime is quite high. So, I'm trying to reduce the run-time of the whole training routine by trying to leverage incremental-training where I'll load the checkpointed trained model from phase 2 and retrain it for smaller epochs on phase 3.
I have a method fit_model() which is used in both training/validation and retraining, but with different args. The core part of my fit() looks something like the following:
def fit_model(self, **kwargs):
...
to_retrain = kwargs.get('to_retrain', False)
ckpt_path = kwargs.get('ckpt_path', None)
trainer = self._get_trainer(cluster_id, gpu_id, to_retrain) # returns a pl.Trainer object
tft_lightning_module = self._prepare_for_training(cluster_id, to_retrain)
train_dtloaders = ...
val_dtloaders = ...
if not to_retrain:
trainer.fit(
tft_lightning_module,
train_dataloaders=train_dtloaders,
val_dataloaders=val_dtloaders
)
else:
trainer.fit(
tft_lightning_module,
train_dataloaders=train_dtloaders,
val_dataloaders=val_dtloaders,
ckpt_path=ckpt_path
)
best_model_path = trainer.checkpoint_callback.best_model_path
return best_model_path
While I call the above method in my retraining phase, I can see the log where it says that it's loading the checkpointed model:
Restored all states from the checkpoint file at /tft/incremental_training/tft_training_20230206/171049/lightning_logs_3/lightning_logs/version_0/checkpoints/epoch=4-step=5.ckpt
But unfortunately, no further training is happening at phase 3. If I look at the best_model_path returned by the method, it has the old checkpoint path from train/validation phase and not from retraining phase. How to resolve this issue?
I'm using the following libraries
pytorch-lightning==1.6.5
pytorch-forecasting==0.9.0

I got it working finally. The key thing to keep in mind here is not to use same number of epochs in both training and retraining. If we're training for x epochs and intend to run the retraining for y more epochs, then max-epochs has to be set to x+y and not y in retraining.

Why is my pruned model larger than my base model when using Tensorflow's Model Optimization library to prune weights

I am experimenting with the Tensorflow model optimization library and am trying to reduce the size of the SavedModel that is running in a production cluster with the goal of reducing operating costs while keeping as much performance as possible.
A few things I've read suggested I should try out pruning weights in the model.
I've tried it and so far have gotten very mixed results.
Here is the code for the model I am trying to prune.
n = 300000 # input vector dimension, it's not exactly 300k but it's close
code_dimension = 512
inputs = Input(shape=(n,))
outputs = Dense(code_dimension, activation='relu')(inputs)
outputs = Dense(code_dimension, activation='relu')(outputs)
outputs = Dense(n, activation='softmax')(outputs)
model = Model(input, outputs)
model.compile("adam", "cosine_similarity")
model.fit(training_data_generator, epochs=10, validation_data=validation_data_generator)
model.save("base_model.pb")
# model pruning starts here
pruning_schedule = tfmot.sparsity.keras.ConstantSparsity(
target_sparsity=0.95, begin_step=0, end_step=-1, frequency=100
)
callbacks = [tfmot.sparsity.keras.UpdatePruningStep()]
model_for_pruning = tfmot.sparsity.keras.prune_low_magnitude(base_model, pruning_schedule=pruning_schedule)
model_for_pruning.compile(optimizer="adam", loss="cosine_similarity")
model_for_pruning.fit(training_data_generator, validation_data=validation_data_generator, epochs=2, callbacks=callbacks)
print(f"Mean model sparsity post-pruning: {mean_model_sparsity(model_for_pruning): .4f}")
# strip pruning not to carry around those extra parameters
model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning)
model_for_export.save("pruned_model.pb")
Here's the problem, when I set code_dimension to 32 or 64 after model pruning and saving the pruned_model.pb file is about 2 - 3 times smaller than the base_model.pb file. However when I use a code dimension of 256 or 512 my pruned model is actually bigger than the base model.
I have a script that runs this and each time I run it I do a full reset of my environment.
Has anyone who used the TensorFlow model optimization library ever experienced this?

What is the most efficient way to load data into Tensorflow for real time inference?

When setting up a data input pipeline to Tensorflow (web cam images), a large amount of time is spent loading the data from the system RAM to the GPU memory.
I am trying to feed a constant stream of images (1024x1024) through my object detection network. I'm currently using a V100 on AWS to perform inference.
The first attempt was with a simple feed dict operation.
# Get layers
img_input_tensor = sess.graph.get_tensor_by_name('import/input_image:0')
img_anchors_input_tensor = sess.graph.get_tensor_by_name('import/input_anchors:0')
img_meta_input_tensor = sess.graph.get_tensor_by_name('import/input_image_meta:0')
detections_input_tensor = sess.graph.get_tensor_by_name('import/output_detections:0')
detections = sess.run(detections_input_tensor,
feed_dict={img_input_tensor: molded_image, img_meta_input_tensor: image_meta, img_anchors_input_tensor: image_anchor})
This produced inference times around 0.06 ms per image.
However, after reading the Tensorflow manual I noticed that the tf.data API was recommended for loading data for inference.
# setup data input
data = tf.data.Dataset.from_tensors((img_input_tensor, img_meta_input_tensor, img_anchors_input_tensor, detections_input_tensor))
iterator = data.make_initializable_iterator() # create the iterator
next_batch = iterator.get_next()
# load data
sess.run(iterator.initializer,
feed_dict={img_input_tensor: molded_image, img_meta_input_tensor: image_meta, img_anchors_input_tensor: image_anchor})
# inference
detections = sess.run([next_batch])[0][3]
This sped up inference time to 0.01ms, put the time taken to load the data took 0.1 ms. This Iterator methods is much longer than the 'slower' feed_dict method significantly. Is there something I can do to speed up the loading process?

Here is a great guide on data pipeline optimization. I personally find the .prefetch method the easiest way to boost your input pipeline. However, the article provides much more advanced techniques.
However, if your input data is not in tfrecords, but you feed it by yourself, you have to implement the described techniques (buffering, interleaved operations) somehow by yourself.

How do we know the model converged while training a neural network for object classification/detection?

I am using Tensorflow's Object Detection API to detect cars. It should detect the cars as one class "car".
I followed sentdex's following series:
https://pythonprogramming.net/introduction-use-tensorflow-object-detection-api-tutorial/
System information:
OS - Ubuntu 18.04 LTS
GPU - Nvidia 940M (VRAM : 2GB)
Tensorflow : 1.10
Python - 3.6
CPU - Intel i5
Problem:
The training process runs pretty fine. In order to know when the model converges and when I should stop training, I observe the loss during the training per step in the terminal where the training is running and also observe the Total Loss graph in Tensorboard via running the following command in another terminal,
$tensorboard --logdit="training"
But even after training till 60k steps, the loss fluctuates between 2.1 to 1.2. If I stop the training and export the inference graph from the last checkpoint(saved in the training/ folder), it detects cars in some cases and in some it gives false positives.
I also tried running eval.py like below,
python3 eval.py --logtostderr --pipeline_config_path=training/ssd_mobilenet_v1_pets.config --checkpoint_dir=training/ --eval_dir=eval/
but it gives out an error that indicates that the GPU memory fails to run this script along with train.py.
So, I stop the training to make sure the GPU is free and then run eval.py but it creates only one eval point in eval/ folder. Why?
Also, how do I understand from the Precision graphs in Tensorboard that the training needs to be stopped?
I could also post screenshots if anyone wants.
Should I keep training till the loss stays on an average around 1?
Thanks.
PS: Added Total Loss graph below till 66k steps.
PS2: After 2 days training(and still on) this is the total loss graph below.

Usually, one uses a separate set of data to measure the error and generalisation abilities of the model. So, one would have the following sets of data to train and evaluate a model:
Training set: The data used to train the model.
Validation set: A separate set of data which will be used to measure the error during training. The data of this set is not used to perform any weight updates.
Test set: This set is used to measure the model's performance after the training.
In your case, you would have to define a separate set of data, the validation set and run an evaluation repeadingly after a fixed number of batches/steps and log the error or accuracy. What usually happens is, that the error on that data will decrease in the beginning and increase at a certain point during training. So it's important to keep track of that error and to generate a checkpoint whenever this error is decreases. The checkpoint with the lowest error on your validation data is one that you want to use. This technique is called Early Stopping.
The reason why the error increases after a certain point during training is called Overfitting. It tells you that the model losses it's ability to generalize to unseen data.
Edit:
Here's an example of a training loop with early stopping procedure:
for step in range(1, _MAX_ITER):
if step % _TEST_ITER == 0:
sample_count = 0
while True:
try:
test_data = sess.run(test_batch)
test_loss, summary = self._model.loss(sess, test_data[0], self._assign_target(test_data), self._merged_summary)
sess.run(self._increment_loss_opt, feed_dict={self._current_loss_pl: test_loss})
sample_count += 1
except tf.errors.OutOfRangeError:
score = sess.run(self._avg_batch_loss, feed_dict={self._batch_count_pl: sample_count})
best_score =sess.run(self._best_loss)
if score < best_score:
'''
Save your model here...
'''

GPU runs during the dev step out of memory

I'm using the Denny Britz implementation of Yoon Kims CNN for sentiment analysis in slightly modified form (I added the word2vec approach, so that the weight matrix is not calculated from scratch).
With small datasets (like 10MB) it works fine, but if I try to train on datasets of size >50MB (still not very large) my GPU runs out of memory and throws the following error: http://pastebin.com/AMfYkpXZ
The GPU is a GeForce GTX 1080 with 8 gb.
I worked out that the error comes from the dev step/evaluation step:
def dev_step(x_batch, y_batch, writer=None):
"""
Evaluates model on a dev set
"""
feed_dict = {
cnn.input_x: x_batch,
cnn.input_y: y_batch,
cnn.dropout_keep_prob: 1.0
}
step, summaries, loss, accuracy = sess.run(
[global_step, dev_summary_op, cnn.loss, cnn.accuracy],
feed_dict)
time_str = datetime.datetime.now().isoformat()
print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
if writer:
writer.add_summary(summaries, step)
To be precisely, it comes from the sess.run([global_step, dev_summary_op, cnn.loss, cnn.accuracy], feed_dict) line.
If i comment out the whole dev_step, the training runs without throwing errors.
Do you have an idea why this error occurs and how I can fix it? Thanks in advance!
Edit:
The whole code is available at: https://gist.github.com/pexmar/7b3d074825eeaf5a336fce406d8e9bae

Check the size of the batches you're passing in to dev_step compared to the size of batches you're running in train_step. You may have to evaluate the test set (which I guess is the same as dev?) in batches as well.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string