Running keras LSTM models with multiple GPU (Tensorflow backend) - keras

I have a keras LSTM model and want to run it under multiple GPUs for speed improvement. But I have some ambiguities:
1- I found that to really get the great speed on GPU I should define my network using CuDNNLSTM layer and not normal LSTM layer. To use multiple GPUs, I looked at Keras documentation and wanted to use multi_gpu_model() function to make distributed model. However, in the sample scripts they recommend to define the model on CPU for easy weight sharing, but my CuDNNLSTM model is not deployable on CPU and LSTM model will not benefit from the enhancements provided by GPU. What is the correct approach?
2- So I tried many configurations, including:
Group 1(using the normal (non-fast) LSTM layers): placing model on CPU and no copying to GPU; placing model on CPU and then use multi_gpu_model to create GPU copies; place model on default GPU and no copying to other GPU; placing model on default GPU and then use multi_gpu_model to create two GPU copies.
Group2 (using CuDNNLSTM layer and therefore no possibility to place model on CPU): defining a single model (which Tensorflow places it on the default GPU); using multi_gpu_model to create two GPU copies.
In all cases, data parallelism (using multi_gpu_model) resulted in lower speed of execution. I didn't change anything else in my code and input data pipeline or batch sizes. What is wrong with me?
3- In general, should I only use CuDNN-type layers to get high speed computation with GPUs when I am programming at high level of keras API?

Related

Pytorch pack/optimize many models into one GPU

Does pytorch offer any facilities or wrappers to pack several instances of a model into a single GPU and jointly optimize them (eg each with different optimizer settings, or with different training instances)?

Is there any way to speed up the predicting process for tensorflow lattice?

I build my own model with Keras Premade Models in tensorflow lattice using python3.7 and save the trained model. However, when I use the trained model for predicting, the speed of predicting each data point is at millisecond level, which seems very slow. Is there any way to speed up the predicting process for tfl?
There are multiple ways to improve speed, but they may involve a tradeoff with prediction accuracy. I think the three most promising options are:
Reduce the number of features
Reduce the number of lattices per feature
Use an ensemble of lattice models where every lattice model only gets a subsets of the features and then average the predictions of the different models (like described here)
As the lattice model is a standard Keras model, I recommend trying OpenVINO. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime. OpenVINO is optimized for Intel hardware, but it should work with any CPU.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert the HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. If you care about latency, I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement. If you care about throughput, change the value to THROUGHPUT or CUMULATIVE_THROUGHPUT.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

Distributed training on CPU's multiple cores in Keras

According to what I read in different posts and blogs and my limited trials, it looks like both Tensorflow and Keras use as many CPU cores available on an individual machine. That makes sense (and is indeed very nice) but does it mean that Keras will do distributed training across multiple cores? This is the thing that I want to do: I have a moderate model but large dataset and want to distribute learning of different batches between cores. I don't have much knowledge on parallel and distributed processing, but I just guess that a distributed learning requires some additional handling of data and gradient calculation and aggregation on top of basic multithreading/ multitasking. Does Keras do such a thing automatically when using different cores of the same CPU in an individual machine? How can I check it?
And if I want to go further and use multiple computers for distributed training, what everybody refers to is at https://www.tensorflow.org/deploy/distributed .
But it is a bit complicated and doesn't mention anything about Keras in specific. Is there any other source that specifically discusses distributed training of Keras models over Tensorflow?

Training one model with several GPU's

How you can program keras or tensorflow to partitionate training on multiple GPU, let's say you are in an amaozn ec2 instance that has 8 GPU's and you want to use all of them to train faster, but your code is just for a single cpu or GPU ?
Yes, can run Keras models on multiple GPUs. This is only possible with the TensorFlow backend for the time being, because the Theano feature is still rather new. We are looking at adding support for multi-gpu in Theano in the near future (it should be fairly straightforward).
With the TensorFlow backend, you can achieve this the same way as you would in pure TensorFlow: by using the with tf.device(d) scope when defining Keras layers.
Originally from here

Train object detection models based on CNN on CPU only

Since we have object detection models based on CNN such as Fast RCNN, Faster RCNN, YOLO (You only look once), ssd (single shot detector).
I have tried running Faster RCNN using CAFFE but backward path is not implemented for CPU mode. Is there any CNN based model which I can use it to train using CPU.
Any help will be appreciated.
faster-rcnn layers in CPU: https://github.com/neuleaf/faster-rcnn-cpu
SSD's original implementation already supports CPU training.

Resources