I'm trying to figure out which is the best way to run multiple pre-trained models for inference in TensorFlow (in the context of single-machine execution). I read several questions about this, but I'm still a bit confused.
For example, let us assume to run 2 deep networks in 2 different processes/threads.
To my understanding, I can do this either:
Running the two models within the same session (i.e., two processes which shares a single session), or
running two different sessions, one for each network/process, or
using TensorFlow Serving.
If I run this in a platform consisting of only a multi-core CPU, I imagine that the difference (from an execution point of view) is that with a single session there is a single intra op thread pool and
a single inter pool thread pool, whereas in the second case they are distinct.
How does TensorFlow Serving modify the way in which the networks are executed on CPUs?
To my understanding, there is a benefit in using it when TensorFlow is used with GPUs, since it groups individual inference requests for joint execution.
In heterogenous architectures, does TF Serving adopt a different graph partitioning among device with respect to the one used in training (i.e., the one described in the TensorFlow White Paper)?
Among these three possibilities, which one is actually adopted in production environments?
If there are some specific documents that I missed which describe this characteristics of TensorFlow, please point me to all of them.
you likely don’t want the same device placement used for serving as you did for training.
in fact, when you export/save your tensorflow model for serving, you can specify ‘clear_devices=true’ (or similar).
we prefer TF serving over the other 2 methods mentioned above. and we experiment with cpu/gpu/tpu for each and every new model that we build.
and while we always have an intuition about the best hardware, we always test our models in live production to learn the model’s empirical behavior.
we’re also experimenting with different runtimes like TensorRT from Nvidia (gpu) and different post-training optimization techniques like TensorFlow Lite (tfcompile).
we use PipelineAI for our TensorFlow model experimentation, optimization, and cluster deployments:
https://github.com/PipelineAI/pipeline
Related
According to what I read in different posts and blogs and my limited trials, it looks like both Tensorflow and Keras use as many CPU cores available on an individual machine. That makes sense (and is indeed very nice) but does it mean that Keras will do distributed training across multiple cores? This is the thing that I want to do: I have a moderate model but large dataset and want to distribute learning of different batches between cores. I don't have much knowledge on parallel and distributed processing, but I just guess that a distributed learning requires some additional handling of data and gradient calculation and aggregation on top of basic multithreading/ multitasking. Does Keras do such a thing automatically when using different cores of the same CPU in an individual machine? How can I check it?
And if I want to go further and use multiple computers for distributed training, what everybody refers to is at https://www.tensorflow.org/deploy/distributed .
But it is a bit complicated and doesn't mention anything about Keras in specific. Is there any other source that specifically discusses distributed training of Keras models over Tensorflow?
I'm not an expert in distributed system and CUDA. But there is one really interesting feature that PyTorch support which is nn.DataParallel and nn.DistributedDataParallel. How are they actually implemented? How do they separate common embeddings and synchronize data?
Here is a basic example of DataParallel.
import torch.nn as nn
from torch.autograd.variable import Variable
import numpy as np
class Model(nn.Module):
def __init__(self):
super().__init__(
embedding=nn.Embedding(1000, 10),
rnn=nn.Linear(10, 10),
)
def forward(self, x):
x = self.embedding(x)
x = self.rnn(x)
return x
model = nn.DataParallel(Model())
model.forward(Variable.from_numpy(np.array([1,2,3,4,5,6], dtype=np.int64)).cuda()).cpu()
PyTorch can split the input and send them to many GPUs and merge the results back.
How does it manage embeddings and synchronization for a parallel model or a distributed model?
I wandered around PyTorch's code but it's very hard to know how the fundamentals work.
That's a great question.
PyTorch DataParallel paradigm is actually quite simple and the implementation is open-sourced here . Note that his paradigm is not recommended today as it bottlenecks at the master GPU and not efficient in data transfer.
This container parallelizes the application of the given :attr:module by
splitting the input across the specified devices by chunking in the batch
dimension (other objects will be copied once per device). In the forward
pass, the module is replicated on each device, and each replica handles a
portion of the input. During the backwards pass, gradients from each replica
are summed into the original module.
As of DistributedDataParallel, thats more tricky. This is currently the more advanced approach and it is quite efficient (see here).
This container parallelizes the application of the given module by
splitting the input across the specified devices by chunking in the batch
dimension. The module is replicated on each machine and each device, and
each such replica handles a portion of the input. During the backwards
pass, gradients from each node are averaged.
There are several approaches towards how to average the gradients from each node. I would recommend this paper to get a real sense how things work. Generally speaking, there is a trade-off between transferring the data from one GPU to another, regarding bandwidth and speed, and we want that part to be really efficient. So one possible approach is to connect each pairs of GPUs with a really fast protocol in a circle, and to pass only part of gradients from one to another, s.t. in total, we transfer less data, more efficiently, and all the nodes get all the gradients (or their average at least). There will still be a master GPU in that situation, or at least a process, but now there is no bottleneck on any GPU, they all share the same amount of data (up to...).
Now this can be further optimized if we don't wait for all the batches to finish compute and start do a time-sharing thing where each node sends his portion when he's ready. Don't take me on the details, but it turns out that if we don't wait for everything to end, and do the averaging as soon as we can, it might also speed up the gradient averaging.
Please refer to literature for more information about that area as it is still developing (as of today).
PS 1: Usually these distributed training work better on machines that are set for that task, e.g. AWS deep learning instances that implement those protocols in HW.
PS 2: Disclaimer: I really don't know what protocol PyTorch devs chose to implement and what is chosen according to what. I work with distributed training and prefer to follow PyTorch best practices without trying to outsmart them. I recommend for you to do the same unless you are really into researching this area.
References:
[1] Distributed Training of Deep Learning Models: A Taxonomic Perspective
Approach to ml parallelism with Pytorch
DataParallel & DistributedDataParallel
Model parallel https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
See Will switching GPU device affect the gradient in PyTorch back propagation?
I want to implement the Asynchronous Advantage Actor Critic (A3C) model for reinforcement learning in my local machine (1 CPU, 1 cuda compatible GPU). In this algorithm, several "learner" networks interact with copies of an environment and update a central model periodically.
I've seen implementations that create n "worker" networks and one "global" network inside the same graph and use threading to run these. In these approaches, the global net is updated by applying gradients to the trainable parameters with a "global" scope.
However, I recently read a bit about distributed tensorflow and now I'm a bit confused. Would it be easier/faster/better to implement this using the distributed tensorflow API? In the documentation and talks they always make expicit mention of using it in multi-device environments. I don't know if it's an overkill to use it in a local async algorithm.
I would also like to ask, is there a way to batch the gradients calculated by every worker to be applied together after n steps?
After implementing both, in the end I found using threading simpler than the distributed tensorflow API, however it also runs slower. The more CPU cores you use, the faster distributed tensorflow becomes compared to threads.
However this only holds for asynchronous training. If the available CPU cores are limited and you want to make use of a GPU, you might want to use synchronous training with multiple workers instead, like OpenAI does in their A2C implementation. There only the environments are parallelized (through multiprocessing) and tensorflow uses the GPU without any graph parallelization. OpenAI reported that their results were better with synchronous training than with A3C.
edit:
Here are some more details:
The problem with distributed tensorflow for A3C is that you need to call multiple tensorflow forward passes (to get the actions during the n steps) before you call the learning step. However since you learn asynchronously your network will change during the n steps by the other workers. So your policy will change during the n steps and the learning step will happen with wrong weights. Distributed tensorflow will not prevent that. Therefore you need a global and a local network in distributed tensorflow as well, making the implementation not easier than an implementation with threading (and for threading you don't have to learn how to make distributed tensorflow work). Runtime wise, on 8 CPU cores or less there will be no large difference.
Can anyone tell me what is advantage of forcing all variables to reside on the CPU as is done in the tensorflow inception v3 code (here) or in the cifar10 code (here)? Do the variables not need to reside on the GPUs too, for executing the forward or backward computation?
Some people have observed that putting variables on GPU:0 in cifar10 makes things faster, https://github.com/tensorflow/tensorflow/issues/4881
It makes sense to keep parameters on CPU when you don't have P2P transfer capability between your GPUs.
I have a CNN model. The requests of using this model, for example to classify a picture, come 1 time a second.
I would like to collect the requests as new unsuperised data, and keep training my model.
My question is: How can I handle the training task and classify task effictively?
I will explain why it becomes a problem:
Every training step takes a long time, at least severy seconds, using GPU and not interruptable. So, if my classify tasks use GPU too, I cannot response the requests in time. I would like to make classify tasks using CPU, but looks like theano not support two diffrent config.device in one process.
Multi-process is not acceptable, because my memory is limited and theano costs too much.
Any help or advice would be apreciated.
You could build two separate copies of the same CNN, one on the CPU and one on the GPU. I think this could be done under either the old GPU backend or the new one, but in different ways....some ideas:
Under the old backend:
Load Theano with device=cpu. Build your inference function and compile it. Then call theano.sandbox.cuda.use('gpu'), and build a new copy of your inference function and take gradients of that one to make any training functions. Now the inference function should execute on the CPU, and the training should happen on the GPU. (I've never done this on purpose but I had it happen to me on accident!)
Under the new backend:
As far as I know, you have to tell Theano about any GPUs right when importing, not later. In this case, you could use THEANO_FLAGS="contexts=dev0->cuda0", which doesn't force using one device over another. Then build the inference version of your function like normal, and for the training version, again put all the shared variables on the GPU, and the input variables to any of your training functions should also be GPU variables (e.g. input_var_1.transfer('dev0')). When all your functions are compiled, look at the programs using theano.printing.debugprint(function) to see what's on GPU vs CPU. (When compiling the CPU functions, it might give a warning that it cannot infer the context, and as far as I've seen, that lands it on the CPU...not sure if this behavior is safe to depend on.)
In either case, this will depend on your GPU-based functions do NOT RETURN ANYTHING TO THE CPU (make sure the output variables are GPU ones). This should allow the training function to run concurrently to your inference function, and later you grab what you need to the CPU. For example when you take a training step, just copy the new values over to your inference network parameters, of course.
Let us hear what you come up with!