pytorch batch normalization in distributed train - pytorch

wondering how distributed pytorch handle batch norm, when I add a batch norm layer, will pytorch engine use the same allreduce call to sync the data cross node? or the batch norm only happen on local node.

Similarly to DataParallel (check the first Warning box). It will compute the norm separately for each node (or, more precisely, each GPU). It will not sync the rolling estimates of the norm either, but it will keep the values from one of the GPUs in the end. So assuming the examples are distributed across your cluster randomly, your BatchNorm will work roughly as expected, except its estimates of the normalization factors will have higher variance due to smaller effective sample sizes.

Related

what does model.eval() do for batch normalization layer?

Why does the testing data use the mean and variance of the all training data? To keep the distribution consistent? What is the difference between the BN layer using model.train compared to model.val
It fixes the mean and var computed in the training phase by keeping estimates of it in running_mean and running_var. See PyTorch Documentation.
As noted there the implementation is based on the description in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. As one tries to use the whole training data one can get (given similar data train/test data) a better estimate of the mean/var for the (unseen) test set.
Also similar questions have been asked here: What does model.eval() do?

PyTorch training with Batches of different lenghts?

Is it possible to train model with batches that have unequal lenght during an epoch? I am new to pytorch.
If you take a look at the dataloader documentation, you'll see a drop_last parameter, which explains that sometimes when the dataset size is not divisible by the batch size, then you get a last batch of different size. So basically the answer is yes, it is possible, it happens often and it does not affect (too much) the training of a neural network.
However you must a bit careful, some pytorch layers deal poorly with very small batch sizes. For example if you happen to have Batchnorm layers, and if you get a batch of size 1, you'll get errors due to the fact that batchnorm at some point divides by len(batch)-1. More generally, training a network that has batchnorms generally require batches of significant sizes, say at least 16 (literature generally aims for 32 or 64). So if you happen to have variable size batches, take the time to check whether your layers have requirement in terms of batch size for optimal training and convergence. But except in particular cases, your network will train anyway, no worries.
As for how to make your batches with custom sizes, I suggest you look at and take inspiration from the pytorch implementation of dataloader and sampler. You may want to implement something similar to BatchSampler and use the batch_sampler argument of Dataloader

How is a minibatch processed by the GPU in PyTorch?

I am trying to understand how PyTorch actually performs a forward pass over a minibatch. When a minibatch is processed by a network, is each example in the minibatch (e.g. each image) sent forwards individually, one after the other? Or are all examples in the minibatch sent forwards at the same time?
When an example is sent forwards through a network, the additional memory requirement is the activations at each layer. And as long as the network does not take up the entire GPU, then it seems that multiple instantiations of these activations could be stored at the same time. Each instantiation could then be used to store the activations for one example in the minibatch. And therefore, multiple examples could be sent through the network simultaneously. However, I'm unsure whether this is actually done in practice.
I have done some simple experiments, and the time for a forward pass is roughly proportional to the minibatch size. This suggests that the examples are sent through one after the other. If so, then why is it that people say that training is faster when the minibatch size is larger? It seems that the processing time for an entire epoch would not be dependent on the minibatch size.
I am trying to understand how PyTorch actually performs a forward pass over a minibatch. When a minibatch is processed by a network, is each example in the minibatch (e.g. each image) sent forwards individually, one after the other? Or are all examples in the minibatch sent forwards at the same time?
All at the same time. To do so, it relies on batch processing, broadcasting, element-wise vectorization for non-linear operations (basically, a highly optimized for-loop, sometimes in parrallel) and matrix linear algebra. The later is much more efficient than a for-loop, since it can leverage dedicated hardware component designed for parallel linear algebra (this is true for both cpu and gpu, but gpu are especially well suited for this).
Each instantiation could then be used to store the activations for one example in the minibatch. And therefore, multiple examples could be sent through the network simultaneously. However, I'm unsure whether this is actually done in practice.
This is not how it works, torch is keeping track of "operations", each of them having a backward used computing the gradient of the inputs wrt to the outputs. It is designed to support batch processing and vectorization, such that processing a bunch of samples is done at once as in single backward pass.
I have done some simple experiments, and the time for a forward pass is roughly proportional to the minibatch size.
This is not true. It may be because you are already eating up 100% of the available resources (cpu or gpu), or because you are not doing the profiling properly (which is not so easy to do). If you post an example, one you try to help you on this point.

validation loss higher only for some tasks

I am training a multi task network, it seems that the validation loss is higher than the training loss only for some tasks but for others, the network seems to converge pretty well. For one task in particular, the validation loss is much higher than the training one and it affects the average. I added some data augmentation, normalization, dropouts, batch norm etc. to avoid overfitting in general. How can I handle this one single task???
I recommend you focusing on that single task. Try to study the residuals (regression) or the errors in classification (in case you are working with a classification). Without more info in the problem I cannot help you more than that!

How does pytorch's parallel method and distributed method work?

I'm not an expert in distributed system and CUDA. But there is one really interesting feature that PyTorch support which is nn.DataParallel and nn.DistributedDataParallel. How are they actually implemented? How do they separate common embeddings and synchronize data?
Here is a basic example of DataParallel.
import torch.nn as nn
from torch.autograd.variable import Variable
import numpy as np
class Model(nn.Module):
def __init__(self):
super().__init__(
embedding=nn.Embedding(1000, 10),
rnn=nn.Linear(10, 10),
)
def forward(self, x):
x = self.embedding(x)
x = self.rnn(x)
return x
model = nn.DataParallel(Model())
model.forward(Variable.from_numpy(np.array([1,2,3,4,5,6], dtype=np.int64)).cuda()).cpu()
PyTorch can split the input and send them to many GPUs and merge the results back.
How does it manage embeddings and synchronization for a parallel model or a distributed model?
I wandered around PyTorch's code but it's very hard to know how the fundamentals work.
That's a great question.
PyTorch DataParallel paradigm is actually quite simple and the implementation is open-sourced here . Note that his paradigm is not recommended today as it bottlenecks at the master GPU and not efficient in data transfer.
This container parallelizes the application of the given :attr:module by
splitting the input across the specified devices by chunking in the batch
dimension (other objects will be copied once per device). In the forward
pass, the module is replicated on each device, and each replica handles a
portion of the input. During the backwards pass, gradients from each replica
are summed into the original module.
As of DistributedDataParallel, thats more tricky. This is currently the more advanced approach and it is quite efficient (see here).
This container parallelizes the application of the given module by
splitting the input across the specified devices by chunking in the batch
dimension. The module is replicated on each machine and each device, and
each such replica handles a portion of the input. During the backwards
pass, gradients from each node are averaged.
There are several approaches towards how to average the gradients from each node. I would recommend this paper to get a real sense how things work. Generally speaking, there is a trade-off between transferring the data from one GPU to another, regarding bandwidth and speed, and we want that part to be really efficient. So one possible approach is to connect each pairs of GPUs with a really fast protocol in a circle, and to pass only part of gradients from one to another, s.t. in total, we transfer less data, more efficiently, and all the nodes get all the gradients (or their average at least). There will still be a master GPU in that situation, or at least a process, but now there is no bottleneck on any GPU, they all share the same amount of data (up to...).
Now this can be further optimized if we don't wait for all the batches to finish compute and start do a time-sharing thing where each node sends his portion when he's ready. Don't take me on the details, but it turns out that if we don't wait for everything to end, and do the averaging as soon as we can, it might also speed up the gradient averaging.
Please refer to literature for more information about that area as it is still developing (as of today).
PS 1: Usually these distributed training work better on machines that are set for that task, e.g. AWS deep learning instances that implement those protocols in HW.
PS 2: Disclaimer: I really don't know what protocol PyTorch devs chose to implement and what is chosen according to what. I work with distributed training and prefer to follow PyTorch best practices without trying to outsmart them. I recommend for you to do the same unless you are really into researching this area.
References:
[1] Distributed Training of Deep Learning Models: A Taxonomic Perspective
Approach to ml parallelism with Pytorch
DataParallel & DistributedDataParallel
Model parallel https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
See Will switching GPU device affect the gradient in PyTorch back propagation?

Resources