Why is convolution in cuDNN non-deterministic? - pytorch

The PyTorch documentary says, when using cuDNN as backend for a convolution, one has to set two options to make the implementation deterministic. The options are torch.backends.cudnn.deterministic = True and torch.backends.cudnn.benchmark = False. Is this because of the way weights are initialized?

When torch.backends.cudnn.deterministic is set to True, CuDNN will use deterministic algorithms for these operations, meaning that given the same input and parameters, the output will always be the same. This can be useful in situations where you need reproducibility in your results, such as when debugging or when comparing different model architectures.
However, using deterministic algorithms can come at a cost of performance, as some of the optimizations that make CuDNN fast may not be compatible with determinism. Therefore, setting torch.backends.cudnn.deterministic to True may result in slower training times.

Related

CV=integer vs predefined splits in GridSearchCV

What's the difference between setting CV=some integer vs cv=PredefinedSplit(test_fold=your_test_fold)?
Is there any advantage of one over the other? Does CV=some integer sets the splits randomly?
Specifying an integer will produce kfold cross-validation without shuffling, as described in the documentation for sklearn.model_selection.KFold. Shuffling before splitting may or may not be preferred; if your data is sorted, shuffling is necessary to randomize the distribution of samples, while if the samples are simply correlated due to spatial or temporal sampling effects, shuffling may provide an optimistic view of performance.
I would avoid using PredefinedSplit unless you have a very good reason to predefine your splits. There are other CV generators that can probably meet your needs, like StratifiedKFold if you want to maintain your class distribution (for example.)

Can I speed up inference in PyTorch using autocast (automatic mixed precision)?

The docs (see also this) for autocast in PyTorch only discuss training. Does it speed things up if I also use autocast for inference?
Yes it could (may not in some cases though).
You are processing data with lower precision (e.g. float16 vs float32).
Your program has to read and process less data in this case.
This might help with cache locality and hardware specific software (e.g. tensor cores if using CUDA)

The result is not fixed after setting random seed in pytorch

def setup_seed(seed):
np.random.seed(seed)
random.seed(seed)
torch.manual_seed(seed) # cpu
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = True
I set random seed when run the code, but I can not get fixed result with pytorch. Besides, I use batchnorm in my code. When evaluate and test, I have set model.eval(). I cannot figure out the reason for that.
I think the line torch.backends.cudnn.benchmark = True causing the problem. It enables the cudnn auto-tuner to find the best algorithm to use. For example, convolution can be implemented using one of these algorithms:
CUDNN_CONVOLUTION_FWD_ALGO_GEMM,
CUDNN_CONVOLUTION_FWD_ALGO_FFT,
CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING,
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM,
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM,
CUDNN_CONVOLUTION_FWD_ALGO_DIRECT,
CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD,
CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED,
There are several algorithms without reproducibility guarantees.
So use torch.backends.cudnn.benchmark = False for deterministic outputs(this may slow execution time).
And also there are some pytorch functions which cannot be deterministic refer this doc.

What kinds of optimization are used in PyTorch methods?

I'm using PyTorch to implement an intense sequence of matrix operations, using methods such as torch.mm or torch.dot. I was wondering if PyTorch uses multithreading or other optimization mechanisms to speed up the process. I am not utilizing a GPU. I appreciate if you could inform me of how fast these methods are and whether I need to take any actions to help the process.
PyTorch uses an efficient BLAS implementation and multithreading (openMP, if I'm not wrong) to parallelize such operations with multiple cores. Some performance loss comes from the Python itself - since this is an interpreted language, no significant compiler-like optimization can be done. You can use the jit module to speed up the "wrapper" code around the matrix multiplies, but for anything more than very small matrices this cost is probably negligible.
One big improvement you may be able to get manually, but which PyTorch doesn't apply automatically, is to properly order the matrix multiplies. As you probably know, depending on matrix shapes, a multiplication ABCD may have different performance computed as A(B(CD)) than if computed as (AB)(CD), etc.

How does pytorch's parallel method and distributed method work?

I'm not an expert in distributed system and CUDA. But there is one really interesting feature that PyTorch support which is nn.DataParallel and nn.DistributedDataParallel. How are they actually implemented? How do they separate common embeddings and synchronize data?
Here is a basic example of DataParallel.
import torch.nn as nn
from torch.autograd.variable import Variable
import numpy as np
class Model(nn.Module):
def __init__(self):
super().__init__(
embedding=nn.Embedding(1000, 10),
rnn=nn.Linear(10, 10),
)
def forward(self, x):
x = self.embedding(x)
x = self.rnn(x)
return x
model = nn.DataParallel(Model())
model.forward(Variable.from_numpy(np.array([1,2,3,4,5,6], dtype=np.int64)).cuda()).cpu()
PyTorch can split the input and send them to many GPUs and merge the results back.
How does it manage embeddings and synchronization for a parallel model or a distributed model?
I wandered around PyTorch's code but it's very hard to know how the fundamentals work.
That's a great question.
PyTorch DataParallel paradigm is actually quite simple and the implementation is open-sourced here . Note that his paradigm is not recommended today as it bottlenecks at the master GPU and not efficient in data transfer.
This container parallelizes the application of the given :attr:module by
splitting the input across the specified devices by chunking in the batch
dimension (other objects will be copied once per device). In the forward
pass, the module is replicated on each device, and each replica handles a
portion of the input. During the backwards pass, gradients from each replica
are summed into the original module.
As of DistributedDataParallel, thats more tricky. This is currently the more advanced approach and it is quite efficient (see here).
This container parallelizes the application of the given module by
splitting the input across the specified devices by chunking in the batch
dimension. The module is replicated on each machine and each device, and
each such replica handles a portion of the input. During the backwards
pass, gradients from each node are averaged.
There are several approaches towards how to average the gradients from each node. I would recommend this paper to get a real sense how things work. Generally speaking, there is a trade-off between transferring the data from one GPU to another, regarding bandwidth and speed, and we want that part to be really efficient. So one possible approach is to connect each pairs of GPUs with a really fast protocol in a circle, and to pass only part of gradients from one to another, s.t. in total, we transfer less data, more efficiently, and all the nodes get all the gradients (or their average at least). There will still be a master GPU in that situation, or at least a process, but now there is no bottleneck on any GPU, they all share the same amount of data (up to...).
Now this can be further optimized if we don't wait for all the batches to finish compute and start do a time-sharing thing where each node sends his portion when he's ready. Don't take me on the details, but it turns out that if we don't wait for everything to end, and do the averaging as soon as we can, it might also speed up the gradient averaging.
Please refer to literature for more information about that area as it is still developing (as of today).
PS 1: Usually these distributed training work better on machines that are set for that task, e.g. AWS deep learning instances that implement those protocols in HW.
PS 2: Disclaimer: I really don't know what protocol PyTorch devs chose to implement and what is chosen according to what. I work with distributed training and prefer to follow PyTorch best practices without trying to outsmart them. I recommend for you to do the same unless you are really into researching this area.
References:
[1] Distributed Training of Deep Learning Models: A Taxonomic Perspective
Approach to ml parallelism with Pytorch
DataParallel & DistributedDataParallel
Model parallel https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
See Will switching GPU device affect the gradient in PyTorch back propagation?

Resources