Difference between `tf.nn.batch_normalization` and `tf.layers.batch_normalization` [duplicate] - python-3.x

In tensorflow 1.4, I found two functions that do batch normalization and they look same:
tf.layers.batch_normalization (link)
tf.contrib.layers.batch_norm (link)
Which function should I use? Which one is more stable?

Just to add to the list, there're several more ways to do batch-norm in tensorflow:
tf.nn.batch_normalization is a low-level op. The caller is responsible to handle mean and variance tensors themselves.
tf.nn.fused_batch_norm is another low-level op, similar to the previous one. The difference is that it's optimized for 4D input tensors, which is the usual case in convolutional neural networks. tf.nn.batch_normalization accepts tensors of any rank greater than 1.
tf.layers.batch_normalization is a high-level wrapper over the previous ops. The biggest difference is that it takes care of creating and managing the running mean and variance tensors, and calls a fast fused op when possible. Usually, this should be the default choice for you.
tf.contrib.layers.batch_norm is the early implementation of batch norm, before it's graduated to the core API (i.e., tf.layers). The use of it is not recommended because it may be dropped in the future releases.
tf.nn.batch_norm_with_global_normalization is another deprecated op. Currently, delegates the call to tf.nn.batch_normalization, but likely to be dropped in the future.
Finally, there's also Keras layer keras.layers.BatchNormalization, which in case of tensorflow backend invokes tf.nn.batch_normalization.

As show in doc, tf.contrib is a contribution module containing volatile or experimental code. When function is complete, it will be removed from this module. Now there are two, in order to be compatible with the historical version.
So, the former tf.layers.batch_normalization is recommended.

Related

_th_addr_out not supported on CPUType for ComplexFloat

I am trying to use a customized loss function for my NN. I've implemented all operations in torch and I have complex numbers among my data.
I get the error while training a NN:
RuntimeError: _th_addr_out not supported on CPUType for ComplexFloat
Do you know any possible solution to deal with it?
Well it seems Complex Autograd in PyTorch is currently in a prototype state, and the backward functionality for some of function is not included.
For example: torch.sign, which is used in the backward computation of torch.abs, is not defined for complex tensors. same for torch.mv. So I debugged my code line by line to find the functions which are not included, and replaced them with a customized function :)
Hope for a lot more functions to be included in the next release of PyTorch.

Do ReLU1 in PyTorch

I want to use ReLU1 non-linear activation. ReLU1 is linear in [0,1] but clamps values less than 0 to 0 and clamps values more than 1 to 1.
It will be used only for the last layer of my deep net in PyTorch having a really high definition output of 2048x4096. Since the code has to be highly optimized in terms of speed and memory I do not know which of the following will be the best implementation.
Following are the two implementations I can think of for the tensor x:-
x.clamp_(min=0.0, max=1.0)
For this I am unable to see the source code given in its docs. So do not know if its the best choice. I will prefer in place operation since backpropagation can happen through it.
The second alternative I have is to use torch.nn.functional.hardtanh_(x, min_val=0.0, max_val=1.0). This is definitely a in place function and the source code says that it uses the C++ file torch._C._nn.hardtanh(input, min_val, max_val) so I think it will be fast.
Please suggest which is the most efficient implementation and another one if possible.
Thankyou
Without trying it, my guess is that clamp and hardtanh will have the same speed, and it will be hard to do this operation any faster if you optimize it in isolation. The arithmetic is trivial so this operation will be bottlenecked by GPU memory bandwidth. To run faster, you'd want to fuse this operation with the operation that produced x. If you don't want to write a custom kernel for the combined operation, you can try using TorchScript.

What kinds of optimization are used in PyTorch methods?

I'm using PyTorch to implement an intense sequence of matrix operations, using methods such as torch.mm or torch.dot. I was wondering if PyTorch uses multithreading or other optimization mechanisms to speed up the process. I am not utilizing a GPU. I appreciate if you could inform me of how fast these methods are and whether I need to take any actions to help the process.
PyTorch uses an efficient BLAS implementation and multithreading (openMP, if I'm not wrong) to parallelize such operations with multiple cores. Some performance loss comes from the Python itself - since this is an interpreted language, no significant compiler-like optimization can be done. You can use the jit module to speed up the "wrapper" code around the matrix multiplies, but for anything more than very small matrices this cost is probably negligible.
One big improvement you may be able to get manually, but which PyTorch doesn't apply automatically, is to properly order the matrix multiplies. As you probably know, depending on matrix shapes, a multiplication ABCD may have different performance computed as A(B(CD)) than if computed as (AB)(CD), etc.

How does pytorch's parallel method and distributed method work?

I'm not an expert in distributed system and CUDA. But there is one really interesting feature that PyTorch support which is nn.DataParallel and nn.DistributedDataParallel. How are they actually implemented? How do they separate common embeddings and synchronize data?
Here is a basic example of DataParallel.
import torch.nn as nn
from torch.autograd.variable import Variable
import numpy as np
class Model(nn.Module):
def __init__(self):
super().__init__(
embedding=nn.Embedding(1000, 10),
rnn=nn.Linear(10, 10),
)
def forward(self, x):
x = self.embedding(x)
x = self.rnn(x)
return x
model = nn.DataParallel(Model())
model.forward(Variable.from_numpy(np.array([1,2,3,4,5,6], dtype=np.int64)).cuda()).cpu()
PyTorch can split the input and send them to many GPUs and merge the results back.
How does it manage embeddings and synchronization for a parallel model or a distributed model?
I wandered around PyTorch's code but it's very hard to know how the fundamentals work.
That's a great question.
PyTorch DataParallel paradigm is actually quite simple and the implementation is open-sourced here . Note that his paradigm is not recommended today as it bottlenecks at the master GPU and not efficient in data transfer.
This container parallelizes the application of the given :attr:module by
splitting the input across the specified devices by chunking in the batch
dimension (other objects will be copied once per device). In the forward
pass, the module is replicated on each device, and each replica handles a
portion of the input. During the backwards pass, gradients from each replica
are summed into the original module.
As of DistributedDataParallel, thats more tricky. This is currently the more advanced approach and it is quite efficient (see here).
This container parallelizes the application of the given module by
splitting the input across the specified devices by chunking in the batch
dimension. The module is replicated on each machine and each device, and
each such replica handles a portion of the input. During the backwards
pass, gradients from each node are averaged.
There are several approaches towards how to average the gradients from each node. I would recommend this paper to get a real sense how things work. Generally speaking, there is a trade-off between transferring the data from one GPU to another, regarding bandwidth and speed, and we want that part to be really efficient. So one possible approach is to connect each pairs of GPUs with a really fast protocol in a circle, and to pass only part of gradients from one to another, s.t. in total, we transfer less data, more efficiently, and all the nodes get all the gradients (or their average at least). There will still be a master GPU in that situation, or at least a process, but now there is no bottleneck on any GPU, they all share the same amount of data (up to...).
Now this can be further optimized if we don't wait for all the batches to finish compute and start do a time-sharing thing where each node sends his portion when he's ready. Don't take me on the details, but it turns out that if we don't wait for everything to end, and do the averaging as soon as we can, it might also speed up the gradient averaging.
Please refer to literature for more information about that area as it is still developing (as of today).
PS 1: Usually these distributed training work better on machines that are set for that task, e.g. AWS deep learning instances that implement those protocols in HW.
PS 2: Disclaimer: I really don't know what protocol PyTorch devs chose to implement and what is chosen according to what. I work with distributed training and prefer to follow PyTorch best practices without trying to outsmart them. I recommend for you to do the same unless you are really into researching this area.
References:
[1] Distributed Training of Deep Learning Models: A Taxonomic Perspective
Approach to ml parallelism with Pytorch
DataParallel & DistributedDataParallel
Model parallel https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
See Will switching GPU device affect the gradient in PyTorch back propagation?

Are foundationdb layers interoperable?

I just started looking at foundation db and I have some trouble understanding how the layers work.
Are foundationdb layers interoperable?
If I add data using sql, can i then query that data using the graph layer?
How does that conversion/mapping work?
Regards Oskar
Short answer regarding the SQL layer: not yet.
Longer answer:
The FoundationDB storage engine maintains a mapping from bytes to bytes, with no additional encoding or structure imposed on top of that. This being the case, interoperability between layers is certainly possible, and in some cases may be a design goal.
A common set of encodings used by many layers is provided by the Tuple Layer (https://foundationdb.com/documentation/data-modeling.html#tuples), so higher-level layers using the Tuple Layer will, for instance, pack identical primitive values to identical strings of bytes. For true interoperability between two layers, however, each layer will have to understand the logic by which the other represents its higher-level data structures in terms of Tuples.
As for the SQL layer, interoperability with other data model layers released by FoundationDB is definitely a medium-term goal. But you can't automatically in the current Alpha version.

Resources