To create RNN cells, there are classes like GRUCell and LSTMCell which can be used later to create RNN layers.
And also there are 2 other classes as CudnnGRU and CudnnLSTM which can be directly used to create RNN layers.
In the documentation they say that the latter classes have cuDNN implementation. Why should I use or not use this cuDNN implemented classes over classical RNN implementations when I'm creating a RNN model..?
In short: cudnnGRU and cudnnLSTM can/ must be used on GPU, normal rnn implementations not. So if you have tensorflow-gpu, cudnn implementation of RNN cells would run faster.
CuDNNLSTM and CuDNNGRU are the fast implementation backed by CuDNN. Both can only be run on the GPU, with the TensorFlow backend. The cuDNN is a GPU-accelerated library of primitives for deep neural networks.
The cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers. cuDNN is part of the NVIDIA Deep Learning SDK.
The cuDNN highlights include:
Up to 3x faster training of ResNet-50 and GNMT on Tesla V100 vs.
Tesla P100
Improved NHWC support for pooling and strided convolution
Get Improved performance for common workloads such as ResNet50 and SSD as batchnorm now supports NHWC data layout with an added option
to fuse batchnorm with Add and ReLu operations
Related
I am looking for an advice for as fast as possible implementation of a convolution algorithm for CNN inference but not a training.
This convolution neural networks modeled as alexnet, mobilenet, resnet etc.. will run on embedded ARM device (A72, A53, A35) and possibly on embedded GPU as well.
I understand there is various implementation outthere and NN frameworks which have various implementations such as direct convolution, unrolling based convolution (im2col), FFT based or Winograd but mine primary focus is to execute CNN under performance constrain of embedded device.
If anybody has experience and can recommend convolution implementation for CPU and parallel implementation as well, point to research paper or open source implementation I would very appreciate it.
If it is still actual. I found a small framework to inference of pre-trained neural network on CPU. It uses Simd Library to accelerate its work. The library has very fast (single thread) implementation of Convolution, Pooling, Relu and many other network layers for CPU (x86 and ARM). CNN convolution includes Winograd's method.
I have a GTX 1080 and an RTX 2080. I want to train using both, but since the RTX can handle FP16 twice as fast, I'd like to set it up so that the training is multi-GPU and the RTX handles the FP16 layers and the GTX handles the FP32 layers.
Is this possible under tensorflow, pytorch, or keras?
Tensorflow
In TF, it is possible to specify for each layer on which device to be executed (GPU, CPU, or specific GPU if you have multiple GPUs ...). This is done using with tf.device('device_name') statement (you need to provide meaningful device_name). See Using multiple GPUs section.
Keras
Since this is possible in TF, that means that you can use it also in Keras, if you use TF as Keras backend (Keras is just a high-level neural networks API).
Note that in Keras there is a multi_gpu_model() function in Keras, but that only copies a whole model on multiple GPUs, you cannot specify which layer to put on specific GPU.
As the title states, does Keras (w. Tensorflow backend) normalize the kernel weights compared to e.g. Tensorflow? For example, if two identical networks are implemented with Keras respectively Tensorflow, will the kernel-weights differ?
If you use Tensorflow in backend of Keras, there are no reasons for the implementation to be different.
You can check by yourself here : https://github.com/keras-team/keras/tree/master/keras/layers
How you can program keras or tensorflow to partitionate training on multiple GPU, let's say you are in an amaozn ec2 instance that has 8 GPU's and you want to use all of them to train faster, but your code is just for a single cpu or GPU ?
Yes, can run Keras models on multiple GPUs. This is only possible with the TensorFlow backend for the time being, because the Theano feature is still rather new. We are looking at adding support for multi-gpu in Theano in the near future (it should be fairly straightforward).
With the TensorFlow backend, you can achieve this the same way as you would in pure TensorFlow: by using the with tf.device(d) scope when defining Keras layers.
Originally from here
I want to know if the filters' weights in a, for example, 2D convolution layer in Keras are shared along the spatial dimensions by default. If yes, is there any way to have not shared weights?
I found that LocallyConnected2D does what I am looking for.
The LocallyConnected2D layer works similarly to the Conv2D layer, except that weights are unshared, that is, a different set of filters is applied at each different patch of the input.
I'm not clear on what your asking but:
The weights in the a single convolutional layer are shared. That is, the filters share the same weights with each stride.
However The weights between two convolutonal layers are not shared by default in keras.
There is no getting around shared wiegths in the filters within the conv layer. Since the execution of the convolution if offloaded to C++ libraries.
See this answer for further reference, in particular:
The implementation of tf.nn.conv2d() is written in C++, which invokes
optimized code using either Eigen (on CPU) or the cuDNN library (on
GPU). You can find the implementation here.