This layer in not ready documented very well and I'm having a bit of trouble figuring out exactly how to use it.
I'm Trying something like:
input_img = Input(shape=(1, h, w))
x = Convolution2D(16, 7, 7, activation='relu', border_mode='valid')(input_img)
d = Deconvolution2D(1, 7, 7, (None, 1, 2*h, 2*w))
x = d(x)
but when I try to write d.output_shape, I get the original shape of the image instead of twice that size (which is what I was expecting).
Any help will be greatly appreciated!
Short answer: you need to add subsample=(2,2) to Deconvolution2D if you wish the output to truly be twice as large as the input.
Longer answer: Deconvolution2D is severely undocumented and you have to go through its code to understand how to use it.
First, you must understand how the deconvolution layer works (skip this if you already know all the details). Deconvolution, unlike what its name suggest, is simply applying the back-propgation (gradient calculation method) of a standard convolution layer on the input to the deconvolution layer. The "kernel size" of the deconvolution layer is actually the kernel size of the virtual convolution layer of the backprop step mentioned above. While given the size of a convolution kernel and its stride, it is straightforward to compute the output shape of the convolution layer (assuming no padding it's (input - kernel) // stride + 1), but the reverse is not true. In fact, there can be more than one possible input shapes that matches a given output shape of the convolution layer (this is because integer division isn't invertible). This means that for a deconvolution layer, the output shape cannot be directly determined simply from the input shape (which is implicitly known), kernel size and stride - this is why we need to know the output shape when we initialize the layer. Of course, because of the way the deconvolution layer is defined, for some input shapes you'll get holes in its output which are undefined, and if we forbid these cases then we actually can deduce the output shape.
Back to Keras and how the above is implemented. Confusingly, the output_shape parameter is actually not used for determining the output shape of the layer, and instead they try to deduce it from the input, the kernel size and the stride, while assuming only valid output_shapes are supplied (though it's not checked in the code to be the case). The output_shape itself is only used as input to the backprop step. Thus, you must also specify the stride parameter (subsample in Keras) in order to get the desired result (which could've been determined by Keras from the given input shape, output shape and kernel size).
Related
I am trying to understand how a nn.conv1d processes an input for a specific example related to audio processing in a WaveNet model.
I have input data of shape (1,1,8820), which passes through an input layer (1,16,1), to output a shape of (1,16,8820).
That part I understand, because you can just multiply the two matrices. The next layer is a conv1d, kernel size=3, input channels=16, output channels=16, so the state dict shows a matrix with shape (16,16,3) for the weights. When the input of (1,16,8820) goes through that layer, the result is another (1,16,8820).
What multiplication steps occur within the layer to apply the weights to the audio data? In other words, if I wanted to apply the layer(forward calculations only) using only the input matrix, the state_dict matrix, and numpy, how would I do that?
This example is using the nn.conv1d layer from Pytorch. Also, if the same layer had a dilation=2, how would that change the operations?
A convolution is a specific type of "sliding window operation": that is, applying the same function/operation on overlapping sliding windows of the input.
In your example, you treat each 3 overlapping temporal samples (each in 16 dimensions) as an input to 16 filters. Therefore, you have a weight matrix of 3x16x16.
You can think of it as "unfolding" the (1, 16, 8820) signal into (1, 16*3, 8820) sliding windows. Then multiplying by 16*3 x 16 weight matrix to get an output of shape (1, 16, 8820).
Padding, dilation and strides affect the way the "sliding windows" are formed.
See nn.Unfold for more information.
I'm trying to build an autoencoder like network but I don't know how to specify the output shape of the network. I have an input of size mxn and a paired expected output of size pxq. I've seen
Calculate the Output size in Convolution layer
Getting the output shape of deconvolution layer using tf.nn.conv2d_transpose in tensorflow
but is there a way force an output shape without having to work out a bunch of math for every input shape?
I really don't think there is a way to do this (tough I will be happy to learn otherwise), since this output shape of a conv layer of any kind is the result of a mathematical operation (convolution) with several parameters. because of that, the resulting shape has to be one of the possible shapes, based on the input tensor and the parameters (stride, kernel size and so on).
this is in contrast to a dense (fully-connected layer), where you can get any shape you want as long as its a single number (4, 60 or 5000 - but not (60,60)).
one small trick that can help you in this kind of situations sometimes is to get the shape of the previous layer print it so you know what parameters you need for the next layer and make sure your calculations are correct:
import keras.backend as K
x = Conv2D()(x) # or any other layer
shape = K.int_shape(x)
print(shape)
x = Conv2D()(x)
I am not sure if I understand how exactly Keras version of LSTM works.
Let's say I have vector of len=20 as input and I specify keras.layers.LSTM(units=10)
So in this example does the network finish after processing 50% of input or it precess the rest from start (I mean from first cell)?
Units are never related to the input size.
Units are related only to the output size (units = output features or channels).
An LSTM layer will always process the entire data and optionally return either the "same length (all steps)" or "no length (only last step)".
In terms of shapes
You must have an input tensor with shape (batch, len=20, input_features).
And it will output:
For return_sequences=False: (batch, output_features=10) - no length
For return_sequences=True: (batch, len=20, output_features=10) - same length
Output features is always equal to units.
See a full comprehension of the LSTM layers here: Understanding Keras LSTMs
As far as I know, the input tuple enters from the convolution blocks.
So if we want to change the input_tuple shape, modifying convolutions would make sense.
Why do we need to include_top=False and remove the fully connected layers at the end?
On the other hand, if we have different number of classes,Keras has an option to change the softmax layer using no_of_classes
I know that I am the one missing something here. Please help me
Example: For Inception Resnet V2
input_shape: optional shape tuple, only to be specified if include_top
is False (otherwise the input shape has to be (299, 299, 3) (with
'channels_last' data format) or (3, 299, 299) (with 'channels_first'
data format). It should have exactly 3 inputs channels, and width and
height should be no smaller than 139. E.g. (150, 150, 3) would be one
valid value.
include_top: whether to include the fully-connected layer at the top
of the network.
https://keras.io/applications/#inceptionresnetv2
This is simply because the fully connected layers at the end can only take fixed size inputs, which has been previously defined by the input shape and all processing in the convolutional layers. Any change to the input shape will change the shape of the input to the fully connected layers, making the weights incompatible (matrix sizes don't match and cannot be applied).
This is a specific problem to fully connected layers. If you use another layer for classification, such as global average pooling, then one would not have this problem.
Consider an input layer in keras as:
model.add(layers.Dense(32, input_shape=(784,)))
What this says is input is a 2D tensor where axix=0 (batch dimension) is not specified while axis=1 is 784. Axis=0 can take any value.
My question is: isnt this style confusing?
Ideally, should it not be
input_shape=(?,784)
This reflects axis=0 is wildcard while axis=1 should be 784
Any particular reason why it is so ? Am I missing something here ?
The consistency in this case is between the sizes of the layers and the size of the input. In general, the shapes are assumed to represent the nature of the data; in that sense, the batch dimension is not part of the data itself, but rather how you group it for training or evaluation. So, in your code snippet, it is quite clear that you have inputs with 784 features and a first layer producing a vector of 32 features. If you want to explicitly include the batch dimension, you can use instead batch_input_shape=(None, 784) (this is sometimes necessary, for example if you want to give batches of a fixed size but with an additional time dimension of unknown size). This is explained in the Sequential model guide, but also matches the documentation of the Input layer, where you can give a shape or batch_shape parameter (analogous to input_shape or batch_input_shape).