What does the `None` in `keras.summary()`'s output shape mean? [duplicate] - keras

What is the meaning of the (None, 100) in Output Shape?
Is this("None") the Sample number or the hidden dimension?

None means this dimension is variable.
The first dimension in a keras model is always the batch size. You don't need fixed batch sizes, unless in very specific cases (for instance, when working with stateful=True LSTM layers).
That's why this dimension is often ignored when you define your model. For instance, when you define input_shape=(100,200), actually you're ignoring the batch size and defining the shape of "each sample". Internally the shape will be (None, 100, 200), allowing a variable batch size, each sample in the batch having the shape (100,200).
The batch size will be then automatically defined in the fit or predict methods.
Other None dimensions:
Not only the batch dimension can be None, but many others as well.
For instance, in a 2D convolutional network, where the expected input is (batchSize, height, width, channels), you can have shapes like (None, None, None, 3), allowing variable image sizes.
In recurrent networks and in 1D convolutions, you can also make the length/timesteps dimension variable, with shapes like (None, None, featuresOrChannels)

Yes, None in summary means a dynamic dimension of a batch (mini batch).
This is why you can set any batch size to your model.
The summary() method is part of TF that incorporates Keras method print_summary().


How to run multi-gpu for a part of model?

I"m trying to run multimodal data, whose sample consists of text tokens shape of (64, 512) and an image shape of (3, 256, 256).
Due to Memory issue, i am trying to run single sample, as my batch,
while running text tokens in parallel with 4 GPUS, where I pass (16, 512) to each GPU through encoder (i.e. BERT).
I'm trying to get average embedding of all of them. (i.e. (64, 768) -> (1, 768)). Then with this average embedding, afterwards the model proceeds with one GPU, to compute its relationship with the image.
My question is how can I now run/implement only a part of model for multi-gpus?
Thank you in advance!

How would I apply a nn.conv1d manually, given an input matrix and weight matrix?

I am trying to understand how a nn.conv1d processes an input for a specific example related to audio processing in a WaveNet model.
I have input data of shape (1,1,8820), which passes through an input layer (1,16,1), to output a shape of (1,16,8820).
That part I understand, because you can just multiply the two matrices. The next layer is a conv1d, kernel size=3, input channels=16, output channels=16, so the state dict shows a matrix with shape (16,16,3) for the weights. When the input of (1,16,8820) goes through that layer, the result is another (1,16,8820).
What multiplication steps occur within the layer to apply the weights to the audio data? In other words, if I wanted to apply the layer(forward calculations only) using only the input matrix, the state_dict matrix, and numpy, how would I do that?
This example is using the nn.conv1d layer from Pytorch. Also, if the same layer had a dilation=2, how would that change the operations?
A convolution is a specific type of "sliding window operation": that is, applying the same function/operation on overlapping sliding windows of the input.
In your example, you treat each 3 overlapping temporal samples (each in 16 dimensions) as an input to 16 filters. Therefore, you have a weight matrix of 3x16x16.
You can think of it as "unfolding" the (1, 16, 8820) signal into (1, 16*3, 8820) sliding windows. Then multiplying by 16*3 x 16 weight matrix to get an output of shape (1, 16, 8820).
Padding, dilation and strides affect the way the "sliding windows" are formed.
See nn.Unfold for more information.

Why do we need to include_top=False if we need to change the input_shape?

As far as I know, the input tuple enters from the convolution blocks.
So if we want to change the input_tuple shape, modifying convolutions would make sense.
Why do we need to include_top=False and remove the fully connected layers at the end?
On the other hand, if we have different number of classes,Keras has an option to change the softmax layer using no_of_classes
I know that I am the one missing something here. Please help me
Example: For Inception Resnet V2
input_shape: optional shape tuple, only to be specified if include_top
is False (otherwise the input shape has to be (299, 299, 3) (with
'channels_last' data format) or (3, 299, 299) (with 'channels_first'
data format). It should have exactly 3 inputs channels, and width and
height should be no smaller than 139. E.g. (150, 150, 3) would be one
valid value.
include_top: whether to include the fully-connected layer at the top
of the network.
This is simply because the fully connected layers at the end can only take fixed size inputs, which has been previously defined by the input shape and all processing in the convolutional layers. Any change to the input shape will change the shape of the input to the fully connected layers, making the weights incompatible (matrix sizes don't match and cannot be applied).
This is a specific problem to fully connected layers. If you use another layer for classification, such as global average pooling, then one would not have this problem.

Keras : order in which dimensions of an input tensor is specified

Consider an input layer in keras as:
model.add(layers.Dense(32, input_shape=(784,)))
What this says is input is a 2D tensor where axix=0 (batch dimension) is not specified while axis=1 is 784. Axis=0 can take any value.
My question is: isnt this style confusing?
Ideally, should it not be
This reflects axis=0 is wildcard while axis=1 should be 784
Any particular reason why it is so ? Am I missing something here ?
The consistency in this case is between the sizes of the layers and the size of the input. In general, the shapes are assumed to represent the nature of the data; in that sense, the batch dimension is not part of the data itself, but rather how you group it for training or evaluation. So, in your code snippet, it is quite clear that you have inputs with 784 features and a first layer producing a vector of 32 features. If you want to explicitly include the batch dimension, you can use instead batch_input_shape=(None, 784) (this is sometimes necessary, for example if you want to give batches of a fixed size but with an additional time dimension of unknown size). This is explained in the Sequential model guide, but also matches the documentation of the Input layer, where you can give a shape or batch_shape parameter (analogous to input_shape or batch_input_shape).

Deconvolution2D layer in keras

This layer in not ready documented very well and I'm having a bit of trouble figuring out exactly how to use it.
I'm Trying something like:
input_img = Input(shape=(1, h, w))
x = Convolution2D(16, 7, 7, activation='relu', border_mode='valid')(input_img)
d = Deconvolution2D(1, 7, 7, (None, 1, 2*h, 2*w))
x = d(x)
but when I try to write d.output_shape, I get the original shape of the image instead of twice that size (which is what I was expecting).
Any help will be greatly appreciated!
Short answer: you need to add subsample=(2,2) to Deconvolution2D if you wish the output to truly be twice as large as the input.
Longer answer: Deconvolution2D is severely undocumented and you have to go through its code to understand how to use it.
First, you must understand how the deconvolution layer works (skip this if you already know all the details). Deconvolution, unlike what its name suggest, is simply applying the back-propgation (gradient calculation method) of a standard convolution layer on the input to the deconvolution layer. The "kernel size" of the deconvolution layer is actually the kernel size of the virtual convolution layer of the backprop step mentioned above. While given the size of a convolution kernel and its stride, it is straightforward to compute the output shape of the convolution layer (assuming no padding it's (input - kernel) // stride + 1), but the reverse is not true. In fact, there can be more than one possible input shapes that matches a given output shape of the convolution layer (this is because integer division isn't invertible). This means that for a deconvolution layer, the output shape cannot be directly determined simply from the input shape (which is implicitly known), kernel size and stride - this is why we need to know the output shape when we initialize the layer. Of course, because of the way the deconvolution layer is defined, for some input shapes you'll get holes in its output which are undefined, and if we forbid these cases then we actually can deduce the output shape.
Back to Keras and how the above is implemented. Confusingly, the output_shape parameter is actually not used for determining the output shape of the layer, and instead they try to deduce it from the input, the kernel size and the stride, while assuming only valid output_shapes are supplied (though it's not checked in the code to be the case). The output_shape itself is only used as input to the backprop step. Thus, you must also specify the stride parameter (subsample in Keras) in order to get the desired result (which could've been determined by Keras from the given input shape, output shape and kernel size).
