Operations made by Conv2D in Pytorch - pytorch

According to Pytorch documentation of Conv2D(c_in,c_out) (the other parameters are irrelevant for this question):
c_in is the number of channels of the input image.
c_out is the number of channel produced by the convolution layer
What I don't understand is how many kernels/filters there are. In many post I have seen that c_out is indeed the number of kernels and that would mean that if I have an input image of 3 channels, and I set c_out=10, the output would be 30 channels, but in reality I get 10 channels.

A convolution operation is performed by "sliding" a kernel over the input, and computing the correlation between the kernel and the corresponding input. This computation yields a single scalar number for each "window".
Therefore, if you have an input with a shape c_inxhxw, a single kernel has c_inxk_hxk_w parameters. Sliding this kernel over the input will yield an output of shape 1xhxw (assuming proper padding). As you can see, the number of input channels, c_in, does not affect the output shape, but rather the size of the kernel. Consequently, if you want to have c_out output channels, you need to have c_out filters each with a shape c_inxhxw.

Related

Is using causal convolution / padding equivalent to shifting the outputs back?

I'm having some trouble understanding the purpose of causal convolutions. Suppose I'm doing time-series classification using a convolutional network with 1 layer and kernel=2, stride=1, dilation=0. Isn't it the same thing as shifting my output back by 1?
For larger networks, it would be a little more involved to take into account the parameters of all the layers to get the resulting receptive field to do a proper output shift. To me it seems, if there is some leak, you could always account for the leak by shifting the output back.
For example, if at time step $t_2$, a non-causal CNN sees $x_0, x_1, x_2, x_3, x_4$, then you'd use the target associated with $t_4$, i.e. $y_4$
Edit: I've seen diagrams for causal CNNs where all the arrows a right-aligned. I get that it's meant to illustrate that $y_t$ aligns to $x_t$, but couldn't you just as easily draw them like this:
Non_Causal CNN Right-aligned
The point of causal convolutions is not to see 'future' data. This is important in real time sequential analysis because we won't have access to new information before it happens, however we typically do in training (due to having the whole training sequence). Therefore, causal convolutions begin t-k//2 and end at t (where t = current timestep and k = kernel size), rather than a typical convolution which starts at t-k//2 and end at t+k//2. This can be imagined as a 1-sided kernel, where instead of having the target pixel/sample be in the centre of the kernel, it's now the rightmost (going from L-R) part of the kernel.
Using your example, if the top orange dot in the following picture is t_n, then t_n has a receptive field stemming from t_n-4 to t_n due to it having a kernel size of 2 and 4 layers.
Compare that to a noncausal convolution (ignore the dilated convolution on the right), where the receptive field stems from t_n-3 to t_n+3 due to it being a double-sided kernel:

How would I apply a nn.conv1d manually, given an input matrix and weight matrix?

I am trying to understand how a nn.conv1d processes an input for a specific example related to audio processing in a WaveNet model.
I have input data of shape (1,1,8820), which passes through an input layer (1,16,1), to output a shape of (1,16,8820).
That part I understand, because you can just multiply the two matrices. The next layer is a conv1d, kernel size=3, input channels=16, output channels=16, so the state dict shows a matrix with shape (16,16,3) for the weights. When the input of (1,16,8820) goes through that layer, the result is another (1,16,8820).
What multiplication steps occur within the layer to apply the weights to the audio data? In other words, if I wanted to apply the layer(forward calculations only) using only the input matrix, the state_dict matrix, and numpy, how would I do that?
This example is using the nn.conv1d layer from Pytorch. Also, if the same layer had a dilation=2, how would that change the operations?
A convolution is a specific type of "sliding window operation": that is, applying the same function/operation on overlapping sliding windows of the input.
In your example, you treat each 3 overlapping temporal samples (each in 16 dimensions) as an input to 16 filters. Therefore, you have a weight matrix of 3x16x16.
You can think of it as "unfolding" the (1, 16, 8820) signal into (1, 16*3, 8820) sliding windows. Then multiplying by 16*3 x 16 weight matrix to get an output of shape (1, 16, 8820).
Padding, dilation and strides affect the way the "sliding windows" are formed.
See nn.Unfold for more information.

LSTM input longer than output

I am not sure if I understand how exactly Keras version of LSTM works.
Let's say I have vector of len=20 as input and I specify keras.layers.LSTM(units=10)
So in this example does the network finish after processing 50% of input or it precess the rest from start (I mean from first cell)?
Units are never related to the input size.
Units are related only to the output size (units = output features or channels).
An LSTM layer will always process the entire data and optionally return either the "same length (all steps)" or "no length (only last step)".
In terms of shapes
You must have an input tensor with shape (batch, len=20, input_features).
And it will output:
For return_sequences=False: (batch, output_features=10) - no length
For return_sequences=True: (batch, len=20, output_features=10) - same length
Output features is always equal to units.
See a full comprehension of the LSTM layers here: Understanding Keras LSTMs

Arbitrary length inputs for CNNs in sequential learning

In An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling, the authors state that TCN networks, a specific type of 1D CNNs applied to sequential data, "can also take in inputs of arbitrary lengths by sliding the 1D convolutional kernels", just like Recurrent Nets. I am asking myself how this can be done.
For an RNN, it is straight-forward that the same function would be applied as often as is the input length. However, for CNNs (or any feed-forward NN in general), one must prespecify the number of input neurons. So the only way I can see TCNs dealing with arbitrary length inputs is by specifying a fixed length input neuron space and then adding zero padding to the arbitrary length inputs.
Am I correct in my understanding?
If you have a fully convolutional neural network, there is no reason to have a fully specified input shape. You definitely need a fixed rank, and the last dimension probably should be the same but otherwise, you can definitely specify an input shape which in tensorflow would look like Input((None, 10)) in the case of 1D-CNNs.
Indeed the shape of the convolution kernel doesn't depend on the length of the input in the temporal dimension (it can depend on the last dimension though typically in convolutional neural networks), and you can apply it to any input with the same rank (and same last dimension).
For example let's say that you are applying only a single 1D convolution, with a kernel that's doing the sum of 2 neighbouring elements (kernel = (1, 1)). This operation could be applied to any input length given it's always 1D.
However, when being confronted with a sequence-to-label task and requiring further operations in the stack such as a fully-connected layer, the inputs must be of fixed length (or must be made so through zero padding).

Keras LSTM: first argument

In Keras, if you want to add an LSTM layer with 10 units, you use model.add(LSTM(10)). I've heard that number 10 referred to as the number of hidden units here and as the number of output units (line 863 of the Keras code here).
My question is, are those two things the same? Is the dimensionality of the output the same as the number of hidden units? I've read a few tutorials (like this one and this one), but none of them state this explicitly.
The answers seems to refer to multi-layer perceptrons (MLP) in which the hidden layer can be of different size and often is. For LSTMs, the hidden dimension is the same as the output dimension by construction:
The h is the output for a given timestep and the cell state c is bound by the hidden size due to element wise multiplication. The addition of terms to compute the gates would require that both the input kernel W and the recurrent kernel U map to the same dimension. This is certainly the case for Keras LSTM as well and is why you only provide single units argument.
To get a good intuition for why this makes sense. Remember that the LSTM job is to encode a sequence into a vector (maybe a Gross oversimplification but its all we need). The size of that vector is specified by hidden_units, the output is:
seq vector RNN weights
(1 X input_dim) * (input_dim X hidden_units),
which has 1 X hidden_units (a row vector representing the encoding of your input sequence). And thus, the names in this case are used synonymously.
Of course RNNs require more than one multiplication and keras implements RNNs as a sequence of matrix-matrix multiplications instead vector-matrix shown above.
The number of hidden units is not the same as the number of output units.
The number 10 controls the dimension of the output hidden state (source code for the LSTM constructor method can be found here. 10 specifies the units argument). In one of the tutorial's you have linked to (colah's blog), the units argument would control the dimension of the vectors ht-1 , ht, and ht+1: RNN image.
If you want to control the number of LSTM blocks in your network, you need to specify this as an input into the LSTM layer. The input shape to the layer is (nb_samples, timesteps, input_dim) Keras documentation. timesteps controls how many LSTM blocks your network contains. Referring to the tutorial on colah's blog again, in RNN image, timesteps would control how many green blocks the network contains.

Resources