Keras Dot Axes questions - keras

I am trying to use Keras Dot and have the following errors.
Could you explain what I am doing wrong?
x1 = Input(shape=(2,4))
x2 = Input(shape=(4,))
y1 = dot([x1,x2], axes = (2,1))
modelA = Model(inputs=[x1, x2], outputs=y1)
a1 = np.arange(16).reshape(2,2,4)
a2 = np.array( [1,2,3,4] )
modelA.predict([a1,a2])
---->
ValueError: Error when checking : expected input_40 to have shape (None, 4) but
got array with shape (4, 1)

I am new to Keras, too. And the following is what I figured out after playing around with the Dot operation.
Firstly, the shape parameter of Input layer is NOT including the batch size. In your code, x2 = Input(shape=(4,)), so x2 is expecting the input data to be (None, 4), (None refers to batch size), but a2 is np.array([1,2,3,4]), the shape is (1, 4), hence the error message.
To get rid of the error you need to add the batch_size dimension to a2.
But then there is another problem, according to the doc of Dot, I think x1 and x2 should have the same batch size:
if applied to a list of two tensors a and b of shape (batch_size, n), the output will be a tensor of shape (batch_size, 1) where each entry i will be the dot product between a[i] and b[i].
So I manually match the batch size of a1 and a2, and the batch size of a1 is 2, so a2 needs to be np.array([[1,2,3,4],[1,2,3,4]])
Now you can have your desire result:
[[ 20. 60.]
[100. 140.]]
A few more words for beginners like me, the shape of x1 is (batch_size, 2, 4), the shape of x2 is (batch_size, 4), it seems that they are not compatible. Now is when the 'axes' parameter comes into play. In OP's code, axes=(2,1) means to dot x1's 3rd axis (0-indexed, it's the axis with the length of 4), with x2's 2nd axis(also the length is 4). So it will be [0,1,2,3]dot[1,2,3,4]=20, [4,5,6,7]dot[1,2,3,4]=60 ...

Related

Difference in the order of applying linear decoder and average pooling for sequence models

I am working with sequence modelling in pytorch and trying to determine if the order of the pooling and linear decoding layer matters. Given that I have a sequence with the shape (Batch, Seqlen, dim_model) and I want to transform it into (Batch, dim_output) I will need a pooling layer for reducing the second dimension (SeqLen) and an affine transformation that maps dim_model to dim_output. Assume Batch = 16, SeqLen = 6000, dim_model = 32, dim_output = 5, we have the following input:
import torch
pooler = lambda x: x.mean(dim=1)
decoder = torch.nn.Linear(32, 5)
x = torch.randn(16, 6000, 32)
Would this:
y = decoder(pooler(x))
Be the same as:
y = pooler(decoder(x))
The normalized difference between both outputs suggest that they are close:
torch.norm(decoder(pooler(x)) - pooler(decoder(x)))
output:
tensor(6.5412e-08, grad_fn=<CopyBackwards>)
But can one say they are equivalent? Are the gradients computed in the same way?
I am interesting the case of using arbitrary pooling layer, this includes for instance the "last" pooler:
pooler = lambda x: x[:,-1]
torch.norm(decoder(pooler(x)) - pooler(decoder(x)))
output:
tensor(0., grad_fn=<CopyBackwards>)
A linear layer does x -> Ax+b for some matrix A and vector b.
If you have a bunch of x (x1, x2, x3, ..., xn) then A[(x1+...+xn)/n] = (Ax1 +... +Axn)/n, so for mean pooling, applying pooling first and then doing the linear layer results (up to floating point errors) in the same value as applying the linear layer first and then doing the pooling.
For "last pooling", the result is the same because it doesn't matter whether you apply A to every element and then afterwards only pick the final one, or if you pick the final one, and apply A to it.
However, for plenty of other operations, the result would not be the same. E.g. for max pooling, the result would in general not be the same.
e.g. if x1 = (1, 0, 0), x2 = (0, 1, 0), x3 = (0, 0, 1), and A = ((1, 1, 1)) then Ax1 = Ax2 =Ax3 = (1), so applying max pooling after the linear layer just gives you (1),
but max pooling applied to x1, x2, x3 gives you (1, 1, 1) and A(1, 1, 1) = 3.

How does calculation in a GRU layer take place

So I want to understand exactly how the outputs and hidden state of a GRU cell are calculated.
I obtained the pre-trained model from here and the GRU layer has been defined as nn.GRU(96, 96, bias=True).
I looked at the the PyTorch Documentation and confirmed the dimensions of the weights and bias as:
weight_ih_l0: (288, 96)
weight_hh_l0: (288, 96)
bias_ih_l0: (288)
bias_hh_l0: (288)
My input size and output size are (1000, 8, 96). I understand that there are 1000 tensors, each of size (8, 96). The hidden state is (1, 8, 96), which is one tensor of size (8, 96).
I have also printed the variable batch_first and found it to be False. This means that:
Sequence length: L=1000
Batch size: B=8
Input size: Hin=96
Now going by the equations from the documentation, for the reset gate, I need to multiply the weight by the input x. But my weights are 2-dimensions and my input has three dimensions.
Here is what I've tried, I took the first (8, 96) matrix from my input and multiplied it with the transpose of my weight matrix:
Input (8, 96) x Weight (96, 288) = (8, 288)
Then I add the bias by replicating the (288) eight times to give (8, 288). This would give the size of r(t) as (8, 288). Similarly, z(t) would also be (8, 288).
This r(t) is used in n(t), since Hadamard product is used, both the matrices being multiplied have to be the same size that is (8, 288). This implies that n(t) is also (8, 288).
Finally, h(t) is the Hadamard produce and matrix addition, which would give the size of h(t) as (8, 288) which is wrong.
Where am I going wrong in this process?
TLDR; This confusion comes from the fact that the weights of the layer are the concatenation of input_hidden and hidden-hidden respectively.
- nn.GRU layer weight/bias layout
You can take a closer look at what's inside the GRU layer implementation torch.nn.GRU by peaking through the weights and biases.
>>> gru = nn.GRU(input_size=96, hidden_size=96, num_layers=1)
First the parameters of the GRU layer:
>>> gru._all_weights
[['weight_ih_l0', 'weight_hh_l0', 'bias_ih_l0', 'bias_hh_l0']]
You can look at gru.state_dict() to get the dictionary of weights of the layer.
We have two weights and two biases, _ih stands for 'input-hidden' and _hh stands for 'hidden-hidden'.
For more efficient computation the parameters have been concatenated together, as the documentation page clearly explains (| means concatenation). In this particular example num_layers=1 and k=0:
~GRU.weight_ih_l[k] – the learnable input-hidden weights of the layer (W_ir | W_iz | W_in), of shape (3*hidden_size, input_size).
~GRU.weight_hh_l[k] – the learnable hidden-hidden weights of the layer (W_hr | W_hz | W_hn), of shape (3*hidden_size, hidden_size).
~GRU.bias_ih_l[k] – the learnable input-hidden bias of the layer (b_ir | b_iz | b_in), of shape (3*hidden_size).
~GRU.bias_hh_l[k] – the learnable hidden-hidden bias of the (b_hr | b_hz | b_hn).
For further inspection we can get those split up with the following code:
>>> W_ih, W_hh, b_ih, b_hh = gru._flat_weights
>>> W_ir, W_iz, W_in = W_ih.split(H_in)
>>> W_hr, W_hz, W_hn = W_hh.split(H_in)
>>> b_ir, b_iz, b_in = b_ih.split(H_in)
>>> b_hr, b_hz, b_hn = b_hh.split(H_in)
Now we have the 12 tensor parameters sorted out.
- Expressions
The four expressions for a GRU layer: r_t, z_t, n_t, and h_t, are computed at each timestep.
The first operation is r_t = σ(W_ir#x_t + b_ir + W_hr#h + b_hr). I used the # sign to designate the matrix multiplication operator (__matmul__). Remember W_ir is shaped (H_in=input_size, hidden_size) while x_t contains the element at step t from the x sequence. Tensor x_t = x[t] is shaped as (N=batch_size, H_in=input_size). At this point, it's simply a matrix multiplication between the input x[t] and the weight matrix. The resulting tensor r is shaped (N, hidden_size=H_in):
>>> (x[t]#W_ir.T).shape
(8, 96)
The same is true for all other weight multiplication operations performed. As a result, you end up with an output tensor shaped (N, H_out=hidden_size).
In the following expressions h is the tensor containing the hidden state of the previous step for each element in the batch, i.e. shaped (N, hidden_size=H_out), since num_layers=1, i.e. there's a single hidden layer.
>>> r_t = torch.sigmoid(x[t]#W_ir.T + b_ir + h#W_hr.T + b_hr)
>>> r_t.shape
(8, 96)
>>> z_t = torch.sigmoid(x[t]#W_iz.T + b_iz + h#W_hz.T + b_hz)
>>> z_t.shape
(8, 96)
The output of the layer is the concatenation of the computed h tensors at
consecutive timesteps t (between 0 and L-1).
- Demonstration
Here is a minimal example of an nn.GRU inference manually computed:
Parameters
Description
Values
H_in
feature size
3
H_out
hidden size
2
L
sequence length
3
N
batch size
1
k
number of layers
1
Setup:
gru = nn.GRU(input_size=H_in, hidden_size=H_out, num_layers=k)
W_ih, W_hh, b_ih, b_hh = gru._flat_weights
W_ir, W_iz, W_in = W_ih.split(H_out)
W_hr, W_hz, W_hn = W_hh.split(H_out)
b_ir, b_iz, b_in = b_ih.split(H_out)
b_hr, b_hz, b_hn = b_hh.split(H_out)
Random input:
x = torch.rand(L, N, H_in)
Inference loop:
output = []
h = torch.zeros(1, N, H_out)
for t in range(L):
r = torch.sigmoid(x[t]#W_ir.T + b_ir + h#W_hr.T + b_hr)
z = torch.sigmoid(x[t]#W_iz.T + b_iz + h#W_hz.T + b_hz)
n = torch.tanh(x[t]#W_in.T + b_in + r*(h#W_hn.T + b_hn))
h = (1-z)*n + z*h
output.append(h)
The final output is given by the stacking the tensors h at consecutive timesteps:
>>> torch.vstack(output)
tensor([[[0.1086, 0.0362]],
[[0.2150, 0.0108]],
[[0.3020, 0.0352]]], grad_fn=<CatBackward>)
In this case the output shape is (L, N, H_out), i.e. (3, 1, 2).
Which you can compare with output, _ = gru(x).

How can I reshape 1D np array into 3D?

I have my 699 training features stored in the array X.
X.shape
(699,)
Each row is however 1292 * 13
For instance:
X[0].shape
(1292, 13)
How can I reshape it correctly to input into a CNN?
In order to put them in a keras Conv2D for example, you must have a specific input_shape.
So , your number of samples is 699 and your shape is (1292, 13, 1) .
The last dimension (1) is the number of channels, so if you have gray images (or something else) you put 1 , if you have color you put 3.
So something like that:
input_shape = (len(x), X[0][0].shape, X[0][1].shape, 1)
tf.keras.layers.Conv2D(2, 3, activation='relu', input_shape=input_shape[1:])(x)
X = np.stack(X) did the job as mentioned by #hpaulj
X.shape
(419, 1292, 13, 1)

Weird output for weights/filters in CNN

My task is to visualize the plotted weights in a cnn layer, now when I passed parameters, filters = 32 and kernel_size = (3, 3), I am expecting the output to be 32 matrices each of 3x3 size by using .get_weights() function(to extract weights and biases), but I am getting a very weird nested output,
the output is as follows:
a = model.layers[0].get_weights()
a[0][0][0]
array([[ 2.87332404e-02, -2.80513391e-02,
**... 32 values ...**,
-1.55516148e-01, -1.26494586e-01, -1.36454999e-01,
1.61165968e-02, 7.63138831e-02],
[-5.21791205e-02, 3.13560963e-02, **... 32 values ...**,
-7.63987377e-02, 7.28923678e-02, 8.98564830e-02,
-3.02852653e-02, 4.07049060e-02],
[-7.04478994e-02, 1.33816227e-02,
**... 32 values ...**, -1.99537817e-02,
-1.67200342e-01, 1.15980692e-02]], dtype=float32)
I want to know that why I am getting this type of weird output and how can I get the weights in the perfect shape. Thanks in advance.
Weights in neural network are values that represent connection strength between input nodes and output nodes(or nodes in next layer).
Conv2D layer's weights usually have shape of (H, W, I, O), where:-
H is kernel height
W is kernel width
I is number of input channels
O is number of output channels
Conv2D weights can be interpreted as connection strength between a patch of input channels and nodes in output filter/feature map. This way you would have weights of shape(H, W) between each Input channels and each Output Channels. It should be noted that the weights are shared among different patches of the same channel.
Consider the following convolution of (8, 8, 1) input with (2, 2) kernel and output with (8, 8, 1). The weights of this layer has shape (2, 2, 1, 1)
The same input can be used to produce 2 feature map using 2 (2, 2) filters as follows. Now the shape of the weights would be (2, 2,1, 2).
Hope this will clarify how to interpret the shape of convolutional layers.
The shape of the kernel weights from a Conv2D layer is (kernel_size[0], kernel_size[1], n_input_channels, filters). So in your case
a = model.layers[0].get_weights()
print(a[0].shape)
# should print (3,3,z,32) if your input has shape (x, y, z)
If you want to print the weights from one of the filters, you can do
a[0][:,:,:,0]

Multiply 3 matrix in Keras custom layer

I would like to create a custom Keras Layer that calculates the product between 2 input matrices and 1 weight matrix (diagonal matrix) : x W y
x = Input((8,200)) # (?,8,200)
y = Input((10,200)) # (?,10,200)
W # Weight matrix define with Keras (200,)
I want the output matrix that compute xWy with a shape (?, 8, 10)
I try :
K.dot(x*W, K.transpose(Y)) # Raise Dimension error
K.dot(x*W, Permute(2,1))(Y)) # (?, 8, ?, 10)
Without the first dimension (batch size) I see how to do it, but with it I'm a little lost.
You can use K.batch_dot, which is made for this purpose.
K.batch_dot(x*W, K.permute_dimensions(y, (0,2,1)), axes=[2, 1]) # (?, 8, 10)
will do the trick.
You can specify the axis along which to take the dot product in a Keras Dot layer. The following code shows how to multiply your inputs x and y. If you want to add a weight matrix W you can do that in a similar way (by first multiplying x and W).
x = Input((8,200)) # (?,8,200)
y = Input((10,200)) # (?,10,200)
output = keras.layers.Dot(axes=-1)([x, y]) # (?,8,10)

Resources