How does calculation in a GRU layer take place - pytorch

So I want to understand exactly how the outputs and hidden state of a GRU cell are calculated.
I obtained the pre-trained model from here and the GRU layer has been defined as nn.GRU(96, 96, bias=True).
I looked at the the PyTorch Documentation and confirmed the dimensions of the weights and bias as:
weight_ih_l0: (288, 96)
weight_hh_l0: (288, 96)
bias_ih_l0: (288)
bias_hh_l0: (288)
My input size and output size are (1000, 8, 96). I understand that there are 1000 tensors, each of size (8, 96). The hidden state is (1, 8, 96), which is one tensor of size (8, 96).
I have also printed the variable batch_first and found it to be False. This means that:
Sequence length: L=1000
Batch size: B=8
Input size: Hin=96
Now going by the equations from the documentation, for the reset gate, I need to multiply the weight by the input x. But my weights are 2-dimensions and my input has three dimensions.
Here is what I've tried, I took the first (8, 96) matrix from my input and multiplied it with the transpose of my weight matrix:
Input (8, 96) x Weight (96, 288) = (8, 288)
Then I add the bias by replicating the (288) eight times to give (8, 288). This would give the size of r(t) as (8, 288). Similarly, z(t) would also be (8, 288).
This r(t) is used in n(t), since Hadamard product is used, both the matrices being multiplied have to be the same size that is (8, 288). This implies that n(t) is also (8, 288).
Finally, h(t) is the Hadamard produce and matrix addition, which would give the size of h(t) as (8, 288) which is wrong.
Where am I going wrong in this process?

TLDR; This confusion comes from the fact that the weights of the layer are the concatenation of input_hidden and hidden-hidden respectively.
- nn.GRU layer weight/bias layout
You can take a closer look at what's inside the GRU layer implementation torch.nn.GRU by peaking through the weights and biases.
>>> gru = nn.GRU(input_size=96, hidden_size=96, num_layers=1)
First the parameters of the GRU layer:
>>> gru._all_weights
[['weight_ih_l0', 'weight_hh_l0', 'bias_ih_l0', 'bias_hh_l0']]
You can look at gru.state_dict() to get the dictionary of weights of the layer.
We have two weights and two biases, _ih stands for 'input-hidden' and _hh stands for 'hidden-hidden'.
For more efficient computation the parameters have been concatenated together, as the documentation page clearly explains (| means concatenation). In this particular example num_layers=1 and k=0:
~GRU.weight_ih_l[k] – the learnable input-hidden weights of the layer (W_ir | W_iz | W_in), of shape (3*hidden_size, input_size).
~GRU.weight_hh_l[k] – the learnable hidden-hidden weights of the layer (W_hr | W_hz | W_hn), of shape (3*hidden_size, hidden_size).
~GRU.bias_ih_l[k] – the learnable input-hidden bias of the layer (b_ir | b_iz | b_in), of shape (3*hidden_size).
~GRU.bias_hh_l[k] – the learnable hidden-hidden bias of the (b_hr | b_hz | b_hn).
For further inspection we can get those split up with the following code:
>>> W_ih, W_hh, b_ih, b_hh = gru._flat_weights
>>> W_ir, W_iz, W_in = W_ih.split(H_in)
>>> W_hr, W_hz, W_hn = W_hh.split(H_in)
>>> b_ir, b_iz, b_in = b_ih.split(H_in)
>>> b_hr, b_hz, b_hn = b_hh.split(H_in)
Now we have the 12 tensor parameters sorted out.
- Expressions
The four expressions for a GRU layer: r_t, z_t, n_t, and h_t, are computed at each timestep.
The first operation is r_t = σ(W_ir#x_t + b_ir + W_hr#h + b_hr). I used the # sign to designate the matrix multiplication operator (__matmul__). Remember W_ir is shaped (H_in=input_size, hidden_size) while x_t contains the element at step t from the x sequence. Tensor x_t = x[t] is shaped as (N=batch_size, H_in=input_size). At this point, it's simply a matrix multiplication between the input x[t] and the weight matrix. The resulting tensor r is shaped (N, hidden_size=H_in):
>>> (x[t]#W_ir.T).shape
(8, 96)
The same is true for all other weight multiplication operations performed. As a result, you end up with an output tensor shaped (N, H_out=hidden_size).
In the following expressions h is the tensor containing the hidden state of the previous step for each element in the batch, i.e. shaped (N, hidden_size=H_out), since num_layers=1, i.e. there's a single hidden layer.
>>> r_t = torch.sigmoid(x[t]#W_ir.T + b_ir + h#W_hr.T + b_hr)
>>> r_t.shape
(8, 96)
>>> z_t = torch.sigmoid(x[t]#W_iz.T + b_iz + h#W_hz.T + b_hz)
>>> z_t.shape
(8, 96)
The output of the layer is the concatenation of the computed h tensors at
consecutive timesteps t (between 0 and L-1).
- Demonstration
Here is a minimal example of an nn.GRU inference manually computed:
Parameters
Description
Values
H_in
feature size
3
H_out
hidden size
2
L
sequence length
3
N
batch size
1
k
number of layers
1
Setup:
gru = nn.GRU(input_size=H_in, hidden_size=H_out, num_layers=k)
W_ih, W_hh, b_ih, b_hh = gru._flat_weights
W_ir, W_iz, W_in = W_ih.split(H_out)
W_hr, W_hz, W_hn = W_hh.split(H_out)
b_ir, b_iz, b_in = b_ih.split(H_out)
b_hr, b_hz, b_hn = b_hh.split(H_out)
Random input:
x = torch.rand(L, N, H_in)
Inference loop:
output = []
h = torch.zeros(1, N, H_out)
for t in range(L):
r = torch.sigmoid(x[t]#W_ir.T + b_ir + h#W_hr.T + b_hr)
z = torch.sigmoid(x[t]#W_iz.T + b_iz + h#W_hz.T + b_hz)
n = torch.tanh(x[t]#W_in.T + b_in + r*(h#W_hn.T + b_hn))
h = (1-z)*n + z*h
output.append(h)
The final output is given by the stacking the tensors h at consecutive timesteps:
>>> torch.vstack(output)
tensor([[[0.1086, 0.0362]],
[[0.2150, 0.0108]],
[[0.3020, 0.0352]]], grad_fn=<CatBackward>)
In this case the output shape is (L, N, H_out), i.e. (3, 1, 2).
Which you can compare with output, _ = gru(x).

Related

keras input not matching output confusion

I am trying to implement a silly learn to rank example. Essentially, I have 2 descriptions of a location, size and number of bathrooms. I want to "combine" them to create a score. Then I wish to compare the scores for the "best". I will always be comparing 3 locations at a time.
The neuralnetwork I expect to do this:
# 3 locations with 2 descriptions.
rinputs = Input(shape=(3, 2), name ='inputlayer')
# take my 3 expected inputs, split them
split = Lambda( lambda x: tf.split(x,num_or_size_splits=3,axis=1))(rinputs)
input_one_tensor = split[0]
input_two_tensor = split[1]
input_three_tensor = split[2]
# combine each set of location elements into 1 "score"
layer2 = Dense(1, name = 'Layer2', use_bias = True, activation = 'sigmoid') # 60 was better than 100
layer2a = layer2(input_one_tensor)
layer2b = layer2(input_two_tensor)
layer2c = layer2(input_three_tensor)
concatLayer = Concatenate(name = 'ConcatLayer2')([layer2a,layer2b, layer2c])
# softmax my score to get "best selection"
softmaxLayer = Dense(3, activation='softmax', name = 'softmax', use_bias = False)
softmaxLayer = softmaxLayer(concatLayer)
model = Model(inputs=rinputs, outputs=softmaxLayer)
model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adam(),metrics=['accuracy'])
I now create my test data:
loc1 = [1, 5]
loc2 = [4, 1]
loc3 = [6, 7]
# create two entries for my trial run
inputs = np.asarray([[loc1, loc2, loc3], [loc3,loc3,loc1]]).reshape(2,3,2)
ytrue = np.asarray([[1, 0, 0], [0, 0, 1]]).reshape(2,3)
model.fit(inputs, ytrue,verbose=True,)
But then I get the following error about my outputs. That I am not understanding.
File "/.virtualenvs/python310/lib/python3.10/site-packages/keras/losses.py", line 1990, in categorical_crossentropy
return backend.categorical_crossentropy(
File "/.virtualenvs/python310/lib/python3.10/site-packages/keras/backend.py", line 5529, in categorical_crossentropy
target.shape.assert_is_compatible_with(output.shape)
ValueError: Shapes (None, 3) and (None, 1, 3) are incompatible
I'm not entirely understanding why the shapes don't match. I expect my softmax layer to output 3 numbers that sum to 1 and can be compared to my ytrue.
any insights appreciated
Just from the model architecture itself, it seems like you just need a two-dimensional data to be fed into Layer2:
One may use a Reshape/Flatten layer to fix it.
By reshaping the output of Lambda layer from (None, 1, 2) to (None, 2), the final output's shape should become compatible too (None, 3).
Additional notes:
As an example borrowed (with some modifications) from the TensorFlow website, let's assume we want to split an input tensor of the shape of (3, 2) into 3 smaller tensors along the axis=1:
x = tf.Variable(tf.random.uniform([3, 2], -1, 1))
s0, s1, s2 = tf.split(x, num_or_size_splits=3, axis=1)
Output:
Here are the smaller tensor splits:
Now, we can see the shape is (1, 2), i.e. a 2D tensor consistent with the tensor it is derived from, and not a vector of the shape of (2,). In the context of your problem, for a batch, that would be (None, 1, 2).

Implementing Dual Encoder LSTM in Keras with Tensorflow backend

Dual Encoder LSTM
I want to implement this model in TensorFlow Keras API. I am confused about how to implement the sigmoid(CMR) function in Keras. How to merge the output of both LSTM's an compute the above function ?
RNN here means LSTM
C and R are sentences encoded into a fixed dimension by the two LSTM's. Then they are passed through a function sigmoid(CMR). We can assume that R and C are both 256 dimensional matrices and M is a 256 * 256 matrix. The matrix M is learned during training.
Assuming you only consider the final output of the LSTMs and not the whole sequence, the shape of the output of each LSTM model would be (batch_size, 256).
Now, we have the following vectors and their shapes:
C: (batch_size, 256)
R: (batch_size, 256)
M: (256, 256).
The simplest case is for batch_size = 1. Then,
C: (1, 256)
R: (1, 256)
So, mathematically, CTMR would practically be CMRT, and give you a vector of shape (1, 1), which can be represented by any number of dimensions.
In code, this is straightforward:
def compute_cmr(c, m, r):
r = tf.transpose(r, [1, 0])
output = tf.matmul(c, m)
output = tf.matmul(output, r)
return output
However, if your batch_size is greater than 1, things can get tricky. My approach (using eager execution) is to unstack along the batch axis, process individually, then restack. It may not be the most efficient way, but it works flawlessly and the time overhead usually is negligible.
Here's how you can do it:
def compute_cmr(c, m, r):
outputs = []
c_list = tf.unstack(c, axis=0)
r_list = tf.unstack(r, axis=0)
for batch_number in range(len(c_list)):
r = tf.expand_dims(r_list[batch_number], axis=1)
c = tf.expand_dims(c_list[batch_number], axis=0)
output = tf.matmul(c, m)
output = tf.matmul(output, r)
outputs.append(output)
return tf.stack(outputs, axis=0)

Weird output for weights/filters in CNN

My task is to visualize the plotted weights in a cnn layer, now when I passed parameters, filters = 32 and kernel_size = (3, 3), I am expecting the output to be 32 matrices each of 3x3 size by using .get_weights() function(to extract weights and biases), but I am getting a very weird nested output,
the output is as follows:
a = model.layers[0].get_weights()
a[0][0][0]
array([[ 2.87332404e-02, -2.80513391e-02,
**... 32 values ...**,
-1.55516148e-01, -1.26494586e-01, -1.36454999e-01,
1.61165968e-02, 7.63138831e-02],
[-5.21791205e-02, 3.13560963e-02, **... 32 values ...**,
-7.63987377e-02, 7.28923678e-02, 8.98564830e-02,
-3.02852653e-02, 4.07049060e-02],
[-7.04478994e-02, 1.33816227e-02,
**... 32 values ...**, -1.99537817e-02,
-1.67200342e-01, 1.15980692e-02]], dtype=float32)
I want to know that why I am getting this type of weird output and how can I get the weights in the perfect shape. Thanks in advance.
Weights in neural network are values that represent connection strength between input nodes and output nodes(or nodes in next layer).
Conv2D layer's weights usually have shape of (H, W, I, O), where:-
H is kernel height
W is kernel width
I is number of input channels
O is number of output channels
Conv2D weights can be interpreted as connection strength between a patch of input channels and nodes in output filter/feature map. This way you would have weights of shape(H, W) between each Input channels and each Output Channels. It should be noted that the weights are shared among different patches of the same channel.
Consider the following convolution of (8, 8, 1) input with (2, 2) kernel and output with (8, 8, 1). The weights of this layer has shape (2, 2, 1, 1)
The same input can be used to produce 2 feature map using 2 (2, 2) filters as follows. Now the shape of the weights would be (2, 2,1, 2).
Hope this will clarify how to interpret the shape of convolutional layers.
The shape of the kernel weights from a Conv2D layer is (kernel_size[0], kernel_size[1], n_input_channels, filters). So in your case
a = model.layers[0].get_weights()
print(a[0].shape)
# should print (3,3,z,32) if your input has shape (x, y, z)
If you want to print the weights from one of the filters, you can do
a[0][:,:,:,0]

Keras Dot Axes questions

I am trying to use Keras Dot and have the following errors.
Could you explain what I am doing wrong?
x1 = Input(shape=(2,4))
x2 = Input(shape=(4,))
y1 = dot([x1,x2], axes = (2,1))
modelA = Model(inputs=[x1, x2], outputs=y1)
a1 = np.arange(16).reshape(2,2,4)
a2 = np.array( [1,2,3,4] )
modelA.predict([a1,a2])
---->
ValueError: Error when checking : expected input_40 to have shape (None, 4) but
got array with shape (4, 1)
I am new to Keras, too. And the following is what I figured out after playing around with the Dot operation.
Firstly, the shape parameter of Input layer is NOT including the batch size. In your code, x2 = Input(shape=(4,)), so x2 is expecting the input data to be (None, 4), (None refers to batch size), but a2 is np.array([1,2,3,4]), the shape is (1, 4), hence the error message.
To get rid of the error you need to add the batch_size dimension to a2.
But then there is another problem, according to the doc of Dot, I think x1 and x2 should have the same batch size:
if applied to a list of two tensors a and b of shape (batch_size, n), the output will be a tensor of shape (batch_size, 1) where each entry i will be the dot product between a[i] and b[i].
So I manually match the batch size of a1 and a2, and the batch size of a1 is 2, so a2 needs to be np.array([[1,2,3,4],[1,2,3,4]])
Now you can have your desire result:
[[ 20. 60.]
[100. 140.]]
A few more words for beginners like me, the shape of x1 is (batch_size, 2, 4), the shape of x2 is (batch_size, 4), it seems that they are not compatible. Now is when the 'axes' parameter comes into play. In OP's code, axes=(2,1) means to dot x1's 3rd axis (0-indexed, it's the axis with the length of 4), with x2's 2nd axis(also the length is 4). So it will be [0,1,2,3]dot[1,2,3,4]=20, [4,5,6,7]dot[1,2,3,4]=60 ...

multi-level feature fusion in tensorflow

I want to know how can I combine two layers with different spatial space in Tensorflow.
for example::
batch_size = 3
input1 = tf.ones([batch_size, 32, 32, 3], tf.float32)
input2 = tf.ones([batch_size, 16, 16, 3], tf.float32)
filt1 = tf.constant(0.1, shape = [3,3,3,64])
filt1_1 = tf.constant(0.1, shape = [1,1,64,64])
filt2 = tf.constant(0.1, shape = [3,3,3,128])
filt2_2 = tf.constant(0.1, shape = [1,1,128,128])
#first layer
conv1 = tf.nn.conv2d(input1, filt1, [1,2,2,1], "SAME")
pool1 = tf.nn.max_pool(conv1, [1,2,2,1],[1,2,2,1], "SAME")
conv1_1 = tf.nn.conv2d(pool1, filt1_1, [1,2,2,1], "SAME")
deconv1 = tf.nn.conv2d_transpose(conv1_1, filt1_1, pool1.get_shape().as_list(), [1,2,2,1], "SAME")
#seconda Layer
conv2 = tf.nn.conv2d(input2, filt2, [1,2,2,1], "SAME")
pool2 = tf.nn.max_pool(conv2, [1,2,2,1],[1,2,2,1], "SAME")
conv2_2 = tf.nn.conv2d(pool2, filt2_2, [1,2,2,1], "SAME")
deconv2 = tf.nn.conv2d_transpose(conv2_2, filt2_2, pool2.get_shape().as_list(), [1,2,2,1], "SAME")
The deconv1 shape is [3, 8, 8, 64] and the deconv2 shape is [3, 4, 4, 128]. Here I cannot use the tf.concat to combine the deconv1 and deconv2. So how can I do this???
Edit
This is image for the architecture that I tried to implement:: it is releated to this paper::
vii. He, W., Zhang, X. Y., Yin, F., & Liu, C. L. (2017). Deep Direct
Regression for Multi-Oriented Scene Text Detection. arXiv preprint
arXiv:1703.08289
I checked the paper you point and there is it, consider the input image to this network has size H x W (height and width), I write the size of the output image on the side of each layer. Now look at the most bottom layer which I circle the input arrows to that layer, let's check it. This layer has two input, the first from the previous layer which has shape H/2 x W/2 and the second from the first pooling layer which also has size H/2 x W/2. These two inputs are merged together (not concatenation, but added together based on paper) and goes into the last Upsample layer, which output image of size H x W.
The other Upsample layers also have the same inputs. As you can see all merging operations have the match shapes. Also, the filter number for all merging layers is 128 which has consistency with others.
You can also use concat instead of merging, but it results in a larger filter number, be careful about that. i.e. merging two matrices with shapes H/2 x W/2 x 128 results in the same shape H/2 x W/2 x 128, but concat two matrices on the last axis, with shapes H/2 x W/2 x 128 results in H/2 x W/2 x 256.
I tried to guide you as much as possible, hope that was useful.

Resources