How to use transforms.FiveCrop() outside the DataLoader? - pytorch

I have a dataloader that returns a batch of shape torch.Size([bs, c, h, w]) where bs=4, c=1,and (h, w=128). Now I want to apply some custom transformations to the returned batch. Note that I can not apply transformations in the Dataloader as I need to feed the returned batch as is to one network and a transformed one to another network.
More specifically, I want to apply the following transformations to the returned batch:
1. CenterCrop(100)
2. FiveCrop(16)
3. Resize(128)
4. ToTensor()
5. Normalize([0.5], [0.5])
I have created a function to achieve the following task as follows:
# DataLoader code
#
#
orig_img = next(iter(DataLoader))
patches = get_patches(orig_img)
def get_patches(orig_img):
# orig_img.shape = torch.Size([4, 1, 128, 128])
images = [TF.to_pil_image(x) for x in orig_img.cpu()]
resized_imgs = []
for img in images:
img = transforms.CenterCrop(100)(img)
five_crop = transforms.FiveCrop(64)(img)
f_crops = transforms.Lambda(lambda crops: torch.stack([transforms.Normalize([0.5], [0.5])(transforms.ToTensor()(transforms.Resize(128)(crop))) for crop in crops]))(five_crop)
resized_imgs.append(f_crops)
return resized_imgs
The problem right now is that when I get the resized_imgs list, every tensor inside it looses the batch size dimension i.e. resized_imgs[0].shape = torch.Size([ncrops, c, h, w]) (4d) whereas, I expect the shape to be torch.Size([bs, ncrops, c, h, w]) (5d).

Your data loader will return a tensor of shape (bs, c, h, w). Therefore orig_img is shaped the same way and iterating through it will provide you with a tensor img shaped as (c, h, w). Applying FiveCrop will create an additional dimension such that five_crop is shaped (5, c, h, w). Then f_crops will be shaped (5, c, 128, 128). Finally, the tensor is appended with the others in resized_imgs (the list containing the different patched images). All in all resized_imgs contains bs elements since orig_img.size(0) = bs, and each element is a tensor shaped (5, c, 128, 128) (five patches per image) as we've described above.
Another way of writing this function would be:
def get_patches(orig_img):
# orig_img.shape = (4, 1, 128, 128)
img_t = T.Compose([T.ToPILImage(),
T.CenterCrop(100),
T.FiveCrop(64)])
patch_t = T.Compose([T.Resize(128),
T.ToTensor(),
T.Normalize([0.5], [0.5])])
resized_imgs = []
for img in orig_img:
five_crop = img_t(img)
f_crops = torch.stack(list(map(patch_t, five_crop)))
resized_imgs.append(f_crops)
return torch.stack(resized_imgs)
The last line will stack all image patches into a single tensor of shape (bs, 5, c, 128, 128).

Related

How does calculation in a GRU layer take place

So I want to understand exactly how the outputs and hidden state of a GRU cell are calculated.
I obtained the pre-trained model from here and the GRU layer has been defined as nn.GRU(96, 96, bias=True).
I looked at the the PyTorch Documentation and confirmed the dimensions of the weights and bias as:
weight_ih_l0: (288, 96)
weight_hh_l0: (288, 96)
bias_ih_l0: (288)
bias_hh_l0: (288)
My input size and output size are (1000, 8, 96). I understand that there are 1000 tensors, each of size (8, 96). The hidden state is (1, 8, 96), which is one tensor of size (8, 96).
I have also printed the variable batch_first and found it to be False. This means that:
Sequence length: L=1000
Batch size: B=8
Input size: Hin=96
Now going by the equations from the documentation, for the reset gate, I need to multiply the weight by the input x. But my weights are 2-dimensions and my input has three dimensions.
Here is what I've tried, I took the first (8, 96) matrix from my input and multiplied it with the transpose of my weight matrix:
Input (8, 96) x Weight (96, 288) = (8, 288)
Then I add the bias by replicating the (288) eight times to give (8, 288). This would give the size of r(t) as (8, 288). Similarly, z(t) would also be (8, 288).
This r(t) is used in n(t), since Hadamard product is used, both the matrices being multiplied have to be the same size that is (8, 288). This implies that n(t) is also (8, 288).
Finally, h(t) is the Hadamard produce and matrix addition, which would give the size of h(t) as (8, 288) which is wrong.
Where am I going wrong in this process?
TLDR; This confusion comes from the fact that the weights of the layer are the concatenation of input_hidden and hidden-hidden respectively.
- nn.GRU layer weight/bias layout
You can take a closer look at what's inside the GRU layer implementation torch.nn.GRU by peaking through the weights and biases.
>>> gru = nn.GRU(input_size=96, hidden_size=96, num_layers=1)
First the parameters of the GRU layer:
>>> gru._all_weights
[['weight_ih_l0', 'weight_hh_l0', 'bias_ih_l0', 'bias_hh_l0']]
You can look at gru.state_dict() to get the dictionary of weights of the layer.
We have two weights and two biases, _ih stands for 'input-hidden' and _hh stands for 'hidden-hidden'.
For more efficient computation the parameters have been concatenated together, as the documentation page clearly explains (| means concatenation). In this particular example num_layers=1 and k=0:
~GRU.weight_ih_l[k] – the learnable input-hidden weights of the layer (W_ir | W_iz | W_in), of shape (3*hidden_size, input_size).
~GRU.weight_hh_l[k] – the learnable hidden-hidden weights of the layer (W_hr | W_hz | W_hn), of shape (3*hidden_size, hidden_size).
~GRU.bias_ih_l[k] – the learnable input-hidden bias of the layer (b_ir | b_iz | b_in), of shape (3*hidden_size).
~GRU.bias_hh_l[k] – the learnable hidden-hidden bias of the (b_hr | b_hz | b_hn).
For further inspection we can get those split up with the following code:
>>> W_ih, W_hh, b_ih, b_hh = gru._flat_weights
>>> W_ir, W_iz, W_in = W_ih.split(H_in)
>>> W_hr, W_hz, W_hn = W_hh.split(H_in)
>>> b_ir, b_iz, b_in = b_ih.split(H_in)
>>> b_hr, b_hz, b_hn = b_hh.split(H_in)
Now we have the 12 tensor parameters sorted out.
- Expressions
The four expressions for a GRU layer: r_t, z_t, n_t, and h_t, are computed at each timestep.
The first operation is r_t = σ(W_ir#x_t + b_ir + W_hr#h + b_hr). I used the # sign to designate the matrix multiplication operator (__matmul__). Remember W_ir is shaped (H_in=input_size, hidden_size) while x_t contains the element at step t from the x sequence. Tensor x_t = x[t] is shaped as (N=batch_size, H_in=input_size). At this point, it's simply a matrix multiplication between the input x[t] and the weight matrix. The resulting tensor r is shaped (N, hidden_size=H_in):
>>> (x[t]#W_ir.T).shape
(8, 96)
The same is true for all other weight multiplication operations performed. As a result, you end up with an output tensor shaped (N, H_out=hidden_size).
In the following expressions h is the tensor containing the hidden state of the previous step for each element in the batch, i.e. shaped (N, hidden_size=H_out), since num_layers=1, i.e. there's a single hidden layer.
>>> r_t = torch.sigmoid(x[t]#W_ir.T + b_ir + h#W_hr.T + b_hr)
>>> r_t.shape
(8, 96)
>>> z_t = torch.sigmoid(x[t]#W_iz.T + b_iz + h#W_hz.T + b_hz)
>>> z_t.shape
(8, 96)
The output of the layer is the concatenation of the computed h tensors at
consecutive timesteps t (between 0 and L-1).
- Demonstration
Here is a minimal example of an nn.GRU inference manually computed:
Parameters
Description
Values
H_in
feature size
3
H_out
hidden size
2
L
sequence length
3
N
batch size
1
k
number of layers
1
Setup:
gru = nn.GRU(input_size=H_in, hidden_size=H_out, num_layers=k)
W_ih, W_hh, b_ih, b_hh = gru._flat_weights
W_ir, W_iz, W_in = W_ih.split(H_out)
W_hr, W_hz, W_hn = W_hh.split(H_out)
b_ir, b_iz, b_in = b_ih.split(H_out)
b_hr, b_hz, b_hn = b_hh.split(H_out)
Random input:
x = torch.rand(L, N, H_in)
Inference loop:
output = []
h = torch.zeros(1, N, H_out)
for t in range(L):
r = torch.sigmoid(x[t]#W_ir.T + b_ir + h#W_hr.T + b_hr)
z = torch.sigmoid(x[t]#W_iz.T + b_iz + h#W_hz.T + b_hz)
n = torch.tanh(x[t]#W_in.T + b_in + r*(h#W_hn.T + b_hn))
h = (1-z)*n + z*h
output.append(h)
The final output is given by the stacking the tensors h at consecutive timesteps:
>>> torch.vstack(output)
tensor([[[0.1086, 0.0362]],
[[0.2150, 0.0108]],
[[0.3020, 0.0352]]], grad_fn=<CatBackward>)
In this case the output shape is (L, N, H_out), i.e. (3, 1, 2).
Which you can compare with output, _ = gru(x).

RuntimeError: Given groups=3, weight of size 12 64 3 768, expected input[32, 12, 30, 768] to have 192 channels, but got 12 channels instead

I started working with Pytorch recently so my understanding of it isn't quite strong. I previously had a 1 layer CNN but wanted to extend it to 2 layers, but the input and output channels have been throwing errors I can seem to decipher. Why does it expect 192 channels? Can someone give me a pointer to help me understand this better? I have seen several related problems on here, but I don't understand those solutions either.
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from transformers import BertConfig, BertModel, BertTokenizer
import math
from transformers import AdamW, get_linear_schedule_with_warmup
def pad_sents(sents, pad_token): # Pad list of sentences according to the longest sentence in the batch.
sents_padded = []
max_len = max(len(s) for s in sents)
for s in sents:
padded = [pad_token] * max_len
padded[:len(s)] = s
sents_padded.append(padded)
return sents_padded
def sents_to_tensor(tokenizer, sents, device):
tokens_list = [tokenizer.tokenize(str(sent)) for sent in sents]
sents_lengths = [len(tokens) for tokens in tokens_list]
tokens_list_padded = pad_sents(tokens_list, '[PAD]')
sents_lengths = torch.tensor(sents_lengths, device=device)
masks = []
for tokens in tokens_list_padded:
mask = [0 if token == '[PAD]' else 1 for token in tokens]
masks.append(mask)
masks_tensor = torch.tensor(masks, dtype=torch.long, device=device)
tokens_id_list = [tokenizer.convert_tokens_to_ids(tokens) for tokens in tokens_list_padded]
sents_tensor = torch.tensor(tokens_id_list, dtype=torch.long, device=device)
return sents_tensor, masks_tensor, sents_lengths
class ConvModel(nn.Module):
def __init__(self, device, dropout_rate, n_class, out_channel=16):
super(ConvModel, self).__init__()
self.bert_config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True)
self.dropout_rate = dropout_rate
self.n_class = n_class
self.out_channel = out_channel
self.bert = BertModel.from_pretrained('bert-base-uncased', config=self.bert_config)
self.out_channels = self.bert.config.num_hidden_layers * self.out_channel
self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', config=self.bert_config)
self.conv = nn.Conv2d(in_channels=self.bert.config.num_hidden_layers,
out_channels=self.out_channels,
kernel_size=(3, self.bert.config.hidden_size),
groups=self.bert.config.num_hidden_layers)
self.conv1 = nn.Conv2d(in_channels=self.out_channels,
out_channels=48,
kernel_size=(3, self.bert.config.hidden_size),
groups=self.bert.config.num_hidden_layers)
self.hidden_to_softmax = nn.Linear(self.out_channels, self.n_class, bias=True)
self.dropout = nn.Dropout(p=self.dropout_rate)
self.device = device
def forward(self, sents):
sents_tensor, masks_tensor, sents_lengths = sents_to_tensor(self.tokenizer, sents, self.device)
encoded_layers = self.bert(input_ids=sents_tensor, attention_mask=masks_tensor)
hidden_encoded_layer = encoded_layers[2]
hidden_encoded_layer = hidden_encoded_layer[0]
hidden_encoded_layer = torch.unsqueeze(hidden_encoded_layer, dim=1)
hidden_encoded_layer = hidden_encoded_layer.repeat(1, 12, 1, 1)
conv_out = self.conv(hidden_encoded_layer) # (batch_size, channel_out, some_length, 1)
conv_out = self.conv1(conv_out)
conv_out = torch.squeeze(conv_out, dim=3) # (batch_size, channel_out, some_length)
conv_out, _ = torch.max(conv_out, dim=2) # (batch_size, channel_out)
pre_softmax = self.hidden_to_softmax(conv_out)
return pre_softmax
def batch_iter(data, batch_size, shuffle=False, bert=None):
batch_num = math.ceil(data.shape[0] / batch_size)
index_array = list(range(data.shape[0]))
if shuffle:
data = data.sample(frac=1)
for i in range(batch_num):
indices = index_array[i * batch_size: (i + 1) * batch_size]
examples = data.iloc[indices]
sents = list(examples.train_BERT_tweet)
targets = list(examples.train_label.values)
yield sents, targets # list[list[str]] if not bert else list[str], list[int]
def train():
label_name = ['Yes', 'Maybe', 'No']
device = torch.device("cpu")
df_train = pd.read_csv('trainn.csv') # , index_col=0)
train_label = dict(df_train.train_label.value_counts())
label_max = float(max(train_label.values()))
train_label_weight = torch.tensor([label_max / train_label[i] for i in range(len(train_label))], device=device)
model = ConvModel(device=device, dropout_rate=0.2, n_class=len(label_name))
optimizer = AdamW(model.parameters(), lr=1e-3, correct_bias=False)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=100, num_training_steps=1000) # changed the last 2 arguments to old ones
model = model.to(device)
model.train()
cn_loss = torch.nn.CrossEntropyLoss(weight=train_label_weight, reduction='mean')
train_batch_size = 16
for epoch in range(1):
for sents, targets in batch_iter(df_train, batch_size=train_batch_size, shuffle=True): # for each epoch
optimizer.zero_grad()
pre_softmax = model(sents)
loss = cn_loss(pre_softmax, torch.tensor(targets, dtype=torch.long, device=device))
loss.backward()
optimizer.step()
scheduler.step()
TrainingModel = train()
Here's a snippet of data https://github.com/Kosisochi/DataSnippet
It seems that the original version of the code you had in this question behaved differently. The final version of the code you have here gives me a different error from what you posted, more specifically - this:
RuntimeError: Calculated padded input size per channel: (20 x 1). Kernel size: (3 x 768). Kernel size can't be greater than actual input size
I apologize if I misunderstood the situation, but it seems to me that your understanding of what exactly nn.Conv2d layer does is not 100% clear and that is the main source of your struggle. I interpret the part "detailed explanation on 2 layer CNN in Pytorch" you requested as an ask to explain in detail on how that layer works and I hope that after this is done there will be no problem applying it 1 time, 2 times or more.
You can find all the documentation about the layer here, but let me give you a recap which hopefully will help to understand more the errors you're getting.
First of all nn.Conv2d inputs are 4-d tensors of the shape (BatchSize, ChannelsIn, Height, Width) and outputs are 4-d tensors of the shape (BatchSize, ChannelsOut, HeightOut, WidthOut). The simplest way to think about nn.Conv2d is of something applied to 2d images with pixel grid of size Height x Width and having ChannelsIn different colors or features per pixel. Even if your inputs have nothing to do with actual images the behavior of the layer is still the same. Simplest situation is when the nn.Conv2d is not using padding (as in your code). In that case the kernel_size=(kernel_height, kernel_width) argument specifies the rectangle which you can imagine sweeping through Height x Width rectangle of your inputs and producing one pixel for each valid position. Without padding the coordinate of the rectangle's point can be any pair of indicies (x, y) with x between 0 and Height - kernel_height and y between 0 and Width - kernel_width. Thus the output will look like a 2d image of size (Height - kernel_height + 1) x (Width - kernel_width + 1) and will have as many output channels as specified to nn.Conv2d constructor, so the output tensor will be of shape (BatchSize, ChannelsOut, Height - kernel_height + 1, Width - kernel_width + 1).
The parameter groups is not affecting how shapes are changed by the layer - it is only controlling which input channels are used as inputs for the output channels (groups=1 means that every input channel is used as input for every output channel, otherwise input and output channels are divided into corresponding number of groups and only input channels from group i are used as inputs for the output channels from group i).
Now in your current version of the code you have BatchSize = 16 and the output of pre-trained model is (BatchSize, DynamicSize, 768) with DynamicSize depending on the input, e.g. 22. You then introduce additional dimension as axis 1 with unsqueeze and repeat the values along that dimension transforming the tensor of shape (16, 22, 768) into (16, 12, 22, 768). Effectively you are using the output of the pre-trained model as 12-channel (with each channel having same values as others) 2-d images here of size (22, 768), where 22 is not fixed (depends on the batch). Then you apply a nn.Conv2d with kernel size (3, 768) - which means that there is no "wiggle room" for width and output 2-d images will be of size (20, 1) and since your layer has 192 channels final size of the output of first convolution layer has shape (16, 192, 20, 1). Then you try to apply second layer of convolution on top of that with kernel size (3, 768) again, but since your 2-d "image" is now just (20 x 1) there is no valid position to fit (3, 768) kernel rectangle inside a rectangle (20 x 1) which leads to the error message Kernel size can't be greater than actual input size.
Hope this explanation helps. Now to the choices you have to avoid the issue:
(a) is to add padding in such a way that the size of the output is not changing comparing to input (I won't go into details here,
because I don't think this is what you need)
(b) Use smaller kernel on both first and/or second convolutions (e.g. if you don't change first convolution the only valid width for
the second kernel would be 1).
(c) Looking at what you're trying to do my guess is that you actually don't want to use 2d convolution, you want 1d convolution (on the sequence) with every position described by 768 values. When you're using one convolution layer with 768 width kernel (and same 768 width input) you're effectively doing exactly same thing as 1d convolution with 768 input channels, but then if you try to apply second one you have a problem. You can specify kernel width as 1 for the next layer(s) and that will work for you, but a more correct way would be to transpose pre-trained model's output tensor by switching the last dimensions - getting shape (16, 768, DynamicSize) from (16, DynamicSize, 768) and then apply nn.Conv1d layer with 768 input channels and arbitrary ChannelsOut as output channels and 1d kernel_size=3 (meaning you look at 3 consecutive elements of the sequence for convolution). If you do that than without padding input shape of (16, 768, DynamicSize) will become (16, ChannelsOut, DynamicSize-2), and after you apply second Conv1d with e.g. the same settings as first one you'll get a tensor of shape (16, ChannelsOut, DynamicSize-4), etc. (each time the 1d length will shrink by kernel_size-1). You can always change number of channels/kernel_size for each subsequent convolution layer too.

Implementing Dual Encoder LSTM in Keras with Tensorflow backend

Dual Encoder LSTM
I want to implement this model in TensorFlow Keras API. I am confused about how to implement the sigmoid(CMR) function in Keras. How to merge the output of both LSTM's an compute the above function ?
RNN here means LSTM
C and R are sentences encoded into a fixed dimension by the two LSTM's. Then they are passed through a function sigmoid(CMR). We can assume that R and C are both 256 dimensional matrices and M is a 256 * 256 matrix. The matrix M is learned during training.
Assuming you only consider the final output of the LSTMs and not the whole sequence, the shape of the output of each LSTM model would be (batch_size, 256).
Now, we have the following vectors and their shapes:
C: (batch_size, 256)
R: (batch_size, 256)
M: (256, 256).
The simplest case is for batch_size = 1. Then,
C: (1, 256)
R: (1, 256)
So, mathematically, CTMR would practically be CMRT, and give you a vector of shape (1, 1), which can be represented by any number of dimensions.
In code, this is straightforward:
def compute_cmr(c, m, r):
r = tf.transpose(r, [1, 0])
output = tf.matmul(c, m)
output = tf.matmul(output, r)
return output
However, if your batch_size is greater than 1, things can get tricky. My approach (using eager execution) is to unstack along the batch axis, process individually, then restack. It may not be the most efficient way, but it works flawlessly and the time overhead usually is negligible.
Here's how you can do it:
def compute_cmr(c, m, r):
outputs = []
c_list = tf.unstack(c, axis=0)
r_list = tf.unstack(r, axis=0)
for batch_number in range(len(c_list)):
r = tf.expand_dims(r_list[batch_number], axis=1)
c = tf.expand_dims(c_list[batch_number], axis=0)
output = tf.matmul(c, m)
output = tf.matmul(output, r)
outputs.append(output)
return tf.stack(outputs, axis=0)

Change values inside a Pytorch 3D tensor

I have a 224x224 binary image in a tensor (1, 224, 224), with 0-pixels representing background a 1-pixels representing foreground.
I want to reshape it in a tensor (2, 224, 224), such as the first "layer" gt[0] has 1-pixels where there were 0-pixels in the original image and viceversa. This way one layer should show 1s where there is background and the other one will have 1s on the foreground (basically I need to have two complementary binary images in this tensor).
This is my code:
# gt is a tensor (1, 224, 224)
gt = gt.expand((2, 224, 224))
backgr = gt[0]
foregr = gt[1]
backgr[backgr == 0] = 2 # swap all 0s in 1s and viceversa
backgr[backgr == 1] = 0
backgr[backgr == 2] = 1
gt[0] = backgr
print(gt[0])
print(gt[1])
The problem is both layers are modified with this code and I can't figure out how to keep one of the two constant and change only gt[0].
Found a solution!
gt = gt.repeat(2, 1, 1)

multi-level feature fusion in tensorflow

I want to know how can I combine two layers with different spatial space in Tensorflow.
for example::
batch_size = 3
input1 = tf.ones([batch_size, 32, 32, 3], tf.float32)
input2 = tf.ones([batch_size, 16, 16, 3], tf.float32)
filt1 = tf.constant(0.1, shape = [3,3,3,64])
filt1_1 = tf.constant(0.1, shape = [1,1,64,64])
filt2 = tf.constant(0.1, shape = [3,3,3,128])
filt2_2 = tf.constant(0.1, shape = [1,1,128,128])
#first layer
conv1 = tf.nn.conv2d(input1, filt1, [1,2,2,1], "SAME")
pool1 = tf.nn.max_pool(conv1, [1,2,2,1],[1,2,2,1], "SAME")
conv1_1 = tf.nn.conv2d(pool1, filt1_1, [1,2,2,1], "SAME")
deconv1 = tf.nn.conv2d_transpose(conv1_1, filt1_1, pool1.get_shape().as_list(), [1,2,2,1], "SAME")
#seconda Layer
conv2 = tf.nn.conv2d(input2, filt2, [1,2,2,1], "SAME")
pool2 = tf.nn.max_pool(conv2, [1,2,2,1],[1,2,2,1], "SAME")
conv2_2 = tf.nn.conv2d(pool2, filt2_2, [1,2,2,1], "SAME")
deconv2 = tf.nn.conv2d_transpose(conv2_2, filt2_2, pool2.get_shape().as_list(), [1,2,2,1], "SAME")
The deconv1 shape is [3, 8, 8, 64] and the deconv2 shape is [3, 4, 4, 128]. Here I cannot use the tf.concat to combine the deconv1 and deconv2. So how can I do this???
Edit
This is image for the architecture that I tried to implement:: it is releated to this paper::
vii. He, W., Zhang, X. Y., Yin, F., & Liu, C. L. (2017). Deep Direct
Regression for Multi-Oriented Scene Text Detection. arXiv preprint
arXiv:1703.08289
I checked the paper you point and there is it, consider the input image to this network has size H x W (height and width), I write the size of the output image on the side of each layer. Now look at the most bottom layer which I circle the input arrows to that layer, let's check it. This layer has two input, the first from the previous layer which has shape H/2 x W/2 and the second from the first pooling layer which also has size H/2 x W/2. These two inputs are merged together (not concatenation, but added together based on paper) and goes into the last Upsample layer, which output image of size H x W.
The other Upsample layers also have the same inputs. As you can see all merging operations have the match shapes. Also, the filter number for all merging layers is 128 which has consistency with others.
You can also use concat instead of merging, but it results in a larger filter number, be careful about that. i.e. merging two matrices with shapes H/2 x W/2 x 128 results in the same shape H/2 x W/2 x 128, but concat two matrices on the last axis, with shapes H/2 x W/2 x 128 results in H/2 x W/2 x 256.
I tried to guide you as much as possible, hope that was useful.

Resources