Yesterday I saw an exercise with the related solution.
The text:
Your code will take an input tensor input with shape (n, iC, H, W) and a kernel kernel with shape (oC, iC, kH, kW ). It needs then to apply a 2D convolution over input, using kernel as kernel tensor and no bias, using a stride of 1, no dilation, no grouping, and no padding, and store the result in out. Both input and kernel have dtype torch.float32
The solution:
#set-up code
import random
import torch
n = random.randint(2, 6)
iC = random.randint(2, 6)
oC = random.randint(2, 6)
H = random.randint(10, 20)
W = random.randint(10, 20)
kH = random.randint(2, 6)
kW = random.randint(2, 6)
input = torch.rand(n, iC, H, W, dtype=torch.float32)
kernel = torch.rand(oC, iC, kH, kW, dtype=torch.float32)
#solution code
oH, oW = H-(kH-1), W-(kW-1)
out = torch.zeros((n, oC, oH, oW), dtype=torch.float32)
for i in range(oH):
for j in range(oW):
inp = input.unsqueeze(1)[:, :, :,i: i+kH, j : j+kW] # shape inp => (n, 1, iC, H, W)
ker = kernel.unsqueeze(0) # shape ker => (1, oC, iC, kH, kW)
out[:, :, i, j] = (inp*ker).sum((-1, -2, -3)) #??
My question is:
Why we make the unsqueeze() in this manner?
I know how unsqueeze() works but I can't figure out the problem we solve with this unsqueeze().
Just for a visual reference of the convolution:
Thanks!
So a bit of elaboration on the comment I made.
First, if we unroll this ALL the way, we can represent this convolution operation with the following septuple nested for-loop.
out = torch.zeros((n, oC, oH, oW), dtype=torch.float32)
for out_i in range(oH):
for out_j in range(oW):
for b_idx in range(n):
for out_ch in range(oC):
for in_ch in range(iC):
for ker_i in range(kH):
for ker_j in range(kW):
out[b_idx, out_ch, out_i, out_j] += \
input[b_idx, in_ch, out_i + ker_i, out_j + ker_j] \
* kernel[out_ch, in_ch, ker_i, ker_j]
Of course this is probably going to be pretty slow. Instead we can aggregate the inner three for-loops into a single operation that takes a kH by kW slice of the input tensor spanning all the input channels and multiply that by a slice of the kernel at the desired output channel.
out = torch.zeros((n, oC, oH, oW), dtype=torch.float32)
for out_i in range(oH):
for out_j in range(oW):
for b_idx in range(n):
for out_ch in range(oC):
# input_slice -> [iC, kH, kW]
input_slice = input[b_idx, :, out_i:out_i+kH, out_j:out_j+kW]
# kernel_slice -> [iC, kH, kW]
kernel_slice = kernel[out_ch, :, :, :]
out[b_idx, out_ch, out_i, out_j] = (input_slice * kernel_slice).sum()
Observe that in this latest version the input_slice is taken from a single batch index (b_idx) and the kernel_slice is taken from a single output channel (out_ch). We compute all combinations of b_idx and out_ch to fill the output. When we see this type of pattern then broadcasting should come to mind.
First, if we just took the input slice over all the batches (e.g. input[:, :, out_i:out_i + kH, out_j:out_j + kW]) then this would be a [n, iC, kH, kW] tensor. And since the kernel is shape [oC, iC, kH, kW] these can't be broadcasted together because they don't agree in the first dimension. To deal with this we need to insert some unitary dimensions so they agree everywhere where both have non-unitary dimensions.
Since we want the output of the broadcasted product to be reduced and stored in out which has shape [n, oC, ...], then we want to insert unitary dimensions as follows:
out = torch.zeros((n, oC, oH, oW), dtype=torch.float32)
for out_i in range(oH):
for out_j in range(oW):
# input_slice -> [n, 1, iC, kH, kW]
input_slice = input[:, :, out_i:out_i+kH, out_j:out_j+kW].unsqueeze(1)
# kernel_slice -> [1, oC, iC, kH, kW]
kernel_slice = kernel[:, :, :, :].unsqueeze(0)
# broadcasted shape [n, 1, ...] times shape [1, oC, ...] -> [n, oC, ...]
# therefore prod_slice -> [n, oC, iC, kH, kW]
prod_slice = input_slice * kernel_slice
# sum over last three channels producing reduced_slice -> [n, oC]
reduced_slice = prod_slice.sum((-1, -2, -3))
out[:, :, out_i, out_j] = reduced_slice
Note that we could have achieved a valid broadcast by using .unsqueeze(0) on the input slice and .unsqueeze(1) on the kernel slice. However, this would have resulted in reduced_slice being shape [oC, n] instead of [n, oC] which would have been the transpose of what we wanted to store in out.
Related
Suppose I now have the following code to calculate source-target attention for two variable, x and y:
class MultiHeadedAttention(nn.Module):
"""Multi-Head Attention layer
:param int n_head: the number of head s
:param int n_feat: the number of features
:param float dropout_rate: dropout rate
"""
def __init__(self, n_head: int, n_feat: int, dropout_rate: float):
super(MultiHeadedAttention, self).__init__()
assert n_feat % n_head == 0
self.d_k = n_feat // n_head
self.h = n_head
self.linear_q = nn.Linear(n_feat, n_feat)
self.linear_k = nn.Linear(n_feat, n_feat)
self.linear_v = nn.Linear(n_feat, n_feat)
self.linear_out = nn.Linear(n_feat, n_feat)
self.dropout = nn.Dropout(p=dropout_rate)
def forward(
self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
mask: Optional[torch.Tensor] = None,
) -> torch.Tensor:
"""Compute 'Scaled Dot Product Attention'
:param torch.Tensor query: (batch, x_len, size)
:param torch.Tensor key: (batch, y_len, size)
:param torch.Tensor value: (batch, y_len, size)
:param torch.Tensor mask: (batch, x_len, y_len)
:param torch.nn.Dropout dropout:
:return torch.Tensor: attentined and transformed `value` (batch, x_len, depth)
weighted by the query dot key attention (batch, head, x_len, y_len)
"""
n_batch = query.size(0)
q = self.linear_q(query).view(n_batch, -1, self.h, self.d_k)
k = self.linear_k(key).view(n_batch, -1, self.h, self.d_k)
v = self.linear_v(value).view(n_batch, -1, self.h, self.d_k)
q = q.transpose(1, 2) # (batch, head, x_len, d_k)
k = k.transpose(1, 2) # (batch, head, x_len, d_k)
v = v.transpose(1, 2) # (batch, head, y_len, d_k)
scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(
self.d_k
) # (batch, head, x_len, y_len)
if mask is not None:
mask = mask.unsqueeze(1).eq(0) # (batch, 1, x_len, y_len)
mask = mask.to(device=scores.device)
scores = scores.masked_fill_(mask, -np.inf)
attn = torch.softmax(scores, dim=-1).masked_fill(
mask, 0.0
) # (batch, head, x_len, y_len)
else:
attn = torch.softmax(scores, dim=-1) # (batch, head, x_len, y_len)
p_attn = self.dropout(attn)
x = torch.matmul(p_attn, v) # (batch, head, x_len, d_k)
x = (
x.transpose(1, 2).contiguous().view(n_batch, -1, self.h * self.d_k)
) # (batch, x_len, depth)
return self.linear_out(x) # (batch, x_len, depth)
So this class calculate the attention of batch size=B pairs of (x, y)_i, gives output of dim (batch, x_len, depth). So far so good.
The question is: What if I wanted to extend this class to calculate NOT ONLY (x1, y1), (x2, y2)..., but also all combination of xy, i.e. (x1, y2), (x1, y3)... within the batch, so that I will get an output of dim (batch, batch, x_len, depth) WITHOUT LOOPING. How would you implement this? Any recommendation, suggestion, example is appreciated.
EDITED
I just came up with an idea which does the desired job at the expense of extra memory use. Just simply copy X and Y along the batch dimension so that the represent all the pairs of x_i and y_i. Specifically:
b = torch.tensor(list(range(batch_size)))
comb = torch.cartesian_prod(b, b)
x = x[comb[:, 0], :, :]
y = y[comb[:, 1], :, :]
and then after the calculation, view or reshape the first dimension and it will return output which is of dim=(batch_size, batch_size, x_len, depth).
I have tested using toy example and quite sure it does do the job.
However, unfortunately, for my case it got CUDA out of memory.
What would you do under this situation? Should I give up on parallelism and just use loop to make it works?
If I understand you correctly, you might want to check out torch.cdist, which is a torch implementation of pairwise distances, similar to scipy.spatial.distance.cdist. You might have to do some tweaking on your tensor dimensions, as described in the documentation torch cdist
I need to compute the torch.nn.CrossEntropyLoss on sequences.
The output tensor y_est has shape: [batch_size, sequence_length, embedding_dim]. The values are embedded as one-hot vectors with embedding_dim dimensions (y_est is not binary however).
The target tensor y has shape: [batch_size, sequence_length] and contains the integer index of the correct class in the range [0, embedding_dim).
If I compute the loss on the two input data, with the shape described above, I get an error 1.
What I would like to do is described by the cycle at [2]. For each sequence in the batch, I would like the sum of the losses computed on each element in the sequence.
After reading the documentation of torch.nn.CrossEntropyLoss I came up with the solution [3], which seems to compute exactly what I want: the losses computed at point [2] and [3] are equale.
However, since .permute(.) returns a view of the original tensor, I am afraid it might mess up the backward propagation on the loss. Somewhere (I do not remember where, sorry) I have read that views should not be used in computing the loss.
Is my solution correct?
import torch
batch_size = 5
seq_len = 10
emb_dim = 100
y_est = torch.randn( (batch_size, seq_len, emb_dim))
y = torch.randint(0, emb_dim, (batch_size, seq_len) )
print("y_est, batch x seq x emb:", y_est.shape)
print("y, batch x seq", y.shape)
loss_fn = torch.nn.CrossEntropyLoss(reduction="none")
# [1]
# loss = loss_fn(y_est, y)
# error:
# RuntimeError: Expected target size [5, 100], got [5, 10]
[2]
loss = 0
for i in range(y_est.shape[1]):
loss += loss_fn ( y_est[:, i, :], y[:, i]).sum()
print(loss)
[3]
y_est_2 = torch.permute( y_est, (0, 2, 1))
print("y_est_2", y_est_2.shape)
loss2 = loss_fn(y_est_2, y).sum()
print(loss2)
whose output is:
y_est, batch x seq x emb: torch.Size([5, 10, 100])
y, batch x seq torch.Size([5, 10])
tensor(253.9994)
y_est_2 torch.Size([5, 100, 10])
tensor(253.9994)
Is the solution correct (also for what concerns the backward pass)? Is there a better way?
If y_est are probabilities you really want to compute the error/loss of a categorical output in each timestep/element of a sequence then y and y_est have to have the same shape. To do so, the categories/classes of y can be expanded to the same dim as y_est with one-hot encoding
import torch
batch_size = 5
seq_len = 10
emb_dim = 100
y_est = torch.randn( (batch_size, seq_len, emb_dim))
y = torch.randint(0, emb_dim, (batch_size, seq_len) )
y = torch.nn.functional.one_hot(y, num_classes=emb_dim).type(torch.float)
loss_fn = torch.nn.CrossEntropyLoss()
loss = loss_fn(y_est, y)
print(loss)
I am developing a code to use the pre-trained GPT2 model for a machine translation task. The length of my data's word-to-id is 91, and I developed the following code for my model:
import torch
from torch.utils.data import DataLoader
from transformers.models.gpt2.modeling_gpt2 import GPT2Model
# data preparation code
def batch_sequences(x, y, env):
"""
Take as input a list of n sequences (torch.LongTensor vectors) and return
a tensor of size (slen, n) where slen is the length of the longest
sentence, and a vector lengths containing the length of each sentence.
"""
lengths_x = torch.LongTensor([len(s) + 2 for s in x])
lengths_y = torch.LongTensor([len(s) + 2 for s in y])
max_length = max(lengths_x.max().item(), lengths_y.max().item())
sent_x = torch.LongTensor(
max_length, lengths_x.size(0)).fill_(env.pad_index)
sent_y = torch.LongTensor(
max_length, lengths_y.size(0)).fill_(env.pad_index)
assert lengths_x.min().item() > 2
assert lengths_y.min().item() > 2
sent_x[0] = env.eos_index
for i, s in enumerate(x):
sent_x[1:lengths_x[i] - 1, i].copy_(s)
sent_x[lengths_x[i] - 1, i] = env.eos_index
sent_y[0] = env.eos_index
for i, s in enumerate(y):
sent_y[1:lengths_y[i] - 1, i].copy_(s)
sent_y[lengths_y[i] - 1, i] = env.eos_index
return sent_x, sent_y, max_length
def collate_fn(elements):
"""
Collate samples into a batch.
"""
x, y = zip(*elements)
x = [torch.LongTensor([env.word2id[w]
for w in seq if w in env.word2id]) for seq in x]
y = [torch.LongTensor([env.word2id[w]
for w in seq if w in env.word2id]) for seq in y]
x, y, length = batch_sequences(x, y, env)
return (x, length), (y, length), torch.LongTensor(nb_ops)
loader = DataLoader(data, batch_size=1, shuffle=False, collate_fn=collate_fn)
gpt2 = GPT2Model.from_pretrained('gpt2')
in_layer = nn.Embedding(len(env.word2id), 768)
out_layer = nn.Linear(768, len(env.word2id))
parameters = list(gpt2.parameters()) + list(in_layer.parameters()) + list(out_layer.parameters())
optimizer = torch.optim.Adam(parameters)
loss_fn = nn.CrossEntropyLoss()
for layer in (gpt2, in_layer, out_layer):
layer.train()
accuracies = list()
n_epochs = 5
for i in range(n_epochs):
for (x, x_len), (y, y_len) in loader:
x = x.to(device=device)
y = y.to(device=device)
embeddings = in_layer(x.reshape(1, -1))
hidden_state = gpt2(inputs_embeds=embeddings).last_hidden_state[:, :]
logits = out_layer(hidden_state)[0]
loss = loss_fn(logits, y.reshape(-1))
accuracies.append(
(logits.argmax(dim=-1) == y.reshape(-1)).float().mean().item())
optimizer.zero_grad()
loss.backward()
optimizer.step()
if len(accuracies) % 500 == 0:
accuracy = sum(accuracies[-50:]) / len(accuracies[-50:])
print(f'Samples: {len(accuracies)}, Accuracy: {accuracy}')
This code works pretty well when the batch size is 1. But it is so slow. I wanted to increase the batch size from 1 to 32, but I get some dimension compatibility problems. How can I increase the batch size without errors?
My data consists of pair of sentences, the first one is a sentence in the first language and the second one is its translation in the second language.
For example, assume that x.shape is (batch_size, 12) (meaning we have 'batch_size' sentences of length 12 as input and y.shape is also (batch_size, 12) (the translations). And also we have a word-to-id dictionary of length 90 that matches each word in a sentence with its index)
This problem can be solved using padding. We need two special symbols:
code 0 in inputs (x) will denote "blank" tokens that should not be translated.
code -100 in outputs (y) will denote "blank" tokens that should not participate in the calculation of loss. nn.CrossEntropyLoss() is programmed to ignore this value (by the argument ignore_index).
The batch of size 3 could look like this:
x:
[[1, 2, 3, 0, 0],
[ 4, 5, 6, 7, 8],
[ 9, 8, 0, 0, 0]]
y:
[[1, 2, 3, -100, -100],
[ 4, 5, 6, 7, 8],
[ 9, 8, -100, -100, -100]]
You could generate it with code such as:
def pad_sequences(batch, pad_value=0):
n = max(len(v) for v in batch)
return torch.tensor([v + [pad_value] * (n - len(v)) for v in batch])
However, I feel there is an issue with your problem statement. If you perform machine translation, then your inputs and outputs can have different lengths, but your architecture only allows x and y to have the same lengths. If you want to support x and y of different lengths, I would suggest to use a seq2seq architecture such as T5 instead.
Another issue is that GPT is autoregressive, so if y is completely aligned with x, then we cannot use the suffix of x while generating the left part of y. So if you wish your x and y to be perfectly aligned, but still would like to use the full information about x when generating y, I would recommend using a bidirectional encoder such as BERT.
It's going to be a long post, sorry in advance...
I'm working on a denoising algorithm and my goal is to:
Use PyTorch to design / train the model
Convert the PyTorch model into a CoreML model
The denoising algorithm consists in the following 3 parts:
A "down-sampling" + noise level map
A regular convnet
An "up-sampling"
The first part is quite simple in its idea, but not so easy to explain. Given for instance an input color image and a input value "sigma" that represents the standard deviation of the image noise.
The "down-sampling" part is in fact a space-to-depth. In short, for a given channel and for a subset of 2x2 pixels, the space-to-depth creates a single pixel composed of 4 channels. The number of channels is multiplied by 4 while the height and width are divided by 2. The data is simply reorganized.
The noise level map consists in creating 3 channels containing the standard deviation value so that the convnet knows how to properly denoise the input image.
This will be maybe more clear with some code:
def downsample_and_noise_map(input, sigma):
# Input tensor size (batch, channels, height, width)
in_n, in_c, in_h, in_w = input.size()
# Output tensor size
out_h = in_h // 2
out_w = in_w // 2
sigma_c = in_c # nb of channels of the standard deviation tensor
image_c = in_c * 4 # nb of channels of the image tensor
# Standard deviation tensor
output_sigma = sigma.view(1, 1, 1, 1).repeat(in_n, sigma_c, out_h, out_w)
# Image tensor
output_image = torch.zeros((in_n, image_c, out_h, out_w))
output_image[:, 0::4, :, :] = input[:, :, 0::2, 0::2]
output_image[:, 1::4, :, :] = input[:, :, 0::2, 1::2]
output_image[:, 2::4, :, :] = input[:, :, 1::2, 0::2]
output_image[:, 3::4, :, :] = input[:, :, 1::2, 1::2]
# Concatenate standard deviation and image tensors
return torch.cat((output_sigma, output_image), dim=1)
This function is then called as the first step in the model's forward function:
def forward(self, x, sigma):
x = downsample_and_noise_map(x, sigma)
x = self.convnet(x)
x = upsample(x)
return x
Let's consider an input tensor of size 1x3x100x100 (PyTorch standard: batch, channels, height, width) and a sigma value of 0.1. The output tensor has the following properties:
Tensor's shape is 1x15x50x50
Tensor's values for channels 0, 1 and 2 are all equal to sigma = 0.1
Tensor's values for channels 3, 4, 5, 6 are composed of the input image values of channel 0
Tensor's values for channels 7, 8, 9, 10 are composed of the input image values of channel 1
Tensor's values for channels 11, 12, 13, 14 are composed of the input image values of channel 2
If this code is not clear enough, I can post an even more naive version.
The up-sampling part is the reciprocal function of the downsampling one.
I was able to use this function for training and testing in PyTorch.
Then, I tried to convert the model to CoreML with ONNX as an intermediate step.
The conversion to ONNX generated "TracerWarning". Conversion from ONNX to CoreML failed (TypeError: 1.0 has type numpy.float64, but expected one of: int, long). The problem came from the down-sampling + noise level map (and from up-sampling too).
When I removed the down-sampling + noise level map and up-sampling layers, I was able to convert to ONNX and to CoreML very easily since only a simple convnet remained. This means I have a solution to my problem: implement these 2 layers using 2 shaders on the mobile side. But I'm not satisfied with this solution as I want my model to contain all layers ^^
Before considering writing a post here, I crawled Internet to find an answer and I was able to write a better version of the previous function using reshape and permute. This version removed all ONNX warning, but the CoreML conversion still failed...
def downsample_and_noise_map(input, sigma):
# Input image size
in_n, in_c, in_h, in_w = input.size()
# Output tensor size
out_n = in_n
out_h = in_h // 2
out_w = in_w // 2
# Create standard deviation tensor
output_sigma = sigma.view(out_n, 1, 1, 1).repeat(out_n, in_c, out_h, out_w)
# Split RGB channels
channels_rgb = torch.split(input, 1, dim=1)
# Reshape (space-to-depth) each image channel
channels_reshaped = []
for channel in channels_rgb:
channel = channel.reshape(1, out_h, 2, out_w, 2)
channel = channel.permute(2, 4, 0, 1, 3)
channel = channel.reshape(1, 4, out_h, out_w)
channels_reshaped.append(channel)
# Concatenate all reshaped image channels together
output_image = torch.cat(channels_reshaped, dim=1)
# Concatenate standard deviation and image tensors
output = torch.cat([output_sigma, output_image], dim=1)
return output
So here are (some of) my questions:
What is the preferred PyTorch way to implement a function such as downsample_and_noise_map function within a model?
Same question but when the conversion to ONNX and then to CoreML is part of the equation?
Is the PyTorch -> ONNX -> CoreML still best path to deploy the model for iOS production?
Thanks for your help (and your patience) ^^
Disclaimer I'm not familiar with CoreML or deploying to iOS but I do have experience deploying PyTorch models in TensorRT and OpenVINO via ONNX.
The main issues I've faced when deploying to other frameworks is that operations like slicing and repeating tensors tend to have limited support in other frameworks. Often we can construct equivalent conv or transpose-conv operations which achieve the desired behavior.
In order to ensure we don't export the logic used to construct the conv weights I've separated the weight initialization from the application of the weights. This makes the ONNX export much more straightforward since all it sees is some constant tensors being applied.
class DownsampleAndNoiseMap():
def __init__(self):
self.initialized = False
self.weight = None
self.zeros = None
def init_weights(self, input):
with torch.no_grad():
in_n, in_c, in_h, in_w = input.size()
out_h = int(in_h // 2)
out_w = int(in_w // 2)
sigma_c = in_c
image_c = in_c * 4
# conv weights used for downsampling
self.weight = torch.zeros(image_c, in_c, 2, 2).to(input)
for c in range(in_c):
self.weight[4 * c, c, 0, 0] = 1
self.weight[4 * c + 1, c, 0, 1] = 1
self.weight[4 * c + 2, c, 1, 0] = 1
self.weight[4 * c + 3, c, 1, 1] = 1
# zeros used to replace repeat
self.zeros = torch.zeros(in_n, sigma_c, out_h, out_w).to(input)
self.initialized = True
def __call__(self, input, sigma):
assert self.initialized
output_sigma = self.zeros + sigma
output_image = torch.nn.functional.conv2d(input, self.weight, stride=2)
return torch.cat((output_sigma, output_image), dim=1)
class Upsample():
def __init__(self):
self.initialized = False
self.weight = None
def init_weights(self, input):
with torch.no_grad():
in_n, in_c, in_h, in_w = input.size()
image_c = in_c * 4
self.weight = torch.zeros(in_c + image_c, in_c, 2, 2).to(input)
for c in range(in_c):
self.weight[in_c + 4 * c, c, 0, 0] = 1
self.weight[in_c + 4 * c + 1, c, 0, 1] = 1
self.weight[in_c + 4 * c + 2, c, 1, 0] = 1
self.weight[in_c + 4 * c + 3, c, 1, 1] = 1
self.initialized = True
def __call__(self, input):
assert self.initialized
return torch.nn.functional.conv_transpose2d(input, self.weight, stride=2)
I made the assumption that upsample was the reciprocal of downsample in the sense that x == upsample(downsample_and_noise_map(x, sigma)) (correct me if I'm wrong in this assumption). I also verified that my version of downsample agrees with yours.
# consistency checking code
x = torch.randn(1, 3, 100, 100)
sigma = torch.randn(1)
# OP downsampling
y1 = downsample_and_noise_map(x, sigma)
ds = DownsampleAndNoiseMap()
ds.init_weights(x)
y2 = ds(x, sigma)
print('downsample diff:', torch.sum(torch.abs(y1 - y2)).item())
us = Upsample()
us.init_weights(x)
x_recov = us(ds(x, sigma))
print('recovery error:', torch.sum(torch.abs(x - x_recov)).item())
which results in
downsample diff: 0.0
recovery error: 0.0
Exporting to ONNX
When exporting we need to invoke init_weights for the new classes before using torch.onnx.export. For example
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.downsample = DownsampleAndNoiseMap()
self.upsample = Upsample()
self.convnet = lambda x: x # placeholder
def init_weights(self, x):
self.downsample.init_weights(x)
self.upsample.init_weights(x)
def forward(self, x, sigma):
x = self.downsample(x, sigma)
x = self.convnet(x)
x = self.upsample(x)
return x
x = torch.randn(1, 3, 100, 100)
sigma = torch.randn(1)
model = Model()
# ... load state dict here
model.init_weights(x)
torch.onnx.export(model, (x, sigma), 'deploy.onnx', verbose=True, input_names=["input", "sigma"], output_names=["output"])
which gives the ONNX graph
graph(%input : Float(1, 3, 100, 100)
%sigma : Float(1)) {
%2 : Float(1, 3, 50, 50) = onnx::Constant[value=<Tensor>](), scope: Model
%3 : Float(1, 3, 50, 50) = onnx::Add(%2, %sigma), scope: Model
%4 : Float(12, 3, 2, 2) = onnx::Constant[value=<Tensor>](), scope: Model
%5 : Float(1, 12, 50, 50) = onnx::Conv[dilations=[1, 1], group=1, kernel_shape=[2, 2], pads=[0, 0, 0, 0], strides=[2, 2]](%input, %4), scope: Model
%6 : Float(1, 15, 50, 50) = onnx::Concat[axis=1](%3, %5), scope: Model
%7 : Float(15, 3, 2, 2) = onnx::Constant[value=<Tensor>](), scope: Model
%output : Float(1, 3, 100, 100) = onnx::ConvTranspose[dilations=[1, 1], group=1, kernel_shape=[2, 2], pads=[0, 0, 0, 0], strides=[2, 2]](%6, %7), scope: Model
return (%output);
}
As for the last question about the recommended way to deploy on iOS I can't answer that since I don't have experience in that area.
I want to implement character-level embedding.
This is usual word embedding.
Word Embedding
Input: [ [‘who’, ‘is’, ‘this’] ]
-> [ [3, 8, 2] ] # (batch_size, sentence_len)
-> // Embedding(Input)
# (batch_size, seq_len, embedding_dim)
This is what i want to do.
Character Embedding
Input: [ [ [‘w’, ‘h’, ‘o’, 0], [‘i’, ‘s’, 0, 0], [‘t’, ‘h’, ‘i’, ‘s’] ] ]
-> [ [ [2, 3, 9, 0], [ 11, 4, 0, 0], [21, 10, 8, 9] ] ] # (batch_size, sentence_len, word_len)
-> // Embedding(Input) # (batch_size, sentence_len, word_len, embedding_dim)
-> // sum each character embeddings # (batch_size, sentence_len, embedding_dim)
The final output shape is same as Word embedding. Because I want to concat them later.
Although I tried it, I am not sure how to implement 3-D embedding. Do you know how to implement such a data?
def forward(self, x):
print('x', x.size()) # (N, seq_len, word_len)
bs = x.size(0)
seq_len = x.size(1)
word_len = x.size(2)
embd_list = []
for i, elm in enumerate(x):
tmp = torch.zeros(1, word_len, self.embd_size)
for chars in elm:
tmp = torch.add(tmp, 1.0, self.embedding(chars.unsqueeze(0)))
Above code got an error because output of self.embedding is Variable.
TypeError: torch.add received an invalid combination of arguments - got (torch.FloatTensor, float, Variable), but expected one of:
* (torch.FloatTensor source, float value)
* (torch.FloatTensor source, torch.FloatTensor other)
* (torch.FloatTensor source, torch.SparseFloatTensor other)
* (torch.FloatTensor source, float value, torch.FloatTensor other)
didn't match because some of the arguments have invalid types: (torch.FloatTensor, float, Variable)
* (torch.FloatTensor source, float value, torch.SparseFloatTensor other)
didn't match because some of the arguments have invalid types: (torch.FloatTensor, float, Variable)
Update
I could do this. But for is not effective for batch. Do you guys know more efficient way?
def forward(self, x):
print('x', x.size()) # (N, seq_len, word_len)
bs = x.size(0)
seq_len = x.size(1)
word_len = x.size(2)
embd = Variable(torch.zeros(bs, seq_len, self.embd_size))
for i, elm in enumerate(x): # every sample
for j, chars in enumerate(elm): # every sentence. [ [‘w’, ‘h’, ‘o’, 0], [‘i’, ‘s’, 0, 0], [‘t’, ‘h’, ‘i’, ‘s’] ]
chars_embd = self.embedding(chars.unsqueeze(0)) # (N, word_len, embd_size) [‘w’,‘h’,‘o’,0]
chars_embd = torch.sum(chars_embd, 1) # (N, embd_size). sum each char's embedding
embd[i,j] = chars_embd[0] # set char_embd as word-like embedding
x = embd # (N, seq_len, embd_dim)
Update2
This is my final code. Thank you, Wasi Ahmad!
def forward(self, x):
# x: (N, seq_len, word_len)
input_shape = x.size()
bs = x.size(0)
seq_len = x.size(1)
word_len = x.size(2)
x = x.view(-1, word_len) # (N*seq_len, word_len)
x = self.embedding(x) # (N*seq_len, word_len, embd_size)
x = x.view(*input_shape, -1) # (N, seq_len, word_len, embd_size)
x = x.sum(2) # (N, seq_len, embd_size)
return x
I am assuming you have a 3d tensor of shape BxSxW where:
B = Batch size
S = Sentence length
W = Word length
And you have declared embedding layer as follows.
self.embedding = nn.Embedding(dict_size, emsize)
Where:
dict_size = No. of unique characters in the training corpus
emsize = Expected size of embeddings
So, now you need to convert the 3d tensor of shape BxSxW to a 2d tensor of shape BSxW and give it to the embedding layer.
emb = self.embedding(input_rep.view(-1, input_rep.size(2)))
The shape of emb will be BSxWxE where E is the embedding size. You can convert the resulting 3d tensor to a 4d tensor as follows.
emb = emb.view(*input_rep.size(), -1)
The final shape of emb will be BxSxWxE which is what you are expecting.
What you are looking for is implemented in allennlp TimeDistributed layer
Here is a demonstration:
from allennlp.modules.time_distributed import TimeDistributed
batch_size = 16
sent_len = 30
word_len = 5
Consider a sentence in input:
sentence = torch.randn(batch_size, sent_len, word_len) # suppose is your data
Define a char embedding layer (suppose you have also the input padded):
char_embedding = torch.nn.Embedding(char_vocab_size, char_emd_dim, padding_idx=char_pad_idx)
Wrap it!
embedding_sentence = TimeDistributed(char_embedding)(sentence) # shape: batch_size, sent_len, word_len, char_emb_dim
embedding_sentence has shape batch_size, sent_len, word_len, char_emb_dim
Actually, you can easily redefine a module in PyTorch to do this.