Masking and Instance Normalization in PyTorch - pytorch

Assume I have a PyTorch tensor, arranged as shape [N, C, L] where N is the batch size, C is the number of channels or features, and L is the length. In this case, if one wishes to perform instance normalization, one does something like:
N = 20
C = 100
L = 40
m = nn.InstanceNorm1d(C, affine=True)
input = torch.randn(N, C, L)
output = m(input)
This will perform a normalization in the L-wise dimension for each N*C = 2000 slices of data, subtracting 2000 means, scaling by 2000 standard deviations, and re-scaling by 100 learnable weight and bias parameters (one per channel). The unspoken assumption here is that all of these values exist and are meaningful.
But I have a situation where, for the slice N=1, I would like to exclude all data after (say) L=35. For the slice N=2 (say) all the data are valid. For the slice N=3, exclude all data after L=30, etc. This mimics data which are one dimensional time sequences, having multiple features, but which are not the same length.
How can I perform an instance norm on such data, get correct statistics, and maintain differentiability/AutoGrad information in PyTorch?
Update: While maintaining GPU performance, or at least not killing it dead.
I cannot...
...Mask with zero values, as this destroys the computer means and variances giving erroneous results
...Mask with np.nan or np.inf, as PyTorch tensors do not ignore such values, but treat them as errors. They are sticky, and lead to garbage results. PyTorch currently lacks the equivalent of np.nanmean and np.nanvar.
...Permute or transpose to an amenable arrangement of data; no such approach gives me what I need
...Use a pack_padded_sequence; instance normalization does not operate on that data structure, and one cannot import data into that structure as far as I know. Also, data re-arrangement would still be necessary, see 3 above.
Am I missing an approach which would give me what I need? Or perhaps am I missing a method of data re-arrangement which would allow 3 or 4 above to work?
This is an issue faced by recurrent neural networks all the time, hence the pack_padded_sequence functionality, but it isn't quite applicable here.

I don't think this is directly possible to implement using the existing InstanceNorm1d, the easiest way would probably be implementing it yourself from scratch. I did a quick implementation that should work. To make it a little bit more general this module requires a boolean mask (a boolean tensor of the same size as the input) that specifies which elements should be considered when passing through the instance norm.
import torch
class MaskedInstanceNorm1d(torch.nn.Module):
def __init__(self, num_features, eps=1e-6, momentum=0.1, affine=True, track_running_stats=False):
super().__init__()
self.num_features = num_features
self.eps = eps
self.momentum = momentum
self.affine = affine
self.track_running_stats = track_running_stats
self.gamma = None
self.beta = None
if self.affine:
self.gamma = torch.nn.Parameter(torch.ones((1, self.num_features, 1), requires_grad=True))
self.beta = torch.nn.Parameter(torch.zeros((1, self.num_features, 1), requires_grad=True))
self.running_mean = None
self.running_variance = None
if self.affine:
self.running_mean = torch.zeros((1, self.num_features, 1), requires_grad=True)
self.running_variance = torch.zeros((1, self.num_features, 1), requires_grad=True)
def forward(self, x, mask):
mean = torch.zeros((1, self.num_features, 1), requires_grad=False)
variance = torch.ones((1, self.num_features, 1), requires_grad=False)
# compute masked mean and variance of batch
for c in range(self.num_features):
if mask[:, c, :].any():
mean[0, c, 0] = x[:, c, :][mask[:, c, :]].mean()
variance[0, c, 0] = (x[:, c, :][mask[:, c, :]] - mean[0, c, 0]).pow(2).mean()
# update running mean and variance
if self.training and self.track_running_stats:
for c in range(self.num_features):
if mask[:, c, :].any():
self.running_mean[0, c, 0] = (1-self.momentum) * self.running_mean[0, c, 0] \
+ self.momentum * mean[0, c, 0]
self.running_variance[0, c, 0] = (1-self.momentum) * self.running_variance[0, c, 0] \
+ self.momentum * variance[0, c, 0]
# compute output
x = (x - mean)/(self.eps + variance).sqrt()
if self.affine:
x = x * self.gamma + self.beta
return x

Related

imlementation of binomial coefficient in Google JAX

trying to implement custom MLE for binomial distribution (for learning purpose) stuck with implantation of binomial coefficient in google JAX . there is no analog for scipy.special.binom() implemented.
what shall i use instead ?
The binomial coefficient for general real-valued inputs can be computed in terms of the gamma function, which is available in JAX via jax.scipy.special.gammaln. Here's one way you could define it:
def binom(x, y):
return jnp.exp(gammaln(x + 1) - gammaln(y + 1) - gammaln(x - y + 1))
Here is a (sequential) integer implementation using JAX.
def binom_int_seq(x : int, y : int):
def scan_body(carry, values):
n, d = values
carry = (carry*n)//d
return carry, None
y = max(y, x-y)
nd = jnp.concatenate(
(jnp.arange(y+2, x+1, dtype = 'u8')[:,None],
jnp.arange(2, x-y+1, dtype = 'u8')[:,None],),
axis = 1
)
bc, *_ = jax.lax.scan(scan_body, jnp.array(y+1, dtype = 'u8'), nd)
return bc
binom_int_seq_jit = jax.jit(binom_int_seq, static_argnums = (0, 1))
which gives
x, y = 60, 31
bc_ref = sp.special.comb(x, y, exact=True)
# 114449595062769120
binom_int_seq(x, y)-bc_ref
# DeviceArray(0, dtype=uint64)
# Using above logarithmic gamma function based implementation
binom(x, y)-bc_ref
# DeviceArray(496., dtype=float64, weak_type=True)
Keep in mind the binom_int_seq implementation is only correct if
(x-max(x-y, y))*sp.special.comb(x, y, exact=True) < jnp.iinfo(jnp.uint64).max
Unlike the real-valued version, the error will be sudden and catastrophic if this condition is not satisfied.
There may be other ways to increase this constraint, such as running cancellations based upon prime factorisation, without resorting to larger unsigned integers (/arbitrary precision).
A monoidal version could be implemented which computes the binomial coefficient numerator and denominator reductions then integer divides, but this places stricter constraints on the maximum arguments.

Pytorch: Custom thresholding activation function - gradient

I created an activation function class Threshold that should operate on one-hot-encoded image tensors.
The function performs min-max feature scaling on each channel followed by thresholding.
class Threshold(nn.Module):
def __init__(self, threshold=.5):
super().__init__()
if threshold < 0.0 or threshold > 1.0:
raise ValueError("Threshold value must be in [0,1]")
else:
self.threshold = threshold
def min_max_fscale(self, input):
r"""
applies min max feature scaling to input. Each channel is treated individually.
input is assumed to be N x C x H x W (one-hot-encoded prediction)
"""
for i in range(input.shape[0]):
# N
for j in range(input.shape[1]):
# C
min = torch.min(input[i][j])
max = torch.max(input[i][j])
input[i][j] = (input[i][j] - min) / (max - min)
return input
def forward(self, input):
assert (len(input.shape) == 4), f"input has wrong number of dims. Must have dim = 4 but has dim {input.shape}"
input = self.min_max_fscale(input)
return (input >= self.threshold) * 1.0
When I use the function I get the following error, since the gradients are not calculated automatically I assume.
Variable._execution_engine.run_backward(RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
I already had a look at How to properly update the weights in PyTorch? but could not get a clue how to apply it to my case.
How is it possible to calculate the gradients for this function?
Thanks for your help.
The issue is you are manipulating and overwriting elements, this time of operation can't be tracked by autograd. Instead, you should stick with built-in functions. You example is not that tricky to tackle: you are looking to retrieve the minimum and maximum values along input.shape[0] x input.shape[1]. Then you will scale your whole tensor in one go i.e. in vectorized form. No for loops involved!
One way to compute min/max along multiple axes is to flatten those:
>>> x_f = x.flatten(2)
Then, find the min-max on the flattened axis while retaining all shapes:
>>> x_min = x_f.min(axis=-1, keepdim=True).values
>>> x_max = x_f.max(axis=-1, keepdim=True).values
The resulting min_max_fscale function would look something like:
class Threshold(nn.Module):
def min_max_fscale(self, x):
r"""
Applies min max feature scaling to input. Each channel is treated individually.
Input is assumed to be N x C x H x W (one-hot-encoded prediction)
"""
x_f = x.flatten(2)
x_min, x_max = x_f.min(-1, True).values, x_f.max(-1, True).values
x_f = (x_f - x_min) / (x_max - x_min)
return x_f.reshape_as(x)
Important note:
You would notice that you can now backpropagate on min_max_fscale... but not on forward. This is because you are applying a boolean condition which is not a differentiable operation.

Inverse transform

I am not able to find an efficient way to give batch input to this function and return the batch output. I want to do this during the training of my neural network.
Inverse_Norm = transforms.Normalize(
mean = [-m/s for m, s in zip(mean, std)],
std = [1/s for s in std]
)
inverse_norm_input = Inverse_Norm(input)
Assuming a tensor of shape (B, C, ...) wheremean and std are iterables of length C then you can use broadcasting semantics to operate across a batch tensor. For example
import torch
def batch_inverse_normalize(x, mean, std):
# represent mean and std to 1, C, 1, ... tensors for broadcasting
reshape_shape = [1, -1] + ([1] * (len(x.shape) - 2))
mean = torch.tensor(mean, device=x.device, dtype=x.dtype).reshape(*reshape_shape)
std = torch.tensor(std, device=x.device, dtype=x.dtype).reshape(*reshape_shape)
return x * std + mean

Vectorized implementation of field-aware factorization

I would like to implement the field-aware factorization model (FFM) in a vectorized way. In FFM, a prediction is made by the following equation
where w are the embeddings that depend on the feature and the field of the other feature. For more info, see equation (4) in FFM.
To do so, I have defined the following parameter:
import torch
W = torch.nn.Parameter(torch.Tensor(n_features, n_fields, n_factors), requires_grad=True)
Now, given an input x of size (batch_size, n_features), I want to be able to compute the previous equation. Here is my current (non-vectorized) implementation:
total_inter = torch.zeros(x.shape[0])
for i in range(n_features):
for j in range(i + 1, n_features):
temp1 = torch.mm(
x[:, i].unsqueeze(1),
W[i, feature2field[j], :].unsqueeze(0))
temp2 = torch.mm(
x[:, j].unsqueeze(1),
W[j, feature2field[i], :].unsqueeze(0))
total_inter += torch.sum(temp1 * temp2, dim=1)
Unsurprisingly, this implementation is horribly slow since n_features can easily be as large as 1000! Note however that most of the entries of x are 0. All inputs are appreciated!
Edit:
If it can help in any ways, here are some implementations of this model in PyTorch:
pytorch-fm
ctr_model_zoo
Unfortunately, I cannot figure out exactly how they have done it.
Additional update:
I can now obtain the product of x and W in a more efficient way by doing:
temp = torch.einsum('ij, jkl -> ijkl', x, W)
Thus, my loop is now:
total_inter = torch.zeros(x.shape[0])
for i in range(n_features):
for j in range(i + 1, n_features):
temp1 = temp[:, i, feature2field[j], :]
temp2 = temp[:, j, feature2field[i], :]
total_inter += 0.5 * torch.sum(temp1 * temp2, dim=1)
It is however still too long since this loop goes over for about 500 000 iterations.
Something that could potentially help you speed up the multiplication is using pytorch sparse tensors.
Also something that might work would be the following:
Create n arrays, one for each feature i that would hold its corresponding field factors in each row. e.g. for feature i = 0
[ W[0, feature2field[0], :],
W[0, feature2field[1], :],
W[0, feature2field[n], :]]
Then calculate the multiplication of those arrays, lets call them F, with X
R[i] = F[i] * X
So each element in R would hold the result of the multiplication, an array, of the F[i] with X.
Next you would multiply each R[i] with its transpose
R[i] = R[i] * R[i].T
Now you can do the summation in a loop like before
for i in range(n_features):
total_inter += torch.sum(R[i], dim=1)
Please take this with a grain of salt as i haven't tested it. In any case i think that it will point you in the right direction.
One problem that might occur is in the transpose multiplication in which each element will also be multiplied with itself and then be added in the sum. I don't think it will affect the classifier but in any case you can make the elements in the diagonal of the transpose and above 0 (including the diagonal).
Also although minor nevertheless please move the 1st unsqueeze operation outside of the nested for loop.
I hope it helps.

Implementing word dropout in pytorch

I want to add word dropout to my network so that I can have sufficient training examples for training the embedding of the "unk" token. As far as I'm aware, this is standard practice. Let's assume the index of the unk token is 0, and the index for padding is 1 (we can switch them if that's more convenient).
This is a simple CNN network which implements word dropout the way I would have expected it to work:
class Classifier(nn.Module):
def __init__(self, params):
super(Classifier, self).__init__()
self.params = params
self.word_dropout = nn.Dropout(params["word_dropout"])
self.pad = torch.nn.ConstantPad1d(max(params["window_sizes"])-1, 1)
self.embedding = nn.Embedding(params["vocab_size"], params["word_dim"], padding_idx=1)
self.convs = nn.ModuleList([nn.Conv1d(1, params["feature_num"], params["word_dim"] * window_size, stride=params["word_dim"], bias=False) for window_size in params["window_sizes"]])
self.dropout = nn.Dropout(params["dropout"])
self.fc = nn.Linear(params["feature_num"] * len(params["window_sizes"]), params["num_classes"])
def forward(self, x, l):
x = self.word_dropout(x)
x = self.pad(x)
embedded_x = self.embedding(x)
embedded_x = embedded_x.view(-1, 1, x.size()[1] * self.params["word_dim"]) # [batch_size, 1, seq_len * word_dim]
features = [F.relu(conv(embedded_x)) for conv in self.convs]
pooled = [F.max_pool1d(feat, feat.size()[2]).view(-1, params["feature_num"]) for feat in features]
pooled = torch.cat(pooled, 1)
pooled = self.dropout(pooled)
logit = self.fc(pooled)
return logit
Don't mind the padding - pytorch doesn't have an easy way of using non zero padding in CNNs, much less trainable non-zero padding, so I'm doing it manually. Dropout also doesn't allow me to use non zero dropout, and I want to separate the padding token from the unk token. I'm keeping it in my example because it's the reason for this question's existence.
This doesn't work because dropout wants Float Tensors so that it can scale them properly, while my input is Long Tensors that don't need to be scaled.
Is there an easy way of doing this in pytorch? I essentially want to use LongTensor-friendly dropout (bonus: better if it will let me specify a dropout constant that isn't 0, so that I could use zero padding).
Actually I would do it outside of your model, before converting your input into a LongTensor.
This would look like this:
import random
def add_unk(input_token_id, p):
#random.random() gives you a value between 0 and 1
#to avoid switching your padding to 0 we add 'input_token_id > 1'
if random.random() < p and input_token_id > 1:
return 0
else:
return input_token_id
#than you have your input token_id
#for this example I take just a random number, lets say 127
input_token_id = 127
#let p be your probability for UNK
p = 0.01
your_input_tensor = torch.LongTensor([add_unk(input_token_id, p)])
Edit:
So there are two options which come to my mind which are actually GPU-friendly. In general both solutions should be much more efficient.
Option one - Doing computation directly in forward():
If you're not using torch.utils and don't have plans using it later this is probably the way to go.
Instead of doing the computation before we just do it in the forward() method of main PyTorch class. However I see no (simple) way doing this in torch 0.3.1., so you would need to upgrade to version 0.4.0:
So imagine x is your input vector:
>>> x = torch.tensor(range(10))
>>> x
tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
probs is a vector containing uniform probabilities for dropout so we can check later agains our probability for dropout:
>>> probs = torch.empty(10).uniform_(0, 1)
>>> probs
tensor([ 0.9793, 0.1742, 0.0904, 0.8735, 0.4774, 0.2329, 0.0074,
0.5398, 0.4681, 0.5314])
Now we apply the dropout probabilities probs on our input x:
>>> torch.where(probs > 0.2, x, torch.zeros(10, dtype=torch.int64))
tensor([ 0, 0, 0, 3, 4, 5, 0, 7, 8, 9])
Note: To see some effect I chose a dropout probability of 0.2 here. I reality you probably want it to be smaller.
You can pick for this any token / id you like, here is an example with 42 as unknown token id:
>>> unk_token = 42
>>> torch.where(probs > 0.2, x, torch.empty(10, dtype=torch.int64).fill_(unk_token))
tensor([ 0, 42, 42, 3, 4, 5, 42, 7, 8, 9])
torch.where comes with PyTorch 0.4.0:
https://pytorch.org/docs/master/torch.html#torch.where
I don't know about the shapes of your network, but your forward() should look something like this then (when using mini-batching you need to flatten the input before applying dropout):
def forward_train(self, x, l):
# probabilities
probs = torch.empty(x.size(0)).uniform_(0, 1)
# applying word dropout
x = torch.where(probs > 0.02, x, torch.zeros(x.size(0), dtype=torch.int64))
# continue like before ...
x = self.pad(x)
embedded_x = self.embedding(x)
embedded_x = embedded_x.view(-1, 1, x.size()[1] * self.params["word_dim"]) # [batch_size, 1, seq_len * word_dim]
features = [F.relu(conv(embedded_x)) for conv in self.convs]
pooled = [F.max_pool1d(feat, feat.size()[2]).view(-1, params["feature_num"]) for feat in features]
pooled = torch.cat(pooled, 1)
pooled = self.dropout(pooled)
logit = self.fc(pooled)
return logit
Note: I named the function forward_train() so you should use another forward() without dropout for evaluation / predicting. But you could also use some if conditions with train().
Option two: using torch.utils.data.Dataset:
If you're using Dataset provided by torch.utils it is very easy to do this kind of pre-processing efficiently. Dataset uses strong multi-processing acceleration by default so the the code sample above just has to be executed in the __getitem__ method of your Dataset class.
This could look like this:
def __getitem__(self, index):
'Generates one sample of data'
# Select sample
ID = self.input_tokens[index]
# Load data and get label
# using add ink_unk function from code above
X = torch.LongTensor(add_unk(ID, p=0.01))
y = self.targets[index]
return X, y
This is a bit out of context and doesn't look very elegant but I think you get the idea. According to this blog post of Shervine Amidi at Stanford it should be no problem to do more complex pre-processing steps in this function:
Since our code [Dataset is meant] is designed to be multicore-friendly, note that you
can do more complex operations instead (e.g. computations from source
files) without worrying that data generation becomes a bottleneck in
the training process.
The linked blog post - "A detailed example of how to generate your data in parallel with PyTorch" - provides also a good guide for implementing the data generation with Dataset and DataLoader.
I guess you'll prefer option one - only two lines and it should be very efficient. :)
Good luck!

Resources