Implementing word dropout in pytorch

Implementing word dropout in pytorch - pytorch

I want to add word dropout to my network so that I can have sufficient training examples for training the embedding of the "unk" token. As far as I'm aware, this is standard practice. Let's assume the index of the unk token is 0, and the index for padding is 1 (we can switch them if that's more convenient).
This is a simple CNN network which implements word dropout the way I would have expected it to work:
class Classifier(nn.Module):
def __init__(self, params):
super(Classifier, self).__init__()
self.params = params
self.word_dropout = nn.Dropout(params["word_dropout"])
self.pad = torch.nn.ConstantPad1d(max(params["window_sizes"])-1, 1)
self.embedding = nn.Embedding(params["vocab_size"], params["word_dim"], padding_idx=1)
self.convs = nn.ModuleList([nn.Conv1d(1, params["feature_num"], params["word_dim"] * window_size, stride=params["word_dim"], bias=False) for window_size in params["window_sizes"]])
self.dropout = nn.Dropout(params["dropout"])
self.fc = nn.Linear(params["feature_num"] * len(params["window_sizes"]), params["num_classes"])
def forward(self, x, l):
x = self.word_dropout(x)
x = self.pad(x)
embedded_x = self.embedding(x)
embedded_x = embedded_x.view(-1, 1, x.size()[1] * self.params["word_dim"]) # [batch_size, 1, seq_len * word_dim]
features = [F.relu(conv(embedded_x)) for conv in self.convs]
pooled = [F.max_pool1d(feat, feat.size()[2]).view(-1, params["feature_num"]) for feat in features]
pooled = torch.cat(pooled, 1)
pooled = self.dropout(pooled)
logit = self.fc(pooled)
return logit
Don't mind the padding - pytorch doesn't have an easy way of using non zero padding in CNNs, much less trainable non-zero padding, so I'm doing it manually. Dropout also doesn't allow me to use non zero dropout, and I want to separate the padding token from the unk token. I'm keeping it in my example because it's the reason for this question's existence.
This doesn't work because dropout wants Float Tensors so that it can scale them properly, while my input is Long Tensors that don't need to be scaled.
Is there an easy way of doing this in pytorch? I essentially want to use LongTensor-friendly dropout (bonus: better if it will let me specify a dropout constant that isn't 0, so that I could use zero padding).

Actually I would do it outside of your model, before converting your input into a LongTensor.
This would look like this:
import random
def add_unk(input_token_id, p):
#random.random() gives you a value between 0 and 1
#to avoid switching your padding to 0 we add 'input_token_id > 1'
if random.random() < p and input_token_id > 1:
return 0
else:
return input_token_id
#than you have your input token_id
#for this example I take just a random number, lets say 127
input_token_id = 127
#let p be your probability for UNK
p = 0.01
your_input_tensor = torch.LongTensor([add_unk(input_token_id, p)])
Edit:
So there are two options which come to my mind which are actually GPU-friendly. In general both solutions should be much more efficient.
Option one - Doing computation directly in forward():
If you're not using torch.utils and don't have plans using it later this is probably the way to go.
Instead of doing the computation before we just do it in the forward() method of main PyTorch class. However I see no (simple) way doing this in torch 0.3.1., so you would need to upgrade to version 0.4.0:
So imagine x is your input vector:
>>> x = torch.tensor(range(10))
>>> x
tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
probs is a vector containing uniform probabilities for dropout so we can check later agains our probability for dropout:
>>> probs = torch.empty(10).uniform_(0, 1)
>>> probs
tensor([ 0.9793, 0.1742, 0.0904, 0.8735, 0.4774, 0.2329, 0.0074,
0.5398, 0.4681, 0.5314])
Now we apply the dropout probabilities probs on our input x:
>>> torch.where(probs > 0.2, x, torch.zeros(10, dtype=torch.int64))
tensor([ 0, 0, 0, 3, 4, 5, 0, 7, 8, 9])
Note: To see some effect I chose a dropout probability of 0.2 here. I reality you probably want it to be smaller.
You can pick for this any token / id you like, here is an example with 42 as unknown token id:
>>> unk_token = 42
>>> torch.where(probs > 0.2, x, torch.empty(10, dtype=torch.int64).fill_(unk_token))
tensor([ 0, 42, 42, 3, 4, 5, 42, 7, 8, 9])
torch.where comes with PyTorch 0.4.0:
https://pytorch.org/docs/master/torch.html#torch.where
I don't know about the shapes of your network, but your forward() should look something like this then (when using mini-batching you need to flatten the input before applying dropout):
def forward_train(self, x, l):
# probabilities
probs = torch.empty(x.size(0)).uniform_(0, 1)
# applying word dropout
x = torch.where(probs > 0.02, x, torch.zeros(x.size(0), dtype=torch.int64))
# continue like before ...
x = self.pad(x)
embedded_x = self.embedding(x)
embedded_x = embedded_x.view(-1, 1, x.size()[1] * self.params["word_dim"]) # [batch_size, 1, seq_len * word_dim]
features = [F.relu(conv(embedded_x)) for conv in self.convs]
pooled = [F.max_pool1d(feat, feat.size()[2]).view(-1, params["feature_num"]) for feat in features]
pooled = torch.cat(pooled, 1)
pooled = self.dropout(pooled)
logit = self.fc(pooled)
return logit
Note: I named the function forward_train() so you should use another forward() without dropout for evaluation / predicting. But you could also use some if conditions with train().
Option two: using torch.utils.data.Dataset:
If you're using Dataset provided by torch.utils it is very easy to do this kind of pre-processing efficiently. Dataset uses strong multi-processing acceleration by default so the the code sample above just has to be executed in the __getitem__ method of your Dataset class.
This could look like this:
def __getitem__(self, index):
'Generates one sample of data'
# Select sample
ID = self.input_tokens[index]
# Load data and get label
# using add ink_unk function from code above
X = torch.LongTensor(add_unk(ID, p=0.01))
y = self.targets[index]
return X, y
This is a bit out of context and doesn't look very elegant but I think you get the idea. According to this blog post of Shervine Amidi at Stanford it should be no problem to do more complex pre-processing steps in this function:
Since our code [Dataset is meant] is designed to be multicore-friendly, note that you
can do more complex operations instead (e.g. computations from source
files) without worrying that data generation becomes a bottleneck in
the training process.
The linked blog post - "A detailed example of how to generate your data in parallel with PyTorch" - provides also a good guide for implementing the data generation with Dataset and DataLoader.
I guess you'll prefer option one - only two lines and it should be very efficient. :)
Good luck!

Related

Backtransforming a PyTorch Tensor

I have trained a WGAN on the CelebA dataset in PyTorch following this youtube video. Since I do this on Google Cloud Platform where TensorBoard is not availabe, I save one figure of generated images by the GAN every epoch to see how the GAN is actually doing.
Now, the saved pdf files look sth like this: generated images. Unfortunately, this is not really readable, and I suspect this has to do with the preprocessing I do:
trafo = transforms.Compose(
[transforms.Resize(size = (64, 64)),
transforms.ToTensor(),
transforms.Normalize( mean = (0.5,), std = (0.5,))])
Is there any way to kind of undo this transformation when I save the image?
Currently, I save the image every epoch as follows:
visualization = torchvision.utils.make_grid(
tensor = gen(fixed_noise),
nrow = 8,
normalize = False)
plt.savefig("generated_WGAN_" + datetime.now().strftime("%Y%m%d-%H%M%S") + ".pdf")
Also, I should probably mention that in the Jupyter notebook, I get the following warning:
"Clipping input data to the valid range for imshow with RGB data ([0..1]) for floats or [0..255] for integers)."

The torchvision.transform.Normalize function is usually used to standardize data (make mean(data)=0 and std(x)=1) while the normalize option on torchvision.utils.make_grid is used to normalize the data between [0,1] given a range. So no need to implement a function to fix this.
If True, shift the image to the range (0, 1), by the min and max values specified by range. Default: False.
Here you are looking to normalize between 0 and 1. Given a tensor x:
torchvision.utils.make_grid(x, nrow=8, normalize=True, range=(x.min(), x.max()))
Here are some examples of use provided by the PyTorch's documentation.
Back to your original question, I should mention that torchvision.transform.Normalize(mean=0.5, std=0.5) doesn't transform your data such that it has mean=0.5 and std=0.5... Neither will it standardize it to mean=0, std=1. You have to measure the mean and std from your dataset.
torchvision.transform.Normalize simply performs a shift-scale operation. To undo that just unscale-unshift with the same values:
>>> x = torch.rand(64, 3, 100, 100)*torch.rand(64, 1, 1, 1)
>>> x.mean(), x.std()
(tensor(0.2536), tensor(0.2175))
>>> t = T.Normalize(mean, std)
>>> t_inv = lambda x: x*std + mean
>>> x_after = t(x)
>>> x_after.mean(), x_after.std()
(tensor(-0.4928), tensor(0.4350))
>>> x_before = t_inv(x_after)
>>> x_before.mean(), x_before.std()
(tensor(0.2536), tensor(0.2175))

It seems like your output pixel values are in range [-1, 1] (please verify this).
Therefore, when you save the images, the negative part is being clipped (as the error message you got suggests).
Try:
visualization = torchvision.utils.make_grid(
tensor = torch.clamp(gen(fixed_noise), -1, 1) * 0.5 + 0.5, # from [-1, 1] -> [0, 1]
nrow = 8,
normalize = False)
plt.savefig("generated_WGAN_" + datetime.now().strftime("%Y%m%d-%H%M%S") + ".pdf")

Masking and Instance Normalization in PyTorch

Assume I have a PyTorch tensor, arranged as shape [N, C, L] where N is the batch size, C is the number of channels or features, and L is the length. In this case, if one wishes to perform instance normalization, one does something like:
N = 20
C = 100
L = 40
m = nn.InstanceNorm1d(C, affine=True)
input = torch.randn(N, C, L)
output = m(input)
This will perform a normalization in the L-wise dimension for each N*C = 2000 slices of data, subtracting 2000 means, scaling by 2000 standard deviations, and re-scaling by 100 learnable weight and bias parameters (one per channel). The unspoken assumption here is that all of these values exist and are meaningful.
But I have a situation where, for the slice N=1, I would like to exclude all data after (say) L=35. For the slice N=2 (say) all the data are valid. For the slice N=3, exclude all data after L=30, etc. This mimics data which are one dimensional time sequences, having multiple features, but which are not the same length.
How can I perform an instance norm on such data, get correct statistics, and maintain differentiability/AutoGrad information in PyTorch?
Update: While maintaining GPU performance, or at least not killing it dead.
I cannot...
...Mask with zero values, as this destroys the computer means and variances giving erroneous results
...Mask with np.nan or np.inf, as PyTorch tensors do not ignore such values, but treat them as errors. They are sticky, and lead to garbage results. PyTorch currently lacks the equivalent of np.nanmean and np.nanvar.
...Permute or transpose to an amenable arrangement of data; no such approach gives me what I need
...Use a pack_padded_sequence; instance normalization does not operate on that data structure, and one cannot import data into that structure as far as I know. Also, data re-arrangement would still be necessary, see 3 above.
Am I missing an approach which would give me what I need? Or perhaps am I missing a method of data re-arrangement which would allow 3 or 4 above to work?
This is an issue faced by recurrent neural networks all the time, hence the pack_padded_sequence functionality, but it isn't quite applicable here.

I don't think this is directly possible to implement using the existing InstanceNorm1d, the easiest way would probably be implementing it yourself from scratch. I did a quick implementation that should work. To make it a little bit more general this module requires a boolean mask (a boolean tensor of the same size as the input) that specifies which elements should be considered when passing through the instance norm.
import torch
class MaskedInstanceNorm1d(torch.nn.Module):
def __init__(self, num_features, eps=1e-6, momentum=0.1, affine=True, track_running_stats=False):
super().__init__()
self.num_features = num_features
self.eps = eps
self.momentum = momentum
self.affine = affine
self.track_running_stats = track_running_stats
self.gamma = None
self.beta = None
if self.affine:
self.gamma = torch.nn.Parameter(torch.ones((1, self.num_features, 1), requires_grad=True))
self.beta = torch.nn.Parameter(torch.zeros((1, self.num_features, 1), requires_grad=True))
self.running_mean = None
self.running_variance = None
if self.affine:
self.running_mean = torch.zeros((1, self.num_features, 1), requires_grad=True)
self.running_variance = torch.zeros((1, self.num_features, 1), requires_grad=True)
def forward(self, x, mask):
mean = torch.zeros((1, self.num_features, 1), requires_grad=False)
variance = torch.ones((1, self.num_features, 1), requires_grad=False)
# compute masked mean and variance of batch
for c in range(self.num_features):
if mask[:, c, :].any():
mean[0, c, 0] = x[:, c, :][mask[:, c, :]].mean()
variance[0, c, 0] = (x[:, c, :][mask[:, c, :]] - mean[0, c, 0]).pow(2).mean()
# update running mean and variance
if self.training and self.track_running_stats:
for c in range(self.num_features):
if mask[:, c, :].any():
self.running_mean[0, c, 0] = (1-self.momentum) * self.running_mean[0, c, 0] \
+ self.momentum * mean[0, c, 0]
self.running_variance[0, c, 0] = (1-self.momentum) * self.running_variance[0, c, 0] \
+ self.momentum * variance[0, c, 0]
# compute output
x = (x - mean)/(self.eps + variance).sqrt()
if self.affine:
x = x * self.gamma + self.beta
return x

Keras layer for slicing image data into sliding windows

I have a set of images, all of varying widths, but with fixed height set to 100 pixels and 3 channels of depth. My task is to classify if each vertical line in the image is interesting or not. To do that, I look at the line in context of its 10 predecessor and successor lines. Imagine the algorithm sweeping from left to right of the image, detecting vertical lines containing points of interest.
My first attempt at doing this was to manually cut out these sliding windows using numpy before feeding the data into the Keras model. Like this:
# Pad left and right
s = np.repeat(D[:1], 10, axis = 0)
e = np.repeat(D[-1:], 10, axis = 0)
# D now has shape (w + 20, 100, 3)
D = np.concatenate((s, D, e))
# Sliding windows creation trick from SO question
idx = np.arange(21)[None,:] + np.arange(len(D) - 20)[:,None]
windows = D[indexer]
Then all windows and all ground truth 0/1 values for all vertical lines in all images would be concatenated into two very long arrays.
I have verified that this works, in principle. I fed each window to a Keras layer looking like this:
Conv2D(20, (5, 5), input_shape = (21, 100, 3), padding = 'valid', ...)
But the windowing causes the memory usage to increase 21 times so doing it this way becomes impractical. But I think my scenario is a very common in machine learning so there must be some standard method in Keras to do this efficiently? E.g I would like to feed Keras my raw image data (w, 100, 80) and tell it what the sliding window sizes are and let it figure out the rest. I have looked at some sample code but I'm a ml noob so I don't get it.

Unfortunately this isn't an easy problem because it can involve using a variable sized input for your Keras model. While I think it is possible to do this with proper use of placeholders that's certainly no place for a beginner to start. your other option is a data generator. As with many computationally intensive tasks there is often a trade off between compute speed and memory requirements, using a generator is more compute heavy and it will be done entirely on your cpu (no gpu acceleration), but it won't make the memory increase.
The point of a data generator is that it will apply the operation to images one at a time to produce the batch, then train on that batch, then free up the memory - so you only end up keeping one batch worth of data in memory at any time. Unfortunately if you have a time consuming generation then this can seriously affect performance.
The generator will be a python generator (using the 'yield' keyword) and is expected to produce a single batch of data, keras is very good at using arbitrary batch sizes, so you can always make one image yield one batch, especially to start.
Here is the keras page on fit_generator - I warn you, this starts to become a lot of work very quickly, consider buying more memory:
https://keras.io/models/model/#fit_generator
Fine I'll do it for you :P
import numpy as np
import pandas as pd
import keras
from keras.models import Model, model_from_json
from keras.layers import Dense, Concatenate, Multiply,Add, Subtract, Input, Dropout, Lambda, Conv1D, Flatten
from tensorflow.python.client import device_lib
# check for my gpu
print(device_lib.list_local_devices())
# make some fake image data
# 1000 random widths
data_widths = np.floor(np.random.random(1000)*100)
# producing 1000 random images with dimensions w x 100 x 3
# and a vector of which vertical lines are interesting
# I assume your data looks like this
images = []
interesting = []
for w in data_widths:
images.append(np.random.random([int(w),100,3]))
interesting.append(np.random.random(int(w))>0.5)
# this is a generator
def image_generator(images, interesting):
num = 0
while num < len(images):
windows = None
truth = None
D = images[num]
# this should look familiar
# Pad left and right
s = np.repeat(D[:1], 10, axis = 0)
e = np.repeat(D[-1:], 10, axis = 0)
# D now has shape (w + 20, 100, 3)
D = np.concatenate((s, D, e))
# Sliding windows creation trick from SO question
idx = np.arange(21)[None,:] + np.arange(len(D) - 20)[:,None]
windows = D[idx]
truth = np.expand_dims(1*interesting[num],axis=1)
yield (windows, truth)
num+=1
# the generator MUST loop
if num == len(images):
num = 0
# basic model - replace with your own
input_layer = Input(shape = (21,100,3), name = "input_node")
fc = Flatten()(input_layer)
fc = Dense(100, activation='relu',name = "fc1")(fc)
fc = Dense(50, activation='relu',name = "fc2")(fc)
fc = Dense(10, activation='relu',name = "fc3")(fc)
output_layer = Dense(1, activation='sigmoid',name = "output")(fc)
model = Model(input_layer,output_layer)
model.compile(optimizer="adam", loss='binary_crossentropy')
model.summary()
#and training
training_history = model.fit_generator(image_generator(images, interesting),
epochs =5,
initial_epoch = 0,
steps_per_epoch=len(images),
verbose=1
)

Problems using poly kernel in GridSearchCV and SVM classifier

I am trying to do a grid search using a SVM classifier.
Consider my data and target that have been parsed from file and input to numpy arrays.
I then preprocess them.
# Transform the data to have zero mean and unit variance.
zeroMeanUnitVarianceScaler = preprocessing.StandardScaler().fit(data)
zeroMeanUnitVarianceScaler.transform(data)
scaledData = data
# Transform the target to have range [-1, 1].
scaledTarget = np.empty([161L,], dtype=int)
for i in range(len(target)):
if(target[i] == 'Malignant'):
scaledTarget[i] = 1
if(target[i] == 'Benign'):
scaledTarget[i] = -1
I now try to set up my grid and fit the scaled data to targets.
# Generate parameters for parameter grid.
CValues = np.logspace(-3, 3, 7)
GammaValues = np.logspace(-3, 3, 7)
kernelValues = ('poly', 'sigmoid')
# kernelValues = ('linear', 'rbf', 'sigmoid')
degreeValues = np.array([0, 1, 2, 3, 4])
coef0Values = np.logspace(-3, 3, 7)
# Generate the parameter grid.
paramGrid = dict(C=CValues, gamma=GammaValues, kernel=kernelValues,
coef0=coef0Values)
# Create and train a SVM classifier using the parameter grid and with
stratified shuffle split.
stratifiedShuffleSplit = StratifiedShuffleSplit(n_splits = 10, test_size =
0.25, train_size = None, random_state = 0)
clf = GridSearchCV(estimator=svm.SVC(), param_grid=paramGrid,
cv=stratifiedShuffleSplit, n_jobs=1)
clf.fit(scaledData, scaledTarget)
If I uncomment the line kernelValues = ('linear', 'rbf', 'sigmoid'), then the code runs in approximately 50 seconds on my 16 GB i7-4950 3.6 GHz machine running windows 10.
However, if I try to run the code as is with 'poly' as a possible kernel value, then the code hangs forever. For example, I ran it yesterday overnight and it did not return anything when I got back in the office today.
Interestingly enough, if I try to create a SVM classifier with a poly kernel, it returns a result immediately
clf = svm.SVC(kernel='poly',degree=2)
clf.fit(data, target)
It hangs up when I do the above code. I have not tried other cv methods to see if that changes anything.
Is this a bug in sci-kit learn? Am I doing things properly? On a side note, is my method of doing gridsearch/cross validation using GridSearchCV and StratifiedShuffleSplit sensible? It seems to me the most brute force (i.e. time consuming) but robust method.
Thank you!

LSTMLayer produces NaN values even before training it

I'm currently trying to construct a LSTM network with Lasagne to predict the next step of noisy sequences. I first trained a stack of 2 LSTM layers for a while, but had to use an abysmally small learning rate (1e-6) because of divergence issues (that ultimately produced NaN values). The results were kind of disappointing, as the network produced smooth, out-of-phase versions of the input.
I then came to the conclusion I should use better parameter initialization than what is given by default. The goal was to start from a network that just mimics identity, since for strongly auto-correlated signal it should be a good first estimation of the next step (x(t) ~ x(t+1)), and to sprinkle a bit of noise on top of it.
import theano, numpy, lasagne
from theano import tensor as T
from lasagne.layers.recurrent import LSTMLayer, InputLayer, Gate
from lasagne.layers import DropoutLayer
from lasagne.nonlinearities import sigmoid, tanh, leaky_rectify
from lasagne.layers import get_output
from lasagne.init import GlorotNormal, Normal, Constant
floatX = 'float32'
# function to create a lstm that ~ propagate the input from start to finish off the bat
# should be a good start for a predictive lstm with high one-step autocorrelation
def create_identity_lstm(input, shape, orig_inp=None, noiselvl=0.01, G=10., mask_input=None):
inp, out = shape
# orig_inp is used to limit the number of units that are actually used to pass the input information from one layer to the other - the rest of the units should produce ~ 0 activation.
if orig_inp is None:
orig_inp = inp
# input gate
inputgate = Gate(
W_in=GlorotNormal(noiselvl),
W_hid=GlorotNormal(noiselvl),
W_cell=Normal(noiselvl),
b=Constant(0.),
nonlinearity=sigmoid
)
# forget gate
forgetgate = Gate(
W_in=GlorotNormal(noiselvl),
W_hid=GlorotNormal(noiselvl),
W_cell=Normal(noiselvl),
b=Constant(0.),
nonlinearity=sigmoid
)
# cell gate
cell = Gate(
W_in=GlorotNormal(noiselvl),
W_hid=GlorotNormal(noiselvl),
W_cell=None,
b=Constant(0.),
nonlinearity=leaky_rectify
)
# output gate
outputgate = Gate(
W_in=GlorotNormal(noiselvl),
W_hid=GlorotNormal(noiselvl),
W_cell=Normal(noiselvl),
b=Constant(0.),
nonlinearity=sigmoid
)
lstm = LSTMLayer(input, out, ingate=inputgate, forgetgate=forgetgate, cell=cell, outgate=outputgate, nonlinearity=leaky_rectify, mask_input=mask_input)
# change matrices and biases
# ingate - should return ~1 (matrices = 0, big bias)
b_i = lstm.b_ingate.get_value()
b_i[:orig_inp] += G
lstm.b_ingate.set_value(b_i)
# forgetgate - should return 0 (matrices = 0, big negative bias)
b_f = lstm.b_forgetgate.get_value()
b_f[:orig_inp] -= G
b_f[orig_inp:] += G # to help learning future features, I preserve a large bias on "unused" units to help it remember stuff
lstm.b_forgetgate.set_value(b_f)
# cell - should return x(t) (W_xc = identity, rest is 0)
W_xc = lstm.W_in_to_cell.get_value()
for i in xrange(orig_inp):
W_xc[i, i] += 1.
lstm.W_in_to_cell.set_value(W_xc)
# outgate - should return 1 (same as ingate)
b_o = lstm.b_outgate.get_value()
b_o[:orig_inp] += G
lstm.b_outgate.set_value(b_o)
# done
return lstm
I then use this lstm generation code to generate the following network:
# layers
#input + dropout
input = InputLayer((None, None, 7), name='input')
mask = InputLayer((None, None), name='mask')
drop1 = DropoutLayer(input, p=0.33)
#lstm1 + dropout
lstm1 = create_identity_lstm(drop1, (7, 1024), mask_input=mask)
drop2 = DropoutLayer(lstm1, p=0.33)
#lstm2 + dropout
lstm2 = create_identity_lstm(drop2, (1024, 128), orig_inp=7, mask_input=mask)
drop3 = DropoutLayer(lstm2, p=0.33)
#lstm3
lstm3 = create_identity_lstm(drop3, (128, 7), orig_inp=7, mask_input=mask)
# symbolic variables and prediction
x = input.input_var
ma = mask.input_var
ma_reshape = ma.dimshuffle((0,1,'x'))
yhat = get_output(lstm3, deterministic=False)
yhat_det = get_output(lstm3, deterministic=True)
y = T.ftensor3('y')
predict = theano.function([x, ma], yhat_det)
Problem is, even without any training, this network produces garbage values and sometimes even a bunch of NaNs, right from the very first LSTM layer:
X = numpy.random.random((5, 10000, 7)).astype('float32')
Masks = numpy.ones(X.shape[:2], dtype='float32')
hid1 = get_output(lstm1, determistic=True)
get_hid1 = theano.function([x, ma], hid1)
h1 = get_hid1(X, Masks)
print numpy.isnan(h1).sum(axis=1).sum(axis=1)
array([6379520, 6367232, 6377472, 6376448, 6378496])
# even the first output value is garbage!
print h1[:,0,0] - X[:,0,0]
array([-0.03898358, -0.10118812, 0.34877831, -0.02509735, 0.36689138], dtype=float32)
I don't get why, I checked each matrices and their values are fine, like I wanted them to be. I even tried to recreate each gate activations and the resulting hidden activations using the actual numpy arrays and they reproduce the input just fine. What did I do wrong there??

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string