Theano function with updates produces NAN output - theano

my code looks like.
output = lasagne.layers.get_output(output_layer)
loss = function(output) * target
loss = -(loss.sum())
params = lasagne.layers.get_all_params(output_layer)
updates = lasagne.updates.sgd(loss,params,learning_rate=0.00001)
train_fn = theano.function([input,target], loss, updates=updates,allow_input_downcast=True)
validate_fn = theano.function([input,target], loss, allow_input_downcast=True)
here outputlayer is a CNN network, and function is defined as follows:
def function(X):
squared_euclidean_distances = (X ** 2).sum(1).reshape((X.shape[0], 1)) + (X ** 2).sum(1).reshape((1, X.shape[0])) - 2 * X.dot(X.T)
dist = 1/(1+squared_euclidean_distances)
Pij = (dist) / (dist.sum(0))
return Pij
target is a sparse matrix where target(i,j) = 1 if outputlayer(i) and outputlayer(j) belong to the same class, otherwise target(i,j) = 0
When digging the code, I found that the error is from a conv layer in the CNN network, raised by a function true_div.
Clearly, the only difference of the train_fn and validate_fn is the updates parameter.
However, I print the output of train_fn and validate_fn with the same dummy input. The output of validate_fn makes sense, but train_fn output is NAN.
I think the ouput is produced before updating updates the parameters. Anything wrong?

Related

Logistic regression cost function returning nan

I learnt logistic regression recently, and I wanted to practice it. I am currently using this dataset from kaggle. I tried to define a cost function in this manner (I made all necessary imports):
# Defining the hypothesis
sigmoid = lambda x: 1 / (1 + np.exp(-x))
predict = lambda trainset, parameters: sigmoid(trainset # parameters)
# Defining the cost
def cost(theta):
#print(X.shape, y.shape, theta.shape)
preds = predict(X, theta.T)
errors = (-y * np.log(preds)) - ((1-y)*np.log(1-preds))
return np.mean(errors)
theta = []
for i in range(13):
theta.append(1)
theta = np.array([theta])
cost(theta)
and when I run this cell I get:
/opt/venv/lib/python3.7/site-packages/ipykernel_launcher.py:9: RuntimeWarning: divide by zero encountered in log
if __name__ == '__main__':
/opt/venv/lib/python3.7/site-packages/ipykernel_launcher.py:9: RuntimeWarning: invalid value encountered in multiply
if __name__ == '__main__':
nan
When I searched online, I got the advice to normalise the data and then try it. So this is how I did it:
df = pd.read_csv("/home/jovyan/work/heart.csv")
df.head()
# The dataset is 303x14 in size (using df.shape)
length = df.shape[0]
# Output vector
y = df['target'].values
y = np.array([y]).T
# We name trainingset as X for convenience
trainingset = df.drop(['target'], axis = 1)
#trainingset = df.insert(0, 'bias', 1)
minmax_normal_trainset = (trainingset - trainingset.min())/(trainingset.max() - trainingset.min())
X = trainingset.values
I really don't know where the division by zero error is occurring and how to fix it. If I made any mistakes in this implementation please correct me. I am sorry if this has been asked before, but all I could find was the tip to normalise the data. Thanks in advance!
np.log(0) raises a divide by zero error. So it's this part that's causing the problems:
errors = (-y * np.log(preds)) - ((1 - y) * np.log(1 - preds))
############## #################
preds can be 0 or 1 when the absolute value of x is greater than 709 (because of floating point math, at least on my machine), which is why normalizing x to be between 0 and 1 solves the problem.
EDIT:
You may want to normalize to a larger range than (0, 1) - your sigmoid function as currently set is pretty much linear in that range. Maybe use:
minmax_normal_trainset = c * (trainingset - trainingset.mean())/(trainingset.stdev())
And tune c for better convergence.

Pytorch datatype/dimension confusion TypeError: 'Tensor' object is not callable

This piece of code is originally written in numpy and I'm trying to utilise GPU computation by rewriting it in pytorch, but as I'm new to pytorch a lot of problems occured to me. Firstly I'm confused by the dimension of the tensors. Sometimes after operating on the tensors, only transposing the tensor would fix the problem, is there anyway I can stop doing .t()? The major problem here is that in the line ar = torch.stack ... the error "TypeError: 'Tensor' object is not callable " occurs. Any suggestion/correction would be appreciated. Thxxx
def vec_datastr(vector):
vector = vector.float()
# Find the indices corresponding to non-zero entries
index = torch.nonzero(vector)
index = index.t()
# Compute probability
prob = vector ** 2
if torch.sum(prob) == 0:
prob = 0
else:
prob = prob / torch.sum(prob)
d = depth(vector)
CumProb = torch.ones((2**d-len(prob.t()),1), device ='cuda')
cp = torch.cumsum(prob, dim=0)
cp = cp.reshape((len(cp.t()),1))
CumProb = torch.cat((cp, CumProb),0)
vector = vector.t()
prob = prob.t()
ar = torch.stack((index, vector([index,1]), prob([index, 1]), CumProb([index, 1]))) # Problems occur here
ar = ar.reshape((len(index), 4))
# Store the data as a 4-dimensional array
output = dict()
output = {'index':ar[:,0], 'value': ar[:,1], 'prob':ar[:,2], 'CumProb': ar[:,3]}
return output
ar = torch.stack(
(index, vector([index, 1]), prob([index, 1]), CumProb([index, 1]))
) # Problems occur here
vector is of type torch.Tensor. It has no __call__ defined. You are going for vector(...) (vector([index,1])) while you should slice the data directly like this: vector[index, 1]. Same goes for prob and CumProb.
Somehow, you do it correctly for ar with ar[:,0] so it might be a typo

use of base_score parameter in R for XGBoost multiclass problem

Im trying to understand how a xgboost works for a multiclass problem. I have used the IRIS dataset to predict which species an input belongs to based on its characteristics and computed results in R.
The code is below
test <- as.data.frame(iris)
test$y <- ifelse(test$Species=="setosa",0,
(ifelse(test$Species=="versicolor",1,
(ifelse(test$Species=="virginica",2,3)))))
x_iris <- test[,c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")]
y_iris <- test[,"y"]
iris_model <- xgboost(data = data.matrix(x_iris), label = y_iris, eta = 0.1, base_score = 0.5, nround=1,
subsample = 1, colsample_bytree = 1, num_class = 3, max_depth = 4, lambda = 0,
eval_metric = "mlogloss", objective = "multi:softprob")
xgb.plot.tree(model = iris_model, feature_names = colnames(x_iris))
I tried to manually compute the results and compare the gain and cover value with the R output. I have noticed a couple of things:
The initial probability is always 1/(num of classes) irrespective of what we provide in the
‘base_score’ parameter in R. The 'base_score' actually gets added at the end, to the final
log_odds value and it matches with the R output when we run the predict function to get log of odds. In the case of binary classification, the ‘base_score’ parameter is used as initial probability for the model.
predict(iris_model,data.matrix(x_iris), reshape = TRUE, outputmargin = FALSE)
The loss function is (2.0f * p * (1.0f - p) * wt) for multiclass problems and
(p * (1.0f - p) * wt) for binary problems.
There is an explanation for loss function in the github repo https://github.com/dmlc/xgboost/issues/638 , but no info on why the base_score gets added at the end.
Is it because the algorithm in R was designed this way or does the XGBoost multiclass algorithm work like this?

Tensorflow Dataset API: Gradient is "None"?

I've got problems with the Tensorflow Dataset API.
I'd like to pass some per-sample parameters, but I am unable to optimize them.
sample_data = tf.placeholder(...)
design = tf.placeholder(...)
mixture_prob = tf.Variable(..., shape=[num_mixtures, num_samples])
# transpose to get 'num_samples' to axis 0:
mixture_log_prob_t = tf.transpose(tf.log(mixture_prob, name="mixture_log_prob"))
assert mixture_log_prob_t.shape == [num_samples, num_mixtures]
Here is the cause of my problem:
I've got some sample data together with a design matrix.
Also, each sample has got 'num_mixtures' parameters which I'd like to optimize.
data = tf.data.Dataset.from_tensor_slices((
sample_data,
design,
mixture_log_prob_t
))
data = data.repeat()
data = data.shuffle(batch_size * 4)
data = data.apply(tf.contrib.data.batch_and_drop_remainder(batch_size))
iterator = data.make_initializable_iterator()
batch_sample_data, batch_design, batch_mixture_log_prob = iterator.get_next()
batch_mixture_log_prob = tf.transpose(batch_mixture_log_prob)
Now, when running "optimizer.gradient()" I get "None" for this parameter:
>>> model.gradient
[(None, <tf.Variable 'mixture_prob/logit_prob:0' shape=(2, 2000) dtype=float32_ref>), ...]

LSTMLayer produces NaN values even before training it

I'm currently trying to construct a LSTM network with Lasagne to predict the next step of noisy sequences. I first trained a stack of 2 LSTM layers for a while, but had to use an abysmally small learning rate (1e-6) because of divergence issues (that ultimately produced NaN values). The results were kind of disappointing, as the network produced smooth, out-of-phase versions of the input.
I then came to the conclusion I should use better parameter initialization than what is given by default. The goal was to start from a network that just mimics identity, since for strongly auto-correlated signal it should be a good first estimation of the next step (x(t) ~ x(t+1)), and to sprinkle a bit of noise on top of it.
import theano, numpy, lasagne
from theano import tensor as T
from lasagne.layers.recurrent import LSTMLayer, InputLayer, Gate
from lasagne.layers import DropoutLayer
from lasagne.nonlinearities import sigmoid, tanh, leaky_rectify
from lasagne.layers import get_output
from lasagne.init import GlorotNormal, Normal, Constant
floatX = 'float32'
# function to create a lstm that ~ propagate the input from start to finish off the bat
# should be a good start for a predictive lstm with high one-step autocorrelation
def create_identity_lstm(input, shape, orig_inp=None, noiselvl=0.01, G=10., mask_input=None):
inp, out = shape
# orig_inp is used to limit the number of units that are actually used to pass the input information from one layer to the other - the rest of the units should produce ~ 0 activation.
if orig_inp is None:
orig_inp = inp
# input gate
inputgate = Gate(
W_in=GlorotNormal(noiselvl),
W_hid=GlorotNormal(noiselvl),
W_cell=Normal(noiselvl),
b=Constant(0.),
nonlinearity=sigmoid
)
# forget gate
forgetgate = Gate(
W_in=GlorotNormal(noiselvl),
W_hid=GlorotNormal(noiselvl),
W_cell=Normal(noiselvl),
b=Constant(0.),
nonlinearity=sigmoid
)
# cell gate
cell = Gate(
W_in=GlorotNormal(noiselvl),
W_hid=GlorotNormal(noiselvl),
W_cell=None,
b=Constant(0.),
nonlinearity=leaky_rectify
)
# output gate
outputgate = Gate(
W_in=GlorotNormal(noiselvl),
W_hid=GlorotNormal(noiselvl),
W_cell=Normal(noiselvl),
b=Constant(0.),
nonlinearity=sigmoid
)
lstm = LSTMLayer(input, out, ingate=inputgate, forgetgate=forgetgate, cell=cell, outgate=outputgate, nonlinearity=leaky_rectify, mask_input=mask_input)
# change matrices and biases
# ingate - should return ~1 (matrices = 0, big bias)
b_i = lstm.b_ingate.get_value()
b_i[:orig_inp] += G
lstm.b_ingate.set_value(b_i)
# forgetgate - should return 0 (matrices = 0, big negative bias)
b_f = lstm.b_forgetgate.get_value()
b_f[:orig_inp] -= G
b_f[orig_inp:] += G # to help learning future features, I preserve a large bias on "unused" units to help it remember stuff
lstm.b_forgetgate.set_value(b_f)
# cell - should return x(t) (W_xc = identity, rest is 0)
W_xc = lstm.W_in_to_cell.get_value()
for i in xrange(orig_inp):
W_xc[i, i] += 1.
lstm.W_in_to_cell.set_value(W_xc)
# outgate - should return 1 (same as ingate)
b_o = lstm.b_outgate.get_value()
b_o[:orig_inp] += G
lstm.b_outgate.set_value(b_o)
# done
return lstm
I then use this lstm generation code to generate the following network:
# layers
#input + dropout
input = InputLayer((None, None, 7), name='input')
mask = InputLayer((None, None), name='mask')
drop1 = DropoutLayer(input, p=0.33)
#lstm1 + dropout
lstm1 = create_identity_lstm(drop1, (7, 1024), mask_input=mask)
drop2 = DropoutLayer(lstm1, p=0.33)
#lstm2 + dropout
lstm2 = create_identity_lstm(drop2, (1024, 128), orig_inp=7, mask_input=mask)
drop3 = DropoutLayer(lstm2, p=0.33)
#lstm3
lstm3 = create_identity_lstm(drop3, (128, 7), orig_inp=7, mask_input=mask)
# symbolic variables and prediction
x = input.input_var
ma = mask.input_var
ma_reshape = ma.dimshuffle((0,1,'x'))
yhat = get_output(lstm3, deterministic=False)
yhat_det = get_output(lstm3, deterministic=True)
y = T.ftensor3('y')
predict = theano.function([x, ma], yhat_det)
Problem is, even without any training, this network produces garbage values and sometimes even a bunch of NaNs, right from the very first LSTM layer:
X = numpy.random.random((5, 10000, 7)).astype('float32')
Masks = numpy.ones(X.shape[:2], dtype='float32')
hid1 = get_output(lstm1, determistic=True)
get_hid1 = theano.function([x, ma], hid1)
h1 = get_hid1(X, Masks)
print numpy.isnan(h1).sum(axis=1).sum(axis=1)
array([6379520, 6367232, 6377472, 6376448, 6378496])
# even the first output value is garbage!
print h1[:,0,0] - X[:,0,0]
array([-0.03898358, -0.10118812, 0.34877831, -0.02509735, 0.36689138], dtype=float32)
I don't get why, I checked each matrices and their values are fine, like I wanted them to be. I even tried to recreate each gate activations and the resulting hidden activations using the actual numpy arrays and they reproduce the input just fine. What did I do wrong there??

Resources