Keras Metric strange behavior - keras

I am trying to define a Keras metric that returns a rounded value. (Normally K.round() can't be used in the Loss since it's not differentiable but I think it can be used on the Metric).
However despite using K.round() the metric consistently has decimal places, giving me values like 2.0812 and similar. It is worth mentioning that my y_true are 3 value float lists like [1., 3., 5.], [2., 5., 1.] and so on.
To try and understand what is happening I defined a very simple metric.
def simplemetric(y_true, y_pred):
return (y_true)
I was expecting this to give me an error, but it returns values around 2.089 that vary slightly on each epoch but not with batch size (which is 128).
I then tried a different metric.
def simplemetric(y_true, y_pred):
return (K.round(K.sum(y_true)))
This gives me values around 800.142 that vary slightly on each epoch but not with batch number.
As a final test I tried:
def simplemetric(y_true, y_pred):
return (y_true*0 + 10.0)
Which gives me the expected value of 10.0 on every epoch.
So what is happening on the previous cases? Why can't I get a whole number, and what is the meaning of the ~2. and ~800 keras is somehow calculating out of lists like [1., 2., 5.]?

Related

Can't use combination of gradiants for multiple losses functions of a multi-output keras model

I am doing a time-series forecasting in Keras with a CNN and the EHR dataset. The goal is to predict both what molecule to give to the patient and the time until the next patient visit. I have to implement a bi-objective gradient descent based on this paper. The algorithm to implements is here (end of page 7, the beginning of page 8):
The model I choose is this one :
With time-series of length 3 as input (correspondings to 3 consecutive visits for a client)
And 2 outputs:
the atc code (the code of the molecule to predict)
the time to wait until the next visit (in categories of months: 0,1,2,3,4 for >=4)
both outputs use the SparseCategoricalCorssentropy loss function.
when I start to implement the first operation: gs - gl I have this error :
Some values in my gradients are at None and I don't know why. My optimizer is defined as follow: optimizer=tf.Keras.optimizers.Adam(learning_rate=1e-3 when compiling my model.
Also, when I try some operations on gradients to see how things work, I have another problem: only one input is taken into account which will pose a problem later because I have to consider each loss function separately:
With this code, I have this output message : WARNING:tensorflow:Gradients do not exist for variables ['outputWaitTime/kernel:0', 'outputWaitTime/bias:0'] when minimizing the loss.
EPOCHS = 1
for epoch in range(EPOCHS):
with tf.GradientTape() as ATCTape, tf.GradientTape() as WTTape:
predictions = model(xTrain,training=False)
ATCLoss = loss(yTrain[:,:,0],predictions[ATC_CODE])
WTLoss = loss(yTrain[:,:,1],predictions[WAIT_TIME])
ATCGrads = ATCTape.gradient(ATCLoss, model.trainable_variables)
WTGrads = WTTape.gradient(WTLoss,model.trainable_variables)
grads = ATCGrads + WTGrads
model.optimizer.apply_gradients(zip(grads, model.trainable_variables))
With this code, it's okay, but both losses are combined into one, whereas I need to consider both losses separately
EPOCHS = 1
for epoch in range(EPOCHS):
with tf.GradientTape() as tape:
predictions = model(xTrain,training=False)
ATCLoss = loss(yTrain[:,:,0],predictions[ATC_CODE])
WTLoss = loss(yTrain[:,:,1],predictions[WAIT_TIME])
lossValue = ATCLoss + WTLoss
grads = tape.gradient(lossValue, model.trainable_variables)
model.optimizer.apply_gradients(zip(grads, model.trainable_variables))
I need help to understand why I have all of those problems.
The notebook containing all the code is here: https://colab.research.google.com/drive/1b6UorAAEddNKFQCxaK1Wsuj09U645KhU?usp=sharing
The implementation begins in the part Model Creation
The reason you get None in ATCGrads and WTGrads is because two gradients corresponding loss is wrt different outputs outputATC and outputWaitTime, if
outputs value is not using to calculate the loss then there will be no gradients wrt that outputs hence you get None gradients for that output layer. That is also the reason why you get WARNING:tensorflow:Gradients do not exist for variables ['outputWaitTime/kernel:0', 'outputWaitTime/bias:0'] when minimizing the loss, because you don't have those gradients wrt each loss. If you combine losses into one then both outputs are using to calculate the loss, thus no WARNING.
So if you want do a list element wise subtraction, you could first convert None to 0. before subtraction, and you cannot using tf.math.subtract(gs, gl) because it require shapes of all inputs must match, so:
import tensorflow as tf
gs = [tf.constant([1., 2.]), tf.constant(3.), None]
gl = [tf.constant([3., 4.]), None, tf.constant(4.)]
to_zero = lambda i : 0. if i is None else i
gs = list(map(to_zero, gs))
gl = list(map(to_zero, gl))
sub = [s_i - l_i for s_i, l_i in zip(gs, gl)]
print(sub)
Outpts:
[<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-2., -2.], dtype=float32)>,
<tf.Tensor: shape=(), dtype=float32, numpy=3.0>,
<tf.Tensor: shape=(), dtype=float32, numpy=-4.0>]
Also beware the tape.gradient() will return a list or nested structure of Tensors (or IndexedSlices, or None), one for each element in sources. Returned structure is the same as the structure of sources; Add two list [1, 2] + [3, 4] in python will not give you [4, 6] like you do in numpy array, instead it will combine two list and give you [1, 2, 3, 4].

The loss will be Nan when I use loss function defined by torch.nn.function.mse_loss

The loss is always Nan when I use the loss function as follow:
def Myloss1(source, target):
loss = torch.nn.functional.mse_loss(source, target, reduction="none")
return torch.sum(loss).sqrt()
...
loss = Myloss1(s, t)
loss.backward()
But when I use the following loss function, the training becomes normal:
def Myloss2(source, target):
diff = target - source
loss = torch.norm(diff)
return loss
...
loss = Myloss2(s, t)
loss.backward()
Why can't use the ‘Myloss1’ to train? Aren't Myloss1 and Myloss2 equivalent?
Please help me,thank you very much!
Myloss1 and Myloss2 are indeed supposedly equivalent. They at least return the same values for all the tensors I have tried them on.
About the Nan, let's first try to find when it happens. The only possible culprit here is the sqrt, which is not differentiable in 0. And indeed :
y = torch.randn(2,3)
x = y.clone()
x.requires_grad_(True)
Myloss1(x,y).backward()
print(x.grad.data)
>>> [[nan, nan, nan], [nan, nan, nan]]
On the other hand :
Myloss2(x,y).backward()
print(x.grad.data)
>>> [[-0., -0., -0.],[-0., -0., -0.]]
Of both results, only the first is mathematically "accurate". Computing the derivative of the square root at 0 yields a division by 0. That is why when training neural networks or whatever, the sqrt is not used. You should use
good_loss = torch.nn.MSELoss(reduction='mean') # or ='sum' if you prefer
This function is differentiable everywhere, you won't have any more trouble.
As to why your Myloss2 yields a different gradient, it is related to its implementation. It was extensively discussed here. Basically, people complained about the nans, so the lib was changed to modify this behavior, while acknowledging that there is no mathematically correct answer here since this derivative is not defined at 0.

predict_proba() Logistic Regression when predicting a single value

I want to use Logistic Regression to predict a class (-1 or +1) given a data set which I split as follows (only a single entry is to be predicted in the test set):
x_train, x_test = loc_indep[:-1], loc_indep[-1:]
y_train, y_test = loc_target[:-1], loc_target[-1:]
Then I use the following to train the model:
regr = LogisticRegression()
regr.fit(x_train, y_train)
predictions = regr.predict(x_test)
probabilities = regr.predict_proba(x_test)
print(probabilities) # prints probabilities
Given the above, the probabilities always prints either [1. 0.] or [0. 1.], meaning that either class +1 or class -1 are picked with the probability 100%. Why is this the case? I expected that the probabilities sum to 1, but that the model picks, say, class +1 with probability 54%.
Your code seems to be correct. So this means you have a supper accurate model (which makes me suspect that something is wrong...). I will recommend check your training data, maybe you have some variable, by error, that explains too much (for example the same output).
Also try to output the train an test accuracy. If train accuracy is 100% and test accuracy is much lower, you are over fitting. Then you will have to change some hyperparameters to avoid it.
To conclude, try to understand your data, maybe it's super easy to differentiate both classes, and maybe for this reason you obtain such a good model.

keras and shape of input and losses

In all code examples for keras I see that the input shape is passed directly and it is surmised that the batch size is the first one , eg:
model = Sequential()
model.add(Dense(32, input_shape=(16,)))
# now the model will take as input arrays of shape (*, 16)
# and output arrays of shape (*, 32)
However when it comes to custom losses I see that the last axis (axis=-1) is used.
def loss(y_true,y_pred):
return K.mean(K.square(y_pred - y_true), axis=-1)
When writing the loss should one think of y_true and y_pred as batches or singular samples?
I'm assuming it's the former , but if that's the case I can't understand why it's specifying the last axis
In your custom loss function, you treat y_true and y_pred as batches which is also the case for the returned value of the function. If you only calculate one loss for your network, you could also get rid of the specified axis, since you only want a single value for your loss in the end.
But if you have multiple outputs in your network and you want to calculate the total loss, where each output might use its own loss functions, things begin to change.
Please check out: https://github.com/keras-team/keras/blob/master/keras/engine/training.py#L658
where the function to calculate the total loss, _prepare_total_loss, is called.
In this function, the following code is executed:
output_loss = loss_fn(y_true, y_pred, sample_weight=sample_weight)
which returns the loss for a single output of your network. This is also where your custom loss function gets called. If there are multiple outputs, all of them are calculated, weighted and added to the total loss: total_loss += loss_weight * output_loss
In the end, _prepare_total_loss returns K.mean(total_loss). So in the simplest case, if your custom loss function returned a vector with its length equal to the batch size, and there is only one output with loss in your network, the final loss will be the mean of the output-vector returned by your custom loss.
But in case of multiple outputs and multiple losses, you first want to calculate the loss vector of a batch for each output and therefore loss function, take their weighted sum and then calculate the final loss by taking the mean of the resulting vector.
If your loss functions would return a single loss value each instead of a batch-sized vector, the final loss would be the mean of multiple mean loss values which differs from the mean loss of the whole batch.

Scikit-Learn GridSearch custom scoring function

I need to perform kernel pca on a dataset of dimension (5000, 26421) to get a lower dimension representation. To choose the number of components (say k) parameter, I am performing the reduction of the data and reconstruction to the original space and getting the mean square error of the reconstructed and original data for different values of k.
I came across sklearn's gridsearch functionality and want to use it for the above parameter estimation. Since there is no score function for kernel pca, I have implemented a custom scoring function and passing it to Gridsearch.
from sklearn.decomposition.kernel_pca import KernelPCA
from sklearn.model_selection import GridSearchCV
import numpy as np
import math
def scorer(clf, X):
Y1 = clf.inverse_transform(X)
error = math.sqrt(np.mean((X - Y1)**2))
return error
param_grid = [
{'degree': [1, 10], 'kernel': ['poly'], 'n_components': [100, 400, 100]},
{'gamma': [0.001, 0.0001], 'kernel': ['rbf'], 'n_components': [100, 400, 100]},
]
kpca = KernelPCA(fit_inverse_transform=True, n_jobs=30)
clf = GridSearchCV(estimator=kpca, param_grid=param_grid, scoring=scorer)
clf.fit(X)
However, it results in the below error:
/usr/lib64/python2.7/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X=array([[ 2., 2., 1., ..., 0., 0., 0.],
...., 0., 1., ..., 0., 0., 0.]], dtype=float32), Y=array([[-0.05904257, -0.02796719, 0.00919842, .... 0.00148251, -0.00311711]], dtype=float32), precomp
uted=False, dtype=<type 'numpy.float32'>)
117 "for %d indexed." %
118 (X.shape[0], X.shape[1], Y.shape[0]))
119 elif X.shape[1] != Y.shape[1]:
120 raise ValueError("Incompatible dimension for X and Y matrices: "
121 "X.shape[1] == %d while Y.shape[1] == %d" % (
--> 122 X.shape[1], Y.shape[1]))
X.shape = (1667, 26421)
Y.shape = (112, 100)
123
124 return X, Y
125
126
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 26421 while Y.shape[1] == 100
Can someone point out what exactly am I doing wrong?
The syntax of scoring function is incorrect. You only need to pass the predicted and truth values for the classifiers. So this is how you declare your custom scoring function :
def my_scorer(y_true, y_predicted):
error = math.sqrt(np.mean((y_true - y_predicted)**2))
return error
Then you can use make_scorer function in Sklearn to pass it to the GridSearch.Be sure to set the greater_is_better attribute accordingly:
Whether score_func is a score function (default), meaning high is good, or a loss function, meaning low is good. In the latter case, the scorer object will sign-flip the outcome of the score_func.
I am assuming you are calculating an error, so this attribute should set as False, since lesser the error, the better:
from sklearn.metrics import make_scorer
my_func = make_scorer(my_scorer, greater_is_better=False)
Then you pass it to the GridSearch :
GridSearchCV(estimator=my_clf, param_grid=param_grid, scoring=my_func)
Where my_clf is your classifier.
One more thing, I don't think GridSearchCV is exactly what you are looking for. It basically accepts data in the form of train and test splits. But here you only want to transform your input data. You need to use Pipeline in Sklearn. Look at the example mentioned here of combining PCA and GridSearchCV.

Resources