Ways to limit output of NN regression problem in certain limit(i.e. I want my NN to always predict output values only between -20 to +30) - keras

I am training NN for the regression problem. So the output layer has a linear activation function. NN output is supposed to be between -20 to 30. My NN is performing good most of the time. However, sometimes it gives output more than 30 which is not desirable for my system. So does anyone know any activation function that can provide such kind of restriction on output or any suggestions on modifying linear activation function for my application?
I am using Keras with tenserflow backend for this application

What you can do is to activate your last layer with a sigmoid, the result will be between 0 and 1 and then create a custom layer in order to get the desired range :
def get_range(input, maxx, minn):
return (minn - maxx) * ((input - K.min(input, axis=1))/ (K.max(input, axis=1)*K.min(input, axis=1))) + maxx
and then add this to your network :
out = layers.Lambda(get_range, arguments={'maxx': 30, 'minn': -20})(sigmoid_output)
The output will be normalized between 'maxx' and 'minn'.
UPDATE
If you want to clip your data without normalizing all your outputs, do this instead :
def clip(input, maxx, minn):
return K.clip(input, minn, maxx)
out = layers.Lambda(clip, arguments={'maxx': 30, 'minn': -20})(sigmoid_output)

What you should do is normalize your target outputs to the range [-1, 1] or [0, 1], and then use a tanh (for [-1, 1]) or sigmoid (for [0, 1]) activation at the output, and train the model with normalize data.
Then you can denormalize the predictions to get values in your original ranges during inference.

Related

Output of the model depends on the shape of the weights tensor

I want to train the model to sum the three inputs. So it is as simple as possible.
Firstly the weights are initialized randomly. It produces bad error estimate (approx. 0.5)
Then I initialize the weights with zeros. There are two options:
the shape of the weights tensor is [1, 3]
the shape of the weights tensor is [3]
When I choose the 1st option the model still works bad and can't learn this simple formula.
When I choose the 2nd option it works perfect with the error of 10e-12.
Why the result depends on the shape of the weights? Why do I need to initialize the model with zeros to solve this simple problem?
import torch
from torch.nn import Sequential as Seq, Linear as Lin
from torch.optim.lr_scheduler import ReduceLROnPlateau
X = torch.rand((1024, 3))
y = (X[:,0] + X[:,1] + X[:,2])
m = Seq(Lin(3, 1, bias=False))
# 1 option
m[0].weight = torch.nn.parameter.Parameter(torch.tensor([[0, 0, 0]], dtype=torch.float))
# 2 option
#m[0].weight = torch.nn.parameter.Parameter(torch.tensor([0, 0, 0], dtype=torch.float))
optim = torch.optim.SGD(m.parameters(), lr=10e-2)
scheduler = ReduceLROnPlateau(optim, 'min', factor=0.5, patience=20, verbose=True)
mse = torch.nn.MSELoss()
for epoch in range(500):
optim.zero_grad()
out = m(X)
loss = mse(out, y)
loss.backward()
optim.step()
if epoch % 20 == 0:
print(loss.item())
scheduler.step(loss)
First option doesn't learning because it fails with broadcasting: while out.shape == (1024, 1) corresponding targets y has shape of (1024, ). MSELoss, as expected, computes mean of tensor (out - y)^2, which in this case has shape (1024, 1024), clearly wrong objective for this task. At the same time, after applying 2-nd option tensor (out - y)^2 has size (1024, ) and mean of it corresponds to actual mse. Default approach, without explicit changing weights shape (through option 1 and 2), would work if set target shape to (1024, 1) for example by y = y.unsqueeze(-1) after definition of y.

Can't use combination of gradiants for multiple losses functions of a multi-output keras model

I am doing a time-series forecasting in Keras with a CNN and the EHR dataset. The goal is to predict both what molecule to give to the patient and the time until the next patient visit. I have to implement a bi-objective gradient descent based on this paper. The algorithm to implements is here (end of page 7, the beginning of page 8):
The model I choose is this one :
With time-series of length 3 as input (correspondings to 3 consecutive visits for a client)
And 2 outputs:
the atc code (the code of the molecule to predict)
the time to wait until the next visit (in categories of months: 0,1,2,3,4 for >=4)
both outputs use the SparseCategoricalCorssentropy loss function.
when I start to implement the first operation: gs - gl I have this error :
Some values in my gradients are at None and I don't know why. My optimizer is defined as follow: optimizer=tf.Keras.optimizers.Adam(learning_rate=1e-3 when compiling my model.
Also, when I try some operations on gradients to see how things work, I have another problem: only one input is taken into account which will pose a problem later because I have to consider each loss function separately:
With this code, I have this output message : WARNING:tensorflow:Gradients do not exist for variables ['outputWaitTime/kernel:0', 'outputWaitTime/bias:0'] when minimizing the loss.
EPOCHS = 1
for epoch in range(EPOCHS):
with tf.GradientTape() as ATCTape, tf.GradientTape() as WTTape:
predictions = model(xTrain,training=False)
ATCLoss = loss(yTrain[:,:,0],predictions[ATC_CODE])
WTLoss = loss(yTrain[:,:,1],predictions[WAIT_TIME])
ATCGrads = ATCTape.gradient(ATCLoss, model.trainable_variables)
WTGrads = WTTape.gradient(WTLoss,model.trainable_variables)
grads = ATCGrads + WTGrads
model.optimizer.apply_gradients(zip(grads, model.trainable_variables))
With this code, it's okay, but both losses are combined into one, whereas I need to consider both losses separately
EPOCHS = 1
for epoch in range(EPOCHS):
with tf.GradientTape() as tape:
predictions = model(xTrain,training=False)
ATCLoss = loss(yTrain[:,:,0],predictions[ATC_CODE])
WTLoss = loss(yTrain[:,:,1],predictions[WAIT_TIME])
lossValue = ATCLoss + WTLoss
grads = tape.gradient(lossValue, model.trainable_variables)
model.optimizer.apply_gradients(zip(grads, model.trainable_variables))
I need help to understand why I have all of those problems.
The notebook containing all the code is here: https://colab.research.google.com/drive/1b6UorAAEddNKFQCxaK1Wsuj09U645KhU?usp=sharing
The implementation begins in the part Model Creation
The reason you get None in ATCGrads and WTGrads is because two gradients corresponding loss is wrt different outputs outputATC and outputWaitTime, if
outputs value is not using to calculate the loss then there will be no gradients wrt that outputs hence you get None gradients for that output layer. That is also the reason why you get WARNING:tensorflow:Gradients do not exist for variables ['outputWaitTime/kernel:0', 'outputWaitTime/bias:0'] when minimizing the loss, because you don't have those gradients wrt each loss. If you combine losses into one then both outputs are using to calculate the loss, thus no WARNING.
So if you want do a list element wise subtraction, you could first convert None to 0. before subtraction, and you cannot using tf.math.subtract(gs, gl) because it require shapes of all inputs must match, so:
import tensorflow as tf
gs = [tf.constant([1., 2.]), tf.constant(3.), None]
gl = [tf.constant([3., 4.]), None, tf.constant(4.)]
to_zero = lambda i : 0. if i is None else i
gs = list(map(to_zero, gs))
gl = list(map(to_zero, gl))
sub = [s_i - l_i for s_i, l_i in zip(gs, gl)]
print(sub)
Outpts:
[<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-2., -2.], dtype=float32)>,
<tf.Tensor: shape=(), dtype=float32, numpy=3.0>,
<tf.Tensor: shape=(), dtype=float32, numpy=-4.0>]
Also beware the tape.gradient() will return a list or nested structure of Tensors (or IndexedSlices, or None), one for each element in sources. Returned structure is the same as the structure of sources; Add two list [1, 2] + [3, 4] in python will not give you [4, 6] like you do in numpy array, instead it will combine two list and give you [1, 2, 3, 4].

The derivative of Softmax outputs really large shapes

I am creating a basic, and also my first neural network on handwritten digit recognition without any framework (like Tensorflow, PyTorch...) using the Backpropagation algorithm.
My NN has 784 inputs and 10 outputs. So for the last layer, I have to use Softmax.
Because of some memory errors, I have right now my images in shape (300, 784) and my labels in shape (300, 10)
After that I am calculating loss from Categorical Cross-entropy.
Now we are getting to my problem. In Backpropagation, I need manually compute the first derivative of an activation function. I am doing it like this:
dAl = -(np.divide(Y, Al) - np.divide(1 - Y, 1 - Al))
#Y = test labels
#Al - Activation value from my last layer
And after that my Backpropagation can start, so the last layer is softmax.
def SoftmaxDerivative(dA, Z):
#Z is an output from np.dot(A_prev, W) + b
#Where A_prev is an activation value from previous layer
#W is weight and b is bias
#dA is the derivative of an activation function value
x = activation_functions.softmax(dA)
s = x.reshape(-1,1)
dZ = np.diagflat(s) - np.dot(s, s.T)
return dZ
1. Is this function working properly?
In the end, I would like to compute derivatives of weights and biases, So I am using this:
dW = (1/m)*np.dot(dZ, A_prev.T)
#m is A_prev.shape[1] -> 10
db = (1/m)*np.sum(dZ, axis = 1, keepdims = True)
BUT it fails on dW, because dZ.shape is (3000, 3000) (compare to A_prev.shape, which is (300,10))
So from this I assume, that there are only 3 possible outcomes.
My Softmax backward is wrong
dW is wrong
I have some other bug completely somewhere else
Any help would be really appreciated!
I faced the same problem recently. I'm not sure but maybe this question will help you: Softmax derivative in NumPy approaches 0 (implementation)

Pytorch Categorical Cross Entropy loss function behaviour

I have question regarding the computation made by the Categorical Cross Entropy Loss from Pytorch.
I have made this easy code snippet and because I use the argmax of the output tensor as the targets, I cannot understand why the loss is still high.
import torch
import torch.nn as nn
ce_loss = nn.CrossEntropyLoss()
output = torch.randn(3, 5, requires_grad=True)
targets = torch.argmax(output, dim=1)
loss = ce_loss(outputs, targets)
print(loss)
Thanks for the help understanding it.
Best regards
Jerome
So here is a sample data from your code with the output, label and loss having the following values
outputs = tensor([[ 0.5968, -0.8249, 1.5018, 2.7888, -0.6125],
[-1.1534, -0.4921, 1.0688, 0.2241, -0.0257],
[ 0.3747, 0.8957, 0.0816, 0.0745, 0.2695]], requires_grad=True)requires_grad=True)
labels = tensor([3, 2, 1])
loss = tensor(0.7354, grad_fn=<NllLossBackward>)
So let's examine the values,
If you compute the softmax output of your logits (outputs), using something like this torch.softmax(outputs,axis=1) you will get
probs = tensor([[0.0771, 0.0186, 0.1907, 0.6906, 0.0230],
[0.0520, 0.1008, 0.4801, 0.2063, 0.1607],
[0.1972, 0.3321, 0.1471, 0.1461, 0.1775]], grad_fn=<SoftmaxBackward>)
So these will be your prediction probabilities.
Now cross-entropy loss is nothing but a combination of softmax and negative log likelihood loss. Hence, your loss can simply be computed using
loss = (torch.log(1/probs[0,3]) + torch.log(1/probs[1,2]) + torch.log(1/probs[2,1])) / 3
, which is the average of the negative log of the probabilities of your true labels. The above equation evaluates to 0.7354, which is equivalent to the value returned from the nn.CrossEntropyLoss module.

tflearn DNN gives zero loss

I am using pandas to extract my data. To get an idea of my data I replicated an example dataset...
data = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
which yields a dataset of shape=(100,4)...
A B C D
0 75 38 81 58
1 36 92 80 79
2 22 40 19 3
... ...
I am using tflearn so I will need a target label as well. So I created a target label by extracting one of the columns from data and then dropped it out of the data variable (I also converted everything to numpy arrays)...
# Target label used for training
labels = np.array(data['A'].values, dtype=np.float32)
# Reshape target label from (100,) to (100, 1)
labels = np.reshape(labels, (-1, 1))
# Data for training minus the target label.
data = np.array(data.drop('A', axis=1).values, dtype=np.float32)
Then I take the data and the labels and feed it into the DNN...
# Deep Neural Network.
net = tflearn.input_data(shape=[None, 3])
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 1, activation='softmax')
net = tflearn.regression(net)
# Define model.
model = tflearn.DNN(net)
model.fit(data, labels, n_epoch=10, batch_size=16, show_metric=True)
This seems like it should work, but the output I get is as follows...
Notice that the loss remains at 0, so I am definitely doing something wrong. I don't really know what form my data should be in. How can I get my training to work?
Your actual output is in range 0 to 100 while the activation softmax in the outermost layer outputs in range [0, 1]. You need to fix that. Also the default loss for tflearn.regression is categorical cross entropy which is used for classification problems and makes no sense in your scenario. You should try L2 loss. The reason you are getting zero error in this setting is that your network predicts 0 for all training examples and if you fit that value in formula for sigmoid cross entropy, loss indeed is zero. Here is its formula , where t[i] denotes the actual probabilities (which doesnt make sense in your problem) and o[i] is the predicted probabilities.
Here is more reasoning about why default choice of loss function is not suitable for your case

Resources