For my classification problem I want to use loss function used prettly for regression, such as Mean Absolute Error.
Consider "y_pred" and "y_true" are in one-hot-encoding, but for MAE i need them in real number representation.
In this first case I'me getting error: ValueError: No gradients provided for any variable
def AgeAccuracyRegularity(y_pred,y_true):
mae_func = tf.keras.losses.MeanAbsoluteError()
y_pred_ages = K.cast(K.argmax(y_pred, axis=-1)+1,dtype='float32')
y_true_ages = K.cast(K.argmax(y_true, axis=-1)+1,dtype='float32')
res = mae_func(y_pred_ages, y_true_ages)
return res
But if I manipulate the result with no sense in this way
def AgeAccuracyRegularity(y_pred,y_true):
mae_func = tf.keras.losses.MeanAbsoluteError()
y_pred_ages = K.cast(K.argmax(y_pred, axis=-1)+1,dtype='float32')
y_true_ages = K.cast(K.argmax(y_true, axis=-1)+1,dtype='float32')
res = mae_func(y_pred_ages, y_true_ages)
mae = mae_func(y_pred, y_true)
return res-mae+mae
it works. I check output of classificator and bot "mae" and "res" in the the custom loss function and they are the same size and type.
For classification problem you should use "sparse_categorical_crossentropy" loss function. MAE is intended to be used for regression problems.
Related
I'm currently working on a distributed federated learning infrastructure and am trying to implement PyTorch. For this I also need federated averaging which averages the retrieved parameters from all the nodes and then passes those to a next training round.
The gathering of the parameters looks like this:
def RPC_get_parameters(data, model):
"""
Get parameters from nodes
"""
with torch.no_grad():
for parameters in model.parameters():
# store parameters in dict
return {"parameters": parameters}
The averaging function which happens at the central server looks like this:
# stores results from RPC_get_parameters() in results
results = client.get_results(task_id=task.get("id"))
# averaging of returned parameters
global_sum = 0
global_count = 0
for output in results:
global_sum += output["parameters"]
global_count += len(global_sum)
#
averaged_parameters = global_sum/global_count
#
new_params = {'averaged_parameters': averaged_parameters}
Now my question is, how do you update all the parameters (tensors) in Pytorch from this? I tried a few things and they usually returned errors like "Value Error: can't optimize a non-leaf tensor" when inserting new_params into the optimizer where usually model.parameters() go optimizerD = optim.SGD(new_params, lr=0.01, momentum = 0.5). So how do I actually update the model so it uses the averaged parameters?
Thank you!
https://github.com/simontkl/torch-vantage6/blob/fed_avg-w/local_dp/v6-ppsdg-py/master.py
I think the most convenient way to work with parameters (outside the SGD context) is using the state_dict of the model.
new_params = OrderedDict()
n = len(clients) # number of clients
for client_model in clients:
sd = client_model.state_dict() # get current parameters of one client
for k, v in sd.items():
new_params[k] = new_params.get(k, 0) + v / n
After that new_params is a state_dict (you can load it using .load_state_dict) with the average weights of the clients.
I am trying to do a seq2seq prediction. For this, I have a LSTM layer followed by a fully connected layer. I employ Teacher training during the training phase and would like to skip this (I maybe wrong here) during testing phase. I have not found a direct way of doing this so I have taken the approach shown below.
def forward(self, inputs, future=0, teacher_force_ratio=0.2, target=None):
outputs = []
for idx in range(future):
rnn_out, _ = self.rnn(inputs)
output = self.fc1(rnn_out)
if self.teacher_training:
new_input = output if np.random.random() >= teacher_force_ratio else target[idx]
else:
new_input = output
inputs = new_input
I use a bool variable teacher_training to check if Teacher training is needed or not. Is this correct? If yes, is there a better way to do this? Thanks.
In PyTorch all classes that extend nn.Module have a kwarg boolean param called training . So instead of teacher_training we should simply use training param. This param is automatically set depending on your model training mode (model.train() and model.eval()).
I did a lot of searching and am still unable to figure out writing a custom loss function with multiple outputs where they interact.
I have a Neural Network defined as :
def NeuralNetwork():
inLayer = Input((2,));
layers = [Dense(numNeuronsPerLayer,activation = 'relu')(inLayer)];
for i in range(10):
hiddenLyr = Dense(5,activation = 'tanh',name = "layer"+ str(i+1))(layers[i]);
layers.append(hiddenLyr);
out_u = Dense(1,activation = 'linear',name = "out_u")(layers[i]);
out_k = Dense(1,activation = 'linear',name = "out_k")(layers[i]);
outLayer = Concatenate(axis=-1)([out_u,out_k]);
model = Model(inputs = [inLayer], outputs = outLayer);
return model
I am now trying to define a custom loss function as follows :
def computeLoss(true,prediction):
u_pred = prediction[:,0];
k_pred = prediction[:,1];
loss = f(u_pred)*k_pred;
return loss;
Where f(u_pred) is some manipulation of u_pred. The code seems to work correct and produce correct results when I use only u_pred (i.e., single output from the neural network only). However, the moment I try to include another output for k_pred and perform the slice of my prediction tensor in the loss function, I start getting wrong results. I feel I am doing something wrong in handling multiple outputs in Keras but am not sure where my mistake lies. Any help on how I may proceed is welcome.
I figured out that you can't just use indexing ( i.e., [:,0] or [:,1] ) to slice tensors in tf. The operation doesn't seem to work. Instead, use the built in function in tensorflow as
detailed in https://www.tensorflow.org/api_docs/python/tf/split?version=stable
So the code that worked was:
(u_pred, k_pred) = tf.split(prediction, num_or_size_splits=2, axis=1)
I am working on a denoising autoencoder for audio, feeding raw time-series audio to the network and receiving time-series audio as output from the network. The mean_square_error loss objective function returns values of shape (batch_size, audio_sequence_length), which (I hope I understood correctly) is further processed by Keras internally to reach the final single-valued loss used for backprop by computing the mean over time bins and batches.
My current efforts are focused on creating a custom loss function using signal power instead of the error of individual samples, returning values of shape (batch_size, ). The model compiles nicely but returns only NaN loss at training time. Trying to predict anything using such a model results in output vectors consisting of NaN as well.
This is the loss function:
def SI_SNR(yTrue,yPred):
yTarget = K.batch_dot(yTrue,yPred,axes=0)
yTarget = K.batch_dot(yTrue,yTarget,axes=None)
yNorm = K.batch_dot(yTrue,yTrue, axes = 0)
yTarget = yTarget/yNorm
eNoise = yPred - yTarget
losses = -(10.*K.log(K.batch_dot(yTarget,yTarget,axes=0)/
K.batch_dot(eNoise,eNoise,axes=0))/K.log(10.))
return K.reshape(losses,([-1]))
When using the function on actual numbers (either using a subset of the training data or randomly filled arrays) I do get non NaN results:
x=K.variable(np.random.rand(8,1024,1))
y=K.variable(np.random.rand(8,1024,1))
K.eval(SI_SNR(y,x))
Is the training behavior due to the shape of the loss or is there perhaps some other problem with the internal structure of the loss function?
To answer my own question: the output shape of the cost was not the issue. Tested this hypothesis using a different (dummy) loss:
def meanMSE(yTrue,yPred):
return K.mean(mean_squared_error(yTrue,yPred),axis=1)
If yPred is a vector of zeros, the previous cost function has Div0 issues, using backend.clip and modifying the function slightly, the problem is resolved:
def SDR(yTrue,yPred):
return(K.batch_dot(yPred,yPred,axes=1)/
K.clip(K.square(K.batch_dot(yPred,yTrue,axes=1)),1e-7,1e12))
To be clear, by weights I mean the entries in the matrices (Ws) of the affine transformation in a node of a neural net.
I start with categorical_crossentropy as my loss function. And I want to add an additional term to penalize negative weights.
To this end I want to introduce a term of the form
theano.tensor.sum(theano.tensor.exp(-10 * ws))
Where "ws" are the weights.
If I follow the source code of categorical_crossentropy:
if true_dist.ndim == coding_dist.ndim:
return -tensor.sum(true_dist *tensor.log(coding_dist), axis=coding_dist.ndim - 1)
elif true_dist.ndim == coding_dist.ndim - 1:
return crossentropy_categorical_1hot(coding_dist, true_dist)
else:
raise TypeError('rank mismatch between coding and true distributions')
Seems like I should update the third line (from the bottom) to read
crossentropy_categorical_1hot(coding_dist, true_dist) + theano.tensor.sum(theano.tensor.exp(- 10 * ws))
And change the declaration of the function to be
my_categorical_crossentropy(coding_dist, true_dist, ws) Where in calling for my_categorical_crossentropy I write
loss = my_categorical_crossentropy(net_output, true_output, l_layers[1].W)
with, for a start, l_layers[1].W to be the weights coming from the first layer of my neural net.
With those updates, I go on writing:
loss = aggregate(loss, mode = 'mean')
updates = sgd(loss, all_params, learning_rate = 0.005)
train = theano.function([l_input.input_var, true_output], loss, updates = updates)
[...]
This passes the compiler and everything runs smoothly, the training of the network completes. However, for some reason the additional term " theano.tensor.sum(theano.tensor.exp(- 10 * ws)) is ignored, it seems not to effect the loss value.
I was trying to look into Theano documentation, but so far I could not figure out what might be wrong? The weighs l_layers[1].W are shared variables, so I could not pass those as
train = theano.function([l_input.input_var, true_output, l_layers[1].W], loss, updates = updates)
Any comments are welcome. Thanks!
Solution
Though, I didn't find why what I did, didn't work, adding the penalty term outside the 'categorical_crossentropy' as suggested in the comments did solve the problem:
loss = aggregate(categorical_crossentropy(net_output, true_output) + theano.tensor.sum(theano.tensor.exp(- 10 * l_layers[1].W))