Why doesn't this simple neural network converge for XOR? - python-3.x

The code for the network below works okay, but it's too slow. This site implies that the network should get 99% accuracy after 100 epochs with a learning rate of 0.2, while my network never gets past 97% even after 1900 epochs.
Epoch 0, Inputs [0 0], Outputs [-0.83054376], Targets [0]
Epoch 100, Inputs [0 1], Outputs [ 0.72563824], Targets [1]
Epoch 200, Inputs [1 0], Outputs [ 0.87570863], Targets [1]
Epoch 300, Inputs [0 1], Outputs [ 0.90996706], Targets [1]
Epoch 400, Inputs [1 1], Outputs [ 0.00204791], Targets [0]
Epoch 500, Inputs [0 1], Outputs [ 0.93396672], Targets [1]
Epoch 600, Inputs [0 0], Outputs [ 0.00006375], Targets [0]
Epoch 700, Inputs [0 1], Outputs [ 0.94778227], Targets [1]
Epoch 800, Inputs [1 1], Outputs [-0.00149935], Targets [0]
Epoch 900, Inputs [0 0], Outputs [-0.00122716], Targets [0]
Epoch 1000, Inputs [0 0], Outputs [ 0.00457281], Targets [0]
Epoch 1100, Inputs [0 1], Outputs [ 0.95921556], Targets [1]
Epoch 1200, Inputs [0 1], Outputs [ 0.96001748], Targets [1]
Epoch 1300, Inputs [1 0], Outputs [ 0.96071742], Targets [1]
Epoch 1400, Inputs [1 1], Outputs [ 0.00110912], Targets [0]
Epoch 1500, Inputs [0 0], Outputs [-0.00012382], Targets [0]
Epoch 1600, Inputs [1 0], Outputs [ 0.9640324], Targets [1]
Epoch 1700, Inputs [1 0], Outputs [ 0.96431516], Targets [1]
Epoch 1800, Inputs [0 1], Outputs [ 0.97004973], Targets [1]
Epoch 1900, Inputs [1 0], Outputs [ 0.96616225], Targets [1]
The dataset I'm using is:
0 0 0
1 0 1
0 1 1
1 1 1
The training set is read using a function in a helper file, but that isn't relevant to the network.
import numpy as np
import helper
FILE_NAME = 'data.txt'
EPOCHS = 2000
TESTING_FREQ = 5
LEARNING_RATE = 0.2
INPUT_SIZE = 2
HIDDEN_LAYERS = [5]
OUTPUT_SIZE = 1
class Classifier:
def __init__(self, layer_sizes):
np.set_printoptions(suppress=True)
self.activ = helper.tanh
self.dactiv = helper.dtanh
network = list()
for i in range(1, len(layer_sizes)):
layer = dict()
layer['weights'] = np.random.randn(layer_sizes[i], layer_sizes[i-1])
layer['biases'] = np.random.randn(layer_sizes[i])
network.append(layer)
self.network = network
def forward_propagate(self, x):
for i in range(0, len(self.network)):
self.network[i]['outputs'] = self.network[i]['weights'].dot(x) + self.network[i]['biases']
if i != len(self.network)-1:
self.network[i]['outputs'] = x = self.activ(self.network[i]['outputs'])
else:
self.network[i]['outputs'] = self.activ(self.network[i]['outputs'])
return self.network[-1]['outputs']
def backpropagate_error(self, x, targets):
self.forward_propagate(x)
self.network[-1]['deltas'] = (self.network[-1]['outputs'] - targets) * self.dactiv(self.network[-1]['outputs'])
for i in reversed(range(len(self.network)-1)):
self.network[i]['deltas'] = self.network[i+1]['deltas'].dot(self.network[i+1]['weights'] * self.dactiv(self.network[i]['outputs']))
def adjust_weights(self, inputs, learning_rate):
self.network[0]['weights'] -= learning_rate * np.atleast_2d(self.network[0]['deltas']).T.dot(np.atleast_2d(inputs))
self.network[0]['biases'] -= learning_rate * self.network[0]['deltas']
for i in range(1, len(self.network)):
self.network[i]['weights'] -= learning_rate * np.atleast_2d(self.network[i]['deltas']).T.dot(np.atleast_2d(self.network[i-1]['outputs']))
self.network[i]['biases'] -= learning_rate * self.network[i]['deltas']
def train(self, inputs, targets, epochs, testfreq, lrate):
for epoch in range(epochs):
i = np.random.randint(0, len(inputs))
if epoch % testfreq == 0:
predictions = self.forward_propagate(inputs[i])
print('Epoch %s, Inputs %s, Outputs %s, Targets %s' % (epoch, inputs[i], predictions, targets[i]))
self.backpropagate_error(inputs[i], targets[i])
self.adjust_weights(inputs[i], lrate)
inputs, outputs = helper.readInput(FILE_NAME, INPUT_SIZE, OUTPUT_SIZE)
print('Input data: {0}'.format(inputs))
print('Output targets: {0}\n'.format(outputs))
np.random.seed(1)
nn = Classifier([INPUT_SIZE] + HIDDEN_LAYERS + [OUTPUT_SIZE])
nn.train(inputs, outputs, EPOCHS, TESTING_FREQ, LEARNING_RATE)

The main bug is that you are doing the forward pass only 20% of the time, i.e. when epoch % testfreq == 0:
for epoch in range(epochs):
i = np.random.randint(0, len(inputs))
if epoch % testfreq == 0:
predictions = self.forward_propagate(inputs[i])
print('Epoch %s, Inputs %s, Outputs %s, Targets %s' % (epoch, inputs[i], predictions, targets[i]))
self.backpropagate_error(inputs[i], targets[i])
self.adjust_weights(inputs[i], lrate)
When I take predictions = self.forward_propagate(inputs[i]) out of if, I get much better results faster:
Epoch 100, Inputs [0 1], Outputs [ 0.80317447], Targets 1
Epoch 105, Inputs [1 1], Outputs [ 0.96340466], Targets 1
Epoch 110, Inputs [1 1], Outputs [ 0.96057278], Targets 1
Epoch 115, Inputs [1 0], Outputs [ 0.87960599], Targets 1
Epoch 120, Inputs [1 1], Outputs [ 0.97725825], Targets 1
Epoch 125, Inputs [1 0], Outputs [ 0.89433666], Targets 1
Epoch 130, Inputs [0 0], Outputs [ 0.03539024], Targets 0
Epoch 135, Inputs [0 1], Outputs [ 0.92888141], Targets 1
Also, note that the term epoch usually means a single run of all of your training data, in your case 4. So, in fact, you are doing 4 times less epochs.
Update
I didn't pay attention to the details, as a result, missed few subtle yet important notes:
the training data in the question represents OR, not XOR, so my results above are for learning OR operation;
backward pass executes forward pass as well (so it's not a bug, rather a surprising implementation detail).
Knowing this, I've updated the data and checked the script once again. Running the training for 10000 iterations gave ~0.001 average error, so the model is learning, simply not so fast as it could.
A simple neural network (without embedded normalization mechanism) is pretty sensitive to particular hyperparameters, such as initialization and the learning rate. I tried various values manually and here's what I've got:
# slightly bigger learning rate
LEARNING_RATE = 0.3
...
# slightly bigger init variation of weights
layer['weights'] = np.random.randn(layer_sizes[i], layer_sizes[i-1]) * 2.0
This gives the following performance:
...
Epoch 960, Inputs [1 1], Outputs [ 0.01392014], Targets 0
Epoch 970, Inputs [0 0], Outputs [ 0.04342895], Targets 0
Epoch 980, Inputs [1 0], Outputs [ 0.96471654], Targets 1
Epoch 990, Inputs [1 1], Outputs [ 0.00084511], Targets 0
Epoch 1000, Inputs [0 0], Outputs [ 0.01585915], Targets 0
Epoch 1010, Inputs [1 1], Outputs [-0.004097], Targets 0
Epoch 1020, Inputs [1 1], Outputs [ 0.01898956], Targets 0
Epoch 1030, Inputs [0 0], Outputs [ 0.01254217], Targets 0
Epoch 1040, Inputs [1 1], Outputs [ 0.01429213], Targets 0
Epoch 1050, Inputs [0 1], Outputs [ 0.98293925], Targets 1
...
Epoch 1920, Inputs [1 1], Outputs [-0.00043072], Targets 0
Epoch 1930, Inputs [0 1], Outputs [ 0.98544288], Targets 1
Epoch 1940, Inputs [1 0], Outputs [ 0.97682002], Targets 1
Epoch 1950, Inputs [1 0], Outputs [ 0.97684186], Targets 1
Epoch 1960, Inputs [0 0], Outputs [-0.00141565], Targets 0
Epoch 1970, Inputs [0 0], Outputs [-0.00097559], Targets 0
Epoch 1980, Inputs [0 1], Outputs [ 0.98548381], Targets 1
Epoch 1990, Inputs [1 0], Outputs [ 0.97721286], Targets 1
The average accuracy is close to 98.5% after 1000 iterations and 99.1% after 2000 iterations. It's a bit slower than promised, but good enough. I'm sure it can be tuned further, but it's not the goal of this toy exercise. After all, tanh is not the best activation function, and classification problems should better be solved with cross-entropy loss (rather than L2 loss). So I wouldn't worry too much about performance of this particular network and go on to the logistic regression. That will be definitely better in terms of speed of learning.

Related

TorchMetrics MultiClass accuracy for semantic segmentation

Let's use the following example for a semantic segmentation problem using TorchMetrics, where we predict tensors of shape (batch_size, classes, height, width):
# shape: (1, 3, 2, 2) => (batch_size, classes, height, width)
mask_multiclass_pred = torch.tensor(
[[
[
# predictions for first class per pixel
[0.85, 0.4],
[0.4, 0.3],
],
[
# predictions for second class per pixel
[0, 0.8],
[0, 1],
],
[
# predictions for third class per pixel
[0.8, 0.6],
[0.7, 0.3],
]
]],
dtype=torch.float32
)
Obviously, if we reduce this to the actual predicted classes as an index tensor:
reduced_pred = torch.argmax(mask_multiclass_pred, dim=1)
reduced_pred = torch.where(torch.amax(mask_multiclass_pred, dim=1) >= 0.5, reduced_pred, -1)
We get:
# shape: (1, 2, 2) => (batch_size, height, width)
tensor([[[0, 1],
[2, 1]]])
...for the predictions.
Let's supposed the following would be our ground truth for the labels, in shape (batch_size, height, width) the MulticlassAccuracy documentation suggests the targets should be (N, ...), thus only batch_size and ... -> extra dimensions, which in semantic segmentation is height & width:
# shape: (1, 2, 2) => (batch_size, height, width)
# as suggested by TorchMetrics targets should be (N, ...) where ... is the extra dimensions, in this case 2D => class per pixel
mask_multiclass_gt = torch.tensor(
[
[
# class 0, 1, or 2 per pixel => (2, 2) shape for mask
[0, 1],
[0, 2],
],
],
dtype=torch.int
)
Now, if we calculate the MulticlassAccuracy:
seg_acc_cls = MulticlassAccuracy(num_classes=3, top_k=1, average="none", multidim_average="global")
seg_acc_cls(mask_multiclass_pred, mask_multiclass_gt)
We get the following result:
# shape (3,) => one accuracy per class (3 classes)
tensor([0.5000, 1.0000, 0.0000])
Why is this the output?
For example, shouldn't the first class be 0.75 instead of 0.5? Because for the default threshold of 0.5 our reduced predictions for the first class would be:
[0, 1] => [True, False]
[2, 1] => [False, False]
And obviously then we have 1 TP, 2 TN, and 1 FN. So we should have (1+2)/4?!
Likewise, the second class would be:
[0, 1] => [False, True]
[2, 1] => [False, True]
So again, we have 1 TP, but also 1 FP (lower right), and then 2 TN, which again should be (1 TP + 2TN)/4 = 0.75 and not 1.0.
For the 3rd class we would get these reduced predictions:
[0, 1] => [False, False]
[2, 1] => [True, False]
Which should be 0 TP (only lower right was True), 1 FP (lower left), and 2 TN should be 2/4 => 0.5.
Seems like you're having mostly a definitional issue here. Multiclass classification accuracy, (at least as defined in this package) is simply the class recall for each class i.e. TP/(TP+FN). True negatives are not taken into account in the scoring, or else sparse classes would have their accuracy dominated almost entirely by false negatives and would be fairly insensitive to the actual performance (TP and FN). For this metric, false positives do not directly impact accuracy (although, since it is multiclass and not a multilabel problem each pixel can have only one class, meaning that a FP in one class indirectly causes a FN in another class so FP are still reflected in the score).
Personally I find these multi-class / multi-label classification tasks especially on segmentation to be complex enough and metric definitions variable enough that I generally just re-implement them myself so I know what it is I'm calculating.

Confusion matrix as the metric for the optimization in a machine learning regression problem

I am training a model to segment an image to predict the degree of damage (ranging from 0: no damage, to 5: severe damage) for each pixel of an image. I have approached it this way:
def simple_loss(pred, mask): # regression case
pred = torch.sigmoid(pred)
return (F.mse_loss(pred, mask, reduce='none')).mean()
def structure_loss(pred, mask): # binary case: damaged vs undamaged
weit = 1 + 5 * torch.abs(F.avg_pool2d(mask, kernel_size=31, stride=1, padding=15) - mask)
wbce = F.binary_cross_entropy_with_logits(pred, mask, reduce='none')
wbce = (weit * wbce).sum(dim=(2, 3)) / weit.sum(dim=(2, 3))
pred = torch.sigmoid(pred)
inter = ((pred * mask) * weit).sum(dim=(2, 3))
union = ((pred + mask) * weit).sum(dim=(2, 3))
wiou = 1 - (inter + 1) / (union - inter + 1)
return (wbce + wiou).mean()
Binary case yields IoU > 0.6, but the regression model is inaccurate. My datset is imbalanced (100:1) with the majority of the pixels belonging to the undamaged class. Hence, the optimization is driven towards accurate prediction of undamaged pixels.
The confusion matrix in the (1..5) region shows no correlation between the label and the predicted value.
I cannot balance the set because the undamaged region next to the damaged area is informative to humans, trained to examine the damage.
How can I modify the loss function to assign higher cost to regression errors regarding the degree of damage?
We can encode irrelevant pixels with -1. Then modify the loss function to ignore irrelevant classes this way:
from keras import backend as K
def masked_mse(mask_value):
def f(y_true, y_pred):
mask_true = K.cast(K.not_equal(y_true, mask_value), K.floatx())
masked_squared_error = K.square(mask_true * (y_true - y_pred))
masked_mse = K.sum(masked_squared_error, axis=-1) / K.sum(mask_true, axis=-1)
return masked_mse
f.__name__ = 'Masked MSE (mask_value={})'.format(mask_value)
return f
y_pred = K.constant([[ 1, 1, 1, 1],
[ 1, 1, 1, 3],
[ 1, 1, 1, 3],
[ 1, 1, 1, 3],
[ 1, 1, 1, 3],
[ 1, 1, 1, 3]])
y_true = K.constant([[ 1, 1, 1, 1],
[ 1, 1, 1, 1],
[-1, 1, 1, 1],
[-1,-1, 1, 1],
[-1,-1,-1, 1],
[-1,-1,-1,-1]])
true = K.eval(y_true)
pred = K.eval(y_pred)
loss = K.eval(masked_mse(-1)(y_true, y_pred))
for i in range(true.shape[0]):
print(true[3], pred[3], loss[3], sep='\t')
# [-1. -1. 1. 1.] [ 1. 1. 1. 3.] 2.0

How to adjust the batch data by the amount of labels in PyTorch

I have made n-grams / doc-ids for document classification,
def create_dataset(tok_docs, vocab, n):
n_grams = []
document_ids = []
for i, doc in enumerate(tok_docs):
for n_gram in [doc[0][i:i+n] for i in range(len(doc[0]) - 1)]:
n_grams.append(n_gram)
document_ids.append(i)
return n_grams, document_ids
def create_pytorch_datasets(n_grams, doc_ids):
n_grams_tensor = torch.tensor(n_grams)
doc_ids_tensor = troch.tensor(doc_ids)
full_dataset = TensorDataset(n_grams_tensor, doc_ids_tensor)
return full_dataset
create_dataset returns pair of (n-grams, document_ids) like below:
n_grams, doc_ids = create_dataset( ... )
train_data = create_pytorch_datasets(n_grams, doc_ids)
>>> train_data[0:100]
(tensor([[2076, 517, 54, 3647, 1182, 7086],
[517, 54, 3647, 1182, 7086, 1149],
...
]),
tensor(([0, 0, 0, 0, 0, ..., 3, 3, 3]))
train_loader = DataLoader(train_data, batch_size = batch_size, shuffle = True)
The first of tensor content means n-grams and the second one does doc_id.
But as you know, by the length of documents, the amount of training data according to the label would changes.
If one document has very long length, there would be so many pairs that have its label in training data.
I think it can cause overfitting in model, because the classification model tends to classify inputs to long length documents.
So, I want to extract input batches from a uniform distribution for label (doc_ids). How can I fix it in code above?
p.s)
If there is train_data like below, I want to extract batch by the probability like that:
n-grams doc_ids
([1, 2, 3, 4], 1) ====> 0.33
([1, 3, 5, 7], 2) ====> 0.33
([2, 3, 4, 5], 3) ====> 0.33 * 0.25
([3, 5, 2, 5], 3) ====> 0.33 * 0.25
([6, 3, 4, 5], 3) ====> 0.33 * 0.25
([2, 3, 1, 5], 3) ====> 0.33 * 0.25
In pytorch you can specify a sampler or a batch_sampler to the dataloader to change how the sampling of datapoints is done.
docs on the dataloader:
https://pytorch.org/docs/stable/data.html#data-loading-order-and-sampler
documentation on the sampler: https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler
For instance, you can use the WeightedRandomSampler to specify a weight to every datapoint. The weighting can be the inverse length of the document for instance.
I would make the following modifications in the code:
def create_dataset(tok_docs, vocab, n):
n_grams = []
document_ids = []
weights = [] # << list of weights for sampling
for i, doc in enumerate(tok_docs):
for n_gram in [doc[0][i:i+n] for i in range(len(doc[0]) - 1)]:
n_grams.append(n_gram)
document_ids.append(i)
weights.append(1/len(doc[0])) # << ngrams of long documents are sampled less often
return n_grams, document_ids, weights
sampler = WeightedRandomSampler(weights, 1, replacement=True) # << create the sampler
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=False, sampler=sampler) # << includes the sampler in the dataloader

How do I mask a loss function in Keras with the TensorFlow backend?

I am trying to implement a sequence-to-sequence task using LSTM by Keras with the TensorFlow backend. The inputs are English sentences with variable lengths. To construct a dataset with 2-D shape [batch_number, max_sentence_length], I add EOF at the end of the line and pad each sentence with enough placeholders, e.g. #. And then each character in the sentence is transformed into a one-hot vector, so that the dataset has 3-D shape [batch_number, max_sentence_length, character_number]. After LSTM encoder and decoder layers, softmax cross-entropy between output and target is computed.
To eliminate the padding effect in model training, masking could be used on input and loss function. Mask input in Keras can be done by using layers.core.Masking. In TensorFlow, masking on loss function can be done as follows: custom masked loss function in TensorFlow.
However, I don't find a way to realize it in Keras, since a user-defined loss function in Keras only accepts parameters y_true and y_pred. So how to input true sequence_lengths to loss function and mask?
Besides, I find a function _weighted_masked_objective(fn) in \keras\engine\training.py. Its definition is
Adds support for masking and sample-weighting to an objective function.
But it seems that the function can only accept fn(y_true, y_pred). Is there a way to use this function to solve my problem?
To be specific, I modify the example of Yu-Yang.
from keras.models import Model
from keras.layers import Input, Masking, LSTM, Dense, RepeatVector, TimeDistributed, Activation
import numpy as np
from numpy.random import seed as random_seed
random_seed(123)
max_sentence_length = 5
character_number = 3 # valid character 'a, b' and placeholder '#'
input_tensor = Input(shape=(max_sentence_length, character_number))
masked_input = Masking(mask_value=0)(input_tensor)
encoder_output = LSTM(10, return_sequences=False)(masked_input)
repeat_output = RepeatVector(max_sentence_length)(encoder_output)
decoder_output = LSTM(10, return_sequences=True)(repeat_output)
output = Dense(3, activation='softmax')(decoder_output)
model = Model(input_tensor, output)
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()
X = np.array([[[0, 0, 0], [0, 0, 0], [1, 0, 0], [0, 1, 0], [0, 1, 0]],
[[0, 0, 0], [0, 1, 0], [1, 0, 0], [0, 1, 0], [0, 1, 0]]])
y_true = np.array([[[0, 0, 1], [0, 0, 1], [1, 0, 0], [0, 1, 0], [0, 1, 0]], # the batch is ['##abb','#babb'], padding '#'
[[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 0], [0, 1, 0]]])
y_pred = model.predict(X)
print('y_pred:', y_pred)
print('y_true:', y_true)
print('model.evaluate:', model.evaluate(X, y_true))
# See if the loss computed by model.evaluate() is equal to the masked loss
import tensorflow as tf
logits=tf.constant(y_pred, dtype=tf.float32)
target=tf.constant(y_true, dtype=tf.float32)
cross_entropy = tf.reduce_mean(-tf.reduce_sum(target * tf.log(logits),axis=2))
losses = -tf.reduce_sum(target * tf.log(logits),axis=2)
sequence_lengths=tf.constant([3,4])
mask = tf.reverse(tf.sequence_mask(sequence_lengths,maxlen=max_sentence_length),[0,1])
losses = tf.boolean_mask(losses, mask)
masked_loss = tf.reduce_mean(losses)
with tf.Session() as sess:
c_e = sess.run(cross_entropy)
m_c_e=sess.run(masked_loss)
print("tf unmasked_loss:", c_e)
print("tf masked_loss:", m_c_e)
The output in Keras and TensorFlow are compared as follows:
As shown above, masking is disabled after some kinds of layers. So how to mask the loss function in Keras when those layers are added?
If there's a mask in your model, it'll be propagated layer-by-layer and eventually applied to the loss. So if you're padding and masking the sequences in a correct way, the loss on the padding placeholders would be ignored.
Some Details:
It's a bit involved to explain the whole process, so I'll just break it down to several steps:
In compile(), the mask is collected by calling compute_mask() and applied to the loss(es) (irrelevant lines are ignored for clarity).
weighted_losses = [_weighted_masked_objective(fn) for fn in loss_functions]
# Prepare output masks.
masks = self.compute_mask(self.inputs, mask=None)
if masks is None:
masks = [None for _ in self.outputs]
if not isinstance(masks, list):
masks = [masks]
# Compute total loss.
total_loss = None
with K.name_scope('loss'):
for i in range(len(self.outputs)):
y_true = self.targets[i]
y_pred = self.outputs[i]
weighted_loss = weighted_losses[i]
sample_weight = sample_weights[i]
mask = masks[i]
with K.name_scope(self.output_names[i] + '_loss'):
output_loss = weighted_loss(y_true, y_pred,
sample_weight, mask)
Inside Model.compute_mask(), run_internal_graph() is called.
Inside run_internal_graph(), the masks in the model is propagated layer-by-layer from the model's inputs to outputs by calling Layer.compute_mask() for each layer iteratively.
So if you're using a Masking layer in your model, you shouldn't worry about the loss on the padding placeholders. The loss on those entries will be masked out as you've probably already seen inside _weighted_masked_objective().
A Small Example:
max_sentence_length = 5
character_number = 2
input_tensor = Input(shape=(max_sentence_length, character_number))
masked_input = Masking(mask_value=0)(input_tensor)
output = LSTM(3, return_sequences=True)(masked_input)
model = Model(input_tensor, output)
model.compile(loss='mae', optimizer='adam')
X = np.array([[[0, 0], [0, 0], [1, 0], [0, 1], [0, 1]],
[[0, 0], [0, 1], [1, 0], [0, 1], [0, 1]]])
y_true = np.ones((2, max_sentence_length, 3))
y_pred = model.predict(X)
print(y_pred)
[[[ 0. 0. 0. ]
[ 0. 0. 0. ]
[-0.11980877 0.05803877 0.07880752]
[-0.00429189 0.13382857 0.19167568]
[ 0.06817091 0.19093043 0.26219055]]
[[ 0. 0. 0. ]
[ 0.0651961 0.10283815 0.12413475]
[-0.04420842 0.137494 0.13727818]
[ 0.04479844 0.17440712 0.24715884]
[ 0.11117355 0.21645413 0.30220413]]]
# See if the loss computed by model.evaluate() is equal to the masked loss
unmasked_loss = np.abs(1 - y_pred).mean()
masked_loss = np.abs(1 - y_pred[y_pred != 0]).mean()
print(model.evaluate(X, y_true))
0.881977558136
print(masked_loss)
0.881978
print(unmasked_loss)
0.917384
As can be seen from this example, the loss on the masked part (the zeroes in y_pred) is ignored, and the output of model.evaluate() is equal to masked_loss.
EDIT:
If there's a recurrent layer with return_sequences=False, the mask stop propagates (i.e., the returned mask is None). In RNN.compute_mask():
def compute_mask(self, inputs, mask):
if isinstance(mask, list):
mask = mask[0]
output_mask = mask if self.return_sequences else None
if self.return_state:
state_mask = [None for _ in self.states]
return [output_mask] + state_mask
else:
return output_mask
In your case, if I understand correctly, you want a mask that's based on y_true, and whenever the value of y_true is [0, 0, 1] (the one-hot encoding of "#") you want the loss to be masked. If so, you need to mask the loss values in a somewhat similar way to Daniel's answer.
The main difference is the final average. The average should be taken over the number of unmasked values, which is just K.sum(mask). And also, y_true can be compared to the one-hot encoded vector [0, 0, 1] directly.
def get_loss(mask_value):
mask_value = K.variable(mask_value)
def masked_categorical_crossentropy(y_true, y_pred):
# find out which timesteps in `y_true` are not the padding character '#'
mask = K.all(K.equal(y_true, mask_value), axis=-1)
mask = 1 - K.cast(mask, K.floatx())
# multiply categorical_crossentropy with the mask
loss = K.categorical_crossentropy(y_true, y_pred) * mask
# take average w.r.t. the number of unmasked entries
return K.sum(loss) / K.sum(mask)
return masked_categorical_crossentropy
masked_categorical_crossentropy = get_loss(np.array([0, 0, 1]))
model = Model(input_tensor, output)
model.compile(loss=masked_categorical_crossentropy, optimizer='adam')
The output of the above code then shows that the loss is computed only on the unmasked values:
model.evaluate: 1.08339476585
tf unmasked_loss: 1.08989
tf masked_loss: 1.08339
The value is different from yours because I've changed the axis argument in tf.reverse from [0,1] to [1].
If you're not using masks as in Yu-Yang's answer, you can try this.
If you have your target data Y with length and padded with the mask value, you can:
import keras.backend as K
def custom_loss(yTrue,yPred):
#find which values in yTrue (target) are the mask value
isMask = K.equal(yTrue, maskValue) #true for all mask values
#since y is shaped as (batch, length, features), we need all features to be mask values
isMask = K.all(isMask, axis=-1) #the entire output vector must be true
#this second line is only necessary if the output features are more than 1
#transform to float (0 or 1) and invert
isMask = K.cast(isMask, dtype=K.floatx())
isMask = 1 - isMask #now mask values are zero, and others are 1
#multiply this by the inputs:
#maybe you might need K.expand_dims(isMask) to add the extra dimension removed by K.all
yTrue = yTrue * isMask
yPred = yPred * isMask
return someLossFunction(yTrue,yPred)
If you have padding only for the input data, or if Y has no length, you can have your own mask outside the function:
masks = [
[1,1,1,1,1,1,0,0,0],
[1,1,1,1,0,0,0,0,0],
[1,1,1,1,1,1,1,1,0]
]
#shape (samples, length). If it fails, make it (samples, length, 1).
import keras.backend as K
masks = K.constant(masks)
Since masks depend on your input data, you can use your mask value to know where to put zeros, such as:
masks = np.array((X_train == maskValue).all(), dtype='float64')
masks = 1 - masks
#here too, if you have a problem with dimensions in the multiplications below
#expand masks dimensions by adding a last dimension = 1.
And make your function taking masks from outside of it (you must recreate the loss function if you change the input data):
def customLoss(yTrue,yPred):
yTrue = masks*yTrue
yPred = masks*yPred
return someLossFunction(yTrue,yPred)
Does anyone know if keras automatically masks the loss function??
Since it provides a Masking layer and says nothing about the outputs, maybe it does it automatically?
I took both anwers and imporvised a way for Multiple Timesteps, single Missing target Values, Loss for LSTM(or other RecurrentNN) with return_sequences=True.
Daniels Answer would not suffice for multiple targets, due to isMask = K.all(isMask, axis=-1). Removing this aggregation made the function undifferentiable, probably. I do not know for shure, since I never run the pure function and cannot tell if its able to fit a model.
I fused Yu-Yangs's and Daniels answer together and it worked.
from tensorflow.keras.layers import Layer, Input, LSTM, Dense, TimeDistributed
from tensorflow.keras import Model, Sequential
import tensorflow.keras.backend as K
import numpy as np
mask_Value = -2
def get_loss(mask_value):
mask_value = K.variable(mask_value)
def masked_loss(yTrue,yPred):
#find which values in yTrue (target) are the mask value
isMask = K.equal(yTrue, mask_Value) #true for all mask values
#transform to float (0 or 1) and invert
isMask = K.cast(isMask, dtype=K.floatx())
isMask = 1 - isMask #now mask values are zero, and others are 1
isMask
#multiply this by the inputs:
#maybe you might need K.expand_dims(isMask) to add the extra dimension removed by K.all
yTrue = yTrue * isMask
yPred = yPred * isMask
# perform a root mean square error, whereas the mean is in respect to the mask
mean_loss = K.sum(K.square(yPred - yTrue))/K.sum(isMask)
loss = K.sqrt(mean_loss)
return loss
#RootMeanSquaredError()(yTrue,yPred)
return masked_loss
# define timeseries data
n_sample = 10
timesteps = 5
feat_inp = 2
feat_out = 2
X = np.random.uniform(0,1, (n_sample, timesteps, feat_inp))
y = np.random.uniform(0,1, (n_sample,timesteps, feat_out))
# define model
model = Sequential()
model.add(LSTM(50, activation='relu',return_sequences=True, input_shape=(timesteps, feat_inp)))
model.add(Dense(feat_out))
model.compile(optimizer='adam', loss=get_loss(mask_Value))
model.summary()
# %%
model.fit(X, y, epochs=50, verbose=0)
Note that Yu-Yang's answer does not appear to work on Tensorflow Keras 2.7.0
Surprisingly, model.evaluate does not compute masked_loss or unmasked_loss. Instead, it assumes that the loss from all masked input steps is zero (but still includes those steps in the mean() calculation). This means that every masked timestep actually reduces the calculated error!
#%% Yu-yang's example
# https://stackoverflow.com/a/47060797/3580080
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
# Fix the random seed for repeatable results
np.random.seed(5)
tf.random.set_seed(5)
max_sentence_length = 5
character_number = 2
input_tensor = keras.Input(shape=(max_sentence_length, character_number))
masked_input = keras.layers.Masking(mask_value=0)(input_tensor)
output = keras.layers.LSTM(3, return_sequences=True)(masked_input)
model = keras.Model(input_tensor, output)
model.compile(loss='mae', optimizer='adam')
X = np.array([[[0, 0], [0, 0], [1, 0], [0, 1], [0, 1]],
[[0, 0], [0, 1], [1, 0], [0, 1], [0, 1]]])
y_true = np.ones((2, max_sentence_length, 3))
y_pred = model.predict(X)
print(y_pred)
# See if the loss computed by model.evaluate() is equal to the masked loss
unmasked_loss = np.abs(1 - y_pred).mean()
masked_loss = np.abs(1 - y_pred[y_pred != 0]).mean()
print(f"model.evaluate= {model.evaluate(X, y_true)}")
print(f"masked loss= {masked_loss}")
print(f"unmasked loss= {unmasked_loss}")
Prints:
[[[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0.05340272 -0.06415359 -0.11803789]
[ 0.08775083 0.00600774 -0.10454659]
[ 0.11212641 0.07632366 -0.04133942]]
[[ 0. 0. 0. ]
[ 0.05394626 0.08956442 0.03843312]
[ 0.09092357 -0.02743799 -0.10386454]
[ 0.10791279 0.04083341 -0.08820333]
[ 0.12459432 0.09971555 -0.02882453]]]
1/1 [==============================] - 1s 658ms/step - loss: 0.6865
model.evaluate= 0.6864957213401794
masked loss= 0.9807082414627075
unmasked loss= 0.986495852470398
(This is intended as a comment rather than an answer).

How to calculate F1-micro score using lasagne

import theano.tensor as T
import numpy as np
from nolearn.lasagne import NeuralNet
def multilabel_objective(predictions, targets):
epsilon = np.float32(1.0e-6)
one = np.float32(1.0)
pred = T.clip(predictions, epsilon, one - epsilon)
return -T.sum(targets * T.log(pred) + (one - targets) * T.log(one - pred), axis=1)
net = NeuralNet(
# your other parameters here (layers, update, max_epochs...)
# here are the one you're interested in:
objective_loss_function=multilabel_objective,
custom_score=("validation score", lambda x, y: np.mean(np.abs(x - y)))
)
I found this code online and wanted to test it. It did work, the results include training loss, test loss, validation score and during time and so on.
But how can I get the F1-micro score? Also, if I was trying to import scikit-learn to calculate the F1 after adding the following code:
data = data.astype(np.float32)
classes = classes.astype(np.float32)
net.fit(data, classes)
score = cross_validation.cross_val_score(net, data, classes, scoring='f1', cv=10)
print score
I got this error:
ValueError: Can't handle mix of multilabel-indicator and
continuous-multioutput
How to implement F1-micro calculation based on above code?
Suppose your true labels on the test set are y_true (shape: (n_samples, n_classes), composed only of 0s and 1s), and your test observations are X_test (shape: (n_samples, n_features)).
Then you get your net predicted values on the test set by y_test = net.predict(X_test).
If you are doing multiclass classification:
Since in your network you have set regression to False, this should be composed of 0s and 1s only, too.
You can compute the micro averaged f1 score with:
from sklearn.metrics import f1_score
f1_score(y_true, y_pred, average='micro')
Small code sample to illustrate this (with dummy data, use your actual y_test and y_true):
from sklearn.metrics import f1_score
import numpy as np
y_true = np.array([[0, 0, 1], [0, 1, 0], [0, 0, 1], [0, 0, 1], [0, 1, 0]])
y_pred = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1], [0, 0, 1], [0, 0, 1]])
t = f1_score(y_true, y_pred, average='micro')
If you are doing multilabel classification:
You are not outputting a matrix of 0 and 1, but a matrix of probabilities. y_pred[i, j] is the probability that observation i belongs to the class j.
You need to define a threshold value, above which you will say an observation belongs to a given class. Then you can attribute labels accordingly and proceed just the same as in the previous case.
thresh = 0.8 # choose your own value
y_test_binary = np.where(y_test > thresh, 1, 0)
# creates an array with 1 where y_test>thresh, 0 elsewhere
f1_score(y_true, y_pred_binary, average='micro')

Resources