Tensorflow: Custom loss function to encourage/restrict false positives/negatives - keras

I am looking for a way to encourage/restrict false positives/negatives. I have not been able to find a solution, which I can get working - most likely due to my lack of experience.
I found this post: Custom loss function in Keras to penalize false negatives which pretty much has the same purpose, but I cannot get the answer to work.
First I kept on getting this error:
AttributeError: 'Tensor' object has no attribute '_numpy'
After some searching around I found that it could be solved with "model.run_eagerly = True", this though provides this error:
No gradients provided for any variable
I would like not to use "model.run_eagerly = True" given that I am not entirely sure what it does, but it quite obviously makes it so that I need to do something with calculating gradients. I have therefore altered the code, but keep getting the No gradients provided for any variable.
I have made an example of this:
import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
class Custom_loss_Class(tf.keras.losses.Loss):
def __init__(self, recall_weight = 0.5, spec_weight = 0.5, name="custom"):
super().__init__(name=name)
self.recall_weight = recall_weight
self.spec_weight = spec_weight
def call(self, y_true, y_pred):
y_pred = tf.math.round(y_pred)
y_true = tf.math.round(y_true)
TN = tf.dtypes.cast(tf.math.logical_and(tf.math.equal(y_true, 0), tf.math.equal(y_pred, 0)), tf.float32)
TP = tf.dtypes.cast(tf.math.logical_and(tf.math.equal(y_true, 1), tf.math.equal(y_pred, 1)), tf.float32)
FP = tf.dtypes.cast(tf.math.logical_and(tf.math.equal(y_true, 0), tf.math.equal(y_pred, 1)), tf.float32)
FN = tf.dtypes.cast(tf.math.logical_and(tf.math.equal(y_true, 1), tf.math.equal(y_pred, 0)), tf.float32)
TN = tf.reduce_sum(TN)
TP = tf.reduce_sum(TP)
FP = tf.reduce_sum(FP)
FN = tf.reduce_sum(FN)
specificity = TN / (TN + FP + K.epsilon())
recall = TP / (TP + FN + K.epsilon())
loss = tf.Variable(1- (self.recall_weight * recall + self.spec_weight * specificity))
return loss
data = load_breast_cancer()
X_train, X_test, Y_train, Y_test = train_test_split(data.data, data.target, test_size=0.3)
N, D = X_train.shape
scalar = StandardScaler()
X_train = scalar.fit_transform(X_train)
X_test = scalar.transform(X_test)
i = Input(shape=(D,))
x = Dense(64, activation="relu")(i)
x = Dense(1, activation="sigmoid")(x)
model = Model(i, x)
model.compile(optimizer="adam",
loss=Custom_loss_Class(recall_weight=0.1, spec_weight=0.9),
metrics="accuracy")
r = model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=10)
I am aware that I can use class_weight to give one class more weight, but that is not really what I want. In this example I would like to heavily restrict false negatives so a patient with cancer do not get a negative prediction, but at the same time not get too many false positives. It seems like a common use case and I therefore hope someone made a good function for it.
Edit:
I am aware that this most likely is due to my loss function not being differentiable, but I do not know how to change that so I get the described functionality.

Related

Implementing and tuning a simple CNN for 3D data using Keras Conv3D

I'm trying to implement a 3D CNN using Keras. However, I am having some difficulties understanding some details in the results obtained and further enhancing the accuracy.
The data that I am trying to analyzing have the shape {64(d1)x64(d2)x38(d3)}, where d1 and d2 are the length and width of the image (64x64 pixels) and d3 is the time dimension. In other words, I have 38 images. The channel parameter is set to 1 as my data are actually raw data and not really colorful images.
My data consist of 219 samples, hence 219x64x64x38. They are divided into training and validation sets with 20% for validation. In addition, I have a fixed 147 additional data for testing.
Below is my code that works fine. It creates a txt file that saves the results for the different combination of parameters in my network (grid search). Here in this code, I only consider tuning 2 parameters: the number of filters and lambda for L2 regularizer. I fixed the dropout and the kernel size for the filters. However, later I considered their variations.
I also tried to set the seed value so that I have some sort of reproducibility (I don't think that I have achieved this task).
My question is that:
Given the below architecture and code, I always reach for all the given combinations of parameters a convergence for the training accuracy towards 1 (which is good). However, for the validation accuracy it is most of the time around 80% +/- 4% (rarely below 70%) despite the hyper-parameters combination. Similar behavior for the test accuracy. How can I enhance this accuracy to above 90% ?
As far as I know, having a gap between the train and validation/test accuracy is a result from overfitting. However, in my model I am adding dropouts and L2 regularizers and also changing the size of my network which should somehow reduce this gap (but it is not).
Is there anything else I can do besides modifying my input data? Does adding more layers help? Or is there maybe a pre-trained 3D CNN like in the case of 2D CNN (e.g., AlexNet)? Should I try ConvLSTM? Is this the limit of this architecture?
Thank you :)
import numpy as np
import tensorflow as tf
import keras
from keras.models import Sequential
from keras.layers import Conv3D, MaxPooling3D, Dense, Flatten, Activation
from keras.utils import to_categorical
from keras.regularizers import l2
from keras.layers import Dropout
from keras.utils import multi_gpu_model
import scipy.io as sio
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from keras.callbacks import ReduceLROnPlateau
tf.set_random_seed(1234)
def normalize_minmax(X_train):
"""
Normalize to [0,1]
"""
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
X_minmax_train = min_max_scaler.fit_transform(X_train)
return X_minmax_train
# generate and prepare the dataset
def get_data():
# Load and prepare the data
X_data = sio.loadmat('./X_train')['X_train']
Y_data = sio.loadmat('./Y_train')['targets_train']
X_test = sio.loadmat('./X_test')['X_test']
Y_test = sio.loadmat('./Y_test')['targets_test']
return X_data, Y_data, X_test, Y_test
def get_model(X_train, Y_train, X_validation, Y_validation, F1_nb, F2_nb, F3_nb, kernel_size_1, kernel_size_2, kernel_size_3, l2_lambda, learning_rate, reduce_lr, dropout_conv1, dropout_conv2, dropout_conv3, dropout_dense, no_epochs):
no_classes = 5
sample_shape = (64, 64, 38, 1)
batch_size = 32
dropout_seed = 30
conv_seed = 20
# Create the model
model = Sequential()
model.add(Conv3D(F1_nb, kernel_size=kernel_size_1, kernel_regularizer=l2(l2_lambda), padding='same', kernel_initializer='glorot_uniform', input_shape=sample_shape))
model.add(Activation('selu'))
model.add(MaxPooling3D(pool_size=(2,2,2)))
model.add(Dropout(dropout_conv1, seed=conv_seed))
model.add(Conv3D(F2_nb, kernel_size=kernel_size_2, kernel_regularizer=l2(l2_lambda), padding='same', kernel_initializer='glorot_uniform'))
model.add(Activation('selu'))
model.add(MaxPooling3D(pool_size=(2,2,2)))
model.add(Dropout(dropout_conv2, seed=conv_seed))
model.add(Conv3D(F3_nb, kernel_size=kernel_size_3, kernel_regularizer=l2(l2_lambda), padding='same', kernel_initializer='glorot_uniform'))
model.add(Activation('selu'))
model.add(MaxPooling3D(pool_size=(2,2,2)))
model.add(Dropout(dropout_conv3, seed=conv_seed))
model.add(Flatten())
model.add(Dense(512, kernel_regularizer=l2(l2_lambda), kernel_initializer='glorot_uniform'))
model.add(Activation('selu'))
model.add(Dropout(dropout_dense, seed=dropout_seed))
model.add(Dense(no_classes, activation='softmax'))
model = multi_gpu_model(model, gpus = 2)
# Compile the model
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adam(lr=learning_rate),
metrics=['accuracy'])
# Train the model.
history = model.fit(X_train, Y_train, batch_size=batch_size, epochs=no_epochs, validation_data=(X_validation, Y_validation),callbacks=[reduce_lr])
return model, history
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=0.0001)
learning_rate = 0.001
no_epochs = 100
X_data, Y_data, X_test, Y_test = get_data()
# Normalize the train/val data
for i in range(X_data.shape[0]):
for j in range(X_data.shape[3]):
X_data[i,:,:,j] = normalize_minmax(X_data[i,:,:,j])
X_data = np.expand_dims(X_data, axis=4)
# Normalize the test data
for i in range(X_test.shape[0]):
for j in range(X_test.shape[3]):
X_test[i,:,:,j] = normalize_minmax(X_test[i,:,:,j])
X_test = np.expand_dims(X_test, axis=4)
# Shuffle the training data
# fix random seed for reproducibility
seedValue = 40
permutation = np.random.RandomState(seed=seedValue).permutation(len(X_data))
X_data = X_data[permutation]
Y_data = Y_data[permutation]
Y_data = np.squeeze(Y_data)
Y_test = np.squeeze(Y_test)
#Split between train and validation (20%). Here I did not use the classical validation_split=0.2 just to make sure that the data is the same for the different architectures I am using.
X_train = X_data[0:175,:,:,:,:]
Y_train = Y_data[0:175]
X_validation = X_data[176:,:,:,:]
Y_validation = Y_data[176:]
Y_train = to_categorical(Y_train,num_classes=5).astype(np.integer)
Y_validation = to_categorical(Y_validation,num_classes=5).astype(np.integer)
Y_test = to_categorical(Y_test,num_classes=5).astype(np.integer)
l2_lambda_list = [(1*pow(10,-4)),(2*pow(10,-4)),
(3*pow(10,-4)),
(4*pow(10,-4)),
(5*pow(10,-4)),(6*pow(10,-4)),
(7*pow(10,-4)),
(8*pow(10,-4)),(9*pow(10,-4)),(10*pow(10,-4))
]
filters_nb = [(128,64,64),(128,64,32),(128,64,16),(128,64,8),(128,32,32),(128,32,16),(128,32,8),(128,16,8),(128,8,8),
(64,64,32),(64,64,16),(64,64,8),(64,32,32),(64,32,16),(64,32,8),(64,16,16),(64,16,8),(64,8,8),
(32,32,16),(32,32,8),(32,16,16),(32,16,8),(32,8,8),
(16,16,16),(16,16,8),(16,8,8)
]
DropOut = [(0.25,0.25,0.25,0.5),
(0,0,0,0.1),(0,0,0,0.2),(0,0,0,0.3),(0,0,0,0.4),(0,0,0,0.5),
(0.1,0.1,0.1,0),(0.2,0.2,0.2,0),(0.3,0.3,0.3,0),(0.4,0.4,0.4,0),(0.5,0.5,0.5,0),
(0.1,0.1,0.1,0.1),(0.1,0.1,0.1,0.2),(0.1,0.1,0.1,0.3),(0.1,0.1,0.1,0.4),(0.1,0.1,0.1,0.5),
(0.15,0.15,0.15,0.1),(0.15,0.15,0.15,0.2),(0.15,0.15,0.15,0.3),(0.15,0.15,0.15,0.4),(0.15,0.15,0.15,0.5),
(0.2,0.2,0.2,0.1),(0.2,0.2,0.2,0.2),(0.2,0.2,0.2,0.3),(0.2,0.2,0.2,0.4),(0.2,0.2,0.2,0.5),
(0.25,0.25,0.25,0.1),(0.25,0.25,0.25,0.2),(0.25,0.25,0.25,0.3),(0.25,0.25,0.25,0.4),(0.25,0.25,0.25,0.5),
(0.3,0.3,0.3,0.1),(0.3,0.3,0.3,0.2),(0.3,0.3,0.3,0.3),(0.3,0.3,0.3,0.4),(0.3,0.3,0.3,0.5),
(0.35,0.35,0.35,0.1),(0.35,0.35,0.35,0.2),(0.35,0.35,0.35,0.3),(0.35,0.35,0.35,0.4),(0.35,0.35,0.35,0.5)
]
kernel_size = [(3,3,3),
(2,3,3),(2,3,4),(2,3,5),(2,3,6),(2,3,7),(2,3,8),(2,3,9),(2,3,10),(2,3,11),(2,3,12),(2,3,13),(2,3,14),(2,3,15),
(3,3,3),(3,3,4),(3,3,5),(3,3,6),(3,3,7),(3,3,8),(3,3,9),(3,3,10),(3,3,11),(3,3,12),(3,3,13),(3,3,14),(3,3,15),
(3,4,3),(3,4,4),(3,4,5),(3,4,6),(3,4,7),(3,4,8),(3,4,9),(3,4,10),(3,4,11),(3,4,12),(3,4,13),(3,4,14),(3,4,15),
]
for l in range(len(l2_lambda_list)):
l2_lambda = l2_lambda_list[l]
f = open("My Results.txt", "a")
lambda_Str = str(l2_lambda)
f.write("---------------------------------------\n")
f.write("lambda = "+f"{lambda_Str}\n")
f.write("---------------------------------------\n")
for i in range(len(filters_nb)):
F1_nb = filters_nb[i][0]
F2_nb = filters_nb[i][1]
F3_nb = filters_nb[i][2]
kernel_size_1 = kernel_size[0]
kernel_size_2 = kernel_size_1
kernel_size_3 = kernel_size_1
dropout_conv1 = DropOut[0][0]
dropout_conv2 = DropOut[0][1]
dropout_conv3 = DropOut[0][2]
dropout_dense = DropOut[0][3]
# fit model
model, history = get_model(X_train, Y_train, X_validation, Y_validation, F1_nb, F2_nb, F3_nb, kernel_size_1, kernel_size_2, kernel_size_3, l2_lambda, learning_rate, reduce_lr, dropout_conv1, dropout_conv2, dropout_conv3, dropout_dense, no_epochs)
# Evaluate metrics
predictions = model.predict(X_test)
out = np.argmax(predictions, axis=1)
Y_test = sio.loadmat('./Y_test')['targets_test']
Y_test = np.squeeze(Y_test)
loss = history.history['loss'][no_epochs-1]
acc = history.history['acc'][no_epochs-1]
val_loss = history.history['val_loss'][no_epochs-1]
val_acc = history.history['val_acc'][no_epochs-1]
# accuracy: (tp + tn) / (p + n)
accuracy = accuracy_score(Y_test, out)
# f1: 2 tp / (2 tp + fp + fn)
f1 = f1_score(Y_test, out,average='macro')
a = str(filters_nb[i][0]) + ',' + str(filters_nb[i][1]) + ',' + str(filters_nb[i][2]) + ': ' + str('f1-metric: ') + str('%f' % f1) + str(' | loss: ') + str('%f' % loss) + str(' | acc: ') + str('%f' % acc) + str(' | val_loss: ') + str('%f' % val_loss) + str(' | val_acc: ') + str('%f' % val_acc) + str(' | test_acc: ') + str('%f' % accuracy)
f.write(f"{a}\n")
f.close()

why is different between model.evaluate() and computed loss by myself based on model.predict()?

I am running neural network by keras. There is my code:
import numpy as np
from keras import Model
from keras.models import Sequential
from keras.layers import Dense
from keras import backend as K
def mean_squared_error(y_true, y_pred):
return K.mean(K.square(y_pred - y_true),axis=-1)
np.random.seed(1)
Train_X = np.random.randint(low=0,high=100,size = (50,5))
Train_Y = np.matmul(Train_X,np.arange(10).reshape(5,2))+np.random.randint(low=0,high=10,size=(50,2))
Test_X = np.random.randint(low=0,high=100,size = (10,5))
Test_Y = np.matmul(Test_X,np.arange(10).reshape(5,2))+np.random.randint(low=0,high=10,size=(10,2))
model = Sequential()
model.add(Dense(4,activation = 'relu'))
model.add(Dense(2,activation='relu'))
model.add(Dense(2,activation='relu'))
model.add(Dense(2))
model.compile(loss=mean_squared_error, optimizer='adam', metrics=['mae'])
history = model.fit(Train_X, Train_Y, epochs=100, batch_size=5,validation_data = (Test_X, Test_Y))
loss1 = model.evaluate(Test_X,Test_Y)
loss2 = history.history['val_loss'][99]
y_pred = model.predict(Test_X)
y_true = Test_Y
loss3 = np.mean(np.square(y_pred-y_true))
I find that loss1 is the same as loss2 but is different with loss3. So i feel so confused. Could someone tell me why?
This is possibly due to different dtypes for Test_Y and y_pred. Keras tries to automatically take care of dtype mismatches for you, so it is possible that Test_Y is a float64 and y_pred is a float32. If that is indeed the case, try converting one of their dtypes for the loss3 calculation and see if the values match.
y_pred = model.predict(Test_X)
y_true = Test_Y.astype(np.float32)
loss3 = np.mean(np.square(y_pred-y_true))

Evaluate and Predict functions in Keras are not giving the same statistics

I have a very simple DNN with a given data set. However, the standard deviation of error I got from "evaluate" and "predict" are different. The mean error seems OK but the stdev from predict is always larger than the stdev from evaluate. Why do these differences happen and how can I fix it?
Raw data is here for download
from keras.models import Sequential
from keras.layers import Dense, Activation
import keras.backend as K
from keras import optimizers
import pickle
import numpy as np
with open('.\\dump','rb') as f:
xTr = pickle.load(f)
yTr = pickle.load(f)
muX = pickle.load(f)
stdX = pickle.load(f)
muY = pickle.load(f)
stdY = pickle.load(f)
def mean_pred(y_true, y_pred):
y_true = y_true*stdY + muY
y_pred = y_pred*stdY + muY
return K.mean(y_pred - y_true)
def std_pred(y_true, y_pred):
y_true = y_true*stdY + muY
y_pred = y_pred*stdY + muY
return K.std(y_pred - y_true)
model = Sequential()
model.add(Dense(256, input_shape=(100,)))
model.add(Activation('tanh'))
model.add(Dense(1))
adam = optimizers.adam(lr=0.0001)
model.compile(optimizer=adam,loss='mse', metrics=[mean_pred, std_pred])
model.fit(xTr, yTr.reshape(-1,1), epochs = 5, batch_size = 128, verbose=0, shuffle=True)
score = model.evaluate(xTr, yTr.reshape(-1,1), verbose=0)
pred = model.predict(xTr, verbose=0)
print(score) #mse, mean, stdev of error
errArr = []
for i,y in enumerate(yTr):
errArr.append((pred[i][0] - y)*stdY)
e = np.asarray(errArr)
print(e.mean(), e.std()) #mean, stdev of error
Finally got the reason... By default, evaluate is not using all samples even if batch_size is set to none. After set batch_size = 1000 (number of samples in my data set), I got the same mean and standard deviation of error

Python different results for manual and cross_val_score prediction

I have one question, I'm trying to implement KFold and cross_val_score.
My goal is to calculate mean_squared_errorand for this purpose I used the following code:
from sklearn import linear_model
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score
x = np.random.random((10000,20))
y = np.random.random((10000,1))
x_train = x[7000:]
y_train = y[7000:]
x_test = x[:7000]
y_test = y[:7000]
Model = linear_model.LinearRegression()
Model.fit(x_train,y_train)
y_predicted = Model.predict(x_test)
MSE = mean_squared_error(y_test,y_predicted)
print(MSE)
kfold = KFold(n_splits = 100, random_state = None, shuffle = False)
results = cross_val_score(Model,x,y,cv=kfold, scoring='neg_mean_squared_error')
print(results.mean())
I think it's all right here, I got the following results:
Results: 0.0828856459279 and -0.083069435946
But when I try to do this on some other example (datas from Kaggle House Prices), it does not work properly, at least I think so..
train = pd.read_csv('train.csv')
Insert missing values...
...
train = pd.get_dummies(train)
y = train['SalePrice']
train = train.drop(['SalePrice'], axis = 1)
x_train = train[:1000].values.reshape(-1,339)
y_train = y[:1000].values.reshape(-1,1)
y_train_normal = np.log(y_train)
x_test = train[1000:].values.reshape(-1,339)
y_test = y[1000:].values.reshape(-1,1)
Model = linear_model.LinearRegression()
Model.fit(x_train,y_train_normal)
y_predicted = Model.predict(x_test)
y_predicted_transform = np.exp(y_predicted)
MSE = mean_squared_error(y_test, y_predicted_transform)
print(MSE)
kfold = KFold(n_splits = 10, random_state = None, shuffle = False)
results = cross_val_score(Model,train,y, cv = kfold, scoring = "neg_mean_squared_error")
print(results.mean())
Here I get the following results: 0.912874946869 and -6.16986926564e+16
Apparently, the mean_squared_error calculated 'manually' is not the same as the mean_squared_error calculated by the help of KFold.
I'm interested in where I made a mistake?
The discrepancy is because, in contrast to your first approach (training/test set), in your CV approach you use the unnormalized y data for fitting the regression, hence your huge MSE. To get comparable results, you should do the following:
y_normal = np.log(y)
y_test_normal = np.log(y_test)
MSE = mean_squared_error(y_test_normal, y_predicted) # NOT y_predicted_transform
results = cross_val_score(Model, train, y_normal, cv = kfold, scoring = "neg_mean_squared_error")

How to calculate F1 Macro in Keras?

I've tried to use the code given from Keras before they're removed. Here's the code:
def precision(y_true, y_pred):
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
precision = true_positives / (predicted_positives + K.epsilon())
return precision
def recall(y_true, y_pred):
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
recall = true_positives / (possible_positives + K.epsilon())
return recall
def fbeta_score(y_true, y_pred, beta=1):
if beta < 0:
raise ValueError('The lowest choosable beta is zero (only precision).')
# If there are no true positives, fix the F score at 0 like sklearn.
if K.sum(K.round(K.clip(y_true, 0, 1))) == 0:
return 0
p = precision(y_true, y_pred)
r = recall(y_true, y_pred)
bb = beta ** 2
fbeta_score = (1 + bb) * (p * r) / (bb * p + r + K.epsilon())
return fbeta_score
def fmeasure(y_true, y_pred):
return fbeta_score(y_true, y_pred, beta=1)
From what I saw, it seems like they use the correct formula. But, when I tried to use it as a metric in the training process, I got exactly equal output for val_accuracy, val_precision, val_recall, and val_fmeasure. I do believe that it might happen even if the formula correct, but I believe it is unlikely. Any explanation for this issue?
since Keras 2.0 metrics f1, precision, and recall have been removed. The solution is to use a custom metric function:
from keras import backend as K
def f1(y_true, y_pred):
def recall(y_true, y_pred):
"""Recall metric.
Only computes a batch-wise average of recall.
Computes the recall, a metric for multi-label classification of
how many relevant items are selected.
"""
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
recall = true_positives / (possible_positives + K.epsilon())
return recall
def precision(y_true, y_pred):
"""Precision metric.
Only computes a batch-wise average of precision.
Computes the precision, a metric for multi-label classification of
how many selected items are relevant.
"""
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
precision = true_positives / (predicted_positives + K.epsilon())
return precision
precision = precision(y_true, y_pred)
recall = recall(y_true, y_pred)
return 2*((precision*recall)/(precision+recall+K.epsilon()))
model.compile(loss='binary_crossentropy',
optimizer= "adam",
metrics=[f1])
The return line of this function
return 2*((precision*recall)/(precision+recall+K.epsilon()))
was modified by adding the constant epsilon, in order to avoid division by 0. Thus NaN will not be computed.
Using a Keras metric function is not the right way to calculate F1 or AUC or something like that.
The reason for this is that the metric function is called at each batch step at validation. That way the Keras system calculates an average on the batch results. And that is not the right F1 score.
Thats the reason why F1 score got removed from the metric functions in keras. See here:
https://github.com/keras-team/keras/commit/a56b1a55182acf061b1eb2e2c86b48193a0e88f7
https://github.com/keras-team/keras/issues/5794
The right way to do this is to use a custom callback function in a way like this:
https://github.com/PhilipMay/mltb#module-keras
https://medium.com/#thongonary/how-to-compute-f1-score-for-each-epoch-in-keras-a1acd17715a2
This is a streaming custom f1_score metric that I made using subclassing. It works for TensorFlow 2.0 beta but I haven't tried it on other versions. What it's doing it keeping track of true positives, predicted positives, and all possible positives throughout the whole epoch and then calculating the f1 score at the end of the epoch. I think the other answers are only giving the f1 score for each batch which isn't really the best metric when we really want the f1 score of the all the data.
I got a raw unedited copy of Aurélien Geron new book Hands-On Machine Learning with Scikit-Learn & Tensorflow 2.0 and highly recommend it. This is how I learned how to this f1 custom metric using sub-classes. It's hands down the most comprehensive TensorFlow book I've ever seen. TensorFlow is seriously a pain in the butt to learn and this guy lays down the coding groundwork to learn a lot.
FYI: In the Metrics, I had to put the parenthesis in f1_score() or else it wouldn't work.
pip install tensorflow==2.0.0-beta1
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras
import numpy as np
def create_f1():
def f1_function(y_true, y_pred):
y_pred_binary = tf.where(y_pred>=0.5, 1., 0.)
tp = tf.reduce_sum(y_true * y_pred_binary)
predicted_positives = tf.reduce_sum(y_pred_binary)
possible_positives = tf.reduce_sum(y_true)
return tp, predicted_positives, possible_positives
return f1_function
class F1_score(keras.metrics.Metric):
def __init__(self, **kwargs):
super().__init__(**kwargs) # handles base args (e.g., dtype)
self.f1_function = create_f1()
self.tp_count = self.add_weight("tp_count", initializer="zeros")
self.all_predicted_positives = self.add_weight('all_predicted_positives', initializer='zeros')
self.all_possible_positives = self.add_weight('all_possible_positives', initializer='zeros')
def update_state(self, y_true, y_pred,sample_weight=None):
tp, predicted_positives, possible_positives = self.f1_function(y_true, y_pred)
self.tp_count.assign_add(tp)
self.all_predicted_positives.assign_add(predicted_positives)
self.all_possible_positives.assign_add(possible_positives)
def result(self):
precision = self.tp_count / self.all_predicted_positives
recall = self.tp_count / self.all_possible_positives
f1 = 2*(precision*recall)/(precision+recall)
return f1
X = np.random.random(size=(1000, 10))
Y = np.random.randint(0, 2, size=(1000,))
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
model = keras.models.Sequential([
keras.layers.Dense(5, input_shape=[X.shape[1], ]),
keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='SGD', metrics=[F1_score()])
history = model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test))
As #Diesche mentioned the main problem in implementing f1_score this way is that it is called at every batch step and leads to confusing results more than anything else.
I've been struggling some time with this issue but eventually worked my way around the problem by using a callback: at the end of an epoch the callback predicts on the data (in this case I chose to only apply it to my validation data) with the new model parameters and gives you coherent metrics evaluated on the whole epoch.
I'm using tensorflow-gpu (1.14.0) on python3
from tensorflow.python.keras.models import Sequential, Model
from sklearn.metrics import f1_score
from tensorflow.keras.callbacks import Callback
from tensorflow.python.keras import optimizers
optimizer = optimizers.SGD(lr=0.0001, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=['accuracy'])
model.summary()
class Metrics(Callback):
def __init__(self, model, valid_data, true_outputs):
super(Callback, self).__init__()
self.model=model
self.valid_data=valid_data #the validation data I'm getting metrics on
self.true_outputs=true_outputs #the ground truth of my validation data
self.steps=len(self.valid_data)
def on_epoch_end(self, args,*kwargs):
gen=generator(self.valid_data) #generator yielding the validation data
val_predict = (np.asarray(self.model.predict(gen, batch_size=1, verbose=0, steps=self.steps)))
"""
The function from_proba_to_output is used to transform probabilities
into an understandable format by sklearn's f1_score function
"""
val_predict=from_proba_to_output(val_predict, 0.5)
_val_f1 = f1_score(self.true_outputs, val_predict)
print ("val_f1: ", _val_f1, " val_precision: ", _val_precision, " _val_recall: ", _val_recall)
The function from_proba_to_output goes as follows:
def from_proba_to_output(probabilities, threshold):
outputs = np.copy(probabilities)
for i in range(len(outputs)):
if (float(outputs[i])) > threshold:
outputs[i] = int(1)
else:
outputs[i] = int(0)
return np.array(outputs)
I then train my model by referencing this metrics class in the callbacks part of fit_generator. I did not detail the implementation of my train_generator and valid_generator as these data generators are specific to the classification problem at hand and posting them would only bring confusion.
model.fit_generator(
train_generator, epochs=nbr_epochs, verbose=1, validation_data=valid_generator, callbacks=[Metrics(model, valid_data)])
As what #Pedia has said in his comment above, on_epoch_end,as stated in the github.com/fchollet/keras/issues/5400 is the best approach.
I also suggest this work-around
install keras_metrics package by ybubnov
call model.fit(nb_epoch=1, ...) inside a for loop taking advantage of the precision/recall metrics outputted after every epoch
Something like this:
for mini_batch in range(epochs):
model_hist = model.fit(X_train, Y_train, batch_size=batch_size, epochs=1,
verbose=2, validation_data=(X_val, Y_val))
precision = model_hist.history['val_precision'][0]
recall = model_hist.history['val_recall'][0]
f_score = (2.0 * precision * recall) / (precision + recall)
print 'F1-SCORE {}'.format(f_score)

Resources