F1 metric and LeaveOneOut validation strategy in scikit-learn - scikit-learn

I want to use GridSearchCV to find the optimal n_neighbors parameter of KNeighborsClassifier
I want to use 'f1_score' metrics AND 'leave one out' strategy.
But this code
clf = GridSearchCV(KNeighborsClassifier(), {'n_neighbors': [1, 2, 3]}, cv=LeaveOneOut(), scoring='f1')
clf.fit(x_train, y_train)
leads to an error
UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
I want to compute f1 score not of each fold of cross validation (it is not possible to compute f1 score of the only one test example), but to compute f1 score based on the whole iteration set with n_neighbors = n.
Is it possible using GridSearchCV?

Not sure if this functionality is directly available in Scikit-Learn, but you can implement the following function to get the desired outcome.
In particular, we will make a dummy scorer which just returns the predicted class instead of computing any score using the ground-truth and the prediction. In this way we can access the predictions of each hyperparameters combination on the different examples in the LOO cv.
from sklearn.metrics import f1_score, make_scorer
def get_pred(y_true, y_predicted):
return y_predicted
get_pred_scorer = make_scorer(get_pred)
clf = GridSearchCV(
KNeighborsClassifier(),
{'n_neighbors': [1, 2, 3]},
cv=LeaveOneOut(),
refit=False,
scoring=get_pred_scorer
)
clf.fit(X_train, y_train)
The problem with this approach is that certain results available in the cv_results_ dictionary (and in certain attributes of GridSearchCV) won't have any meaning, but that probably is not a problem. We should just remember to put refit=False, since GridSearchCV doesn't have a way to determine the best model.
Now we can access the predictions through cv_results_ and just use f1_score to compute the metric for each hyperparams configuration.
def print_params_f1_scores(clf, y_true):
y_preds = [] # will contain the predictions of each params combination
results = clf.cv_results_
params = results["params"] # all params combinations
for j in range(len(params)): # for each combination
y_preds.append([])
for i in range(clf.n_splits_): # for each split (sample in loo)
prediction_of_j_on_i = results[f"split{i}_test_score"][j]
y_preds[j].append(prediction_of_j_on_i)
# show the f1-scores of each combination
for j in range(len(y_preds)):
score = f1_score(y_true, y_preds[j])
print(f"KNeighborsClassifier with {params[j]} obtained f1-score of {score}")
print_params_f1_scores(clf, y_train)
The function prints the following output:
KNeighborsClassifier with {'n_neighbors': 1} obtained f1-score of 0.94
KNeighborsClassifier with {'n_neighbors': 2} obtained f1-score of 0.94
KNeighborsClassifier with {'n_neighbors': 3} obtained f1-score of 0.92

Related

Why is the ROC_AUC from cross_val_score so much higher than manually using a StratfiedKFold with metrics.roc_auc_score for an XGB classifier?

Method 1 - StratifiedKFold cross validation
skf = StratifiedKFold(n_splits=5, shuffle=False)
roc_aucs_temp = []
for i, (train_index, test_index) in enumerate(skf.split(X_train_xgb, y_train_xgb)):
X_train_fold, X_test_fold = X_train_xgb.iloc[train_index], X_train_xgb.iloc[test_index]
y_train_fold, y_test_fold = y_train_xgb[train_index], y_train_xgb[test_index]
xgb_temp.fit(X_train_fold, y_train_fold)
y_pred=model.predict(X_test_fold)
roc_aucs_temp.append(metrics.roc_auc_score(y_test_fold, y_pred))
print(roc_aucs_temp)
[0.8622474747474748, 0.8497474747474747, 0.9045918367346939, 0.8670918367346939, 0.879591836734694]
Method 2 CrossValScore
# this uses the same CV object as method 1
print(cross_val_score(xgb, X_train_xgb, y_train_xgb, cv=skf, scoring='roc_auc'))
[0.9614899 0.94861111 0.96045918 0.97270408 0.96977041]
I might be misunderstanding the functionality of cross_val_score, but from my understanding it creates K folds of training and test data. It then trains the model on K-1 folds, and tests on 1 fold, repeatedly. It should be around the same accuracy as manually creating K Folds with StratifiedKFold. Why isn't it?
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
The documentation for roc_auc_score indicates its second argument is the label scores rather than the predicted labels. Like they show in their example, you probably want something like model.predict_proba(X_test_fold)[:, 1] instead of model.predict(X_test_fold).
cross_val_score with roc_auc is evaluating it that way, and that is why you are seeing the difference.

Different result roc_auc_score and plot_roc_curve

I am training a RandomForestClassifier (sklearn) to predict credit card fraud. When I then test the model and check the rocauc score i get different values when I use roc_auc_score and plot_roc_curve. roc_auc_score gives me around 0.89 and the plot_curve calculates AUC to 0.96 why is that?
The labels are all 0 and 1 as well as the predictions are 0 or 1.
CodE:
clf = RandomForestClassifier(random_state =42)
clf.fit(X_train, y_train[target].values)
pred_test = clf.predict(X_test)
print(roc_auc_score(y_test, pred_test))
clf_disp = plot_roc_curve(clf, X_test, y_test)
plt.show()
Output of the code (the roc_auc_Score is just above the graph).
You are feeding the prediction classes instead of prediction probabilities to
roc_auc_score.
From Documentation:
y_score: array-like of shape (n_samples,) or (n_samples, n_classes)
Target scores. In the binary and multilabel cases, these can be either probability estimates or non-thresholded decision values (as returned by decision_function on some classifiers).
change your code to:
clf = RandomForestClassifier(random_state =42)
clf.fit(X_train, y_train[target].values)
y_score = clf.predict_prob(X_test)
print(roc_auc_score(y_test, y_score[:, 1]))
The ROC Curve and the roc_auc_score take the prediction probabilities as input, but as I can see from your code you are providing the prediction labels. You need to fix that.

Micro F1 score in Scikit-Learn with Class imbalance

I have some class imbalance and a simple baseline classifier that assigns the majority class to every sample:
from sklearn.metrics import precision_score, recall_score, confusion_matrix
y_true = [0,0,0,1]
y_pred = [0,0,0,0]
confusion_matrix(y_true, y_pred)
This yields
[[3, 0],
[1, 0]]
This means TP=3, FP=1, FN=0.
So far, so good. Now I want to calculate the micro average of precision and recall.
precision_score(y_true, y_pred, average='micro') # yields 0.75
recall_score(y_true, y_pred, average='micro') # yields 0.75
I am Ok with the precision, but why is recall not 1.0? How can they ever be the same in this example, given that FP > 0 and FN == 0? I know it must have to do with the micro averaging, but I can't wrap my head around this one.
Yes, its because of micro-averaging. See the documentation here to know how its calculated:
Note that if all labels are included, “micro”-averaging in a
multiclass setting will produce precision, recall and f-score that are all
identical to accuracy.
As you can see in the above linked page, both precision and recall are defined as:
where R(y, y-hat) is:
So in your case, Recall-micro will be calculated as
R = number of correct predictions / total predictions = 3/4 = 0.75

scikit-learn LogisticRegressionCV: best coefficients

I am trying to understand how the best coefficients are calculated in a logistic regression cross-validation, where the "refit" parameter is True.
If I understand the docs correctly, the best coefficients are the result of first determining the best regularization parameter "C", i.e., the value of C that has the highest average score over all folds. Then, the best coefficients are simply the coefficients that were calculated on the fold that has the highest score for the best C. I assume that if the maximum score is achieved by several folds, the coefficients of these folds would be averaged to give the best coefficients (I didn't see anything on how this case is handled in the docs).
To test my understanding, I determined the best coefficients in two different ways:
directly from the coef_ attribute of the fitted model, and
from the coefs_paths attribute, which contains the path of the coefficients obtained during cross-validating across each fold and then across each C.
The results I get from 1. and 2. are similar but not identical, so I was hoping someone could point out what I am doing wrong here.
Thanks!
An example to demonstrate the issue:
from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Set parameters
n_folds = 10
C_values = [0.001, 0.01, 0.05, 0.1, 1., 100.]
# Load and preprocess data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
X_train_scaled = StandardScaler().fit_transform(X_train)
# Fit model
clf = LogisticRegressionCV(Cs=C_values, cv=n_folds, penalty='l1',
refit=True, scoring='roc_auc',
solver='liblinear', random_state=0,
fit_intercept=False)
clf.fit(X_train_scaled, y_train)
########################
# Get and plot coefficients using method 1
########################
coefs1 = clf.coef_
coefs1_series = pd.Series(coefs1.ravel(), index=cancer['feature_names'])
coefs1_series.sort_values().plot(kind="barh")
########################
# Get and plot coefficients using method 2
########################
# mean of scores of class "1"
scores = clf.scores_[1]
mean_scores = np.mean(scores, axis=0)
# Get index of the C that has the highest average score across all folds
best_C_idx = np.where(mean_scores==np.max(mean_scores))[0][0]
# Get index (here: indices) of the folds with highest scores for the
# best C
best_folds_idx = np.where(scores[:, best_C_idx]==np.max(scores[:, best_C_idx]))[0]
paths = clf.coefs_paths_[1] # has shape (n_folds, len(C_values), n_features)
coefs2 = np.squeeze(paths[best_folds_idx, best_C_idx, :])
coefs2 = np.mean(coefs2, axis=0)
coefs2_series = pd.Series(coefs2.ravel(), index=cancer['feature_names'])
coefs2_series.sort_values().plot(kind="barh")
I think this article answers your question: https://orvindemsy.medium.com/understanding-grid-search-randomized-cvs-refit-true-120d783a5e94.
The key point is the refit parameter of LogisticRegressionCV.
According to sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html)
refitbool, default=True
If set to True, the scores are averaged across all folds, and the coefs and the C that corresponds to the best score is taken, and a final refit is done using these parameters. Otherwise the coefs, intercepts and C that correspond to the best scores across folds are averaged.
Best.

Shouldn't a SVM binary classifier understand the threshold from the training set?

I'm very confused about SVM classifiers and I'm sorry if I'll sound stupid.
I'm using the Spark library for java http://spark.apache.org/docs/latest/mllib-linear-methods.html, the first example from the Linear Support Vector Machines paragraph. On this training set:
1 1:10
1 1:9
1 1:9
1 1:9
0 1:1
1 1:8
1 1:8
0 1:2
0 1:2
0 1:3
the prediction on values: 8, 2 and 1 are all positive (1). Given the training set, I would expect them to be positive, negative, negative. It gives negative only on 0 or negative values. I read that the standard threshold is "positive" if the prediction is a positive double, "negative" if it's negative, and I've seen that there is a method to manually set the threshold. But isn't this the exact reason I need a binary classifier for? I mean, if I know in advance what the threshold is I can distinguish between positive and negative values, so why bother training a classifier?
UPDATE:
Using this python code from a different library:
X = [[10], [9],[9],[9],[1],[8],[8],[2],[2],[3]]
y = [1,1,1,1,0,1,1,0,0,0]
​
from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
import numpy as np
​
# we convert our list of lists in numpy arrays
X = np.array(X)
y = np.array(y)
# we compute the general accuracy of the system - we need more "false questions" to continue the study
accuracy = []
​
#we do 10 fold cross-validation - to be sure to test all possible combination of training and test
kf_total = StratifiedKFold(y, n_folds=5, shuffle=True)
for train, test in kf_total:
X_train, X_test = X[train], X[test]
y_train, y_test = y[train], y[test]
print X_train
clf = SVC().fit(X_train, y_train)
y_pred = clf.predict(X_test)
print "the classifier says: ", y_pred
print "reality is: ", y_test
print accuracy_score(y_test, y_pred)
print ""
accuracy.append(accuracy_score(y_test, y_pred))
print sum(accuracy)/len(accuracy)
the results are correct:
######
1 [0]
######
2 [0]
######
8 [1]
So I think it's possible for a SVM classifier to understand the threshold by itself; how can I do the same with the spark library?
SOLVED: I solved the issue changing the example to this:
SVMWithSGD std = new SVMWithSGD();
std.setIntercept(true);
final SVMModel model = std.run(training.rdd());
From this:
final SVMModel model = SVMWithSGD.train(training.rdd(), numIterations);
The standard value for "intercept" is false, which is what I needed to be true.
If you search for probability calibration you will find some research on a related matter (recalibrating the outputs to return better scores).
If your problem is a binary classification problem, you can calculate the slope of the cost by assigning vales to true/false positive/negative options multiplied by the class ratio. You can then form a line with the given AUC curve that intersects at only one point to find a point that is in some sense optimal as a threshold for your problem.
Threshold is one value that will differentiate classes .

Resources