Sklearn logistic regression - adjust cutoff point - python-3.x

I have a logistic regression model trying to predict one of two classes: A or B.
My model's accuracy when predicting A is ~85%.
Model's accuracy when predicting B is ~50%.
Prediction of B is not important however prediction of A is very important.
My goal is to maximize the accuracy when predicting A. Is there any way to adjust the default decision threshold when determining the class?
classifier = LogisticRegression(penalty = 'l2',solver = 'saga', multi_class = 'ovr')
classifier.fit(np.float64(X_train), np.float64(y_train))
Thanks!
RB

As mentioned in the comments, procedure of selecting threshold is done after training. You can find threshold that maximizes utility function of your choice, for example:
from sklearn import metrics
preds = classifier.predict_proba(test_data)
tpr, tpr, thresholds = metrics.roc_curve(test_y,preds[:,1])
print (thresholds)
accuracy_ls = []
for thres in thresholds:
y_pred = np.where(preds[:,1]>thres,1,0)
# Apply desired utility function to y_preds, for example accuracy.
accuracy_ls.append(metrics.accuracy_score(test_y, y_pred, normalize=True))
After that, choose threshold that maximizes chosen utility function. In your case choose threshold that maximizes 1 in y_pred.

Related

Threshold does not work on numpy array for accuracy metric

I am trying to implement logistic regression from scratch using numpy. I wrote a class with the following methods to implement logistic regression for a binary classification problem and to score it based on BCE loss or Accuracy.
def accuracy(self, true_labels, predictions):
"""
This method implements the accuracy score. Where the accuracy is the number
of correct predictions our model has.
args:
true_labels: vector of shape (1, m) that contains the class labels where,
m is the number of samples in the batch.
predictions: vector of shape (1, m) that contains the model predictions.
"""
counter = 0
for y_true, y_pred in zip(true_labels, predictions):
if y_true == y_pred:
counter+=1
return counter/len(true_labels)
def train(self, score='loss'):
"""
This function trains the logistic regression model and updates the
parameters based on the Batch-Gradient Descent algorithm.
The function prints the training loss and validation loss on every epoch.
args:
X: input features with shape (num_features, m) or (num_features) for a
singluar sample where m is the size of the dataset.
Y: gold class labels of shape (1, m) or (1) for a singular sample.
"""
train_scores = []
dev_scores = []
for i in range(self.epochs):
# perform forward and backward propagation & get the training predictions.
training_predictions = self.propagation(self.X_train, self.Y_train)
# get the predictions of the validation data
dev_predictions = self.predict(self.X_dev, self.Y_dev)
# calculate the scores of the predictions.
if score == 'loss':
train_score = self.loss_function(training_predictions, self.Y_train)
dev_score = self.loss_function(dev_predictions, self.Y_dev)
elif score == 'accuracy':
train_score = self.accuracy((training_predictions==+1).squeeze(), self.Y_train)
dev_score = self.accuracy((dev_predictions==+1).squeeze(), self.Y_dev)
train_scores.append(train_score)
dev_scores.append(dev_score)
plot_training_and_validation(train_scores, dev_scores, self.epochs, score=score)
after testing the code with the following input
model = LogisticRegression(num_features=X_train.shape[0],
Learning_rate = 0.01,
Lambda = 0.001,
epochs=500,
X_train=X_train,
Y_train=Y_train,
X_dev=X_dev,
Y_dev=Y_dev,
normalize=False,
regularize = False,)
model.train(score = 'loss')
i get the following results
however when i swap the scoring metric to measure over time from loss to accuracy ass follows model.train(score='accuracy') i get the following result:
I have removed normalization and regularization to make sure i am using a simple implementation of logistic regression.
Note that i use an external method to visualize the training/validation score overtime in the LogisticRegression.train() method.
The trick you are using to create your predictions before passing into the accuracy method is wrong. You are using (dev_predictions==+1).
Your problem statement is a Logistic Regression model that would generate a value between 0 and 1. Most of the times, the values will NOT be exactly equal to +1.
So essentially, every time you are passing a bunch of False or 0 to the accuracy function. I bet if you check the number of classes in your datasets having the value False or 0 would be :
exactly 51.7 % in validation dataset
exactly 56.2 % in training dataset.
To fix this, you can use a in-between threshold like 0.5 to generate your labels. So use something like dev_predictions>0.5

Hypermarameters optimization in Gaussian Process in scikitlearn

Are the hyper-parameters in Gaussian Process Regresor optimized during the fitting in scikit-learn?
In the page
https://scikit-learn.org/stable/modules/gaussian_process.html
it is said:
"The hyperparameters of the kernel are optimized during fitting of GaussianProcessRegressor by maximizing the log-marginal-likelihood (LML) based on the passed optimizer"
So, it is not required, for instance, to optimize it by using grid earch?
A hyperparameter is something that you need to specify, usually, the best way to do it is within a pipeline ( series of steps) in which you try many hyperparameters and get the best one. Here is an example of just trying different hyperparameters for a k-means in which you give a list of hyperparameters (n_neighbors for K-Means) in order to see which ones work best! Hope it helps you!
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))
# Loop over different values of k
for i, k in enumerate(neighbors):
# Setup a k-NN Classifier with k neighbors: knn
knn = KNeighborsClassifier(n_neighbors= k)
# Fit the classifier to the training data
knn.fit(X_train,y_train)
#Compute accuracy on the training set
train_accuracy[i] = knn.score(X_train, y_train)
knn.predict(X_test)
#Compute accuracy on the testing set
test_accuracy[i] = knn.score(X_test, y_test)
# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')

What is the accuracy of a clustering algorithm?

I have a set of points that I have clustered using a clustering algorithm (k-means in this case). I also know the ground-truth labels and I want to measure how accurate my clustering is. What I need is to find the actual accuracy. The problem, of course, is that the labels given by the clustering do not match the ordering of the original one.
Is there a way to measure this accuracy? The intuitive idea would be to compute the score of the confusion matrix of every combination of labels, and only keep the maximum. Is there a function that does this?
I have also evaluated my results using rand scores and adjusted rand score. How close are these two measures to actual accuracy?
Thanks!
First of all, what does The problem, of course, is that the labels given by the clustering do not match the ordering of the original one. mean?
If you know the ground truth labels then you can re-arrange them to match the order of the X matrix and in that way, the Kmeans labels will be in accordance with the true labels after prediction.
In this situation, I suggest the following.
If you have the ground truth labels and you want to see how accurate your model is, then you need metrics such as the Rand index or mutual information between the predicted and true labels. You can do that in a cross-validation scheme and see how the model behaves i.e. if it can predict correctly the classes/labels under a cross-validation scheme. The assessment of prediction goodness can be calculated using metrics like the Rand index.
In summary:
Define a Kmeans model and use cross-validation and in each iteration estimate the Rand index (or mutual information) between the assignments and the true labels. Repeat that for all iterations and finally, take the mean of the Rand index scores. If this score is high, then the model is good.
Full example:
from sklearn.cluster import KMeans
from sklearn.metrics.cluster import adjusted_rand_score
from sklearn.datasets import load_iris
from sklearn.model_selection import LeaveOneOut
import numpy as np
# some data
data = load_iris()
X = data.data
y = data.target # ground truth labels
loo = LeaveOneOut()
rand_index_scores = []
for train_index, test_index in loo.split(X): # LOOCV here
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# the model
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X_train) # fit using training data
predicted_labels = kmeans.predict(X_test) # predict using test data
rand_index_scores.append(adjusted_rand_score(y_test, predicted_labels)) # calculate goodness of predicted labels
print(np.mean(rand_index_scores))
Since clustering is an unsupervised learning problem, you have specific metrics for it: https://scikit-learn.org/stable/modules/classes.html#clustering-metrics
You can refer to the discussion in the scikit-learn User Guide to have an idea of the differences between the different metrics for clustering: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
For instance, the adjusted Rand index will compare a pair of points and check that if the labels are the same in the ground-truth, it will be the same in the predictions. Unlike the accuracy, you cannot make strict label equality.
you can use sklearn.metrics.accuracy as documented in link mentioned below
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
an example can be seen in link mentioned below
sklearn: calculating accuracy score of k-means on the test data set

How can I used ensamble balancing result in base algorithms

I have large imbalance data set and I will apply under sampling + boosting ,sampling + bagging
to balance my data .
my question how apply base algorithms such as Knn on result of balancing
ex:
rusboost = RUSBoostClassifier(random_state=0)
rusboost.fit(X_train, y_train)
RUSBoostClassifier(...)
y_pred = rusboost.predict(X_test)
how take the result from this code as label and feature and inter it in another training
balanced_accuracy_score(y_test, y_pred)

Binary classification (logistic regression) predict wrong label with high accuracy

I have a problem that a binary Logistic regression (using scikit-learn python=2.7) classification that is predicting the wrong/opposite class with a high accuracy. That is, after fitting the model the predicted score and predicted probabilities for each class are very consistent but always of the wrong class. I cannot share the data, but some pseudo-code of my approach is:
X = np.vstack((cond_1, cond_2)) # shape of X = 200*51102
y = np.concatenate([np.zeros(len(cond_1)), np.ones(len(cond_2)])
scls = []
clfs = []
scores = []
for train, test in cv.split(X, y):
clf = LogisticRegression(C=1)
scl = StandardScaler()
scl.fit(X[train])
X_train = scl.transform(X[train])
scls.append(scl)
X_test = scl.transform(X[test])
clf.fit(X_train, y[train])
y_pred = clf.predict(X_test)
scores.append(roc_auc_score(y[test], y_pred))
The roc_auc scores have a mean of 0.065% and a standard deviation of 0.05% so there seems to be going something, but what? I have plotted the features and they seem to be okay normally distributed. I also look that at the probabilities from predict_proba and they are mostly above 80% for the wrong class/label.
Any ideas what is going on and/or how to proper diagnose the problem?
I apologise for not being able to ask a more precise question but I'm lacking the vocabulary.

Resources