How to obtain AUC-ROC instead of accuracy while cross validation? - python-3.x

I'm performing classification on a dataset and I'm using cross-validation for modelling. The cross-validation gives accuracy for each fold since the class are imbalances, accuracy is not correct measure. I want to get AUC-ROC instead of accuracy.

The cross_val_score support a large number of scoring options.
The exhaustive list is mentioned here.
['accuracy', 'recall_samples', 'f1_macro', 'adjusted_rand_score',
'recall_weighted', 'precision_weighted', 'recall_macro',
'homogeneity_score', 'neg_mean_squared_log_error', 'recall_micro',
'f1', 'neg_log_loss', 'roc_auc', 'average_precision', 'f1_weighted',
'r2', 'precision_macro', 'explained_variance', 'v_measure_score',
'neg_mean_absolute_error', 'completeness_score',
'fowlkes_mallows_score', 'f1_micro', 'precision_samples',
'mutual_info_score', 'neg_mean_squared_error', 'balanced_accuracy',
'neg_median_absolute_error', 'precision_micro',
'normalized_mutual_info_score', 'adjusted_mutual_info_score',
'precision', 'f1_samples', 'brier_score_loss', 'recall']
Here is an example to showcase how to use auc_roc.
>>> from sklearn import datasets, linear_model
>>> from sklearn.model_selection import cross_val_score
>>> import numpy as np
>>> X, y = datasets.load_breast_cancer(return_X_y=True)
>>> model = linear_model.SGDClassifier(max_iter=50, random_state=7)
>>> print(cross_val_score(model, X, y, cv=5, scoring = 'roc_auc'))
[0.96382429 0.96996124 0.95573441 0.96646546 0.91113347]

Related

Why is my sklearn linear regression model producing perfect predictions?

I'm trying to do multiple linear regression with sklearn and I have performed the following steps. However, when it comes to predicting y_pred using the trained model I am getting a perfect r^2 = 1.0. Does anyone know why this is the case/what's going wrong with my code?
Also sorry I'm new to this site so I'm not fully up to speed with the formatting/etiquette of questions!
import numpy as np
import pandas as pd
# Import and subset data
ml_data_all = pd.read_excel('C:/Users/User/Documents/RSEM/STADM/Coursework/Crime_SF/Machine_learning_collated_data.xlsx')
ml_data_1218 = ml_data_all[ml_data_all['Year'] >= 2012]
ml_data_1218.drop(columns=['Pop_MOE',
'Pop_density_MOE',
'Age_median_MOE',
'Sex_ratio_MOE',
'Income_median_household_MOE',
'Pop_total_pov_status_determ_MOE',
'Pop_total_50percent_pov_MOE',
'Pop_total_125percent_pov_MOE',
'Poverty_percent_below_MOE',
'Total_labourforceMOE',
'Unemployed_total_MOE',
'Unemployed_total_male_MOE'], inplace=True)
# Taking care of missing data
# Delete rows containing any NaNs
ml_data_1218.dropna(axis=0,
how='any',
inplace=True)
# DATA PREPROCESSING
# Defining X and y
X = ml_data_1218.drop(columns=['Year']).values
y = ml_data_1218['Burglaries '].values
# Encoding categorical data
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer(transformers=[("cat", OneHotEncoder(), [0])], remainder='passthrough')
X = transformer.fit_transform(X)
X.toarray()
X = pd.DataFrame.sparse.from_spmatrix(X)
# Split into Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train.iloc[:,149:] = sc_X.fit_transform(X_train.iloc[:,149:])
X_test.iloc[:,149:] = sc_X.transform(X_test.iloc[:,149:])
# Fitting multiple linear regression to training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting test set results
y_pred = regressor.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
So turns out it was a stupid mistake in the end: I forgot to drop the dependent variable (Burglaries) from the X columns haha, hence why the linear regression model was making perfect predictions. Now it's working (r2 = 0.56). Thanks everyone!
With regression, it's often a good idea to run a correlation matrix against all of your variables (IVs and the DV). Regression likes parsimony, so removing IVs that are functionally the same (and just leaving one in the model) is better for R^2 value (aka model fit). Also, if something is correlated at .97 or higher with the DV, it is basically a substitute for the DV and all the other data is most likely superfluous.
When reading your issue (before I saw your "Answer") I was thinking "either this person has outrageous correlation issues or the DV is also in the prediction data."

Why do my ML models have horrible accuracy?

I am new to machine learning and I am building my first model independently. I have a dataset that evaluates cars, it contains features of price, safety and luxury and classifies if its good, very good, acceptable and unacceptable. I converted all the non-numeric columns into numeric, trained the data and predicted with a test set. However, my predictions are awful; I used LinearRegression and r2_score outputs 0.05 which is practically 0. I have tried a few different models and all have been giving me horrible predictions and accuracy.
What am I doing wrong? I have seen tutorials, read articles with similar methodology, yet they end up with 0.92 accuracy and I'm getting 0.05. How do you make a good model for your data and how do you know which model to use?
Code:
import numpy as np
import pandas as pd
from sklearn import preprocessing, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class value']
df = pd.read_csv('car.data.txt', index_col=False, names=columns)
for col in df.columns.values:
try:
if df[col].astype(int):
pass
except ValueError:
enc = preprocessing.LabelEncoder()
enc.fit(df[col])
df[col] = enc.transform(df[col])
#Split the data
class_y = df.pop('class value')
x_train, x_test, y_train, y_test = train_test_split(df, class_y, test_size=0.2, random_state=0)
#Make the model
regression_model = linear_model.LinearRegression()
regression_model = regression_model.fit(x_train, y_train)
#Predict the test data
y_pred = regression_model.predict(x_test)
score = r2_score(y_test, y_pred)
You should not use Linear Regression, which is used for predicting continuous values and not categorical values. In your case what you are trying to predict is categorical. Technically, each situation is a class.
I would suggest trying Logistic Regression or other type of classification methods such as Naive Bayes, SVM , decision tree classifiers etc. instead.

Geneating Learning Curve for Logistic Regression

I have a dataset of 260 microscopic image.I want to generate a learning curve for logistic regression algorithm.But I am getting this error "'module' object is not iterable" .Perhaps I don't understand something basic, as I'm a begineer who freshly learning Python
from sklearn.cross_validation import train_test_split
from imutils import paths
from scipy import misc
import numpy as np
import argparse
import imutils
import cv2
import os
from matplotlib import pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.model_selection import cross_val_score
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(50, 80, 110)):
"""
Generate a simple plot of the test and training learning curve.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 3-fold cross-validation,
- integer, to specify the number of folds.
- :term:`CV splitter`,
- An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if ``y`` is binary or multiclass,
:class:`StratifiedKFold` used. If the estimator is not a classifier
or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
Refer :ref:`User Guide <cross_validation>` for the various
cross-validators that can be used here.
n_jobs : int or None, optional (default=None)
Number of jobs to run in parallel.
``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
``-1`` means using all processors. See :term:`Glossary <n_jobs>`
for more details.
train_sizes : array-like, shape (n_ticks,), dtype float or int
Relative or absolute numbers of training examples that will be used to
generate the learning curve. If the dtype is float, it is regarded as a
fraction of the maximum size of the training set (that is determined
by the selected validation method), i.e. it has to be within (0, 1].
Otherwise it is interpreted as absolute sizes of the training sets.
Note that for classification the number of samples usually have to
be big enough to contain at least one sample from each class.
(default: np.linspace(0.1, 1.0, 5))
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
#training with logistic regression
clfLR = LogisticRegression(random_state=0, solver='lbfgs')
clfLR.fit(trainFeat,trainLabels)
acc = clfLR.score(testFeat, testLabels)
print("accuracy of Logistic regression ",acc)
I am facing this problem only when i Want to plot the curve.Rest of the code works fine.
#plotting the curve
estimator =LogisticRegression()
train_sizes, train_scores, valid_scores = plot_learning_curve(
estimator,'logistic learning curve ', trainFeat, trainLabels, cv=5, n_jobs=4,train_sizes=[50, 80, 110])
print(train_sizes)
plt.show()
Screenshot of the error
learning curve
Try running the code on Jupyter online IDE IDE. It plots automatically if you add "%matplotlib" line to import section.
If you want to keep working on this IDE, share your error message so maybe I can help you. You are missing one of the imports probably or it might be a Python2/3 problem.

trying a custom computation of grid.best_score_ (obtained with GridSearchCV)

I'm trying to recompute grid.best_score_ I obtained on my own data without success...
So I tried it using a conventional dataset but no more success. Here is the code :
from sklearn import datasets
from sklearn import linear_model
from sklearn.cross_validation import ShuffleSplit
from sklearn import grid_search
from sklearn.metrics import r2_score
import numpy as np
lr = linear_model.LinearRegression()
boston = datasets.load_boston()
target = boston.target
param_grid = {'fit_intercept':[False]}
cv = ShuffleSplit(target.size, n_iter=5, test_size=0.30, random_state=0)
grid = grid_search.GridSearchCV(lr, param_grid, cv=cv)
grid.fit(boston.data, target)
# got cv score computed by gridSearchCV :
print grid.best_score_
0.677708680059
# now try a custom computation of cv score
cv_scores = []
for (train, test) in cv:
y_true = target[test]
y_pred = grid.best_estimator_.predict(boston.data[test,:])
cv_scores.append(r2_score(y_true, y_pred))
print np.mean(cv_scores)
0.703865991851
I can't see why it's different, GridSearchCV is supposed to use scorer from LinearRegression, which is r2 score. Maybe the way I code cv score is not the one used to compute best_score_... I'm asking here before going through GridSearchCV code.
Unless refit=False in the GridSearchCV constructor, the winning estimator is refit on the entire dataset at the end of fit. best_score_ is the estimator's average score using the cross-validation splits, while best_estimator_ is an estimator of the winning configuration fit on all the data.
lr2 = linear_model.LinearRegression(fit_intercept=False)
scores2 = [lr2.fit(boston.data[train,:], target[train]).score(boston.data[test,:], target[test])
for train, test in cv]
print np.mean(scores2)
Will print 0.67770868005943297.

sklearn: Evaluating LinearSVC's AUC

I know that one would evaluate the AUC of sklearn.svm.SVC by passing in the probability=True option into the constructor, and having the SVM predict probabilities, but I'm not sure how to evaluate sklearn.svm.LinearSVC's AUC. Does anyone have any idea how?
I'd like to use LinearSVC over SVC because LinearSVC seems to train faster on data with many attributes.
You can use the CalibratedClassifierCV class to extract the probabilities. Here is an example with code.
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn import datasets
#Load iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2] # Using only two features
y = iris.target #3 classes: 0, 1, 2
linear_svc = LinearSVC() #The base estimator
# This is the calibrated classifier which can give probabilistic classifier
calibrated_svc = CalibratedClassifierCV(linear_svc,
method='sigmoid', #sigmoid will use Platt's scaling. Refer to documentation for other methods.
cv=3)
calibrated_svc.fit(X, y)
# predict
prediction_data = [[2.3, 5],
[4, 7]]
predicted_probs = calibrated_svc.predict_proba(prediction_data) #important to use predict_proba
print predicted_probs
Looks like it's not possible.
https://github.com/scikit-learn/scikit-learn/issues/4820

Resources