sklearn.metrics r2_score negative - scikit-learn

I can't understand r2_score in sklearn.metrics, which seems to return meaningless values. I followed all the "similar questions" proposed by stackoverflow (some of which elude to wrong argument sequence, which is why I include both orders below), but I'm still lost:
import pandas as pd
from sklearn import linear_model
from sklearn.metrics import r2_score
data = [[0.70940504,0.81604095],
[0.69506565,0.78922145],
[0.66527803,0.72174502],
[0.75251691,0.74893098],
[0.72517034,0.73999503],
[0.68269306,0.72230534],
[0.75251691,0.77163700],
[0.78954422,0.81163350],
[0.83077994,0.94561242],
[0.74107290,0.75122162]]
df = pd.DataFrame(data)
x = df[0].to_numpy().reshape(-1,1)
y = df[1].to_numpy()
print("r2 = ", r2_score(y, x))
print("r2 (wrong order) = ", r2_score(x, y))
lreg = linear_model.LinearRegression()
lreg.fit(x, y)
y_pred = lreg.predict(x)
print("predicted values: ", y_pred)
print("slope = ", lreg.coef_)
print("intercept = ", lreg.intercept_)
print("score = ", lreg.score(x, y))
returns
r2 = 0.01488309898850404 # surprise!!
r2 (wrong order) = -0.7313385423077101 # even more of a surprise!!
predicted values: [0.75664194 0.74219177 0.71217403 0.80008687 0.77252903 0.7297236 0.80008687 0.83740023 0.87895451 0.78855445]
slope = [1.00772544]
intercept = 0.04175643677503682
score = 0.5778168671193278
Plotting data and predicted values in Excel show that the linear_model return values make sense (orange dots fall on Excel trend line), but r2_score return values do not (in both argument sequences):

Your model explains nearly 60% of the target variance, which is much better than the average predictor (which would explain 0).
Why does your single feature explain less? Mainly because of the intercept in this case: r2_score(y, x + 0.042) would work nearly as good.
In a simplified way, you may think of R2 as 1 - (mean_squared_error(y, y_pred) / y.var()). Not being centered around the target mean inflates the sum of squared residuals inevitably, resulting in a poor R2.

Related

calculate adjusted R2 using GridSearchCV

I am trying to use GridSearchCV with multiple scoring metrics, one of which, the adjusted R2. The latter, as far I am concerned, is not implemented in scikit-learn. I would like to confirm whether my approach is the correct one to implement the adjusted R2.
Using the scores implemented in scikit-learn (in the example below MAE and R2), I can do something like shown below (in this dummy example I am ignoring good practices, like feature scaling and a suitable number of iterations for SVR):
import numpy as np
from sklearn.svm import SVR
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score, mean_absolute_error
#generate input
X = np.random.normal(75, 10, (1000, 2))
y = np.random.normal(200, 20, 1000)
#perform grid search
params = {"degree": [2, 3], "max_iter": [10]}
grid = GridSearchCV(SVR(), param_grid=params,
scoring={"MAE": "neg_mean_absolute_error", "R2": "r2"}, refit="R2")
grid.fit(X, y)
The example above will report the MAE and R2 for each cross-validated partition and will refit the best parameters based on the best R2. Following this example, I have attempted to do the same using a custom scorer:
def adj_r2(true, pred, p=2):
'''p is the number of independent variables and n is the sample size'''
n = true.size
return 1 - ((1 - r2_score(true, pred)) * (n - 1))/(n-p-1)
scorer=make_scorer(adj_r2)
grid = GridSearchCV(SVR(), param_grid=params,
scoring={"MAE": "neg_mean_absolute_error", "adj R2": scorer}, refit="adj R2")
grid.fit(X, y)
#print(grid.cv_results_)
The code above appears to generate values for the "adj R2" scorer. I have two questions:
Is the approach used above technically correct coding-wise?
If the approach is correct, how can I define p (number of independent variables) in a dynamic way? As you can see, I had to force a default when defining the function, but I would like to be able to define p in GridSearchCV.
Firstly, adjusted R2 score is not available in sklearn so far because the API of scoring functions just takes y_true and y_pred. Hence, measuring the dimensions of X is out of question.
We can do a work around for SearchCVs.
The scorer needs to have a signature of (estimator, X, y). This has been delivered in the make_scorer here.
I have provided a more simplified version of that here for wrapping the r2 scorer.
def adj_r2(estimator, X, y_true):
n, p = X.shape
pred = estimator.predict(X)
return 1 - ((1 - r2_score(y_true, pred)) * (n - 1))/(n-p-1)
grid = GridSearchCV(SVR(), param_grid=params,
scoring={"MAE": "neg_mean_absolute_error",
"adj R2": adj_r2}, refit="adj R2")
grid.fit(X, y)

Problem with negative numbers in sklearn.feature_selection.SelectKBest feautre scoring module

I was trying auto feature engineering and selecting, so for that, I used the Boston house price dataset available in sklearn.
from sklearn.datasets import load_boston
import pandas as pd
data = load_boston()
x = data.data
y= data.target
y = pd.DataFrame(y)
Then I implemented the feature transformation library on the dataset.
import autofeat as af
clf = af.AutoFeatRegressor()
df = clf.fit_transform(x,y)
df = pd.DataFrame(df)
After this, I implemented another function to find the score of each feature in relation to the label.
from sklearn.feature_selection import SelectKBest, chi2
X_new = SelectKBest(chi2, k=20)
X_new_done = X_new.fit_transform(df,y)
dfscores = pd.DataFrame(X_new.scores_)
dfcolumns = pd.DataFrame(X_new_done.columns)
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']
print(featureScores.nlargest(10,'Score'))
This gave error as following.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-b0fa1556bdef> in <module>()
1 from sklearn.feature_selection import SelectKBest, chi2
2 X_new = SelectKBest(chi2, k=20)
----> 3 X_new_done = X_new.fit_transform(df,y)
4 dfscores = pd.DataFrame(X_new.scores_)
5 dfcolumns = pd.DataFrame(X_new_done.columns)
ValueError: Input X must be non-negative.
I had a few negative numbers in my dataset. So how can I overcome this problem?
Note:- df has now transformations of y, its only having transformations of x.
You have a feature with all negative values:
df['exp(x005)*log(x000)']
returns
0 -3630.638503
1 -2212.931477
2 -4751.790753
3 -3754.508972
4 -3395.387438
...
501 -2022.382877
502 -1407.856591
503 -2998.638158
504 -1973.273347
505 -1267.482741
Name: exp(x005)*log(x000), Length: 506, dtype: float64
Quoting another answer (https://stackoverflow.com/a/46608239/5025009):
The error message Input X must be non-negative says it all: Pearson's chi square test (goodness of fit) does not apply to negative values. It's logical because the chi square test assumes frequencies distribution and a frequency can't be a negative number. Consequently, sklearn.feature_selection.chi2 asserts the input is non-negative.
In many cases, it may be quite safe to simply shift each feature to make it all positive, or even normalize to [0, 1] interval as suggested by EdChum.
If data transformation is for some reason not possible (e.g. a negative value is an important factor), you should pick another statistic to score your features:
sklearn.feature_selection.f_regression computes ANOVA f-value
sklearn.feature_selection.mutual_info_classif computes the mutual information
Since the whole point of this procedure is to prepare the features for another method, it's not a big deal to pick anyone, the end result usually the same or very close.

Fewer than expected purity scores in PCA analysis

I'm trying to plot the line graph of purity scores against the captured variances in PCA. The goal is to plot the line graph of purity scores against the captured variances of 89% and 99% only. In my code when the components/dimensions are 2 it captures 89% of variance and and when components/dimensions are 4 it captures 99% of variance.
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("clustering.csv")
X10_df = df.drop("Class",axis = 1) #feature matrix
Y10_df = df["Class"] #Target vector
X10_df = np.array(X10_df)
Y10_df = np.array(Y10_df)
scaler = StandardScaler() # Standardizing the data
df_std = scaler.fit_transform(X10_df)
pca = PCA()
pca.fit(df_std)
purity = []
n_comp = range(2,5)
for k in n_comp :
pca = PCA(n_components = k)
pca.fit(df_std)
pca.transform(df_std)
scores_pca = pca.transform(df_std)
kmeans_pca = KMeans(n_clusters=3, init ='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y12 = kmeans_pca.fit_predict(scores_pca)
purity13 = purity_score(Y10_df, pred_y12)
purity.append(purity13)
Below function calculates the purity score :
def purity_score(y_true, y_pred):
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)
However, while I have four variance scores, I only have three purity scores. I expected to have four purity scores so that I could create a plot of the variance vs purity.
Why there are only three purity scores?
Here is the link to my dataset file : https://gofile.io/d/3CgFTi
This is simply because when you using for loop with a range, the last number in the range is ignored. So in a range(2,5), it will go 2, 3, 4 and then quite the loop. Please read on for loop in Python.

gradient descendent coust increass by each iteraction in linear regression with one feature

Hi I am learning some machine learning algorithms and for the sake of understanding I was trying to implement a linear regression algorithm with one feature using as cost function the Residual sum of squares for gradient descent method as bellow:
My pseudocode:
while not converge
w <- w - step*gradient
python code
Linear.py
import math
import numpy as num
def get_regression_predictions(input_feature, intercept, slope):
predicted_output = [intercept + xi*slope for xi in input_feature]
return(predicted_output)
def rss(input_feature, output, intercept,slope):
return sum( [ ( output.iloc[i] - (intercept + slope*input_feature.iloc[i]) )**2 for i in range(len(output))])
def train(input_feature,output,intercept,slope):
file = open("train.csv","w")
file.write("ID,intercept,slope,RSS\n")
i =0
while True:
print("RSS:",rss(input_feature, output, intercept,slope))
file.write(str(i)+","+str(intercept)+","+str(slope)+","+str(rss(input_feature, output, intercept,slope))+"\n")
i+=1
gradient = [derivative(input_feature, output, intercept,slope,n) for n in range(0,2) ]
step = 0.05
intercept -= step*gradient[0]
slope-= step*gradient[1]
return intercept,slope
def derivative(input_feature, output, intercept,slope,n):
if n==0:
return sum( [ -2*(output.iloc[i] - (intercept + slope*input_feature.iloc[i])) for i in range(0,len(output))] )
return sum( [ -2*(output.iloc[i] - (intercept + slope*input_feature.iloc[i]))*input_feature.iloc[i] for i in range(0,len(output))] )
With the main program:
import Linear as lin
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
df = pd.read_csv("test2.csv")
train = df
lin.train(train["X"],train["Y"], 0, 0)
The test2.csv:
X,Y
0,1
1,3
2,7
3,13
4,21
I resisted the value of rss on a file and noticed that the value of rss became worst at each iteration as follows:
ID,intercept,slope,RSS
0,0,0,669
1,4.5,14.0,3585.25
2,-7.25,-18.5,19714.3125
3,19.375,58.25,108855.953125
Mathematically I think it doesn't make any sense I review my own code many times I think it is correct, I am doing something else wrong?
If your cost isn't decreasing, that's usually a sign you're overshooting with your gradient descent approach, meaning too large of a step size.
A smaller step size can help. You can also look into methods for variable step sizes, which can change each iteration to get you nice convergence properties and speed; usually, these methods change the step size with some proportionality to the gradient. Of course, the specifics depend on each problem.

scikit learn: how to check coefficients significance

i tried to do a LR with SKLearn for a rather large dataset with ~600 dummy and only few interval variables (and 300 K lines in my dataset) and the resulting confusion matrix looks suspicious. I wanted to check the significance of the returned coefficients and ANOVA but I cannot find how to access it. Is it possible at all? And what is the best strategy for data that contains lots of dummy variables? Thanks a lot!
Scikit-learn deliberately does not support statistical inference. If you want out-of-the-box coefficients significance tests (and much more), you can use Logit estimator from Statsmodels. This package mimics interface glm models in R, so you could find it familiar.
If you still want to stick to scikit-learn LogisticRegression, you can use asymtotic approximation to distribution of maximum likelihiood estimates. Precisely, for a vector of maximum likelihood estimates theta, its variance-covariance matrix can be estimated as inverse(H), where H is the Hessian matrix of log-likelihood at theta. This is exactly what the function below does:
import numpy as np
from scipy.stats import norm
from sklearn.linear_model import LogisticRegression
def logit_pvalue(model, x):
""" Calculate z-scores for scikit-learn LogisticRegression.
parameters:
model: fitted sklearn.linear_model.LogisticRegression with intercept and large C
x: matrix on which the model was fit
This function uses asymtptics for maximum likelihood estimates.
"""
p = model.predict_proba(x)
n = len(p)
m = len(model.coef_[0]) + 1
coefs = np.concatenate([model.intercept_, model.coef_[0]])
x_full = np.matrix(np.insert(np.array(x), 0, 1, axis = 1))
ans = np.zeros((m, m))
for i in range(n):
ans = ans + np.dot(np.transpose(x_full[i, :]), x_full[i, :]) * p[i,1] * p[i, 0]
vcov = np.linalg.inv(np.matrix(ans))
se = np.sqrt(np.diag(vcov))
t = coefs/se
p = (1 - norm.cdf(abs(t))) * 2
return p
# test p-values
x = np.arange(10)[:, np.newaxis]
y = np.array([0,0,0,1,0,0,1,1,1,1])
model = LogisticRegression(C=1e30).fit(x, y)
print(logit_pvalue(model, x))
# compare with statsmodels
import statsmodels.api as sm
sm_model = sm.Logit(y, sm.add_constant(x)).fit(disp=0)
print(sm_model.pvalues)
sm_model.summary()
The outputs of print() are identical, and they happen to be coefficient p-values.
[ 0.11413093 0.08779978]
[ 0.11413093 0.08779979]
sm_model.summary() also prints a nicely formatted HTML summary.

Resources