calculate adjusted R2 using GridSearchCV - python-3.x

I am trying to use GridSearchCV with multiple scoring metrics, one of which, the adjusted R2. The latter, as far I am concerned, is not implemented in scikit-learn. I would like to confirm whether my approach is the correct one to implement the adjusted R2.
Using the scores implemented in scikit-learn (in the example below MAE and R2), I can do something like shown below (in this dummy example I am ignoring good practices, like feature scaling and a suitable number of iterations for SVR):
import numpy as np
from sklearn.svm import SVR
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score, mean_absolute_error
#generate input
X = np.random.normal(75, 10, (1000, 2))
y = np.random.normal(200, 20, 1000)
#perform grid search
params = {"degree": [2, 3], "max_iter": [10]}
grid = GridSearchCV(SVR(), param_grid=params,
scoring={"MAE": "neg_mean_absolute_error", "R2": "r2"}, refit="R2")
grid.fit(X, y)
The example above will report the MAE and R2 for each cross-validated partition and will refit the best parameters based on the best R2. Following this example, I have attempted to do the same using a custom scorer:
def adj_r2(true, pred, p=2):
'''p is the number of independent variables and n is the sample size'''
n = true.size
return 1 - ((1 - r2_score(true, pred)) * (n - 1))/(n-p-1)
scorer=make_scorer(adj_r2)
grid = GridSearchCV(SVR(), param_grid=params,
scoring={"MAE": "neg_mean_absolute_error", "adj R2": scorer}, refit="adj R2")
grid.fit(X, y)
#print(grid.cv_results_)
The code above appears to generate values for the "adj R2" scorer. I have two questions:
Is the approach used above technically correct coding-wise?
If the approach is correct, how can I define p (number of independent variables) in a dynamic way? As you can see, I had to force a default when defining the function, but I would like to be able to define p in GridSearchCV.

Firstly, adjusted R2 score is not available in sklearn so far because the API of scoring functions just takes y_true and y_pred. Hence, measuring the dimensions of X is out of question.
We can do a work around for SearchCVs.
The scorer needs to have a signature of (estimator, X, y). This has been delivered in the make_scorer here.
I have provided a more simplified version of that here for wrapping the r2 scorer.
def adj_r2(estimator, X, y_true):
n, p = X.shape
pred = estimator.predict(X)
return 1 - ((1 - r2_score(y_true, pred)) * (n - 1))/(n-p-1)
grid = GridSearchCV(SVR(), param_grid=params,
scoring={"MAE": "neg_mean_absolute_error",
"adj R2": adj_r2}, refit="adj R2")
grid.fit(X, y)

Related

Pytorch bincount with gradient

I am trying to get gradient from sum of some indexes of an array using bincount. However, pytorch does not implement the gradient. This can be implemented by a loop and torch.sum but it is too slow. Is it possible to do this efficiently in pytorch (maybe einsum or index_add)? Of course, we can loop over indexes and add one by one, however that would increase the computational graph size significantly and is very low performance.
import torch
from torch import autograd
import numpy as np
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)
inds = tt([1, 5, 7, 1], False).long()
y = tt(np.arange(4) + 0.1).float()
sum_y_section = torch.bincount(inds, y * y, minlength=8)
#sum_y_section = torch.sum(y * y)
grad = autograd.grad(sum_y_section, y, create_graph=True, allow_unused=False)
print("sum_y_section", sum_y_section)
print("grad", grad)
We can use a new feature in Pytorch V1.11 called scatter_reduce.
bincount = lambda inds, arr: torch.scatter_reduce(arr, 0, inds, reduce="sum")
I’d try to use a hook to manipulate the gradient in a custom way

What is the best way to find a function to closely approximate this data?

I am working with Python and linear regression, but can't seem to find a way to generate an accurate function. The following graph was generated from a 1000 element list of values.
I have tried Skicit-learn, but I can't get it to actually learn and improve the estimate.
Ideally, the function will closely mirror the graph. The graph itself is blatantly sinusoidal, so I imagine that this might be straightforward.
here is an example for the RandomForestRegressor It's based on a tutorial I did to learn, so intellectual property might belong to somebody else. If anybody knows the proper reference, please comment/edit!
I think this fits your data - however I'd like to add that this creates/trains a random forest model, not a function in the sense of a physical description of the process that generates the data.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
rng = np.random.RandomState(42)
x = 10 * rng.rand(200)
def model(x, sigma=0.3):
fast_oscillation = np.sin(5 * x)
slow_oscillation = np.sin(0.5 * x)
noise = sigma * rng.randn(len(x))
return slow_oscillation + fast_oscillation + noise
y = model(x)
forest = RandomForestRegressor(200)
forest.fit(x[:, None], y)
xfit = np.linspace(0, 10, 1000)
yfit = forest.predict(xfit[:, None])
ytrue = model(xfit, sigma=0)
plt.errorbar(x, y, 0.3, fmt='o', alpha=0.6)
plt.plot(xfit, yfit, '-r')
plt.plot(xfit, ytrue, '-k', alpha=0.5)

pytorch: how to directly find gradient w.r.t. loss

In theano, it was very easy to get the gradient of some variable w.r.t. a given loss:
loss = f(x, w)
dl_dw = tt.grad(loss, wrt=w)
I get that pytorch goes by a different paradigm, where you'd do something like:
loss = f(x, w)
loss.backwards()
dl_dw = w.grad
The thing is I might not want to do a full backwards propagation through the graph - just along the path needed to get to w.
I know you can define Variables with requires_grad=False if you don't want to backpropagate through them. But then you have to decide that at the time of variable-creation (and the requires_grad=False property is attached to the variable, rather than the call which gets the gradient, which seems odd).
My Question is is there some way to backpropagate on demand (i.e. only backpropagate along the path needed to compute dl_dw, as you would in theano)?
It turns out that this is reallyy easy. Just use torch.autograd.grad
Example:
import torch
import numpy as np
from torch.autograd import grad
x = torch.autograd.Variable(torch.from_numpy(np.random.randn(5, 4)))
w = torch.autograd.Variable(torch.from_numpy(np.random.randn(4, 3)), requires_grad=True)
y = torch.autograd.Variable(torch.from_numpy(np.random.randn(5, 3)))
loss = ((x.mm(w) - y)**2).sum()
(d_loss_d_w, ) = grad(loss, w)
assert np.allclose(d_loss_d_w.data.numpy(), (x.transpose(0, 1).mm(x.mm(w)-y)*2).data.numpy())
Thanks to JerryLin for answering the question here.

Role of class_weight in loss functions for linearSVC and LogisticRegression

I am trying to figure out what exactly the loss function formula is and how I can manually calculate it when class_weight='auto' in case of svm.svc, svm.linearSVC and linear_model.LogisticRegression.
For balanced data, say you have a trained classifier: clf_c. Logistic loss should be (am I correct?):
def logistic_loss(x,y,w,b,b0):
'''
x: nxp data matrix where n is number of data points and p is number of features.
y: nx1 vector of true labels (-1 or 1).
w: nx1 vector of weights (vector of 1./n for balanced data).
b: px1 vector of feature weights.
b0: intercept.
'''
s = y
if 0 in np.unique(y):
print 'yes'
s = 2. * y - 1
l = np.dot(w, np.log(1 + np.exp(-s * (np.dot(x, np.squeeze(b)) + b0))))
return l
I realized that logisticRegression has predict_log_proba() which gives you exactly that when data is balanced:
b, b0 = clf_c.coef_, clf_c.intercept_
w = np.ones(len(y))/len(y)
-(clf_c.predict_log_proba(x[xrange(len(x)), np.floor((y+1)/2).astype(np.int8)]).mean() == logistic_loss(x,y,w,b,b0)
Note, np.floor((y+1)/2).astype(np.int8) simply maps y=(-1,1) to y=(0,1).
But this does not work when data is imbalanced.
What's more, you expect the classifier (here, logisticRegression) to perform similarly (in terms of loss function value) when data in balance and class_weight=None versus when data is imbalanced and class_weight='auto'. I need to have a way to calculate the loss function (without the regularization term) for both scenarios and compare them.
In short, what does class_weight = 'auto' exactly mean? Does it mean class_weight = {-1 : (y==1).sum()/(y==-1).sum() , 1 : 1.} or rather class_weight = {-1 : 1./(y==-1).sum() , 1 : 1./(y==1).sum()}?
Any help is much much appreciated. I tried going through the source code, but I am not a programmer and I am stuck.
Thanks a lot in advance.
class_weight heuristics
I am a bit puzzled by your first proposition for the class_weight='auto' heuristic, as:
class_weight = {-1 : (y == 1).sum() / (y == -1).sum(),
1 : 1.}
is the same as your second proposition if we normalize it so that the weights sum to one.
Anyway to understand what class_weight="auto" does, see this question:
what is the difference between class weight = none and auto in svm scikit learn.
I am copying it here for later comparison:
This means that each class you have (in classes) gets a weight equal
to 1 divided by the number of times that class appears in your data
(y), so classes that appear more often will get lower weights. This is
then further divided by the mean of all the inverse class frequencies.
Note how this is not completely obvious ;).
This heuristic is deprecated and will be removed in 0.18. It will be replaced by another heuristic, class_weight='balanced'.
The 'balanced' heuristic weighs classes proportionally to the inverse of their frequency.
From the docs:
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data:
n_samples / (n_classes * np.bincount(y)).
np.bincount(y) is an array with the element i being the count of class i samples.
Here's a bit of code to compare the two:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.utils import compute_class_weight
n_classes = 3
n_samples = 1000
X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=10,
n_classes=n_classes, weights=[0.05, 0.4, 0.55])
print("Count of samples per class: ", np.bincount(y))
balanced_weights = n_samples /(n_classes * np.bincount(y))
# Equivalent to the following, using version 0.17+:
# compute_class_weight("balanced", [0, 1, 2], y)
print("Balanced weights: ", balanced_weights)
print("'auto' weights: ", compute_class_weight("auto", [0, 1, 2], y))
Output:
Count of samples per class: [ 57 396 547]
Balanced weights: [ 5.84795322 0.84175084 0.60938452]
'auto' weights: [ 2.40356854 0.3459682 0.25046327]
The loss functions
Now the real question is: how are these weights used to train the classifier?
I don't have a thorough answer here unfortunately.
For SVC and linearSVC the docstring is pretty clear
Set the parameter C of class i to class_weight[i]*C for SVC.
So high weights mean less regularization for the class and a higher incentive for the svm to classify it properly.
I do not know how they work with logistic regression. I'll try to look into it but most of the code is in liblinear or libsvm and I'm not too familiar with those.
However, note that the weights in class_weight do not influence directly methods such as predict_proba. They change its ouput because the classifier optimizes a different loss function.
Not sure this is clear, so here's a snippet to explain what I mean (you need to run the first one for the imports and variable definition):
lr = LogisticRegression(class_weight="auto")
lr.fit(X, y)
# We get some probabilities...
print(lr.predict_proba(X))
new_lr = LogisticRegression(class_weight={0: 100, 1: 1, 2: 1})
new_lr.fit(X, y)
# We get different probabilities...
print(new_lr.predict_proba(X))
# Let's cheat a bit and hand-modify our new classifier.
new_lr.intercept_ = lr.intercept_.copy()
new_lr.coef_ = lr.coef_.copy()
# Now we get the SAME probabilities.
np.testing.assert_array_equal(new_lr.predict_proba(X), lr.predict_proba(X))
Hope this helps.

scikit learn: how to check coefficients significance

i tried to do a LR with SKLearn for a rather large dataset with ~600 dummy and only few interval variables (and 300 K lines in my dataset) and the resulting confusion matrix looks suspicious. I wanted to check the significance of the returned coefficients and ANOVA but I cannot find how to access it. Is it possible at all? And what is the best strategy for data that contains lots of dummy variables? Thanks a lot!
Scikit-learn deliberately does not support statistical inference. If you want out-of-the-box coefficients significance tests (and much more), you can use Logit estimator from Statsmodels. This package mimics interface glm models in R, so you could find it familiar.
If you still want to stick to scikit-learn LogisticRegression, you can use asymtotic approximation to distribution of maximum likelihiood estimates. Precisely, for a vector of maximum likelihood estimates theta, its variance-covariance matrix can be estimated as inverse(H), where H is the Hessian matrix of log-likelihood at theta. This is exactly what the function below does:
import numpy as np
from scipy.stats import norm
from sklearn.linear_model import LogisticRegression
def logit_pvalue(model, x):
""" Calculate z-scores for scikit-learn LogisticRegression.
parameters:
model: fitted sklearn.linear_model.LogisticRegression with intercept and large C
x: matrix on which the model was fit
This function uses asymtptics for maximum likelihood estimates.
"""
p = model.predict_proba(x)
n = len(p)
m = len(model.coef_[0]) + 1
coefs = np.concatenate([model.intercept_, model.coef_[0]])
x_full = np.matrix(np.insert(np.array(x), 0, 1, axis = 1))
ans = np.zeros((m, m))
for i in range(n):
ans = ans + np.dot(np.transpose(x_full[i, :]), x_full[i, :]) * p[i,1] * p[i, 0]
vcov = np.linalg.inv(np.matrix(ans))
se = np.sqrt(np.diag(vcov))
t = coefs/se
p = (1 - norm.cdf(abs(t))) * 2
return p
# test p-values
x = np.arange(10)[:, np.newaxis]
y = np.array([0,0,0,1,0,0,1,1,1,1])
model = LogisticRegression(C=1e30).fit(x, y)
print(logit_pvalue(model, x))
# compare with statsmodels
import statsmodels.api as sm
sm_model = sm.Logit(y, sm.add_constant(x)).fit(disp=0)
print(sm_model.pvalues)
sm_model.summary()
The outputs of print() are identical, and they happen to be coefficient p-values.
[ 0.11413093 0.08779978]
[ 0.11413093 0.08779979]
sm_model.summary() also prints a nicely formatted HTML summary.

Resources