r in stats.linregress compared to r-squared in statsmodels - python-3.x

I'm working on a program to investigate the correlation between magnitude and redshift for some quasars, and I'm using statsmodels and scipy.stats.linregress to compute the statistics of the data; statsmodels to compute r-squared (among other parameters), and stats.linregress to compute r (among others).
Some example output is:
W1 r-squared: 0.855715
W1 r-value : 0.414026
W2 r-squared: 0.861169
W2 r-value : 0.517381
W3 r-squared: 0.874051
W3 r-value : 0.418523
W4 r-squared: 0.856747
W4 r-value : 0.294094
Visual minus WISE r-squared: 0.87366
Visual minus WISE r-value : -0.521463
My question is, why do the r and r-squared values not match
(i.e. for the W1 band, 0.414026**2 != 0.855715)?
The code for my computation function is as follows:
def computeStats(x, y, yName):
from scipy import stats
import statsmodels.api as sm
# Compute model parameters
model = sm.OLS(y, x, missing= 'drop')
results = model.fit()
# Mask NaN values in both axes
mask = ~np.isnan(y) & ~np.isnan(x)
# Compute fit parameters
params = stats.linregress(x[mask], y[mask])
fit = params[0]*x + params[1]
fitEquation = '$(%s)=(%.4g \pm %.4g) \\times redshift+%.4g$'%(yName,
params[0], # slope
params[4], # stderr in slope
params[1]) # y-intercept
print('%s r-squared: %g'%(name, arrayresults.rsquared))
print('%s r-value : %g'%(name, arrayparams[2]))
return results, params, fit, fitEquation
Am I interpreting the statistics incorrectly? Or do the two modules compute the regressions using different methods?

By default, OLS in statsmodels does not include the constant term (i.e. the intercept) in the linear equation. (The constant term corresponds to a column of ones in the design matrix.)
To match linregress, create model like this:
model = sm.OLS(y, sm.add_constant(x), missing= 'drop')

Related

How to statistically compare the intercept and slope of two different linear regression models in Python?

I have two series of data as below. I want to create an OLS linear regression model for df1 and another OLS linear regression model for df2. And then statistically test if the y-intercepts of these two linear regression models are statistically different (p<0.05), and also test if the slopes of these two linear regression models are statistically different (p<0.05). I did the following
import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
np.inf == float('inf')
data1 = [1, 3, 45, 0, 25, 13, 43]
data2 = [1, 1, 1, 1, 1, 1, 1]
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
fig, ax = plt.subplots()
df1.plot(figsize=(20, 10), linewidth=5, fontsize=18, ax=ax, kind='line')
df2.plot(figsize=(20, 10), linewidth=5, fontsize=18, ax=ax, kind='line')
plt.show()
model1 = sm.OLS(df1, df1.index)
model2 = sm.OLS(df2, df2.index)
results1 = model1.fit()
results2 = model2.fit()
print(results1.summary())
print(results2.summary())
Results #1
OLS Regression Results
=======================================================================================
Dep. Variable: 0 R-squared (uncentered): 0.625
Model: OLS Adj. R-squared (uncentered): 0.563
Method: Least Squares F-statistic: 10.02
Date: Mon, 01 Mar 2021 Prob (F-statistic): 0.0194
Time: 20:34:34 Log-Likelihood: -29.262
No. Observations: 7 AIC: 60.52
Df Residuals: 6 BIC: 60.47
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 5.6703 1.791 3.165 0.019 1.287 10.054
==============================================================================
Omnibus: nan Durbin-Watson: 2.956
Prob(Omnibus): nan Jarque-Bera (JB): 0.769
Skew: 0.811 Prob(JB): 0.681
Kurtosis: 2.943 Cond. No. 1.00
==============================================================================
Results #2
OLS Regression Results
=======================================================================================
Dep. Variable: 0 R-squared (uncentered): 0.692
Model: OLS Adj. R-squared (uncentered): 0.641
Method: Least Squares F-statistic: 13.50
Date: Mon, 01 Mar 2021 Prob (F-statistic): 0.0104
Time: 20:39:14 Log-Likelihood: -5.8073
No. Observations: 7 AIC: 13.61
Df Residuals: 6 BIC: 13.56
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 0.2308 0.063 3.674 0.010 0.077 0.384
==============================================================================
Omnibus: nan Durbin-Watson: 0.148
Prob(Omnibus): nan Jarque-Bera (JB): 0.456
Skew: 0.000 Prob(JB): 0.796
Kurtosis: 1.750 Cond. No. 1.00
==============================================================================
This is as far I have got, but I think something is wrong. Neither of these regression outcome seems to show the y-intercept. Also, I expect the coef in results #2 to be 0 since I expect the slope to be 0 when all the values are 1, but the result shows 0.2308. Any suggestions or guiding material will be greatly appreciated.
In statsmodels an OLS model does not fit an intercept by default (see the docs).
exog array_like
A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.
The documentation on the exog argument of the OLS constructor suggests using this feature of the tools module in order to add an intercept to the data.
To perform a hypothesis test on the values of the coefficients this question provides some guidance. This unfortunately only works if the variances of the residual errors is the same.
We can start by looking at whether the residuals of each distribution have the same variance (using Levine's test) and ignore coefficients of the regression model for now.
import numpy as np
import pandas as pd
from scipy.stats import levene
from statsmodels.tools import add_constant
from statsmodels.formula.api import ols ## use formula api to make the tests easier
np.inf == float('inf')
data1 = [1, 3, 45, 0, 25, 13, 43]
data2 = [1, 1, 1, 1, 1, 1, 1]
df1 = add_constant(pd.DataFrame(data1)) ## add a constant column so we fit an intercept
df1 = df1.reset_index() ## just doing this to make the index a column of the data frame
df1 = df1.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df2 = add_constant(pd.DataFrame(data2)) ## this does nothing because the y column is already a constant
df2 = df2.reset_index()
df2 = df2.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
formula1 = 'y ~ x + const' ## define formulae
formula2 = 'y ~ x'
model1 = ols(formula1, df1).fit()
model2 = ols(formula2, df2).fit()
print(levene(model1.resid, model2.resid))
The output of the levene test looks like this:
LeveneResult(statistic=7.317386741297884, pvalue=0.019129208414097015)
So we can reject the null hypothesis that the residual distributions have the same variance at alpha=0.05.
There is no point to testing the linear regression coefficients now because the residuals don't have don't have the same distributions. It is important to remember that in a regression problem it doesn't make sense to compare the regression coefficients independent of the data they are fit on. The distribution of the regression coefficients depends on the distribution of the data.
Lets see what happens when we try the proposed test anyways. Combining the instructions above with this method from the OLS package yields the following code:
## stack the data and addd the indicator variable as described in:
## stackexchange question:
df1['c'] = 1 ## add indicator variable that tags the first groups of points
df_all = df1.append(df2, ignore_index=True).drop('const', axis=1)
df_all = df_all.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df_all = df_all.fillna(0) ## a bunch of the values are missing in the indicator columns after stacking
df_all['int'] = df_all['x'] * df_all['c'] # construct the interaction column
print(df_all) ## look a the data
formula = 'y ~ x + c + int' ## define the linear model using the formula api
result = ols(formula, df_all).fit()
hypotheses = '(c = 0), (int = 0)'
f_test = result.f_test(hypotheses)
print(f_test)
The result of the f-test looks like this:
<F test: F=array([[4.01995453]]), p=0.05233934453138028, df_denom=10, df_num=2>
The result of the f-test means that we just barely fail to reject any of the null hypotheses specified in the hypotheses variable namely that the coefficient of the indicator variable 'c' and interaction term 'int' are zero.
From this example it is clear that the f test on the regression coefficients is not very powerful if the residuals do not have the same variance.
Note that the given example has so few points it is hard for the statistical tests to clearly distinguish the two cases even though to the human eye they are very different. This is because even though the statistical tests are designed to make few assumptions about the data but those assumption get better the more data you have. When testing statistical methods to see if they accord with your expectations it is often best to start by constructing large samples with little noise and then see how well the methods work as your data sets get smaller and noisier.
For the sake of completeness I will construct an example where the Levene test will fail to distinguish the two regression models but f test will succeed to do so. The idea is to compare the regression of a noisy data set with its reverse. The distribution of residual errors will be the same but the relationship between the variables will be very different. Note that this would not work reversing the noisy dataset given in the previous example because the data is so noisy the f test cannot distinguish between the positive and negative slope.
import numpy as np
import pandas as pd
from scipy.stats import levene
from statsmodels.tools import add_constant
from statsmodels.formula.api import ols ## use formula api to make the tests easier
n_samples = 6
noise = np.random.randn(n_samples) * 5
data1 = np.linspace(0, 30, n_samples) + noise
data2 = data1[::-1] ## reverse the time series
df1 = add_constant(pd.DataFrame(data1)) ## add a constant column so we fit an intercept
df1 = df1.reset_index() ## just doing this to make the index a column of the data frame
df1 = df1.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df2 = add_constant(pd.DataFrame(data2)) ## this does nothing because the y column is already a constant
df2 = df2.reset_index()
df2 = df2.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
formula1 = 'y ~ x + const' ## define formulae
formula2 = 'y ~ x'
model1 = ols(formula1, df1).fit()
model2 = ols(formula2, df2).fit()
print(levene(model1.resid, model2.resid))
## stack the data and addd the indicator variable as described in:
## stackexchange question:
df1['c'] = 1 ## add indicator variable that tags the first groups of points
df_all = df1.append(df2, ignore_index=True).drop('const', axis=1)
df_all = df_all.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df_all = df_all.fillna(0) ## a bunch of the values are missing in the indicator columns after stacking
df_all['int'] = df_all['x'] * df_all['c'] # construct the interaction column
print(df_all) ## look a the data
formula = 'y ~ x + c + int' ## define the linear model using the formula api
result = ols(formula, df_all).fit()
hypotheses = '(c = 0), (int = 0)'
f_test = result.f_test(hypotheses)
print(f_test)
The result of Levene test and the f test follow:
LeveneResult(statistic=5.451203655948632e-31, pvalue=1.0)
<F test: F=array([[10.62788052]]), p=0.005591319998324387, df_denom=8, df_num=2>
A final note since we are doing multiple comparisons on this data and stopping if we get a significant result, i.e. if the Levene test rejects the null we quit, if it doesn't then we do the f test, this is a stepwise hypothesis test and we are actually inflating our false positive error rate. We should correct our p-values for multiple comparisons before we report our results. Note that the f test is already doing this for the hypotheses we test about the regression coefficients. I am a bit fuzzy on the underlying assumptions of these testing procedures so I am not 100% sure that you are better off making the following correction but keep it in mind in case you feel you are getting false positives too often.
from statsmodels.sandbox.stats.multicomp import multipletests
print(multipletests([1, .005591], .05)) ## correct out pvalues given that we did two comparisons
The output looks like this:
(array([False, True]), array([1. , 0.01115074]), 0.025320565519103666, 0.025)
This means we rejected the second null hypothesis under the correction and that the corrected p-values looks like [1., 0.011150]. The last two values are corrections to your significance level under two different correction methods.
I hope this helps anyone trying to do this type of work. If anyone has anything to add I would welcome comments. This isn't my area of expertise so I could be making some mistakes.

Plotting residuals of masked values with `statsmodels`

I'm using statsmodels.api to compute the statistical parameters for an OLS fit between two variables:
def computeStats(x, y, yName):
'''
Takes as an argument an array, and a string for the array name.
Uses Ordinary Least Squares to compute the statistical parameters for the
array against log(z), and determines the equation for the line of best fit.
Returns the results summary, residuals, statistical parameters in a list, and the
best fit equation.
'''
# Mask NaN values in both axes
mask = ~np.isnan(y) & ~np.isnan(x)
# Compute model parameters
model = sm.OLS(y, sm.add_constant(x), missing= 'drop')
results = model.fit()
residuals = results.resid
# Compute fit parameters
params = stats.linregress(x[mask], y[mask])
fit = params[0]*x + params[1]
fitEquation = '$(%s)=(%.4g \pm %.4g) \\times redshift+%.4g$'%(yName,
params[0], # slope
params[4], # stderr in slope
params[1]) # y-intercept
return results, residuals, params, fit, fitEquation
The second part of the function (using stats.linregress) plays nicely with the masked values, but statsmodels does not. When I try to plot the residuals against the x values with plt.scatter(x, resids), the dimensions do not match:
ValueError: x and y must be the same size
because there are 29007 x-values, and 11763 residuals (that's how many y-values made it through the masking process). I tried changing the model variable to
model = sm.OLS(y[mask], sm.add_constant(x[mask]), missing= 'drop')
but this had no effect.
How can I scatter-plot the residuals against the x-values they match with?
Hi #jim421616 Since statsmodels dropped few missing values, you should use the model's exog variable to plot the scatter as shown.
plt.scatter(model.model.exog[:,1], model.resid)
For reference a complete dummy example
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
#generate data
x = np.random.rand(1000)
y =np.sin( x*25)+0.1*np.random.rand(1000)
# Make some as NAN
y[np.random.choice(np.arange(1000), size=100)]= np.nan
x[np.random.choice(np.arange(1000), size=80)]= np.nan
# fit model
model = sm.OLS(y, sm.add_constant(x) ,missing='drop').fit()
print model.summary()
# plot
plt.scatter(model.model.exog[:,1], model.resid)
plt.show()

After normalize data, using regression anlaysis how to predict y?

I have Normalize my data and apply regression analysis to predict yield(y).
but my predicted output also gives in normalized (in 0 to 1)
I want my predicted answer in my correct data numbers,not in 0 to 1.
Data:
Total_yield(y) Rain(x)
64799.30 720.1
77232.40 382.9
88487.70 1198.2
77338.20 341.4
145602.05 406.4
67680.50 325.8
84536.20 791.8
99854.00 748.6
65939.90 1552.6
61622.80 1357.7
66439.60 344.3
Next,I have normalize data using this code :
from sklearn.preprocessing import Normalizer
import pandas
import numpy
dataframe = pandas.read_csv('/home/desktop/yield.csv')
array = dataframe.values
X = array[:,0:2]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
print(normalizedX)
Total_yield Rain
0 0.999904 0.013858
1 0.999782 0.020872
2 0.999960 0.008924
3 0.999967 0.008092
4 0.999966 0.008199
5 0.999972 0.007481
6 0.999915 0.013026
7 0.999942 0.010758
8 0.999946 0.010414
9 0.999984 0.005627
10 0.999967 0.008167
Next, I use this normalize value to calculate R-sqaure using following code :
array=normalizedX
data = pandas.DataFrame(array,columns=['Total_yield','Rain'])
import statsmodels.formula.api as smf
lm = smf.ols(formula='Total_yield ~ Rain', data=data).fit()
lm.summary()
Output :
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: Total_yield R-squared: 0.752
Model: OLS Adj. R-squared: 0.752
Method: Least Squares F-statistic: 1066.
Date: Thu, 09 Feb 2017 Prob (F-statistic): 2.16e-108
Time: 14:21:21 Log-Likelihood: 941.53
No. Observations: 353 AIC: -1879.
Df Residuals: 351 BIC: -1871.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 1.0116 0.001 948.719 0.000 1.009 1.014
Rain -0.3013 0.009 -32.647 0.000 -0.319 -0.283
==============================================================================
Omnibus: 408.798 Durbin-Watson: 1.741
Prob(Omnibus): 0.000 Jarque-Bera (JB): 40636.533
Skew: -4.955 Prob(JB): 0.00
Kurtosis: 54.620 Cond. No. 10.3
==============================================================================
Now, R-square = 0.75 ,
regression model : y = b0 + b1 *x
Yield = b0 + b1 * Rain
Yield = intercept + coefficient for Rain * Rain
Now when I use my data value for Rain data then it will gives this answer :
Yield = 1.0116 + ( -0.3013 * 720.1(mm)) = -215.95
-215.95yield is wrong,
And when I use normalize value for rain data then predicted yield comes in normalize value in between 0 to 1.
I want predict if rainfall will be 720.1 mm then how many yield will be there?
If anyone help me how to get predicted yield ? I want to compare Predicted yield vs given yield.
First, you should not use Normalizer in this case. It doesn't normalize across features. It does it along rows. You may not want it.
Use MinMaxScaler or RobustScaler to scale each feature. See the preprocessing docs for more details.
Second, these classes have a inverse_transform() function which can convert the predicted y value back to original units.
x = np.asarray([720.1,382.9,1198.2,341.4,406.4,325.8,
791.8,748.6,1552.6,1357.7,344.3]).reshape(-1,1)
y = np.asarray([64799.30,77232.40,88487.70,77338.20,145602.05,67680.50,
84536.20,99854.00,65939.90,61622.80,66439.60]).reshape(-1,1)
scalerx = RobustScaler()
x_scaled = scalerx.fit_transform(x)
scalery = RobustScaler()
y_scaled = scalery.fit_transform(y)
Call your statsmodel.OLS on these scaled data.
While predicting, first transform your test data:
x_scaled_test = scalerx.transform([720.1])
Apply your regression model on this value and get the result. This result of y will be according to the scaled data.
Yield_scaled = b0 + b1 * x_scaled_test
So inverse transform it to get data in original units.
Yield_original = scalery.inverse_transform(Yield_scaled)
But in my opinion, this linear model will not give much accuracy, because when I plotted your data, this is the result.
This data will not be fitted with linear models. Use other techniques, or get more data.

Role of class_weight in loss functions for linearSVC and LogisticRegression

I am trying to figure out what exactly the loss function formula is and how I can manually calculate it when class_weight='auto' in case of svm.svc, svm.linearSVC and linear_model.LogisticRegression.
For balanced data, say you have a trained classifier: clf_c. Logistic loss should be (am I correct?):
def logistic_loss(x,y,w,b,b0):
'''
x: nxp data matrix where n is number of data points and p is number of features.
y: nx1 vector of true labels (-1 or 1).
w: nx1 vector of weights (vector of 1./n for balanced data).
b: px1 vector of feature weights.
b0: intercept.
'''
s = y
if 0 in np.unique(y):
print 'yes'
s = 2. * y - 1
l = np.dot(w, np.log(1 + np.exp(-s * (np.dot(x, np.squeeze(b)) + b0))))
return l
I realized that logisticRegression has predict_log_proba() which gives you exactly that when data is balanced:
b, b0 = clf_c.coef_, clf_c.intercept_
w = np.ones(len(y))/len(y)
-(clf_c.predict_log_proba(x[xrange(len(x)), np.floor((y+1)/2).astype(np.int8)]).mean() == logistic_loss(x,y,w,b,b0)
Note, np.floor((y+1)/2).astype(np.int8) simply maps y=(-1,1) to y=(0,1).
But this does not work when data is imbalanced.
What's more, you expect the classifier (here, logisticRegression) to perform similarly (in terms of loss function value) when data in balance and class_weight=None versus when data is imbalanced and class_weight='auto'. I need to have a way to calculate the loss function (without the regularization term) for both scenarios and compare them.
In short, what does class_weight = 'auto' exactly mean? Does it mean class_weight = {-1 : (y==1).sum()/(y==-1).sum() , 1 : 1.} or rather class_weight = {-1 : 1./(y==-1).sum() , 1 : 1./(y==1).sum()}?
Any help is much much appreciated. I tried going through the source code, but I am not a programmer and I am stuck.
Thanks a lot in advance.
class_weight heuristics
I am a bit puzzled by your first proposition for the class_weight='auto' heuristic, as:
class_weight = {-1 : (y == 1).sum() / (y == -1).sum(),
1 : 1.}
is the same as your second proposition if we normalize it so that the weights sum to one.
Anyway to understand what class_weight="auto" does, see this question:
what is the difference between class weight = none and auto in svm scikit learn.
I am copying it here for later comparison:
This means that each class you have (in classes) gets a weight equal
to 1 divided by the number of times that class appears in your data
(y), so classes that appear more often will get lower weights. This is
then further divided by the mean of all the inverse class frequencies.
Note how this is not completely obvious ;).
This heuristic is deprecated and will be removed in 0.18. It will be replaced by another heuristic, class_weight='balanced'.
The 'balanced' heuristic weighs classes proportionally to the inverse of their frequency.
From the docs:
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data:
n_samples / (n_classes * np.bincount(y)).
np.bincount(y) is an array with the element i being the count of class i samples.
Here's a bit of code to compare the two:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.utils import compute_class_weight
n_classes = 3
n_samples = 1000
X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=10,
n_classes=n_classes, weights=[0.05, 0.4, 0.55])
print("Count of samples per class: ", np.bincount(y))
balanced_weights = n_samples /(n_classes * np.bincount(y))
# Equivalent to the following, using version 0.17+:
# compute_class_weight("balanced", [0, 1, 2], y)
print("Balanced weights: ", balanced_weights)
print("'auto' weights: ", compute_class_weight("auto", [0, 1, 2], y))
Output:
Count of samples per class: [ 57 396 547]
Balanced weights: [ 5.84795322 0.84175084 0.60938452]
'auto' weights: [ 2.40356854 0.3459682 0.25046327]
The loss functions
Now the real question is: how are these weights used to train the classifier?
I don't have a thorough answer here unfortunately.
For SVC and linearSVC the docstring is pretty clear
Set the parameter C of class i to class_weight[i]*C for SVC.
So high weights mean less regularization for the class and a higher incentive for the svm to classify it properly.
I do not know how they work with logistic regression. I'll try to look into it but most of the code is in liblinear or libsvm and I'm not too familiar with those.
However, note that the weights in class_weight do not influence directly methods such as predict_proba. They change its ouput because the classifier optimizes a different loss function.
Not sure this is clear, so here's a snippet to explain what I mean (you need to run the first one for the imports and variable definition):
lr = LogisticRegression(class_weight="auto")
lr.fit(X, y)
# We get some probabilities...
print(lr.predict_proba(X))
new_lr = LogisticRegression(class_weight={0: 100, 1: 1, 2: 1})
new_lr.fit(X, y)
# We get different probabilities...
print(new_lr.predict_proba(X))
# Let's cheat a bit and hand-modify our new classifier.
new_lr.intercept_ = lr.intercept_.copy()
new_lr.coef_ = lr.coef_.copy()
# Now we get the SAME probabilities.
np.testing.assert_array_equal(new_lr.predict_proba(X), lr.predict_proba(X))
Hope this helps.

scikit learn: how to check coefficients significance

i tried to do a LR with SKLearn for a rather large dataset with ~600 dummy and only few interval variables (and 300 K lines in my dataset) and the resulting confusion matrix looks suspicious. I wanted to check the significance of the returned coefficients and ANOVA but I cannot find how to access it. Is it possible at all? And what is the best strategy for data that contains lots of dummy variables? Thanks a lot!
Scikit-learn deliberately does not support statistical inference. If you want out-of-the-box coefficients significance tests (and much more), you can use Logit estimator from Statsmodels. This package mimics interface glm models in R, so you could find it familiar.
If you still want to stick to scikit-learn LogisticRegression, you can use asymtotic approximation to distribution of maximum likelihiood estimates. Precisely, for a vector of maximum likelihood estimates theta, its variance-covariance matrix can be estimated as inverse(H), where H is the Hessian matrix of log-likelihood at theta. This is exactly what the function below does:
import numpy as np
from scipy.stats import norm
from sklearn.linear_model import LogisticRegression
def logit_pvalue(model, x):
""" Calculate z-scores for scikit-learn LogisticRegression.
parameters:
model: fitted sklearn.linear_model.LogisticRegression with intercept and large C
x: matrix on which the model was fit
This function uses asymtptics for maximum likelihood estimates.
"""
p = model.predict_proba(x)
n = len(p)
m = len(model.coef_[0]) + 1
coefs = np.concatenate([model.intercept_, model.coef_[0]])
x_full = np.matrix(np.insert(np.array(x), 0, 1, axis = 1))
ans = np.zeros((m, m))
for i in range(n):
ans = ans + np.dot(np.transpose(x_full[i, :]), x_full[i, :]) * p[i,1] * p[i, 0]
vcov = np.linalg.inv(np.matrix(ans))
se = np.sqrt(np.diag(vcov))
t = coefs/se
p = (1 - norm.cdf(abs(t))) * 2
return p
# test p-values
x = np.arange(10)[:, np.newaxis]
y = np.array([0,0,0,1,0,0,1,1,1,1])
model = LogisticRegression(C=1e30).fit(x, y)
print(logit_pvalue(model, x))
# compare with statsmodels
import statsmodels.api as sm
sm_model = sm.Logit(y, sm.add_constant(x)).fit(disp=0)
print(sm_model.pvalues)
sm_model.summary()
The outputs of print() are identical, and they happen to be coefficient p-values.
[ 0.11413093 0.08779978]
[ 0.11413093 0.08779979]
sm_model.summary() also prints a nicely formatted HTML summary.

Resources