I'm using statsmodel to do simple and multiple linear regression and I'm getting bad R^2 values from the summary. The coefficients look to be calculated correctly, but I get an R^2 of 1.000 which is impossible for my data. I graphed it in excel and I should be getting around 0.93, not 1.
I'm using a mask to filter data to send into the model and I'm wondering if that could be the issue, but to me the data looks fine. I am fairly new to python and statsmodel so maybe I'm missing something here.
import statsmodels.api as sm
for i, df in enumerate(fallwy_xy): # Iterate through list of dataframes
if len(df.index) > 0: # Check if frame is empty or not
mask3 = (df['fnu'] >= low) # Mask data below 'low' variable
valid3 = df[mask3]
if len(valid3) > 0: # Check if there is data in range of mask3
X = valid3[['logfnu', 'logdischarge']]
y = valid3[['logssc']]
estm = sm.OLS(y, X).fit()
X = valid3[['logfnu']]
y = valid3[['logssc']]
ests = sm.OLS(y, X).fit()
I finally found out what was going on. Statsmodels by default does not incorporate a constant into its OLS regression equation, you have to call it out specifically with
X = sm.add_constant(X)
The reason the constant is so important is because without it, Statsmodels calculates R-squared differently, uncentered to be exact. If you do add a constant then the R-squared gets calculated the way most people calculate R-squared which is the centered version. Excel does not change the way it calculates R-squared when given a constant or not which is why when Statsmodels reported it's R-squared with no constant it as so different from Excel. The OLS Regression summary from Statsmodels actually points out the calculation method if it uses the uncentered no-constant, calculation by showing R-squared (uncentered): where the R-squared shows up in the summary table. The below links helped me figure this out.
add hasconstant indicator for R-squared and df calculations
Same model coeffs, different R^2 with statsmodels OLS and sci-kit learn linearregression
Warning : Rod Made a Mistake!
Related
I used log-transformed data (dependent varibale=count) in my generalised additive model (using mgcv) and tried to plot the response by using "trans=plogis" as for logistic GAMs but the results don't seem right. Am I forgetting something here? When I used linear models for my data first, I plotted the least-square means. Any idea how I could plot the output of my GAMs in a more interpretable way other than on the log scale?
Cheers
Are you running a logistic regression for count data? Logistic regression is normally a binary variable or a proportion of binary outcomes.
That being said, the real question here is that you want to backtransform a variable that was fit on the log scale back to the original scale for plotting. That can be easily done using the itsadug package. I've simulated some silly data here just to show the code required.
With itsadug, you can visually inspect many aspects of GAM models. I'd encourage you to look at this: https://cran.r-project.org/web/packages/itsadug/vignettes/inspect.html
The transform argument of plot_smooth() can also be used with custom functions written in R. This can be useful if you have both centred and logged a dependent variable.
library(mgcv)
library(itsadug)
# Setting seed so it's reproducible
set.seed(123)
# Generating 50 samples from a uniform distribution
x <- runif(50, min = 20, max = 50)
# Taking the sin of x to create a dependent variable
y <- sin(x)
# Binding them to a dataframe
d <- data.frame(x, y)
# Logging the dependent variable after adding a constant to prevent negative values
d$log_y <- log(d$y + 1)
# Fitting a GAM to the transformed dependent variable
model_fit <- gam(log_y ~ s(x),
data = d)
# Using the plot_smooth function from itsadug to backtransform to original y scale
plot_smooth(model_fit,
view = "x",
transform = exp)
You can specify the trans function for back-transforming as :trans = function(x){exp(coef(gam)[1]+x)}, where gam is your fitted model, and coef(gam)[1] is the intercept.
I wanted to get beta coefficients in my panel data random effects regression model in Stata. But then I noticed that the option "beta" is not allowed in the xtreg command.
It made me think if it is probably wrong to want standardised coefficients in a random effects model?
my model looks something like this -
xtreg y x##z, re
You can manually get standardized coefficients by 0-1 standardizing your variables before the command:
foreach v of varlist x y z {
qui sum `v'
replace `v' = (`v'-`r(mean)') / `r(sd)'
xtreg y x##z, re
I have a binary classification problem for detecting AO/Non-AO images, using PyTorch for this purpose.
First, I load the data using the ImageFolder utility.
The Dataset class to label mapping in dataset.class_to_idx is {0: 'AO', 1: 'Non-AO'}.
So, my 'positive class' AO, is assigned a label 0, and my 'negative class' Non-AO is assigned a label 1.
Then I train and validate the model without any issues.
When I come to testing, I need to calculate some metrics on the test data.
Here is where I am confused.
[Method A]
fpr, tpr, thresholds = roc_curve(y_true, y_score)
roc_auc = auc(fpr, tpr)
[Method B]
# because 0 is my actual 'positive' class for this problem
fpr, tpr, thresholds = roc_curve(y_true, y_score, pos_label=0)
roc_auc = auc(fpr, tpr)
Now, this second curve is basically the mirror of the first one along the diagonal, right?
And I think, that it can't be the correct curve, because I checked the accuracy of the model by directly comparing y_true and y_pred to get the following accuracies.
Apart from this, here is what my confusion matrix looks like.
So, my first question is, am I doing something wrong? What is the significance of the curve from Method B? Can I say that Method A gives me the correct ROC curve for my classification task? If not, then how do I proceed for getting the correct curve?
What does true positive or true negative or any of the other terms signify for my confusion matrix? Does the matrix consider 0 : AO as negative and 1 : Non-AO as positive (I think so, yes) or the vice versa?
If 0 is indeed being considered as negative, when I actually want 0 to be considered as positive, how can I make changes to reflect so in the matrix (because I am using the matrix later to calculate other matrics like specificity, sensitivity, etc) ?
I've fit a SARIMAX model using statsmodels as follows
mod = sm.tsa.statespace.SARIMAX(ratingCountsRSint,order=(2,0,0),seasonal_order=(1,0,0,52),enforce_stationarity=False,enforce_invertibility=False, freq='W')
results = mod.fit()
print(results.summary().tables[1])
In the results table I have a coefficient ar.S.L52 that shows as 0.0163. When I try to retrieve the coefficient using
seasonalAR=results.polynomial_seasonal_ar[52]
I get -0.0163. I'm wondering why the sign has turned around. The same thing happens with polynomial_ar. In the documentation it says that polynomial_seasonal_ar gives the "array containing seasonal autoregressive lag polynomial coefficients". I would have guessed that I should get exactly the same as in the summary table. Could someone clarify how that comes about and whether the actual coefficient of the lag is positive or negative?
I'll use an AR(1) model as an example, but the same principle applies to a seasonal model.
We typically write the AR(1) model as:
y_t = \phi_1 y_{t-1} + \varespilon_t
The parameter estimated by Statsmodels is \phi_1, and that is what is presented in the summary table.
When writing the AR(1) model in lag-polynomial form, we usually write it like:
\phi(L) y_t = \varepsilon_t
where \phi(L) = 1 - \phi L, and L is the lag operator. The coefficients of this lag polynomial are (1, -\phi). These coefficients are what are presented in the polynomial attributes in the results object.
I have a TimeArray type dataset, and I would like to perform a linear regression. However, it appears that julia does not currently support regression methods for TimeArray types.
I can download the data as a DataFrame instead of a TimeArray and use the GLM package, but the TimeArray timestamp is quite useful for other analyses later on. I would like to perform a linear regression directly on the TimeArray dataset.
Edit 1: A simple example is given below:
julia> using TimeSeries
dates = collect(Date(1999,1,1):Date(1999,1,31))
# Dependent variable
y = TimeArray(dates, rand(length(dates)))
# Explanatory variables
x1 = TimeArray(dates, rand(length(dates))) # Explanatory variable 1
x2 = TimeArray(dates, rand(length(dates))) # Explanatory variable 2
x = rename(merge(x1,x2), ["x1", "x2"]) # Merge x1 and x2 into a single TimeArray
# Linear regression
coefs = linreg(x, y) # Yields a method error since linreg does not support the TimeArray type.
Has anyone found a solution or workaround for this problem?
The TimeArray type seems to have a .values field you can use to obtain the values associated with the array in the right order. So you can perform your linear regression with:
coefs = linreg(x.values,y.values)