The question is that I am interested in restricting non significant parameters in a VAR model say VAR(2). How may I do so in python 3.0x?
The question is already raised for R but was not for python
How to fit a restricted VAR model in R?
Can you please help me figure this out?
Another way is to use R functions
First, you need to import packages into python
import rpy2.robjects as robjects, pandas as pd, numpy as np
from rpy2.robjects import r
from rpy2.robjects.numpy2ri import numpy2ri
from rpy2.robjects.packages import importr
Import important packages into R from python
r('library("vars")')
Then, suppose you have data in the same directory called here.Rdata
# Load the data in R through python this will create a variable B
r('load("here.Rdata")')
# Change the name of the variable to y
r('y=B')
# Run a normal VAR model
r("t=VAR(y, p=5, type='const')")
# Restrict it
r('t1=restrict(t, method = "ser", thresh = 2.0, resmat = NULL)')
# Then find the summary statistics
r('s=summary(t1)')
# Save the output into text file call it myfile
r('capture.output(s, file = "myfile.txt")')
# Open it and print it in python
f = open('myfile.txt', 'r')
file_contents = f.read()
print (file_contents)
The output was as follows:
VAR Estimation Results:
=========================
Endogenous variables: y1, y2
Deterministic variables: const
Sample size: 103
Log Likelihood: 83.772
Roots of the characteristic polynomial:
0.5334 0.3785 0.3785 0 0 0 0 0 0 0
Call:
VAR(y = y, p = 5, type = "const")
Estimation results for equation y1:
===================================
y1 = y1.l1 + y2.l1 + y1.l2
Estimate Std. Error t value Pr(>|t|)
y1.l1 0.26938 0.11306 2.383 0.01908 *
y2.l1 -0.21767 0.07725 -2.818 0.00583 **
y1.l2 0.24068 0.10116 2.379 0.01925 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.146 on 100 degrees of freedom
Multiple R-Squared: 0.1047, Adjusted R-squared: 0.07786
F-statistic: 3.899 on 3 and 100 DF, p-value: 0.01111
Estimation results for equation y2:
===================================
y2 = y1.l1 + y2.l1 + const
Estimate Std. Error t value Pr(>|t|)
y1.l1 0.73199 0.16065 4.557 1.47e-05 ***
y2.l1 -0.31753 0.10836 -2.930 0.004196 **
const 0.08039 0.02165 3.713 0.000338 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2082 on 100 degrees of freedom
Multiple R-Squared: 0.251, Adjusted R-squared: 0.2286
F-statistic: 11.17 on 3 and 100 DF, p-value: 2.189e-06
Covariance matrix of residuals:
y1 y2
y1 0.02243 0.01573
y2 0.01573 0.04711
Correlation matrix of residuals:
y1 y2
y1 1.0000 0.4838
y2 0.4838 1.0000
To my knowledge, I was not able to find a python package that is able to do this kind of restriction. However, there is a package in R which is known as vars. Therefore, I had to augment the R package in python using the rpy2 interface.
This may be done easily but there is a list of events that should take place that allows you from python to call R functions (even packages) as vars package.
A summary using different package may be found in this link
However, for the sake of illustration here is my code
First you have to import vars to python as
# Import vars
Rvars = importr("vars", lib_loc = "C:/Users/Rami Chehab/Documents/R/win-library/3.3")
Then one need to fit a VAR model to your data (or Dataframe) say data as
t=Rvars.VAR(data,p=2, type='const')
then afterwards one needs to apply a different function in the vars packages known by restrict which removes nonsignificant parameters
# Let us try restricting it
t1=Rvars.restrict(t,method = "ser")
The final steps that will allow you to observe the data is to call a built-in function in R known as summary as
# Calling built-in functions from R
Rsummary = robjects.r['summary']
now print the outcome as
print(Rsummary(t1))
This will give
Estimation results for equation Ireland.Real.Bond:
==================================================
Ireland.Real.Bond = Ireland.Real.Bond.l1 + Ireland.Real.Equity.l1 + Ireland.Real.Bond.l2
Estimate Std. Error t value Pr(>|t|)
Ireland.Real.Bond.l1 0.26926 0.11139 2.417 0.01739 *
Ireland.Real.Equity.l1 -0.21706 0.07618 -2.849 0.00529 **
Ireland.Real.Bond.l2 0.23929 0.09979 2.398 0.01829 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.41 on 103 degrees of freedom
Multiple R-Squared: 0.1041, Adjusted R-squared: 0.07799
F-statistic: 3.989 on 3 and 103 DF, p-value: 0.009862
Estimation results for equation Ireland.Real.Equity:
====================================================
Ireland.Real.Equity = Ireland.Real.Bond.l1 + Ireland.Real.Equity.l1 + const
Estimate Std. Error t value Pr(>|t|)
Ireland.Real.Bond.l1 0.7253 0.1585 4.575 1.33e-05 ***
Ireland.Real.Equity.l1 -0.3112 0.1068 -2.914 0.004380 **
const 7.7494 2.1057 3.680 0.000373 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.58 on 103 degrees of freedom
Multiple R-Squared: 0.2462, Adjusted R-squared: 0.2243
F-statistic: 11.21 on 3 and 103 DF, p-value: 1.984e-06
Related
I do not understand very well the logic behind sklearn function train_test_split and StratifiedKFold for obtaining balanced splits according to multiple "columns" and not only according to the target distribution. I know the previous sentence is a bit obscure so I hope the following code helps.
import numpy as np
import pandas as pd
import random
n_samples = 100
prob = 0.2
pos = int(n_samples * prob)
neg = n_samples - pos
target = [1] * pos + [0] * neg
cat = ["a"] * 50 + ["b"] * 50
random.shuffle(target)
random.shuffle(cat)
ds = pd.DataFrame()
ds["target"] = target
ds["cat"] = cat
ds["f1"] = np.random.random(size=(n_samples,))
ds["f2"] = np.random.random(size=(n_samples,))
print(ds.head())
This is a 100-example dataset, target distribution is governed by p, in this case we have 20% positive examples. There is a binary categorical column cat, perfectly balanced. The output of the previous code is:
target cat f1 f2
0 0 a 0.970585 0.134268
1 0 a 0.410689 0.225524
2 0 a 0.638111 0.273830
3 0 b 0.594726 0.579668
4 0 a 0.737440 0.667996
with train_test_split(), stratify on target and cat, if we study the frequencies, we get:
from sklearn.model_selection import train_test_split, StratifiedKFold
# with train_test_split
training, valid = train_test_split(range(n_samples),
test_size=20,
stratify=ds[["target", "cat"]])
print("---")
print("* training")
print(ds.loc[training, ["target", "cat"]].value_counts() / len(training)) # balanced
print("* validation")
print(ds.loc[valid, ["target", "cat"]].value_counts() / len(valid)) # balanced
we get this:
* dataset
0 0.8
1 0.2
Name: target, dtype: float64
target cat
0 a 0.4
b 0.4
1 a 0.1
b 0.1
dtype: float64
---
* training
target cat
0 a 0.4
b 0.4
1 a 0.1
b 0.1
dtype: float64
* validation
target cat
0 a 0.4
b 0.4
1 a 0.1
b 0.1
dtype: float64
It is perfectly stratified.
Now with StratifiedKFold:
# with stratified k-fold
skf = StratifiedKFold(n_splits=5)
try:
for train, valid in skf.split(X=range(len(ds)), y=ds[["target", "cat"]]):
pass
except:
print("! does not work")
for train, valid in skf.split(X=range(len(ds)), y=ds.target):
print("happily iterating")
output:
! does not work
happily iterating
happily iterating
happily iterating
happily iterating
happily iterating
How do I obtain what I got with train_test_split with StratifiedKFold? I know there might be data distributions not allowing such stratifications in k-fold cross validation, but I cannot understand why train_test_split accepts two or more columns and the other method does not.
This doesn't seem readily possible currently.
Multilabel isn't exactly what you're looking for, but related. That's been asked here before, and was an Issue on sklearn's github (not sure why it got closed).
As a bit of a hack, you should be able to just combine your two columns into a new one with ordered pairs, and stratify on that?
I have two series of data as below. I want to create an OLS linear regression model for df1 and another OLS linear regression model for df2. And then statistically test if the y-intercepts of these two linear regression models are statistically different (p<0.05), and also test if the slopes of these two linear regression models are statistically different (p<0.05). I did the following
import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
np.inf == float('inf')
data1 = [1, 3, 45, 0, 25, 13, 43]
data2 = [1, 1, 1, 1, 1, 1, 1]
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
fig, ax = plt.subplots()
df1.plot(figsize=(20, 10), linewidth=5, fontsize=18, ax=ax, kind='line')
df2.plot(figsize=(20, 10), linewidth=5, fontsize=18, ax=ax, kind='line')
plt.show()
model1 = sm.OLS(df1, df1.index)
model2 = sm.OLS(df2, df2.index)
results1 = model1.fit()
results2 = model2.fit()
print(results1.summary())
print(results2.summary())
Results #1
OLS Regression Results
=======================================================================================
Dep. Variable: 0 R-squared (uncentered): 0.625
Model: OLS Adj. R-squared (uncentered): 0.563
Method: Least Squares F-statistic: 10.02
Date: Mon, 01 Mar 2021 Prob (F-statistic): 0.0194
Time: 20:34:34 Log-Likelihood: -29.262
No. Observations: 7 AIC: 60.52
Df Residuals: 6 BIC: 60.47
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 5.6703 1.791 3.165 0.019 1.287 10.054
==============================================================================
Omnibus: nan Durbin-Watson: 2.956
Prob(Omnibus): nan Jarque-Bera (JB): 0.769
Skew: 0.811 Prob(JB): 0.681
Kurtosis: 2.943 Cond. No. 1.00
==============================================================================
Results #2
OLS Regression Results
=======================================================================================
Dep. Variable: 0 R-squared (uncentered): 0.692
Model: OLS Adj. R-squared (uncentered): 0.641
Method: Least Squares F-statistic: 13.50
Date: Mon, 01 Mar 2021 Prob (F-statistic): 0.0104
Time: 20:39:14 Log-Likelihood: -5.8073
No. Observations: 7 AIC: 13.61
Df Residuals: 6 BIC: 13.56
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 0.2308 0.063 3.674 0.010 0.077 0.384
==============================================================================
Omnibus: nan Durbin-Watson: 0.148
Prob(Omnibus): nan Jarque-Bera (JB): 0.456
Skew: 0.000 Prob(JB): 0.796
Kurtosis: 1.750 Cond. No. 1.00
==============================================================================
This is as far I have got, but I think something is wrong. Neither of these regression outcome seems to show the y-intercept. Also, I expect the coef in results #2 to be 0 since I expect the slope to be 0 when all the values are 1, but the result shows 0.2308. Any suggestions or guiding material will be greatly appreciated.
In statsmodels an OLS model does not fit an intercept by default (see the docs).
exog array_like
A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.
The documentation on the exog argument of the OLS constructor suggests using this feature of the tools module in order to add an intercept to the data.
To perform a hypothesis test on the values of the coefficients this question provides some guidance. This unfortunately only works if the variances of the residual errors is the same.
We can start by looking at whether the residuals of each distribution have the same variance (using Levine's test) and ignore coefficients of the regression model for now.
import numpy as np
import pandas as pd
from scipy.stats import levene
from statsmodels.tools import add_constant
from statsmodels.formula.api import ols ## use formula api to make the tests easier
np.inf == float('inf')
data1 = [1, 3, 45, 0, 25, 13, 43]
data2 = [1, 1, 1, 1, 1, 1, 1]
df1 = add_constant(pd.DataFrame(data1)) ## add a constant column so we fit an intercept
df1 = df1.reset_index() ## just doing this to make the index a column of the data frame
df1 = df1.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df2 = add_constant(pd.DataFrame(data2)) ## this does nothing because the y column is already a constant
df2 = df2.reset_index()
df2 = df2.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
formula1 = 'y ~ x + const' ## define formulae
formula2 = 'y ~ x'
model1 = ols(formula1, df1).fit()
model2 = ols(formula2, df2).fit()
print(levene(model1.resid, model2.resid))
The output of the levene test looks like this:
LeveneResult(statistic=7.317386741297884, pvalue=0.019129208414097015)
So we can reject the null hypothesis that the residual distributions have the same variance at alpha=0.05.
There is no point to testing the linear regression coefficients now because the residuals don't have don't have the same distributions. It is important to remember that in a regression problem it doesn't make sense to compare the regression coefficients independent of the data they are fit on. The distribution of the regression coefficients depends on the distribution of the data.
Lets see what happens when we try the proposed test anyways. Combining the instructions above with this method from the OLS package yields the following code:
## stack the data and addd the indicator variable as described in:
## stackexchange question:
df1['c'] = 1 ## add indicator variable that tags the first groups of points
df_all = df1.append(df2, ignore_index=True).drop('const', axis=1)
df_all = df_all.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df_all = df_all.fillna(0) ## a bunch of the values are missing in the indicator columns after stacking
df_all['int'] = df_all['x'] * df_all['c'] # construct the interaction column
print(df_all) ## look a the data
formula = 'y ~ x + c + int' ## define the linear model using the formula api
result = ols(formula, df_all).fit()
hypotheses = '(c = 0), (int = 0)'
f_test = result.f_test(hypotheses)
print(f_test)
The result of the f-test looks like this:
<F test: F=array([[4.01995453]]), p=0.05233934453138028, df_denom=10, df_num=2>
The result of the f-test means that we just barely fail to reject any of the null hypotheses specified in the hypotheses variable namely that the coefficient of the indicator variable 'c' and interaction term 'int' are zero.
From this example it is clear that the f test on the regression coefficients is not very powerful if the residuals do not have the same variance.
Note that the given example has so few points it is hard for the statistical tests to clearly distinguish the two cases even though to the human eye they are very different. This is because even though the statistical tests are designed to make few assumptions about the data but those assumption get better the more data you have. When testing statistical methods to see if they accord with your expectations it is often best to start by constructing large samples with little noise and then see how well the methods work as your data sets get smaller and noisier.
For the sake of completeness I will construct an example where the Levene test will fail to distinguish the two regression models but f test will succeed to do so. The idea is to compare the regression of a noisy data set with its reverse. The distribution of residual errors will be the same but the relationship between the variables will be very different. Note that this would not work reversing the noisy dataset given in the previous example because the data is so noisy the f test cannot distinguish between the positive and negative slope.
import numpy as np
import pandas as pd
from scipy.stats import levene
from statsmodels.tools import add_constant
from statsmodels.formula.api import ols ## use formula api to make the tests easier
n_samples = 6
noise = np.random.randn(n_samples) * 5
data1 = np.linspace(0, 30, n_samples) + noise
data2 = data1[::-1] ## reverse the time series
df1 = add_constant(pd.DataFrame(data1)) ## add a constant column so we fit an intercept
df1 = df1.reset_index() ## just doing this to make the index a column of the data frame
df1 = df1.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df2 = add_constant(pd.DataFrame(data2)) ## this does nothing because the y column is already a constant
df2 = df2.reset_index()
df2 = df2.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
formula1 = 'y ~ x + const' ## define formulae
formula2 = 'y ~ x'
model1 = ols(formula1, df1).fit()
model2 = ols(formula2, df2).fit()
print(levene(model1.resid, model2.resid))
## stack the data and addd the indicator variable as described in:
## stackexchange question:
df1['c'] = 1 ## add indicator variable that tags the first groups of points
df_all = df1.append(df2, ignore_index=True).drop('const', axis=1)
df_all = df_all.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df_all = df_all.fillna(0) ## a bunch of the values are missing in the indicator columns after stacking
df_all['int'] = df_all['x'] * df_all['c'] # construct the interaction column
print(df_all) ## look a the data
formula = 'y ~ x + c + int' ## define the linear model using the formula api
result = ols(formula, df_all).fit()
hypotheses = '(c = 0), (int = 0)'
f_test = result.f_test(hypotheses)
print(f_test)
The result of Levene test and the f test follow:
LeveneResult(statistic=5.451203655948632e-31, pvalue=1.0)
<F test: F=array([[10.62788052]]), p=0.005591319998324387, df_denom=8, df_num=2>
A final note since we are doing multiple comparisons on this data and stopping if we get a significant result, i.e. if the Levene test rejects the null we quit, if it doesn't then we do the f test, this is a stepwise hypothesis test and we are actually inflating our false positive error rate. We should correct our p-values for multiple comparisons before we report our results. Note that the f test is already doing this for the hypotheses we test about the regression coefficients. I am a bit fuzzy on the underlying assumptions of these testing procedures so I am not 100% sure that you are better off making the following correction but keep it in mind in case you feel you are getting false positives too often.
from statsmodels.sandbox.stats.multicomp import multipletests
print(multipletests([1, .005591], .05)) ## correct out pvalues given that we did two comparisons
The output looks like this:
(array([False, True]), array([1. , 0.01115074]), 0.025320565519103666, 0.025)
This means we rejected the second null hypothesis under the correction and that the corrected p-values looks like [1., 0.011150]. The last two values are corrections to your significance level under two different correction methods.
I hope this helps anyone trying to do this type of work. If anyone has anything to add I would welcome comments. This isn't my area of expertise so I could be making some mistakes.
I'm trying to implement vectorized logistic regression on the Iris dataset. This is the implementation from Andrew Ng's youtube series on deep learning. My best predictions using this method have been 81% accuracy while sklearn's implementation achieves 100% with completely different values for coefficients and bias. Also, I cant seem to get get proper outputs from my cost function. I suspect it is an issue with computing the gradients of the weights and bias with respect to the cost function though in the course he provides all of the necessary equations ( unless there is something in the actual exercise which I don't have access to being left out.) My code is as follows.
n = 4
m = 150
y = y.reshape(1, 150)
X = X.reshape(4, 150)
W = np.zeros((4, 1))
b = np.zeros((1,1))
for epoch in range(1000):
Z = np.dot(W.T, X) + b
A = sigmoid(Z) # 1/(1 + e **(-Z))
J = -1/m * np.sum(y * np.log(A) + (1-y) * (1 - np.log(A))) #cost function
dz = A - y
dw = 1/m * np.dot(X, dz.T)
db = np.sum(dz)
W = W - 0.01 * dw
b = b - 0.01 * db
if epoch % 100 == 0:
print(J)
My output looks something like this.
-1.6126604413879289
-1.6185960074767125
-1.6242504226045396
-1.6296400635926438
-1.6347800862216104
-1.6396845400653066
-1.6443664703028427
-1.648838008214648
-1.653110451818512
-1.6571943378913891
W and b values are:
array([[-0.68262679, -1.56816916, 0.12043066, 1.13296948]])
array([[0.53087131]])
Where as sklearn outputs:
(array([[ 0.41498833, 1.46129739, -2.26214118, -1.0290951 ]]),
array([0.26560617]))
I understand sklearn uses L2 regularization but even when turned off it's still far from the correct values. Any help would be appreciated. Thanks
You are likely getting strange results because you are trying to use logistic regression where y is not a binary choice. Categorizing the iris data is a multiclass problem, y can be one of three values:
> np.unique(iris.target)
> array([0, 1, 2])
The cross entropy cost function expects y to either be one or zero. One way to handle this is the one vs all method.
You can check each class by making y a boolean of whether the iris in in one class or not. For example here you can make y a data set of either class 1 or not:
y = (iris.target == 1).astype(int)
With that your cost function and gradient descent should work, but you'll need to run it multiple times and pick the best score for each example. Andrew Ng's class talks about this method.
EDIT:
It's not clear what you are starting with for data. When I do this, don't reshape the inputs. So you should double check that all your multiplication is delivering the shapes you want. On thing I notice that's a little odd, is the last term in your cost function. I generally do this:
cost = -1/m * np.sum(Y*np.log(A) + (1-Y) * np.log(1-A))
not:
-1/m * np.sum(y * np.log(A) + (1-y) * (1 - np.log(A)))
Here's code that converges for me using the dataset from sklearn:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
# Iris is a multiclass problem. Here, just calculate the probabily that
# the class is `iris_class`
iris_class = 0
Y = np.expand_dims((iris.target == iris_class).astype(int), axis=1)
# Y is now a data set of booleans indicating whether the sample is or isn't a member of iris_class
# initialize w and b
W = np.random.randn(4, 1)
b = np.random.randn(1, 1)
a = 0.1 # learning rate
m = Y.shape[0] # number of samples
def sigmoid(Z):
return 1/(1 + np.exp(-Z))
for i in range(1000):
Z = np.dot(X ,W) + b
A = sigmoid(Z)
dz = A - Y
dw = 1/m * np.dot(X.T, dz)
db = np.mean(dz)
W -= a * dw
b -= a * db
cost = -1/m * np.sum(Y*np.log(A) + (1-Y) * np.log(1-A))
if i%100 == 0:
print(cost)
I'm a novice to statics analysis and was looking into using statsmodels. As part of my research I ran into the following set of examples.
The section "OLS Non-linear curve but liner in parameters" is confusing the heck out of me. Example Follows:
np.random.seed(9876789)
nsample = 50
sig = 0.5
x = np.linspace(0, 20, nsample)
X = np.column_stack((x, np.sin(x), (x-5)**2, np.ones(nsample)))
beta = [0.5, 0.5, -0.02, 5.]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size = nsample)
res = sm.OLS(y, X).fit()
print(res.summary())
This shows the following summary results:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.933
Model: OLS Adj. R-squared: 0.928
Method: Least Squares F-statistic: 211.8
Date: Tue, 28 Feb 2017 Prob (F-statistic): 6.30e-27
Time: 21:33:30 Log-Likelihood: -34.438
No. Observations: 50 AIC: 76.88
Df Residuals: 46 BIC: 84.52
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 0.4687 0.026 17.751 0.000 0.416 0.522
x2 0.4836 0.104 4.659 0.000 0.275 0.693
x3 -0.0174 0.002 -7.507 0.000 -0.022 -0.013
const 5.2058 0.171 30.405 0.000 4.861 5.550
==============================================================================
Omnibus: 0.655 Durbin-Watson: 2.896
Prob(Omnibus): 0.721 Jarque-Bera (JB): 0.360
Skew: 0.207 Prob(JB): 0.835
Kurtosis: 3.026 Cond. No. 221.
==============================================================================
When you plot all of this you get:
Plot of Data, Fit and True Line
What's confusing me is I can't figure out how the fit comes out of the coefficients shown in the summary table. My understanding is that these coefficients for the linear fit should correspond to and equation in the format X1 * x^3 + X2 * X^2 + X3 * X + Const, but that does not result in the curve seen. My next thought was that it might have been extrapolating the equation based on the values in the X matrix, therefore something like X1 * x + X2 * sin(x) + X3 * (x-5)^2 + Const. That also does not work.
What does seem to work is a polynomial fit with a degree of around 10. I found that using np.polyfit(x, y, 10). (The Coefficients of which are not similar to the OLS ones, plus there's 6 more)
So my question is what equation is OLS using to produce the predicted values? How doe the coefficients relate to it? When no equation is specified (assuming it's using something different than normal polynomial equations) how does it determine what to use or whats the best fit?
One Note, I've figured out that I can force it to do what I was expecting by changing the x values used to a different matrix via np.vander()
X = np.vander(X, 4)
This produces results in line with what I was expecting and np.polyfit.
I have Normalize my data and apply regression analysis to predict yield(y).
but my predicted output also gives in normalized (in 0 to 1)
I want my predicted answer in my correct data numbers,not in 0 to 1.
Data:
Total_yield(y) Rain(x)
64799.30 720.1
77232.40 382.9
88487.70 1198.2
77338.20 341.4
145602.05 406.4
67680.50 325.8
84536.20 791.8
99854.00 748.6
65939.90 1552.6
61622.80 1357.7
66439.60 344.3
Next,I have normalize data using this code :
from sklearn.preprocessing import Normalizer
import pandas
import numpy
dataframe = pandas.read_csv('/home/desktop/yield.csv')
array = dataframe.values
X = array[:,0:2]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
print(normalizedX)
Total_yield Rain
0 0.999904 0.013858
1 0.999782 0.020872
2 0.999960 0.008924
3 0.999967 0.008092
4 0.999966 0.008199
5 0.999972 0.007481
6 0.999915 0.013026
7 0.999942 0.010758
8 0.999946 0.010414
9 0.999984 0.005627
10 0.999967 0.008167
Next, I use this normalize value to calculate R-sqaure using following code :
array=normalizedX
data = pandas.DataFrame(array,columns=['Total_yield','Rain'])
import statsmodels.formula.api as smf
lm = smf.ols(formula='Total_yield ~ Rain', data=data).fit()
lm.summary()
Output :
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: Total_yield R-squared: 0.752
Model: OLS Adj. R-squared: 0.752
Method: Least Squares F-statistic: 1066.
Date: Thu, 09 Feb 2017 Prob (F-statistic): 2.16e-108
Time: 14:21:21 Log-Likelihood: 941.53
No. Observations: 353 AIC: -1879.
Df Residuals: 351 BIC: -1871.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 1.0116 0.001 948.719 0.000 1.009 1.014
Rain -0.3013 0.009 -32.647 0.000 -0.319 -0.283
==============================================================================
Omnibus: 408.798 Durbin-Watson: 1.741
Prob(Omnibus): 0.000 Jarque-Bera (JB): 40636.533
Skew: -4.955 Prob(JB): 0.00
Kurtosis: 54.620 Cond. No. 10.3
==============================================================================
Now, R-square = 0.75 ,
regression model : y = b0 + b1 *x
Yield = b0 + b1 * Rain
Yield = intercept + coefficient for Rain * Rain
Now when I use my data value for Rain data then it will gives this answer :
Yield = 1.0116 + ( -0.3013 * 720.1(mm)) = -215.95
-215.95yield is wrong,
And when I use normalize value for rain data then predicted yield comes in normalize value in between 0 to 1.
I want predict if rainfall will be 720.1 mm then how many yield will be there?
If anyone help me how to get predicted yield ? I want to compare Predicted yield vs given yield.
First, you should not use Normalizer in this case. It doesn't normalize across features. It does it along rows. You may not want it.
Use MinMaxScaler or RobustScaler to scale each feature. See the preprocessing docs for more details.
Second, these classes have a inverse_transform() function which can convert the predicted y value back to original units.
x = np.asarray([720.1,382.9,1198.2,341.4,406.4,325.8,
791.8,748.6,1552.6,1357.7,344.3]).reshape(-1,1)
y = np.asarray([64799.30,77232.40,88487.70,77338.20,145602.05,67680.50,
84536.20,99854.00,65939.90,61622.80,66439.60]).reshape(-1,1)
scalerx = RobustScaler()
x_scaled = scalerx.fit_transform(x)
scalery = RobustScaler()
y_scaled = scalery.fit_transform(y)
Call your statsmodel.OLS on these scaled data.
While predicting, first transform your test data:
x_scaled_test = scalerx.transform([720.1])
Apply your regression model on this value and get the result. This result of y will be according to the scaled data.
Yield_scaled = b0 + b1 * x_scaled_test
So inverse transform it to get data in original units.
Yield_original = scalery.inverse_transform(Yield_scaled)
But in my opinion, this linear model will not give much accuracy, because when I plotted your data, this is the result.
This data will not be fitted with linear models. Use other techniques, or get more data.