I need to find the quadratic equation term of a graph I have plotted in R.
When I do this in excel, the term appears in a text box on the chart but I'm unsure how to move this to a cell for subsequent use (to apply to values requiring calibrating) or indeed how to ask for it in R. If it is summonable in R, is it saveable as an object to do future calculations with?
This seems like it should be a straightforward request in R, but I can't find any similar questions. Many thanks in advance for any help anyone can provide on this.
All the answers provide aspects of what you appear at want to do, but non thus far brings it all together. Lets consider Tom Liptrot's answer example:
fit <- lm(speed ~ dist + I(dist^2), cars)
This gives us a fitted linear model with a quadratic in the variable dist. We extract the model coefficients using the coef() extractor function:
> coef(fit)
(Intercept) dist I(dist^2)
5.143960960 0.327454437 -0.001528367
So your fitted equation (subject to rounding because of printing is):
\hat{speed} = 5.143960960 + (0.327454437 * dist) + (-0.001528367 * dist^2)
(where \hat{speed} is the fitted values of the response, speed).
If you want to apply this fitted equation to some data, then we can write our own function to do it:
myfun <- function(newdist, model) {
coefs <- coef(model)
res <- coefs[1] + (coefs[2] * newdist) + (coefs[3] * newdist^2)
return(res)
}
We can apply this function like this:
> myfun(c(21,3,4,5,78,34,23,54), fit)
[1] 11.346494 6.112569 6.429325 6.743024 21.386822 14.510619 11.866907
[8] 18.369782
for some new values of distance (dist), Which is what you appear to want to do from the Q. However, in R we don't do things like this normally, because, why should the user have to know how to form fitted or predicted values from all the different types of model that can be fitted in R?
In R, we use standard methods and extractor functions. In this case, if you want to apply the "equation", that Excel displays, to all your data to get the fitted values of this regression, in R we would use the fitted() function:
> fitted(fit)
1 2 3 4 5 6 7 8
5.792756 8.265669 6.429325 11.608229 9.991970 8.265669 10.542950 12.624600
9 10 11 12 13 14 15 16
14.510619 10.268988 13.114445 9.428763 11.081703 12.122528 13.114445 12.624600
17 18 19 20 21 22 23 24
14.510619 14.510619 16.972840 12.624600 14.951557 19.289106 21.558767 11.081703
25 26 27 28 29 30 31 32
12.624600 18.369782 14.057455 15.796751 14.057455 15.796751 17.695765 16.201008
33 34 35 36 37 38 39 40
18.688450 21.202650 21.865976 14.951557 16.972840 20.343693 14.057455 17.340416
41 42 43 44 45 46 47 48
18.038887 18.688450 19.840853 20.098387 18.369782 20.576773 22.333670 22.378377
49 50
22.430008 21.93513
If you want to apply your model equation to some new data values not used to fit the model, then we need to get predictions from the model. This is done using the predict() function. Using the distances I plugged into myfun above, this is how we'd do it in a more R-centric fashion:
> newDists <- data.frame(dist = c(21,3,4,5,78,34,23,54))
> newDists
dist
1 21
2 3
3 4
4 5
5 78
6 34
7 23
8 54
> predict(fit, newdata = newDists)
1 2 3 4 5 6 7 8
11.346494 6.112569 6.429325 6.743024 21.386822 14.510619 11.866907 18.369782
First up we create a new data frame with a component named "dist", containing the new distances we want to get predictions for from our model. It is important to note that we include in this data frame a variable that has the same name as the variable used when we created our fitted model. This new data frame must contain all the variables used to fit the model, but in this case we only have one variable, dist. Note also that we don't need to include anything about dist^2. R will handle that for us.
Then we use the predict() function, giving it our fitted model and providing the new data frame just created as argument 'newdata', giving us our new predicted values, which match the ones we did by hand earlier.
Something I glossed over is that predict() and fitted() are really a whole group of functions. There are versions for lm() models, for glm() models etc. They are known as generic functions, with methods (versions if you like) for several different types of object. You the user generally only need to remember to use fitted() or predict() etc whilst R takes care of using the correct method for the type of fitted model you provide it. Here are some of the methods available in base R for the fitted() generic function:
> methods(fitted)
[1] fitted.default* fitted.isoreg* fitted.nls*
[4] fitted.smooth.spline*
Non-visible functions are asterisked
You will possibly get more than this depending on what other packages you have loaded. The * just means you can't refer to those functions directly, you have to use fitted() and R works out which of those to use. Note there isn't a method for lm() objects. This type of object doesn't need a special method and thus the default method will get used and is suitable.
You can add a quadratic term in the forumla in lm to get the fit you are after. You need to use an I()around the term you want to square as in the example below:
plot(speed ~ dist, cars)
fit1 = lm(speed ~ dist, cars) #fits a linear model
abline(fit1) #puts line on plot
fit2 = lm(speed ~ I(dist^2) + dist, cars) #fits a model with a quadratic term
fit2line = predict(fit2, data.frame(dist = -10:130))
lines(-10:130 ,fit2line, col=2) #puts line on plot
To get the coefficients from this use:
coef(fit2)
I dont think it is possible in Excel, as they only provide functions to get coefficients for a linear regression (SLOPE, INTERCEPT, LINEST) or for a exponential one (GROWTH, LOGEST), though you may have more luck by using Visual Basic.
As for R you can extract model coefficients using the coef function:
mdl <- lm(y ~ poly(x,2,raw=T))
coef(mdl) # all coefficients
coef(mdl)[3] # only the 2nd order coefficient
I guess you mean that you plot X vs Y values in Excel or R, and in Excel use the "Add trendline" functionality. In R, you can use the lm function to fit a linear function to your data, and this also gives you the "r squared" term (see examples in the linked page).
Related
I'm plotting this dataset and making a logarithmic fit, but, for some reason, the fit seems to be strongly wrong, at some point I got a good enough fit, but then I re ploted and there were that bad fit. At the very beginning there were a 0.0 0.0076 but I changed that to 0.001 0.0076 to avoid the asymptote.
I'm using (not exactly this one for the image above but now I'm testing with this one and there is that bad fit as well) this for the fit
f(x) = a*log(k*x + b)
fit = fit f(x) 'R_B/R_B.txt' via a, k, b
And the output is this
Also, sometimes it says 7 iterations were as is the case shown in the screenshot above, others only 1, and when it did the "correct" fit, it did like 35 iterations or something and got a = 32 if I remember correctly
Edit: here is again the good one, the plot I got is this one. And again, I re ploted and get that weird fit. It's curious that if there is the 0.0 0.0076 when the good fit it's about to be shown, gnuplot says "Undefined value during function evaluation", but that message is not shown when I'm getting the bad one.
Do you know why do I keep getting this inconsistence? Thanks for your help
As I already mentioned in comments the method of fitting antiderivatives is much better than fitting derivatives because the numerical calculus of derivatives is strongly scattered when the data is slightly scatered.
The principle of the method of fitting an integral equation (obtained from the original equation to be fitted) is explained in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales . The application to the case of y=a.ln(c.x+b) is shown below.
Numerical calculus :
In order to get even better result (according to some specified criteria of fitting) one can use the above values of the parameters as initial values for iterarive method of nonlinear regression implemented in some convenient software.
NOTE : The integral equation used in the present case is :
NOTE : On the above figure one can compare the result with the method of fitting an integral equation to the result with the method of fitting with derivatives.
Acknowledgements : Alex Sveshnikov did a very good work in applying the method of regression with derivatives. This allows an interesting and enlightening comparison. If the goal is only to compute approximative values of parameters to be used in nonlinear regression software both methods are quite equivalent. Nevertheless the method with integral equation appears preferable in case of scattered data.
UPDATE (After Alex Sveshnikov updated his answer)
The figure below was drawn in using the Alex Sveshnikov's result with further iterative method of fitting.
The two curves are almost indistinguishable. This shows that (in the present case) the method of fitting the integral equation is almost sufficient without further treatment.
Of course this not always so satisfying. This is due to the low scatter of the data.
In ADDITION , answer to a question raised in comments by CosmeticMichu :
The problem here is that the fit algorithm starts with "wrong" approximations for parameters a, k, and b, so during the minimalization it finds a local minimum, not the global one. You can improve the result if you provide the algorithm with starting values, which are close to the optimal ones. For example, let's start with the following parameters:
gnuplot> a=47.5087
gnuplot> k=0.226
gnuplot> b=1.0016
gnuplot> f(x)=a*log(k*x+b)
gnuplot> fit f(x) 'R_B.txt' via a,k,b
....
....
....
After 40 iterations the fit converged.
final sum of squares of residuals : 16.2185
rel. change during last iteration : -7.6943e-06
degrees of freedom (FIT_NDF) : 18
rms of residuals (FIT_STDFIT) = sqrt(WSSR/ndf) : 0.949225
variance of residuals (reduced chisquare) = WSSR/ndf : 0.901027
Final set of parameters Asymptotic Standard Error
======================= ==========================
a = 35.0415 +/- 2.302 (6.57%)
k = 0.372381 +/- 0.0461 (12.38%)
b = 1.07012 +/- 0.02016 (1.884%)
correlation matrix of the fit parameters:
a k b
a 1.000
k -0.994 1.000
b 0.467 -0.531 1.000
The resulting plot is
Now the question is how you can find "good" initial approximations for your parameters? Well, you start with
If you differentiate this equation you get
or
The left-hand side of this equation is some constant 'C', so the expression in the right-hand side should be equal to this constant as well:
In other words, the reciprocal of the derivative of your data should be approximated by a linear function. So, from your data x[i], y[i] you can construct the reciprocal derivatives x[i], (x[i+1]-x[i])/(y[i+1]-y[i]) and the linear fit of these data:
The fit gives the following values:
C*k = 0.0236179
C*b = 0.106268
Now, we need to find the values for a, and C. Let's say, that we want the resulting graph to pass close to the starting and the ending point of our dataset. That means, that we want
a*log(k*x1 + b) = y1
a*log(k*xn + b) = yn
Thus,
a*log((C*k*x1 + C*b)/C) = a*log(C*k*x1 + C*b) - a*log(C) = y1
a*log((C*k*xn + C*b)/C) = a*log(C*k*xn + C*b) - a*log(C) = yn
By subtracting the equations we get the value for a:
a = (yn-y1)/log((C*k*xn + C*b)/(C*k*x1 + C*b)) = 47.51
Then,
log(k*x1+b) = y1/a
k*x1+b = exp(y1/a)
C*k*x1+C*b = C*exp(y1/a)
From this we can calculate C:
C = (C*k*x1+C*b)/exp(y1/a)
and finally find the k and b:
k=0.226
b=1.0016
These are the values used above for finding the better fit.
UPDATE
You can automate the process described above with the following script:
# Name of the file with the data
data='R_B.txt'
# The coordinates of the last data point
xn=NaN
yn=NaN
# The temporary coordinates of a data point used to calculate a derivative
x0=NaN
y0=NaN
linearFit(x)=Ck*x+Cb
fit linearFit(x) data using (xn=$1,dx=$1-x0,x0=$1,$1):(yn=$2,dy=$2-y0,y0=$2,dx/dy) via Ck, Cb
# The coordinates of the first data point
x1=NaN
y1=NaN
plot data using (x1=$1):(y1=$2) every ::0::0
a=(yn-y1)/log((Ck*xn+Cb)/(Ck*x1+Cb))
C=(Ck*x1+Cb)/exp(y1/a)
k=Ck/C
b=Cb/C
f(x)=a*log(k*x+b)
fit f(x) data via a,k,b
plot data, f(x)
pause -1
So I am trying to do a toy example where I know factors in advance and I want to back them out using FactorAnalysis or PCA using SciKit learn.
Lets say I have defined 4 random X factors and 10 Y dependent variable:
# number of obs
N=10000
n_factors=4
n_variables=10
# 4 Random Factors ~N(0,1)
X=np.random.normal(size=(N,n_factors))
# Loadings for 10 Y dependent variables
loadings=pd.DataFrame(np.round(np.random.normal(0,2,size=(n_factors,n_variables)),2))
# Y without unique variance
Y_hat=X.dot(loadings)
There is no random noise here so if I run the PCA it will show that 4 factors explain all the variance as one would expect:
pca=PCA(n_components=n_factors)
pca.fit(Y_hat)
np.cumsum(pca.explained_variance_ratio_)
array([0.47940185, 0.78828548, 0.93573719, 1. ])
so far so good. In the next step I have ran the FA analysis and reconstituted the Y from the calculated loadings and factor scores:
fa=FactorAnalysis(n_components=n_factors, random_state=0,rotation=None)
X_fa = fa.fit_transform(Y_hat)
loadings_fa=pd.DataFrame(fa.components_)
Y_hat_fa=X_fa.dot(loadings_fa)+np.mean(Y_hat,axis=0)
print((Y_hat_fa-Y_hat).max())
print((Y_hat_fa-Y_hat).min())
6.039613253960852e-13
-5.577760475716786e-13
So the the original variables and reconstituted variables from FA match almost exactly.
However,
The loadings don't match at all and neither do factors:
loadings_fa-loadings
0 1 2 3 4 5 6 7 8 9
0 1.70402 -3.37357 3.62861 -0.85049 -6.10061 11.63636 3.06843 -6.89921 4.17525 3.90106
1 -1.38336 5.00735 0.04610 1.50830 0.84080 -0.44424 -1.52718 3.53620 3.06496 7.13725
2 1.63517 -1.95932 2.71208 -2.34872 -2.10633 4.50955 3.45529 -1.44261 0.03151 0.37575
3 0.27463 3.89216 2.00659 -2.18016 1.99597 -1.85738 2.34128 6.40504 -0.55935 4.13107
From quick calculations the factors from FA are not even well correlated with the original factors.
I am looking for a good theoretical explanation why I can't back out the original Factors and loadings, and not necessarily looking for code example
I want to create a matrix with random numbers in J programming language when the required shape is derived from other variables.
I could create such a matrix with ? 3 5 $ 0 if i specify its shape using literal integers. But I am struggling to find a way to create such a matrix when the shape is # y and # x instead of 3 and 5 shown in above example.
I have tried ? 0 $~ # y, # x and it has not worked.
I think I need some way to apply # over a list of variables and return a list of numbers which should be placed after $~, somewhat like map functionality of other languages. Is there a way to do this?
I think that ?#:$ is what you are looking for
3 5 ?#:$ 0
0.031974 0.272734 0.792653 0.439747 0.136448
0.332198 0.00904103 0.7896 0.78304 0.682833
0.27289 0.855249 0.0922516 0.185466 0.257876
The general structure for this is x u#:v y <-> (u (x v y)) where u and v are the verbs and the arguments are x and y.
Hope this helps.
Rereading your question it looks as if you want the shape to be based on the number of items in the arguments. Here I would use # to count the items in each argument, then use , to create the left argument for $&0 and apply ? to the result.
3 4 5 (?#:($&0 #:,))&# 5 3 3 4 5
0.179395 0.456545 0.805514 0.471521 0.0967092
0.942029 0.30713 0.228288 0.693909 0.338689
0.632752 0.618275 0.100224 0.959804 0.517927
Is this closer to what you had in mind?
And as often the case, I thought of another approach overnight
3 4 5 ?#0:"0/ 1 2 3 4 5
0.271366 0.291846 0.0493541 0.72488 0.47988
0.50287 0.980205 0.58541 0.778901 0.0755205
0.0114588 0.523955 0.535905 0.5333 0.984908
I am trying to fit a linear model and my dataset is normalized where each feature is divided by the maximum possible value. So the values ranges from 0-1. Now i came to know from my previous post Linear Regression vs Closed form Ordinary least squares in Python linear regression in scikit learn produces same result as Closed form OLS when fit_intercept parameter is set to false. I am not quite getting how fit_intercept works.
For any linear problem, if y is the predicted value.
y(w, x) = w_0 + w_1 x_1 + ... + w_p x_p
Across the module, the vector w = (w_1, ..., w_p) is denoted as coef_ and w_0 as intercept_
In closed form OLS we also have a bias value for w_0 and we introduce vector X_0=[1...1] before computing the dot product and solves using matrix multiplication and inverse.
w = np.dot(X.T, X)
w1 = np.dot(np.linalg.pinv(w), np.dot(X.T, Y))
When fit_intercept is True, scikit-learn linear regression solves the problem if y is the predicted value.
y(w, x) = w_0 + w_1 x_1 + ... + w_p x_p + b where b is the intercept item.
How does it differ to use fit_intercept in a model and when should one set it to True/False. I was trying to look at the source code and it seems like the coefficients are normalized by some scale.
if self.fit_intercept:
self.coef_ = self.coef_ / X_scale
self.intercept_ = y_offset - np.dot(X_offset, self.coef_.T)
else:
self.intercept_ = 0
What does this scaling do exactly. I want to interpret the coefficients in both approach (Linear Regression, Closed form OLS) but since just setting fit_intercept True/False gives different result for Linear Regression i can't quite decide on the intuition behind them. Which one is better and why?
Let's take a step back and consider the following sentence you said:
since just setting fit_intercept True/False gives different result for Linear Regression
That is not entirely true. It may or may not be different, and it depends entirely on your data. It would help to understand what goes into the calculation of regression weights. I mean this somewhat literally: what does your input (x) data look like?
Understanding your input data, and understanding why it matters, will help you realize why you sometimes get different results, and why at other times the results are the same
Data setup
Lets set up some test data:
import numpy as np
from sklearn.linear_model import LinearRegression
np.random.seed(1243)
x = np.random.randint(0,100,size=10)
y = np.random.randint(0,100,size=10)
Our x and y variables look like this:
X Y
51 29
3 73
7 77
98 29
29 80
90 37
49 9
42 53
8 17
65 35
No-intercept model
Recall that the calculation of regression weights has a closed form solution, which we can obtain using normal equations:
Using this method, we get a single regression coefficient because we only have 1 predictor variable:
x = x.reshape(-1,1)
w = np.dot(x.T, x)
w1 = np.dot(np.linalg.pinv(w), np.dot(x.T, y))
print(w1)
[ 0.53297593]
Now, let's look at scikit-learn when we set fit_intercept = False:
clf = LinearRegression(fit_intercept=False)
print(clf.fit(x, y).coef_)
[ 0.53297593]
What happens when we set fit_intercept = True instead?
clf = LinearRegression(fit_intercept=True)
print(clf.fit(x, y).coef_)
[-0.35535884]
It would seem that setting fit_intercept to True and False gives different answers, and that the "correct" answer occurs only when we set it to False, but this is not entirely correct...
Intercept model
At this point we have to consider what our input data actually is. In the models above, our data matrix (also called a feature matrix, or design matrix in statistics) is just a single vector containing our x values. The y variable is not included in the design matrix. If we want to add an intercept to our model, one common approach is to add a column of 1's to the design matrix, so x becomes:
x_vals = x.flatten()
x = np.zeros((10, 2))
x[:,0] = 1
x[:,1] = x_vals
intercept x
0 1.0 51.0
1 1.0 3.0
2 1.0 7.0
3 1.0 98.0
4 1.0 29.0
5 1.0 90.0
6 1.0 49.0
7 1.0 42.0
8 1.0 8.0
9 1.0 65.0
Now, when we use this as our design matrix, we can try the closed form solution again:
w = np.dot(x.T, x)
w1 = np.dot(np.linalg.pinv(w), np.dot(x.T, y))
print(w1)
[ 59.60686058 -0.35535884]
Notice 2 things:
We now have 2 coefficients. The first is our intercept and the second is the regression coefficient for the x predictor variable
The coefficient for x matches the coefficient from the scikit-learn output above when we set fit_intercept = True
So in the scikit-learn models above, why was there a difference between True and False? Because in one case no intercept was modeled. In the other case the underlying model included an intercept, which is confirmed when you manually add an intercept term/column when solving the normal equations
If you were to use this new design matrix in scikit-learn, it doesn't matter whether you set True or False for fit_intercept, the coefficient for the predictor variable will not change (the intercept value will be different due to centering, but thats irrelevant for this discussion):
clf = LinearRegression(fit_intercept=False)
print(clf.fit(x, y).coef_)
[ 59.60686058 -0.35535884]
clf = LinearRegression(fit_intercept=True)
print(clf.fit(x, y).coef_)
[ 0. -0.35535884]
Summing up
The output (i.e. coefficient values) you get will be entirely dependent on the matrix that you input into these calculations (whether its normal equations, scikit-learn, or any other)
How does it differ to use fit_intercept in a model and when should one set it to True/False
If your design matrix does not contain a 1's column, then normal equations and scikit-learn (fit_intercept = False) will give you the same answer (as you noted). However, if you set the parameter to True, the answer you get will actually be the same as normal equations if you calculated that with a 1's column.
When should you set True/False? As the name suggests, you set False when you don't want to include an intercept in your model. You set True when you do want an intercept, with the understanding that the coefficient values will change, but will match the normal equations approach when your data includes a 1's column
So True/False doesn't actually give you different results (compared to normal equations) when considering the same underlying model. The difference you observe is because you're looking at two different statistical models (one with an intercept term, and one without). The reason the fit_intercept parameter exists is so you can create an intercept model without the hassle of manually adding that 1's column. It effectively allows you to toggle between the two underlying statistical models.
Without going into the details of mathematical formulation, when the fit intercept is set to false, the estimator deliberately sets the intercept to zero and this in turn affects the other regressors as the 'responsibility' of the error reduction falls onto these factors. As a result, the result could be very different in either cases if it is sensitive to the presence of an intercept term. The scaling shifts the origin thereby allowing the same closed loop solutions to both intercept and intercept-free models.
I'm currently trying to impute the missing data through Gaussian mixture model.
My reference paper is from here:
http://mlg.eng.cam.ac.uk/zoubin/papers/nips93.pdf
I currently focus on bivariate dataset with 2 Gaussian components.
This is the code to define the weight for each Gaussian component:
myData = faithful[,1:2]; # the data matrix
for (i in (1:N)) {
prob1 = pi1*dmvnorm(na.exclude(myData[,1:2]),m1,Sigma1); # probabilities of sample points under model 1
prob2 = pi2*dmvnorm(na.exclude(myData[,1:2]),m2,Sigma2); # same for model 2
Z<-rbinom(no,1,prob1/(prob1 + prob2 )) # Z is latent variable as to assign each data point to the particular component
pi1<-rbeta(1,sum(Z)+1/2,no-sum(Z)+1/2)
if (pi1>1/2) {
pi1<-1-pi1
Z<-1-Z
}
}
This is my code to define the missing values:
> whichMissXY<-myData[ which(is.na(myData$waiting)),1:2]
> whichMissXY
eruptions waiting
11 1.833 NA
12 3.917 NA
13 4.200 NA
14 1.750 NA
15 4.700 NA
16 2.167 NA
17 1.750 NA
18 4.800 NA
19 1.600 NA
20 4.250 NA
My constraint is, how to impute the missing data in "waiting" variable based on particular component.
This code is my first attempt to impute the missing data using conditional mean imputation. I know, it is definitely in the wrong way. The outcome would not lie to the particular component and produce outlier.
miss.B2 <- which(is.na(myData$waiting))
for (i in miss.B2) {
myData[i, "waiting"] <- m1[2] + ((rho * sqrt(Sigma1[2,2]/Sigma1[1,1])) * (myData[i, "eruptions"] - m1[1] ) + rnorm(1,0,Sigma1[2,2]))
#print(miss.B[i,])
}
I would appreciate if someone could give any advice on how to improve the imputation technique that could work with latent/hidden variable through Gaussian mixture model.
Thank you in advance
This is a solution for one type of covariance structure.
devtools::install_github("alexwhitworth/emclustr")
library(emclustr)
data(faithful)
set.seed(23414L)
ff <- apply(faithful, 2, function(j) {
na_idx <- sample.int(length(j), 50, replace=F)
j[na_idx] <- NA
return(j)
})
ff2 <- em_clust_mvn_miss(ff, nclust=2)
# hmm... seems I don't return the imputed values.
# note to self to update the code
plot(faithful, col= ff2$mix_est)
And the parameter outputs
$it
[1] 27
$clust_prop
[1] 0.3955708 0.6044292
$clust_params
$clust_params[[1]]
$clust_params[[1]]$mu
[1] 2.146797 54.833431
$clust_params[[1]]$sigma
[1] 13.41944
$clust_params[[2]]
$clust_params[[2]]$mu
[1] 4.317408 80.398192
$clust_params[[2]]$sigma
[1] 13.71741