Excel Polynomial Regression with multiple variables - excel

I saw a lot of tutorials online on how to use polynomial regression on Excel and multi-regression but none which explain how to deal with multiple variable AND multiple regression.
In , the left columns contain all my variables X1,X2,X3,X4 (say they are features of a car), and Y1 is the price of the car I am looking for.
I got about 5000 lines of data that I got from running a model with various values of X1,X2,X3,X4 and I am looking to make a regression so that I can get a best estimate of my model without having to run it (saving me valuable computing time).
So far I've managed to do multiple linear regression using the Data Analysis pack in Excel, just by using the X1,X2,X3,X4. I noticed however that the regression looks very messy and inaccurate in places, which is due to the fact that my variables X1,X2,X3,X4, affect my output Y1 non-linearly.
I had a look online and to add polynomials to the mix, tutorial suggest adding a X^2 column. But when I do that (see right part of the chart) my regression is much much worse than when I use linear fits.
I know that polynomials, can over-fit the data, but I though that using a quadratic form was safe since the regression would only have to return a coefficient of 0 to ignore any excess polynomial orders.
Any help would be very welcome,
For info I get an adujsted-R^2 of 0.91 for linear fits and 0.66 when I add a few X^2 columns.
So far this is the best regression I can get (black line is 1:1):
As you can see I would like to increase the fit for the bottom left part and top right parts of the curve.

Related

How do I analyze the change in the relationship between two variables?

I'm working on a simple project in which I'm trying to describe the relationship between two positively correlated variables and determine if that relationship is changing over time, and if so, to what degree. I feel like this is something people probably do pretty often, but maybe I'm just not using the correct terminology because google isn't helping me very much.
I've plotted the variables on a scatter plot and know how to determine the correlation coefficient and plot a linear regression. I thought this may be a good first step because the linear regression tells me what I can expect y to be for a given x value. This means I can quantify how "far away" each data point is from the regression line (I think this is called the squared error?). Now I'd like to see what the error looks like for each data point over time. For example, if I have 100 data points and the most recent 20 are much farther away from where the regression line/function says it should be, maybe I could say that the relationship between the variables is showing signs of changing? Does that make any sense at all or am I way off base?
I have a suspicion that there is a much simpler way to do this and/or that I'm going about it in the wrong way. I'd appreciate any guidance you can offer!
I can suggest two strands of literature that study changing relationships over time. Typing these names into google should provide you with a large number of references so I'll stick to more concise descriptions.
(1) Structural break modelling. As the name suggest, this assumes that there has been a sudden change in parameters (e.g. a correlation coefficient). This is applicable if there has been a policy change, change in measurement device, etc. The estimation approach is indeed very close to the procedure you suggest. Namely, you would estimate the squared error (or some other measure of fit) on the full sample and the two sub-samples (before and after break). If the gains in fit are large when dividing the sample, then you would favour the model with the break and use different coefficients before and after the structural change.
(2) Time-varying coefficient models. This approach is more subtle as coefficients will now evolve more slowly over time. These changes can originate from the time evolution of some observed variables or they can be modeled through some unobserved latent process. In the latter case the estimation typically involves the use of state-space models (and thus the Kalman filter or some more advanced filtering techniques).
I hope this helps!

Can I tell Excel to keep an automated quadratic model y>= 0, without trying to run a manual model?

I am trying to create a quadratic model in Excel with tennis ranks data.
When running the automatic model trendline function it gives me a model with negative y values, which can obviously not occur for ranks.
How do I tell Excel to keep model y-values >=0?
Thank you!
Screenshot below/here refer.
There are several advantages to understanding the formulation / construct of the quadratic trend. For instance, replicating the 'automatic trend' using 'linest' as follows provides the user with additional control over the individual terms, and can highlight any graphical errors:
=$L$3+$K$3*D3+$J$3*D3^2+$I$3*D3^3
This demonstrates a cubic regression (white dots) - which coincide with Excel's 'automatic' trend line.
Summary of possible issues
There are several potential issues | remedies you can consider, depending on the goodness of fit, data in question, etc. A non-exhaustive list of issues you may be encountering include the following:
Issue
Resolve
1) Overfitting
Reduce # terms (e.g. order = 2 instead of 3 etc.)
2) Wrong fit
Attempt Lognormal
3) Negative left
Set Intercept
4) Graphical error
Use scatter chart, sort x values (ascending)
5) Outliers
Various: exclude/adjust, fit separate curve (Extreme Value Theory), manually adjust polynomial terms noting reduction to goodness of fit etc.
1) Overfitting
Trendline options: reduce the order per screenshot:
2) Lognormal | Other
Transform / consider other fits/curves (you can also place y and x axes on lognormal scale which will automatically remove negatives, although consider outliers and impact upon R-squared / goodness of fit).
3) Negative left
In certain circumstances, a negative left may be removed by setting the intercept to an appropriate value.
4) Graphical error
It's often easier to use a scatter chart, with x-values ordered per description (regression parameters may be affected otherwise).
5) Outliers
It may be the case you're fitting to 1 or 2 outliers. Consider reducing complexity/number of terms; or adjusting / omitting outliers suitably. There is an entire branch of statistics that deal with the distribution of extreme values/outliers (Extreme Value Theory - beyond scope of present answer).
Other remarks:
Rounding errors in the automatic trend-line function can lead to inaccuracies; human-error in replicating 'automatic trend-line' displayed on the chart - suggesting linest / exact formulation preferable).
Reference(s)
Data / formulation for first screenshot here
Useful video content: here

How can r-squared be negative when the correlation between prediction and truth is positive?

Trying to understand how the r-squared (and also explained variance) metrics can be negative (thus indicating non-existant forecasting power) when at the same time the correlation factor between prediction and truth (as well as slope in a linear-regression (regressing truth on prediction)) are positive
R Squared can be negative in a rare scenario.
R squared = 1 – (SSR/SST)
Here, SST stands for Sum of Squared Total which is nothing but how much does the predicted points get varies from the mean of the target variable. Mean is nothing but a regression line here.
SST = Sum (Square (Each data point- Mean of the target variable))
For example,
If we want to build a regression model to predict height of a student with weight as the independent variable then a possible prediction without much effort is to calculate the mean height of all current students and consider it as the prediction.
In the above diagram, red line is the regression line which is nothing but the mean of all heights. This mean calculated without much effort and can be considered as one of the worst method of prediction with poor accuracy. In the diagram itself we can see that the prediction is nowhere near to the original data points.
Now come to SSR,
SSR stands for Sum of Squared Residuals. This residual is calculated from the model which we build from our mathematical approach (Linear regression, Bayesian regression, Polynomial regression or any other approach). If we use a sophisticated approach rather than using a naive approach like mean then our accuracy will obviously increase.
SSR = Sum (Square (Each data point - Each corresponding data point in the regression line))
In the above diagram, let's consider that the blue line indicates a sophisticated model with large mathematical analysis. We can see that it has obviously higher accuracy than the red line.
Now come to the formula,
R Squared = 1- (SSR/SST)
Here,
SST will be large number because it a very poor model (red line).
SSR will be a small number because it is the best model we developed
after much mathematical analysis (blue line).
So, SSR/SST will be a very small number (It will become very small
whenever SSR decreases).
So, 1- (SSR/SST) will be large number.
So we can infer that whenever R Squared goes higher, it means the
model is too good.
This is a generic case but this cannot be applied in many cases where multiple independent variables are present. In the example, we had only one independent variable and one target variable but in real case, we will have 100's of independent variables for a single dependent variable. The actual problem is that, out of 100's of independent variables-
Some variables will have very high correlation with target variable.
Some variables will have very small correlation with target variable.
Also some independent variables will have no correlation at all.
So, RSquared is calculated on an assumption that the average line of the target which is perpendicular line of y axis is the worst fit a model can have at a maximum riskiest case. SST is the squared difference between this average line and original data points. Similarly, SSR is the squared difference between the predicted data points (by the model plane) and original data points.
SSR/SST gives a ratio how SSR is worst with respect to SST. If your model can somewhat build a plane which is a comparatively good than the worst, then in 99% cases SSR<SST. It eventually makes R squared as positive if you substitute it in the equation.
But what if SSR>SST ? This means that your regression plane is worse than the mean line (SST). In this case, R squared will be obviously negative. But it happens only at 1% of cases or smaller.
Answer was originally written in quora by me -
https://qr.ae/pNsLU8
https://qr.ae/pNsLUr

Issues with OLS Regression - highly similar X and Intercept coefficients

I'm estimating a linear OLS regression using some software, and I have three variables: Y (dependent), X1 (independent), and Intercept (a column of "1"s I manually created). I created Intercept because this particular software doesn't have a function to add a constant term.
The coefficients of X and Intercept are almost perfectly inverse (i.e. Intercept-coefficient = 1.5 and X-coefficient = negative 1.51). Both Y and X are columns of very small percentage changes (i.e. 0.0001). I've tried adding some other independent variables, and quickly run into multicollinearity issues - not sure if that's simply because the variables are highly similar, though.
I'm not very experienced with stats, are the coefficients a dead giveaway from statistical issues with the regression? Any advice is much appreciated, thank you!

Is linear regression the same thing as ordinary least squares in SPSS?

I want to use a linear regression model, but I want to use ordinary least squares, which I think it is a type of linear regression. The software I use is SPSS. It only has linear regression, partial least squares and 2-stages least squares. I have no idea which one is ordinary least squares (OLS).
Yes, although 'linear regression' refers to any approach to model the relationship between one or more variables, OLS is the method used to find the simple linear regression of a set of data.
Linear regression is a vast term that just says we are finding a relationship between the dependent and independent variable(s), no matter what technique we are using.
OLS is just one of the technique to do linear reg.
Lets say,
error(e) = (observed value - predicted value)
Observed values - blue dots in picture
predicted values - points on the line(vertically below to the observed values)
The vertical lines below represent 'e'. We square them -> add them and get total err. And we try to reduce this total error.
For OLS, as the name says (ordinary least squared method), here we reduce the sum of all e^2 i.e. we try to make the error least.

Resources