Issues with OLS Regression - highly similar X and Intercept coefficients

Issues with OLS Regression - highly similar X and Intercept coefficients - statistics

I'm estimating a linear OLS regression using some software, and I have three variables: Y (dependent), X1 (independent), and Intercept (a column of "1"s I manually created). I created Intercept because this particular software doesn't have a function to add a constant term.
The coefficients of X and Intercept are almost perfectly inverse (i.e. Intercept-coefficient = 1.5 and X-coefficient = negative 1.51). Both Y and X are columns of very small percentage changes (i.e. 0.0001). I've tried adding some other independent variables, and quickly run into multicollinearity issues - not sure if that's simply because the variables are highly similar, though.
I'm not very experienced with stats, are the coefficients a dead giveaway from statistical issues with the regression? Any advice is much appreciated, thank you!

Related

Can I tell Excel to keep an automated quadratic model y>= 0, without trying to run a manual model?

I am trying to create a quadratic model in Excel with tennis ranks data.
When running the automatic model trendline function it gives me a model with negative y values, which can obviously not occur for ranks.
How do I tell Excel to keep model y-values >=0?
Thank you!

Screenshot below/here refer.
There are several advantages to understanding the formulation / construct of the quadratic trend. For instance, replicating the 'automatic trend' using 'linest' as follows provides the user with additional control over the individual terms, and can highlight any graphical errors:
=$L$3+$K$3*D3+$J$3*D3^2+$I$3*D3^3
This demonstrates a cubic regression (white dots) - which coincide with Excel's 'automatic' trend line.
Summary of possible issues
There are several potential issues | remedies you can consider, depending on the goodness of fit, data in question, etc. A non-exhaustive list of issues you may be encountering include the following:
Issue
Resolve
1) Overfitting
Reduce # terms (e.g. order = 2 instead of 3 etc.)
2) Wrong fit
Attempt Lognormal
3) Negative left
Set Intercept
4) Graphical error
Use scatter chart, sort x values (ascending)
5) Outliers
Various: exclude/adjust, fit separate curve (Extreme Value Theory), manually adjust polynomial terms noting reduction to goodness of fit etc.
1) Overfitting
Trendline options: reduce the order per screenshot:
2) Lognormal | Other
Transform / consider other fits/curves (you can also place y and x axes on lognormal scale which will automatically remove negatives, although consider outliers and impact upon R-squared / goodness of fit).
3) Negative left
In certain circumstances, a negative left may be removed by setting the intercept to an appropriate value.
4) Graphical error
It's often easier to use a scatter chart, with x-values ordered per description (regression parameters may be affected otherwise).
5) Outliers
It may be the case you're fitting to 1 or 2 outliers. Consider reducing complexity/number of terms; or adjusting / omitting outliers suitably. There is an entire branch of statistics that deal with the distribution of extreme values/outliers (Extreme Value Theory - beyond scope of present answer).
Other remarks:
Rounding errors in the automatic trend-line function can lead to inaccuracies; human-error in replicating 'automatic trend-line' displayed on the chart - suggesting linest / exact formulation preferable).
Reference(s)
Data / formulation for first screenshot here
Useful video content: here

How can r-squared be negative when the correlation between prediction and truth is positive?

Trying to understand how the r-squared (and also explained variance) metrics can be negative (thus indicating non-existant forecasting power) when at the same time the correlation factor between prediction and truth (as well as slope in a linear-regression (regressing truth on prediction)) are positive

R Squared can be negative in a rare scenario.
R squared = 1 – (SSR/SST)
Here, SST stands for Sum of Squared Total which is nothing but how much does the predicted points get varies from the mean of the target variable. Mean is nothing but a regression line here.
SST = Sum (Square (Each data point- Mean of the target variable))
For example,
If we want to build a regression model to predict height of a student with weight as the independent variable then a possible prediction without much effort is to calculate the mean height of all current students and consider it as the prediction.
In the above diagram, red line is the regression line which is nothing but the mean of all heights. This mean calculated without much effort and can be considered as one of the worst method of prediction with poor accuracy. In the diagram itself we can see that the prediction is nowhere near to the original data points.
Now come to SSR,
SSR stands for Sum of Squared Residuals. This residual is calculated from the model which we build from our mathematical approach (Linear regression, Bayesian regression, Polynomial regression or any other approach). If we use a sophisticated approach rather than using a naive approach like mean then our accuracy will obviously increase.
SSR = Sum (Square (Each data point - Each corresponding data point in the regression line))
In the above diagram, let's consider that the blue line indicates a sophisticated model with large mathematical analysis. We can see that it has obviously higher accuracy than the red line.
Now come to the formula,
R Squared = 1- (SSR/SST)
Here,
SST will be large number because it a very poor model (red line).
SSR will be a small number because it is the best model we developed
after much mathematical analysis (blue line).
So, SSR/SST will be a very small number (It will become very small
whenever SSR decreases).
So, 1- (SSR/SST) will be large number.
So we can infer that whenever R Squared goes higher, it means the
model is too good.
This is a generic case but this cannot be applied in many cases where multiple independent variables are present. In the example, we had only one independent variable and one target variable but in real case, we will have 100's of independent variables for a single dependent variable. The actual problem is that, out of 100's of independent variables-
Some variables will have very high correlation with target variable.
Some variables will have very small correlation with target variable.
Also some independent variables will have no correlation at all.
So, RSquared is calculated on an assumption that the average line of the target which is perpendicular line of y axis is the worst fit a model can have at a maximum riskiest case. SST is the squared difference between this average line and original data points. Similarly, SSR is the squared difference between the predicted data points (by the model plane) and original data points.
SSR/SST gives a ratio how SSR is worst with respect to SST. If your model can somewhat build a plane which is a comparatively good than the worst, then in 99% cases SSR<SST. It eventually makes R squared as positive if you substitute it in the equation.
But what if SSR>SST ? This means that your regression plane is worse than the mean line (SST). In this case, R squared will be obviously negative. But it happens only at 1% of cases or smaller.
Answer was originally written in quora by me -
https://qr.ae/pNsLU8
https://qr.ae/pNsLUr

Excel Polynomial Regression with multiple variables

I saw a lot of tutorials online on how to use polynomial regression on Excel and multi-regression but none which explain how to deal with multiple variable AND multiple regression.
In , the left columns contain all my variables X1,X2,X3,X4 (say they are features of a car), and Y1 is the price of the car I am looking for.
I got about 5000 lines of data that I got from running a model with various values of X1,X2,X3,X4 and I am looking to make a regression so that I can get a best estimate of my model without having to run it (saving me valuable computing time).
So far I've managed to do multiple linear regression using the Data Analysis pack in Excel, just by using the X1,X2,X3,X4. I noticed however that the regression looks very messy and inaccurate in places, which is due to the fact that my variables X1,X2,X3,X4, affect my output Y1 non-linearly.
I had a look online and to add polynomials to the mix, tutorial suggest adding a X^2 column. But when I do that (see right part of the chart) my regression is much much worse than when I use linear fits.
I know that polynomials, can over-fit the data, but I though that using a quadratic form was safe since the regression would only have to return a coefficient of 0 to ignore any excess polynomial orders.
Any help would be very welcome,
For info I get an adujsted-R^2 of 0.91 for linear fits and 0.66 when I add a few X^2 columns.
So far this is the best regression I can get (black line is 1:1):
As you can see I would like to increase the fit for the bottom left part and top right parts of the curve.

standard error of addition, subtraction, multiplication and ratio

Let's say, I have two random variables,x and y, both of them have n observations. I've used a forecasting method to estimate xn+1 and yn+1, and I also got the standard error for both xn+1 and yn+1. So my question is that what the formula would be if I want to know the standard error of xn+1 + yn+1, xn+1 - yn+1, (xn+1)*(yn+1) and (xn+1)/(yn+1), so that I can calculate the prediction interval for the 4 combinations. Any thought would be much appreciated. Thanks.

Well, the general topic you need to look at is called "change of variables" in mathematical statistics.
The density function for a sum of random variables is the convolution of the individual densities (but only if the variables are independent). Likewise for the difference. In special cases, that convolution is easy to find. For example, for Gaussian variables the density of the sum is also a Gaussian.
For product and quotient, there aren't any simple results, except in special cases. For those, you might as well compute the result directly, maybe by sampling or other numerical methods.
If your variables x and y are not independent, that complicates the situation. But even then, I think sampling is straightforward.

statistical cosinor analysis,

Hey i am trying to calculate a cosinor analysis in statistica but am at a loss as to how to do so. I need to calculate the MESOR, AMPLITUDE, and ACROPHASE of ciracadian rhythm data.
http://www.wepapers.com/Papers/73565/Cosinor_analysis_of_accident_risk_using__SPSS%27s_regression_procedures.ppt
there is a link that shows how to do it, the formulas and such, but it has not given me much help. Does anyone know the code for it, either in statistica or SPSS??
I really need to get this done because it is for an important paper

I don't have SPSS or Statistica, so I can't tell you the exact "push-this-button" kind of steps, but perhaps this will help.
Cosinor analysis is fitting a cosine (or sine) curve with a known period. The main idea is that the non-linear problem of fitting a cosine function can be reduced to a problem that is linear in its parameters if the period is known. I will assume that your period T=24 hours.
You should already have two variables: Time at which the measurement is taken, and Value of the measurement (these, of course, might be called something else).
Now create two new variables: SinTime = sin(2 x pi x Time / 24) and CosTime = cos(2 x pi x Time / 24) - this is desribed on p.11 of the presentation you linked (x is multiplication). Use pi=3.1415 if the exact value is not built-in.
Run multiple linear regression with Value as outcome and SinTime and CosTime as two predictors. You should get estimates of their coefficients, which we will call A and B.
The intercept term of the regression model is the MESOR.
The AMPLITUDE is sqrt(A^2 + B^2) [square root of A squared plus B squared]
The ACROPHASE is arctan(- B / A), where arctan is the inverse function of tan. The last two formulas are from p.14 of the presentation.
The regression model should also give you an R-squared value to see how well the 24 hour circadian pattern fits the data, and an overall p-value that tests for the presence of a circadian component with period 24 hrs.
One can get standard errors on amplitude and phase using standard error propogation formulas, but that is not included in the presentation.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string