Multiple Linear Regression: a Significant Anova but NO significant coeffecient predictors? - statistics

Multiple Linear Regression: a Significant ANOVA but NO significant coefficient predictors?
I have ran a multiple regression on 2 IVs to predict a dependant, all assumptions have been met, the ANOVA has a significant result but the coefficient table suggests that none of the predictors are significant?
what does this mean and how am I able to interpret the result of this?
(USED SPSS)

This almost certainly means the two predictors are substantially correlated with each other. The REGRESSION procedure in SPSS Statistics has a variety of collinearity diagnostics to aid in detection of more complicated situations involving collinearity, but in this case simply correlating the two predictors should establish the basic point.

Related

What to do if a significant factor in univariate linear regression become insignificant in multivariate analysis?

I am studying the impact of several factors "sleeping time, studying time, anxiety degree, depression degree ..." on students final exam mark.
when I did the univaraite linear regression analysis all models where significant (as final exam mark is a dependent variable) despite some have a small R^2.
then I tried to put all predictor factors in one multiple linear regression model, the result was most of the predictors are insignificant with exception to study time which was significant and has a big R^2 in uni and multi variate analysis.
How should I explain this in my paper? is it okay to have this result? or should I search for another model?
I sound like you have highly correlated predictors. This gives you a very unstable model, one where small changes in a few observations could produce large changes in regression coefficients.
You should try various models that use subsets of your predictors, and select a final model that has a significant overall F statistic, and significant t stats for your included predictors.
In your paper, you could explain that anxiety score and depression score were too highly correlated to allow them both into the model and you’ve selected the best model that doesn’t contain both of these scores.

Perfect Separation on linear model

There are lots of posts here about the "Perfect Separation Error" in statsmodels when running a logisitc regression. But I'm not doing logistic regression. I'm doing GLM with frequency weights and gaussian distribution. So basically OLS.
All of my independent variables are categorical with lots of categories. So high dimensional binary coded feature set.
But I'm very frequently getting the "perfectseperationerror" from statsmodels
I'm running many many models. I think I'm getting this error when my data is too thin for that many variables. However, With freq weights, in theory, I actually have many more features then the dataframe holds because the observations should be multiplied by the freq.
Any guidance on how to proceed?
reg = sm.GLM(dep, Indies, freq_weights = freq)
<p>Error: class 'statsmodels.tools.sm_exceptions.PerfectSeparationError'>
The check is on perfect prediction and is used independently of the family.
Currently, there is now workaround when using irls. Using scipy optimizers, e.g. method="bfgs", avoids the perfect prediction/separation check.
https://github.com/statsmodels/statsmodels/issues/2680
Perfect separation is only defined for the binary case, i.e. family binomial in GLM, and could be extended to other discrete models.
However, there can be other problems with inference if the residual variance is zero, i.e. we have a perfect fit.
Here is an issue with perfect prediction in OLS
https://github.com/statsmodels/statsmodels/issues/1459

Sensitivity Vs Positive Predicted Value - which is best?

I am trying to build a model on a class imbalanced dataset (binary - 1's:25% and 0's 75%). Tried with Classification algorithms and ensemble techniques. I am bit confused on below two concepts as i am more interested in predicting more 1's.
1. Should i give preference to Sensitivity or Positive Predicted Value.
Some ensemble techniques give maximum 45% of sensitivity and low Positive Predicted Value.
And some give 62% of Positive Predicted Value and low Sensitivity.
2. My dataset has around 450K observations and 250 features.
After power test i took 10K observations by Simple random sampling. While selecting
variable importance using ensemble technique's the features
are different compared to the features when i tried with 150K observations.
Now with my intuition and domain knowledge i felt features that came up as important in
150K observation sample are more relevant. what is the best practice?
3. Last, can i use the variable importance generated by RF in other ensemple
techniques to predict the accuracy?
Can you please help me out as am bit confused on which w
The preference between Sensitivity and Positive Predictive value depends on your ultimate goal of the analysis. The difference between these two values is nicely explained here: https://onlinecourses.science.psu.edu/stat507/node/71/
Altogether, these are two measures that look at the results from two different perspectives. Sensitivity gives you a probability that a test will find a "condition" among those you have it. Positive Predictive value looks at the prevalence of the "condition" among those who is being tested.
Accuracy is depends on the outcome of your classification: it is defined as (true positive + true negative)/(total), not variable importance's generated by RF.
Also, it is possible to compensate for the imbalances in the dataset, see https://stats.stackexchange.com/questions/264798/random-forest-unbalanced-dataset-for-training-test

Improving linear regression model by taking absolute value of predicted output?

I have a particular classification problem that I was able to improve using Python's abs() function. I am still somewhat new when it comes to machine learning, and I wanted to know if what I am doing is actually "allowed," so to speak, for improving a regression problem. The following line describes my method:
lr = linear_model.LinearRegression()
predicted = abs(cross_val_predict(lr, features, labels_postop_IS, cv=10))
I attempted this solution because linear regression can sometimes produce negative predictions values, even though my particular case, these predictions should never be negative, as they are a physical quantity.
Using the abs() function, my predictions produce a better fit for the data.
Is this allowed?
Why would it not be "allowed". I mean if you want to make certain statistical statements (like a 95% CI e.g.) you need to be careful. However, most ML practitioners do not care too much about underlying statistical assumptions and just want a blackbox model that can be evaluated based on accuracy or some other performance metric. So basically everything is allowed in ML, you just have to be careful not to overfit. Maybe a more sensible solution to your problem would be to use a function that truncates at 0 like f(x) = x if x > 0 else 0. This way larger negative values don't suddenly become large positive ones.
On a side note, you should probably try some other models as well with more parameters like a SVR with a non-linear kernel. The thing is obviously that a LR fits a line, and if this line is not parallel to your x-axis (thinking in the single variable case) it will inevitably lead to negative values at some point on the line. That's one reason for why it is often advised not to use LRs for predictions outside the "fitted" data.
A straight line y=a+bx will predict negative y for some x unless a>0 and b=0. Using logarithmic scale seems natural solution to fix this.
In the case of linear regression, there is no restriction on your outputs.
If your data is non-negative (as in your case the values are physical quantities and cannot be negative), you could model using a generalized linear model (GLM) with a log link function. This is known as Poisson regression and is helpful for modeling discrete non-negative counts such as the problem you described. The Poisson distribution is parameterized by a single value λ, which describes both the expected value and the variance of the distribution.
I cannot say your approach is wrong but a better way is to go towards the above method.
This results in an approach that you are attempting to fit a linear model to the log of your observations.

Comparing a Poisson Regression to a logistic Regression

I have data which has an associated binary outcome variable. Naturally I ran a logistic regression in order to see parameter estimates and odds ratios. I was curious though, to change this data from a binary outcome to count data. Then I ran a poisson regression (and negative binomial regression) on the count data.
I have no idea of how to compare these different models though, all comparisons I see seem to only be concerned with nested models.
How would you go about deciding on the best model to use in this situation?
Essentially both models will be roughly equal. What really matters is what is your objective- what you really want to predict. If you want to determine how many of cases are good or bad (1 or 0), then you go for logistic regression. If you are really interested on how much the cases are going to do (counts) then do poisson.
In other words, the only difference between these two models is the logistic transformation and the fact that logistic regression tries to minimize the misclassification error (-2 log likelihood) .To put it simply, even if you run a linear regression (OLS) on the binary outcome, you should not see big differences from your logistic model apart from the fact that the results may not be between 0 and 1 (e.g. the Area under the RoC curve will be similar to the logistic model) .
To sum up, don't worry about which of these two models is better, they should be roughly the same in the way the capture your features' information. Just think what makes more sense to optimize, counts or probabilties. The answer might have been different if you were considering non-linear models (e.g random forests or neural networks etc), but the two you are considering are both (almost) linear- so don't worry about it.
One thing to consider is the sample design. If you are using a case-control study, then logistic regression is the way to go because of its logit link function, rather than log of ratios as in Poisson regression. This is because, where there is an oversampling of cases such as in case-control study, odds ratio is unbiased.

Resources