In a hypothetical situation where I have 3 independent variables and one of those variables has a non-linear relationship(exponential) with the dependent variable and the other two independent variables are linearly related to the dependent variable. In such a case, what would be the best approach for running a regression analysis?
Considering I tried transforming the one non-linear independent variable.
Related
I'm trying to model outcomes of 2 binomial variables that I don't believe to be independent of one another. I had the idea of nesting the outcomes of one variable within the other like this:
but I'm unsure of how to go about this in SAS. Otherwise is it acceptable to use multinomial regression and make dummy variables to compare groups without nesting? the two variables appear related but there's other variables were interested in looking at significance for
I want to impute missing values of an independent variable say variable X1, the other independent variables are weakly related to X1. However, the dependent variable has strong relation with X1.
I wish to use sklearn IterativeImputer's missing value imputation estimators like KNN regressor or ExtraTreesRegressor (similar to missforest in R).
https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py
Can I use dependent variable in addition to independent variables to impute values of X1? Will this introduce too much variance in my model? If this isn't recommended then how should X1 be treated, deletion of X1 is not an option and I fear if I impute X1 missings with only other IV's the imputed values would not be moderately accurate.
Thanks
I don't know anything about the software packages that you are referring to. But imputing variables while ignoring relations with the dependent variable is generally a bad idea. This assumes that there is no relation between these variables and, consequently correlations between the dependent variable and imputed values will be biased towards 0.
Graham (2009) writes about this:
"The truth is that all variables in the analysis model must be
included in the imputation model. The fear is that including the DV in
the imputation model might lead to bias in estimating the important
relationships (e.g., the regression coefficient of a program variable
predicting the DV). However, the opposite actually happens. When the DV is included in the model, all relevant parameter estimates are unbiased, but excluding the DV from the imputation model for the IVs and covariates can be shown to produce biased estimates."
Hope this helps. To summarize:
Can I use dependent variable in addition to independent variables to impute values of X1?
Yes you can and most of the literature I've read suggests you definitely should
Will this introduce too much variance in my model?
No, it shouldn't (why do you assume this would introduce more variance? And variance in what exactly?). It should reduce bias in the estimated covariance/correlation of the variables.
For an excellent article about imputation see:
Graham (2009). Missing data analysis: making it work in the real world. Annual Review of Psychology, 60, 549-576.
So I am trying to explain my strictly bounded variable (percentage) with some predictors - categorical as well as numerical. I have read quite a bit about the topic, but I am still confused about some of the arguments. The purpose of my regression is explaining, not predicting.
What are the consequences of running a linear regression on a strictly bounded outcome variable?
A linear regression does not have a bounded output. It's a linear transformation of the input, so if the input is twice as large, the output will be twice as large. That way, it will always be possible to find an input that exceeds the boundaries of the output.
You can apply a sigmoid function to the output of the linear regression (this is called "logistic regression"), but this will model a binary variable and give you the probability of the variable being 1. In your case, your variable isn't binary, it can have any value between 0 and 1. For that problem, you need to apply a beta regression, which will give you a bounded output between 0 and 1.
I've been searching the web for possibilities to conduct multivariate regressions in Excel, where I've seen that Analysis ToolPak can make the job done.
However it seems that Analysis ToolPak can handle multivariable linear regression but not multivariate linear regression (where the latter is that one may have more than one dependent variable Y1,...,Yn = x1+x2..+xn and the former that a dependent variable can have multiple independent variables Y = x1+x2+..+xn).
Is there a way to conduct multivariate regressions in Excel or should I start looking for other programs like R?
Thanks in advance
Is this what you are looking for?
http://smallbusiness.chron.com/run-multivariate-regression-excel-42353.html
How should one decide between using a linear regression model or non-linear regression model?
My goal is to predict Y.
In case of simple x and y dataset I could easily decide which regression model should be used by plotting a scatter plot.
In case of multi-variant like x1,x2,...,xn and y. How can I decide which regression model has to be used? That is, How will I decide about going with simple linear model or non linear models such as quadric, cubic etc.
Is there any technique or statistical approach or graphical plots to infer and decide which regression model has to be used? Please advise.
That is a pretty complex question.
You start visually first: if the data is normally distributed, and satisfy conditions for classical linear model, you use linear model. I normally start by making a scatter plot matrix to observe the relationships. If it is obvious that the relationship is non linear then you use non-linear model. But, a lot of times, I visually inspect, assuming that the number of factors are just not too many.
For example, this would be a non linear model:
However, if you want to use data mining (and computationally demanding methods), I suggest starting with stepwise regression. What you do is set a model evaluation criteria first: could be R^2 for example. You start a model with nothing and sequentially add predictors or permutations of them until your model evaluation criteria is "maximized". However, adding new predictor almost always increases R^2, a type of over-fitting.
The solution is to split the data into training and testing. You should make model based on the training and evaluate the mean error on testing. The best model will be the one that that minimized mean error on the testing set.
If your data is sparse, try integrating ridge or lasso regression in model evaluation.
Again, this is a kind of a complex question. The answer also kind of depends on whether you are building descriptive or explanatory model.