I'm wondering if there is a way to get many single covariate logistic regression. I want to do it for all my variables because of the missing values. I wanted to have a multiple logistic regression but I have too many missing values. I don't want to compute a logistic regression for each variable in my DB, is there any automatic way?
Thank you very much!
You can code it using SPSS syntax.
For example:
LOGISTIC REGRESSION VARIABLES F2B16C -- Dependent variable
/METHOD=BSTEP -- Backwards step - all variables in then see what could be backed out
XRACE BYSES2 BYTXMSTD F1RGPP2 F1STEXP XHiMath -- Independent variables
/contrast (xrace)=indicator(6) -- creates the dummy variables with #6 as the base case
/contrast (F1Rgpp2)=indicator(6)
/contrast (f1stexp)=indicator(6)
/contrast (XHiMath)=indicator(5)
/PRINT=GOODFIT CORR ITER(1)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).`
If you do that you can also tell it to keep records with missing values where appropriate.
add /MISSING=INCLUDE
If anyone knows of a good explanation of the implications of including the missing values, I'd love to hear it.
Related
I am rleatively new to statistics and am stuggling with the normality assumption.
I understand that parametric tests are underpinned by the assumption that the data is normally distributed, but there seems to be lots of papers and articles providing conflicting information.
Some articles say that independant variables need to be normally disrbiuted and this may require a transformation (log, SQRT etc.). Others says that in linear modelling there are no assumptions about any linear the distribution of the independent variables.
I am trying to create a multiple regression model to predict highest pain scores on hospital admissions:
DV: numeric pain scores (0-no pain -> 5 intense pain)(discrete- dependant variable).
IVs: age (continuous), weight (continuous), sex (nominal), depreviation status (ordinal), race (nominal).
Can someone help clear up the following for me?
Before fitting a model, do I need to check the whether my independant variables are normally distributed? If so, why? Does this only apply to continuous variables (e.g. age and weight in my model)?
If age is positively skewed, would a transformation (e.g. log, SQRT) be appropriate and why? Is it best to do this before or after fitting a model? I assume I am trying to get close to a linear relationship between my DV and IV.
As part of the SPSS outputs it provides plots of the standardised residuals against predicted values and also normal P-P plots of standardised residuals. Are these tests all that is needed to check the normality assumption after fitting a model?
Many Thanks in advance!
I'm trying to model outcomes of 2 binomial variables that I don't believe to be independent of one another. I had the idea of nesting the outcomes of one variable within the other like this:
but I'm unsure of how to go about this in SAS. Otherwise is it acceptable to use multinomial regression and make dummy variables to compare groups without nesting? the two variables appear related but there's other variables were interested in looking at significance for
I want to impute missing values of an independent variable say variable X1, the other independent variables are weakly related to X1. However, the dependent variable has strong relation with X1.
I wish to use sklearn IterativeImputer's missing value imputation estimators like KNN regressor or ExtraTreesRegressor (similar to missforest in R).
https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py
Can I use dependent variable in addition to independent variables to impute values of X1? Will this introduce too much variance in my model? If this isn't recommended then how should X1 be treated, deletion of X1 is not an option and I fear if I impute X1 missings with only other IV's the imputed values would not be moderately accurate.
Thanks
I don't know anything about the software packages that you are referring to. But imputing variables while ignoring relations with the dependent variable is generally a bad idea. This assumes that there is no relation between these variables and, consequently correlations between the dependent variable and imputed values will be biased towards 0.
Graham (2009) writes about this:
"The truth is that all variables in the analysis model must be
included in the imputation model. The fear is that including the DV in
the imputation model might lead to bias in estimating the important
relationships (e.g., the regression coefficient of a program variable
predicting the DV). However, the opposite actually happens. When the DV is included in the model, all relevant parameter estimates are unbiased, but excluding the DV from the imputation model for the IVs and covariates can be shown to produce biased estimates."
Hope this helps. To summarize:
Can I use dependent variable in addition to independent variables to impute values of X1?
Yes you can and most of the literature I've read suggests you definitely should
Will this introduce too much variance in my model?
No, it shouldn't (why do you assume this would introduce more variance? And variance in what exactly?). It should reduce bias in the estimated covariance/correlation of the variables.
For an excellent article about imputation see:
Graham (2009). Missing data analysis: making it work in the real world. Annual Review of Psychology, 60, 549-576.
Is there a way to get the significance level of each coefficient we receive after we fit a logistic regression model on training data?
I was trying to find out a way and could not figure out myself.
I think I may get the significance level of each feature if I run chi sq test but first of all not sure if I can run the test on all features together and secondly I have numeric data value so if it will give me right result or not that remains a question as well.
Right now I am running the modeling part using statsmodel and scikit learn but certainly, want to know, how can I get these results from PySpark ML or MLLib itself
If anyone can shed some light, it will be helpful
I use only mllib, I think that when you train a model you can use toPMML method to export your model un PMML format (xml file), then you can parse the xml file to get features weights, here an example
https://spark.apache.org/docs/2.0.2/mllib-pmml-model-export.html
Hope that will help
I've been searching the web for possibilities to conduct multivariate regressions in Excel, where I've seen that Analysis ToolPak can make the job done.
However it seems that Analysis ToolPak can handle multivariable linear regression but not multivariate linear regression (where the latter is that one may have more than one dependent variable Y1,...,Yn = x1+x2..+xn and the former that a dependent variable can have multiple independent variables Y = x1+x2+..+xn).
Is there a way to conduct multivariate regressions in Excel or should I start looking for other programs like R?
Thanks in advance
Is this what you are looking for?
http://smallbusiness.chron.com/run-multivariate-regression-excel-42353.html