I am running an ordinal logistic regression. My problem is that SAS won't let me specify which value in the dependent categorical variable as my reference.
My code looks like:
proc surveylogistic data=mydata;
weight mywgt;
strata mystrata;
domain mydomain;
class depvar (ref="myref") indvar1 (ref="myref1") indvar2 (ref="myref2") /param=ref;
model depvar (order=internal)=indvar1 indvar2;
title 'my model';run;
In the class statement I specify that I want "myref" to be the reference for the dependent var which means when I look at the parameter estimates for the Intercepts the value "myref" should be omitted. When I look at the response profile, SAS correctly orders the categories for my dependent var, but no matter what I put in the class or model statement, I keep getting the highest value as my reference for my dependent var.
Does anyone know how I can specify my reference for my dependent var? It occurred to me I could change the order so that the category I want as the reference would have the highest value, but then it wouldn't be ordered correctly so an ordinal logistic regression would be inappropriate.
Thanks
Use the event= to specify your ref in the dependent variable.
model depvar(event='myref')=indvar1 indvar2;
I discovered that Ordinal Logistic Regressions don't have reference groups for the dependent variable. Only multinomial logistic regressions do, so that's why I couldn't do it.
Related
I'm using spatstat to run some mppm models and would like to be able to calculate standard errors for the predictions as in predict.ppm. I could use predict.ppm on each point process individually of course, but I'm wondering if this in invalid for any reason or if there is a better way of doing so?
This is not yet implemented as an option in predict.mppm. (It is on our long list of things to do. I will move it closer to the top of the list.)
However, it is available by applying predict.ppm to each element of subfits(model), where model was the original fitted model of class mppm. Something like:
m <- mppm(......)
fits <- subfits(m)
Y <- lapply(fits, predict, se=TRUE)
Just to clarify, fits[[i]] is a point process model, of class ppm, for the data in row i of the data hyperframe, implied by the big model m. The parameter estimates and variance estimates in fits[[i]] are based on information from the entire hyperframe. This is not the same as fitting a separate model of class ppm to the data in each row of the hyperframe and calculating predictions and standard errors for those fits.
I tuned a RandomForest with GroupKFold (to prevent data leakage because some rows came from the same group).
I get a best fit model, but when I go to make a prediction on the test data it says that it needs the group feature.
Does that make sense? Its odd that the group feature is coming up as one of the most important features as well.
I'm just wondering if there is something I could be doing wrong.
Thanks
A search on the scikit-learn Github repo does not reveal a single instance of the string "group feature" or "group_feature" or anything similar, so I will go ahead and assume you have in your data set a feature called "group" that the prediction model requires as input in order to produce an output.
Remember that a prediction model is basically a function that takes an input (the "predictor" variable) and returns an output (the "predicted" variable). If a variable called "group" was defined as input for your prediction model, then it makes sense that scikit-learn would request it.
Does the group appear as a column on the training set? If so, remove it and re-train. It looks like you are just using it to generate splits. If it isn't a part of the input data you need to predict, it shouldn't be in the training set.
I want to impute missing values of an independent variable say variable X1, the other independent variables are weakly related to X1. However, the dependent variable has strong relation with X1.
I wish to use sklearn IterativeImputer's missing value imputation estimators like KNN regressor or ExtraTreesRegressor (similar to missforest in R).
https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py
Can I use dependent variable in addition to independent variables to impute values of X1? Will this introduce too much variance in my model? If this isn't recommended then how should X1 be treated, deletion of X1 is not an option and I fear if I impute X1 missings with only other IV's the imputed values would not be moderately accurate.
Thanks
I don't know anything about the software packages that you are referring to. But imputing variables while ignoring relations with the dependent variable is generally a bad idea. This assumes that there is no relation between these variables and, consequently correlations between the dependent variable and imputed values will be biased towards 0.
Graham (2009) writes about this:
"The truth is that all variables in the analysis model must be
included in the imputation model. The fear is that including the DV in
the imputation model might lead to bias in estimating the important
relationships (e.g., the regression coefficient of a program variable
predicting the DV). However, the opposite actually happens. When the DV is included in the model, all relevant parameter estimates are unbiased, but excluding the DV from the imputation model for the IVs and covariates can be shown to produce biased estimates."
Hope this helps. To summarize:
Can I use dependent variable in addition to independent variables to impute values of X1?
Yes you can and most of the literature I've read suggests you definitely should
Will this introduce too much variance in my model?
No, it shouldn't (why do you assume this would introduce more variance? And variance in what exactly?). It should reduce bias in the estimated covariance/correlation of the variables.
For an excellent article about imputation see:
Graham (2009). Missing data analysis: making it work in the real world. Annual Review of Psychology, 60, 549-576.
I know that thanks to scikit tool, we can calculate BIC or score for Gaussian mixture model as shown below easily.
clf.fit(data)
bic=clf.bic(data)
score=clf.score(data)
http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html
but my question is, how to calculate bic or score WITHOUT using fit method, when I already have weights, means, covariances and data?
I could set as
clf = mixture.GaussianMixture(n_components=3, covariance_type='full')
clf.weights_=weights_list
clf.means_=means_list
clf.covariances_=covariances_list
or
clf.weights_init=weights_list
clf.means_init=means_list
clf.precisions_init =np.linalg.inv(covariances_list)
but when I try to get bic,
bic=clf.bic(data)
I get error message saying
sklearn.exceptions.NotFittedError: This GaussianMixture instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
I don'T want to run fit, because it will change given weights, means and covariances..
What can i do?
thanks
You need to set these three variables to pass the check_is_fitted test: 'weights_', 'means_', 'precisions_cholesky_'. 'weights_' and 'means_', you are setting correctly. And for calculating 'precisions_cholesky_' you need to have covariances_ which you do have.
So, just calculate that using this method here
from sklearn.mixture.gaussian_mixture import _compute_precision_cholesky
precisions_cholesky = _compute_precision_cholesky(covariances_list, 'full')
Change the "full" to appropriate covariance type and then set the result to clf using
clf.precisions_cholesky_ = precisions_cholesky
Make sure the shape of all these variables correspond correctly to your data.
I'm wondering if there is a way to get many single covariate logistic regression. I want to do it for all my variables because of the missing values. I wanted to have a multiple logistic regression but I have too many missing values. I don't want to compute a logistic regression for each variable in my DB, is there any automatic way?
Thank you very much!
You can code it using SPSS syntax.
For example:
LOGISTIC REGRESSION VARIABLES F2B16C -- Dependent variable
/METHOD=BSTEP -- Backwards step - all variables in then see what could be backed out
XRACE BYSES2 BYTXMSTD F1RGPP2 F1STEXP XHiMath -- Independent variables
/contrast (xrace)=indicator(6) -- creates the dummy variables with #6 as the base case
/contrast (F1Rgpp2)=indicator(6)
/contrast (f1stexp)=indicator(6)
/contrast (XHiMath)=indicator(5)
/PRINT=GOODFIT CORR ITER(1)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).`
If you do that you can also tell it to keep records with missing values where appropriate.
add /MISSING=INCLUDE
If anyone knows of a good explanation of the implications of including the missing values, I'd love to hear it.