Zero Inflated Binomial Regression in SPSS - statistics

I am running a zero inflated binomial regression (ZINB) in SPSS using the R essentials extension. Unfortunately, the output from ZINB in SPSS only gives me the estimate, standard error, z-score, and p-value. I am wondering if there is an easy way to get the odds ratio and 95% confidence interval for these ZINB analyses in SPSS?

Related

Perfect Separation on linear model

There are lots of posts here about the "Perfect Separation Error" in statsmodels when running a logisitc regression. But I'm not doing logistic regression. I'm doing GLM with frequency weights and gaussian distribution. So basically OLS.
All of my independent variables are categorical with lots of categories. So high dimensional binary coded feature set.
But I'm very frequently getting the "perfectseperationerror" from statsmodels
I'm running many many models. I think I'm getting this error when my data is too thin for that many variables. However, With freq weights, in theory, I actually have many more features then the dataframe holds because the observations should be multiplied by the freq.
Any guidance on how to proceed?
reg = sm.GLM(dep, Indies, freq_weights = freq)
<p>Error: class 'statsmodels.tools.sm_exceptions.PerfectSeparationError'>
The check is on perfect prediction and is used independently of the family.
Currently, there is now workaround when using irls. Using scipy optimizers, e.g. method="bfgs", avoids the perfect prediction/separation check.
https://github.com/statsmodels/statsmodels/issues/2680
Perfect separation is only defined for the binary case, i.e. family binomial in GLM, and could be extended to other discrete models.
However, there can be other problems with inference if the residual variance is zero, i.e. we have a perfect fit.
Here is an issue with perfect prediction in OLS
https://github.com/statsmodels/statsmodels/issues/1459

Sensitivity Vs Positive Predicted Value - which is best?

I am trying to build a model on a class imbalanced dataset (binary - 1's:25% and 0's 75%). Tried with Classification algorithms and ensemble techniques. I am bit confused on below two concepts as i am more interested in predicting more 1's.
1. Should i give preference to Sensitivity or Positive Predicted Value.
Some ensemble techniques give maximum 45% of sensitivity and low Positive Predicted Value.
And some give 62% of Positive Predicted Value and low Sensitivity.
2. My dataset has around 450K observations and 250 features.
After power test i took 10K observations by Simple random sampling. While selecting
variable importance using ensemble technique's the features
are different compared to the features when i tried with 150K observations.
Now with my intuition and domain knowledge i felt features that came up as important in
150K observation sample are more relevant. what is the best practice?
3. Last, can i use the variable importance generated by RF in other ensemple
techniques to predict the accuracy?
Can you please help me out as am bit confused on which w
The preference between Sensitivity and Positive Predictive value depends on your ultimate goal of the analysis. The difference between these two values is nicely explained here: https://onlinecourses.science.psu.edu/stat507/node/71/
Altogether, these are two measures that look at the results from two different perspectives. Sensitivity gives you a probability that a test will find a "condition" among those you have it. Positive Predictive value looks at the prevalence of the "condition" among those who is being tested.
Accuracy is depends on the outcome of your classification: it is defined as (true positive + true negative)/(total), not variable importance's generated by RF.
Also, it is possible to compensate for the imbalances in the dataset, see https://stats.stackexchange.com/questions/264798/random-forest-unbalanced-dataset-for-training-test

Does SPSS adjust chi squared tests for weighted samples?

In SPSS, you can adjust summary statistics for stratified samples using
Weight by ..., and it allows you to then do a chi-squared test. I found a lot of examples of people doing a chi-squared like this, but nobody mentioning whether SPSS actually accounts for this in the chi square calculation.
Does standard SPSS adjust the chi squared test of independence for the stratified sample after using Weight by ...?
The R "survey" package uses Rao-Scott correction. Similarily, SAS has a Rao-Scott Chi-square test.
I'm aware of the SPSS complex surveys extension - while I'd be curious how well that one works, here I'm specifically interested in whether base SPSS does this correctly.
Thanks a lot for your help!
The answer is No. Base SPSS does not adjust standard errors for stratified samples.
All calculations in SPSS on weighted samples that use the standard error (p-values, confidence intervals, ...) will be wrong without the Complex Surveys extension.
Confimed by IBM, UCLA, PARE and Thomas Lumley.
All sources recommended by Thomas Lumley.

Modelling probabilities in a regularized (logistic?) regression model in python

I would like to fit a regression model to probabilities. I am aware that linear regression is often used for this purpose, but I have several probabilities at or near 0.0 and 1.0 and would like to fit a regression model where the output is constrained to lie between 0.0 and 1.0. I want to be able to specify a regularization norm and strength for the model and ideally do this in python (but an R implementation would be helpful as well). All the logistic regression packages I've found seem to be only suited for classification whereas this is a regression problem (albeit one where I want to use the logit link function). I use scikits-learn for my classification and regression needs so if this regression model can be implemented in scikits-learn, that would be fantastic (it seemed to me that this is not possible), but I'd be happy about any solution in python and/or R.
The question has two issues, penalized estimation and fractional or proportions data as dependent variable. I worked on each separately but never tried the combination.
Penalization
Statsmodels has had L1 regularized Logit and other discrete models like Poisson for some time. In recent months there has been a lot of effort to support more penalization but it is not in statsmodels yet. Elastic net for linear and Generalized Linear Model (GLM) is in a pull request and will be merged soon. More penalized GLM like L2 penalization for GAM and splines or SCAD penalization will follow over the next months based on pull requests that still need work.
Two examples for the current L1 fit_regularized for Logit are here
Difference in SGD classifier results and statsmodels results for logistic with l1 and https://github.com/statsmodels/statsmodels/blob/master/statsmodels/examples/l1_demo/short_demo.py
Note, the penalization weight alpha can be a vector with zeros for coefficients like the constant if they should not be penalized.
http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.fit_regularized.html
Fractional models
Binary and binomial models in statsmodels do not impose that the dependent variable is binary and work as long as the dependent variable is in the [0,1] interval.
Fractions or proportions can be estimated with Logit as Quasi-maximum likelihood estimator. The estimates are consistent if the mean function, logistic, cumulative normal or similar link function, is correctly specified but we should use robust sandwich covariance for proper inference. Robust standard errors can be obtained in statsmodels through a fit keyword cov_type='HC0'.
Best documentation is for Stata http://www.stata.com/manuals14/rfracreg.pdf and the references therein. I went through those references before Stata had fracreg, and it works correctly with at least Logit and Probit which were my test cases. (I don't find my scripts or test cases right now.)
The bad news for inference is that robust covariance matrices have not been added to fit_regularized, so the correct sandwich covariance is not directly available. The standard covariance matrix and standard errors of the parameter estimates are derived under the assumption that the model, i.e. the likelihood function, is correctly specified, which will not be the case if the data are fractions and not binary.
Besides using Quasi-Maximum Likelihood with binary models, it is also possible to use a likelihood that is defined for fractional data in (0, 1). A popular model is Beta regression, which is also waiting in a pull request for statsmodels and is expected to be merged within the next months.

Test error lower than training error

Would appreciate your input on this. I am constructing a regression model with the help of genetic programming.
If my RMSE on test data is (much) lower than my RMSE on training data for a 1:5 ratio of data, should I be worried?
The test data is drawn randomly without replacement from a set of 24 data points. The model was built using genetic programming technique so the number of features, modeling framework etc vary as I minimize the training RMSE regularized by the number of nodes in the GP tree.
Is the model underfitted? Or should I have minimized MSE instead of RMSE (I thought it would be the same as MSE is positive and the minimum of MSE would coincide with the minimum of RMSE assuming the optimizer is good enough to find the minimum)?
Tks
So your model is trained on 20 out of 24 data points and tested on the 4 remaining data points?
To me it sounds like you need (much) more data, so you can have a larger train and test sets. I'm not surprised by the low performance on your test set as it seems that your model wasn't able to learn from such few data. As a rule of thumb, for machine learning you can never have enough data. Is it a possibility to gather a larger dataset?

Resources