p-values of scikit-learn LogisticRegressionCV? - scikit-learn

I'm using scikit-learn's LogisticRegressionCV. Looks like the coefs_ field are the logistic regression coefficients. Is there any way to get p-values, z-values, or some measure of uncertainty for each feature? (For example, as discussed here in R.)

Unfortunately, scikit-learn does not have any such methods for the logistic regression (nor for the linear regression as a matter of fact). I found this which might be of interest for you, but honestly, I would try to stick to R for such tasks if you can.
https://datascience.stackexchange.com/questions/15398/how-to-get-p-value-and-confident-interval-in-logisticregression-with-sklearn

Related

automatic classification model selecion

i want to know is there any method by which the computer can decide which classification model to use ( Decision trees, logistic regression, KNN, etc. ) by just looking at the training data.
even just the math will be extremely helpful.
I am going to be writing this in python 3, so if there's any built method in scikit-learn or tensorflow for this purpose,it would be of great help.
This scikit learn tool kit solves it :
https://automl.github.io/auto-sklearn/stable/index.html

What does the CV stand for in sklearn.linear_model.LogisticRegressionCV?

scikit-learn has two logistic regression functions:
sklearn.linear_model.LogisticRegression
sklearn.linear_model.LogisticRegressionCV
I'm just curious what the CV stands for in the second one. The only acronym I know in ML that matches "CV" is cross-validation, but I'm guessing that's not it, since that would be achieved in scikit-learn with a wrapper function, not as part of the logistic regression function itself (I think).
You are right in guessing that the latter allows the user to perform cross validation. The user can pass the number of folds as an argument cv of the function to perform k-fold cross-validation (default is 10 folds with StratifiedKFold).
I would recommend reading the documentation for the functions LogisticRegression and LogisticRegressionCV
Yes, it's cross-validation. Excerpt from the docs:
For the grid of Cs values (that are set by default to be ten values in a logarithmic scale between 1e-4 and 1e4), the best hyperparameter is selected by the cross-validator StratifiedKFold, but it can be changed using the cv parameter.
The point here is the following:
yes: sklearn has general model-selection wrappers providing CV-functionality for all those classifiers/regressors
but: when the classifier/regressor is known/fixed a-priori (to some extent) or sometimes even some CV-model, one can gain advantages using these facts with specialized code bound to one classifier/regressor resulting in improved performance!
Typically:
CV already embedded in optimization-algorithm
Efficient warm-starting (instead of full re-optimization after just the change of one parameter like alpha)
It seems, at least the latter idea is used in sklearn's LogisticRegressionCV, as seen in this excerpt:
In the case of newton-cg and lbfgs solvers, we warm start along the path i.e guess the initial coefficients of the present fit to be the coefficients got after convergence in the previous fit, so it is supposed to be faster for high-dimensional dense data.
May I also refer you to this section in scikit-learn documentation which I beleive explains it well:
Some models can fit data for a range of values of some parameter
almost as efficiently as fitting the estimator for a single value of
the parameter. This feature can be leveraged to perform a more
efficient cross-validation used for model selection of this parameter.
The most common parameter amenable to this strategy is the parameter
encoding the strength of the regularizer. In this case we say that we
compute the regularization path of the estimator.
And logistic regression is one such model. That's why scikit-learn has the dedicated LogisticRegressionCV class that does this.
There are some things left out on other answers, e.g. about gridsearch functionality. See the docs:
cross-validation estimator
An estimator that has built-in cross-validation capabilities to automatically select the best hyper-parameters (see the User Guide). Some example of cross-validation estimators are ElasticNetCV and LogisticRegressionCV. Cross-validation estimators are named EstimatorCV and tend to be roughly equivalent to GridSearchCV(Estimator(), ...). The advantage of using a cross-validation estimator over the canonical estimator class along with grid search is that they can take advantage of warm-starting by reusing precomputed results in the previous steps of the cross-validation process. This generally leads to speed improvements. An exception is the RidgeCV class, which can instead perform efficient Leave-One-Out CV.
https://scikit-learn.org/stable/glossary.html#term-cross-validation-estimator
https://github.com/amueller/talks_odt/blob/master/2015/nyc-open-data-2015-andvanced-sklearn.pdf

Logistic regression overfits even using cross validation in sklearn?

I am implementing a logistic regression model using sklearn, for a text classification competition on Kaggle.
When I use unigram, there are 23,617 features. The best mean_test_score Cross validation search (sklearn's GridSearchCV) gives me is similar to the score I got from Kaggle, using the best model.
There are 1,046,524 features if I use bigram. GridSearchCV gives me a better mean_test_score compared to unigram, but using this new model I got a much much lower score on Kaggle.
I guess the reason might be overfitting, since I have too many features. I have tried to set the GridSearchCV using 5-fold, or even 2-fold, but the scores are still inconsistent.
Does it really indicate my second model is overfitting, even in the validation stage? If so, how can I tune the regularization term for my logistic model using sklearn? Any suggestions are appreciated!
Assuming you are using sklearn. You could try looking into using the tuning parameters max_df, min_df, and max_features. Throwing these into a GridSearch may take a long time but you will likely get some interesting results back. I know these features are implemented in the sklearn.feature_extraction.text.TfidfVectorizer, but I am sure they use them elsewhere as well. Essentially the idea is that including too many grams can lead to overfitting, same thing with having too many grams with low or high document frequencies.

Sklearn: Difference between using OneVsRestClassifier and build each classifier individually

As far as I know, multi-label problem can be solved with one-vs-all scheme, for which Scikit-learn implements OneVsRestClassifier as a wrapper on classifier such as svm.SVC. I am wondering how would it be different if I literally train, say we have a multi-label problem with n classes, n individual binary classifiers for each label and thereby evaluate them separately.
I know it is like a "manual" way of implementing one-vs-all rather than using the wrapper, but are two ways actually different? If so, how are they different, like in execution time or performance of classifier(s)?
There would be no difference. For multi-label classification, sklearn one-versus-rest implements binary relevance which is what you have described.

Comparing a Poisson Regression to a logistic Regression

I have data which has an associated binary outcome variable. Naturally I ran a logistic regression in order to see parameter estimates and odds ratios. I was curious though, to change this data from a binary outcome to count data. Then I ran a poisson regression (and negative binomial regression) on the count data.
I have no idea of how to compare these different models though, all comparisons I see seem to only be concerned with nested models.
How would you go about deciding on the best model to use in this situation?
Essentially both models will be roughly equal. What really matters is what is your objective- what you really want to predict. If you want to determine how many of cases are good or bad (1 or 0), then you go for logistic regression. If you are really interested on how much the cases are going to do (counts) then do poisson.
In other words, the only difference between these two models is the logistic transformation and the fact that logistic regression tries to minimize the misclassification error (-2 log likelihood) .To put it simply, even if you run a linear regression (OLS) on the binary outcome, you should not see big differences from your logistic model apart from the fact that the results may not be between 0 and 1 (e.g. the Area under the RoC curve will be similar to the logistic model) .
To sum up, don't worry about which of these two models is better, they should be roughly the same in the way the capture your features' information. Just think what makes more sense to optimize, counts or probabilties. The answer might have been different if you were considering non-linear models (e.g random forests or neural networks etc), but the two you are considering are both (almost) linear- so don't worry about it.
One thing to consider is the sample design. If you are using a case-control study, then logistic regression is the way to go because of its logit link function, rather than log of ratios as in Poisson regression. This is because, where there is an oversampling of cases such as in case-control study, odds ratio is unbiased.

Resources