I would like to fit a regression model to probabilities. I am aware that linear regression is often used for this purpose, but I have several probabilities at or near 0.0 and 1.0 and would like to fit a regression model where the output is constrained to lie between 0.0 and 1.0. I want to be able to specify a regularization norm and strength for the model and ideally do this in python (but an R implementation would be helpful as well). All the logistic regression packages I've found seem to be only suited for classification whereas this is a regression problem (albeit one where I want to use the logit link function). I use scikits-learn for my classification and regression needs so if this regression model can be implemented in scikits-learn, that would be fantastic (it seemed to me that this is not possible), but I'd be happy about any solution in python and/or R.
The question has two issues, penalized estimation and fractional or proportions data as dependent variable. I worked on each separately but never tried the combination.
Penalization
Statsmodels has had L1 regularized Logit and other discrete models like Poisson for some time. In recent months there has been a lot of effort to support more penalization but it is not in statsmodels yet. Elastic net for linear and Generalized Linear Model (GLM) is in a pull request and will be merged soon. More penalized GLM like L2 penalization for GAM and splines or SCAD penalization will follow over the next months based on pull requests that still need work.
Two examples for the current L1 fit_regularized for Logit are here
Difference in SGD classifier results and statsmodels results for logistic with l1 and https://github.com/statsmodels/statsmodels/blob/master/statsmodels/examples/l1_demo/short_demo.py
Note, the penalization weight alpha can be a vector with zeros for coefficients like the constant if they should not be penalized.
http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.fit_regularized.html
Fractional models
Binary and binomial models in statsmodels do not impose that the dependent variable is binary and work as long as the dependent variable is in the [0,1] interval.
Fractions or proportions can be estimated with Logit as Quasi-maximum likelihood estimator. The estimates are consistent if the mean function, logistic, cumulative normal or similar link function, is correctly specified but we should use robust sandwich covariance for proper inference. Robust standard errors can be obtained in statsmodels through a fit keyword cov_type='HC0'.
Best documentation is for Stata http://www.stata.com/manuals14/rfracreg.pdf and the references therein. I went through those references before Stata had fracreg, and it works correctly with at least Logit and Probit which were my test cases. (I don't find my scripts or test cases right now.)
The bad news for inference is that robust covariance matrices have not been added to fit_regularized, so the correct sandwich covariance is not directly available. The standard covariance matrix and standard errors of the parameter estimates are derived under the assumption that the model, i.e. the likelihood function, is correctly specified, which will not be the case if the data are fractions and not binary.
Besides using Quasi-Maximum Likelihood with binary models, it is also possible to use a likelihood that is defined for fractional data in (0, 1). A popular model is Beta regression, which is also waiting in a pull request for statsmodels and is expected to be merged within the next months.
Related
There are lots of posts here about the "Perfect Separation Error" in statsmodels when running a logisitc regression. But I'm not doing logistic regression. I'm doing GLM with frequency weights and gaussian distribution. So basically OLS.
All of my independent variables are categorical with lots of categories. So high dimensional binary coded feature set.
But I'm very frequently getting the "perfectseperationerror" from statsmodels
I'm running many many models. I think I'm getting this error when my data is too thin for that many variables. However, With freq weights, in theory, I actually have many more features then the dataframe holds because the observations should be multiplied by the freq.
Any guidance on how to proceed?
reg = sm.GLM(dep, Indies, freq_weights = freq)
<p>Error: class 'statsmodels.tools.sm_exceptions.PerfectSeparationError'>
The check is on perfect prediction and is used independently of the family.
Currently, there is now workaround when using irls. Using scipy optimizers, e.g. method="bfgs", avoids the perfect prediction/separation check.
https://github.com/statsmodels/statsmodels/issues/2680
Perfect separation is only defined for the binary case, i.e. family binomial in GLM, and could be extended to other discrete models.
However, there can be other problems with inference if the residual variance is zero, i.e. we have a perfect fit.
Here is an issue with perfect prediction in OLS
https://github.com/statsmodels/statsmodels/issues/1459
I have a particular classification problem that I was able to improve using Python's abs() function. I am still somewhat new when it comes to machine learning, and I wanted to know if what I am doing is actually "allowed," so to speak, for improving a regression problem. The following line describes my method:
lr = linear_model.LinearRegression()
predicted = abs(cross_val_predict(lr, features, labels_postop_IS, cv=10))
I attempted this solution because linear regression can sometimes produce negative predictions values, even though my particular case, these predictions should never be negative, as they are a physical quantity.
Using the abs() function, my predictions produce a better fit for the data.
Is this allowed?
Why would it not be "allowed". I mean if you want to make certain statistical statements (like a 95% CI e.g.) you need to be careful. However, most ML practitioners do not care too much about underlying statistical assumptions and just want a blackbox model that can be evaluated based on accuracy or some other performance metric. So basically everything is allowed in ML, you just have to be careful not to overfit. Maybe a more sensible solution to your problem would be to use a function that truncates at 0 like f(x) = x if x > 0 else 0. This way larger negative values don't suddenly become large positive ones.
On a side note, you should probably try some other models as well with more parameters like a SVR with a non-linear kernel. The thing is obviously that a LR fits a line, and if this line is not parallel to your x-axis (thinking in the single variable case) it will inevitably lead to negative values at some point on the line. That's one reason for why it is often advised not to use LRs for predictions outside the "fitted" data.
A straight line y=a+bx will predict negative y for some x unless a>0 and b=0. Using logarithmic scale seems natural solution to fix this.
In the case of linear regression, there is no restriction on your outputs.
If your data is non-negative (as in your case the values are physical quantities and cannot be negative), you could model using a generalized linear model (GLM) with a log link function. This is known as Poisson regression and is helpful for modeling discrete non-negative counts such as the problem you described. The Poisson distribution is parameterized by a single value λ, which describes both the expected value and the variance of the distribution.
I cannot say your approach is wrong but a better way is to go towards the above method.
This results in an approach that you are attempting to fit a linear model to the log of your observations.
I am trying to lean SVC classifier of SVM model in sklearn. I have learned to use it on various datasets and even applied gridsearch to improve the results but I have not yet understood some parameters like C, gamma.
If anyone can give me simple but detail explanation of each parameter, it would be great.
Since we are trying to minimize some objective function, we can add some 'size' measure of the coefficient vector itself to the function. C is essentially the inverse of the weight on that 'regularization' term. Decreasing C will prevent overfitting by forcing the coefficients to be sparse or small, depending on the penalty. Increasing C too much will promote underfitting.
Gamma is a parameter for the RBF kernel. Increasing gamma allows for a more complex decision boundary (which can lead to overfitting, but can also improve results--it depends on the data).
This scikit-learn tutorial graphically shows the effect of changing both hyperparameters.
How should one decide between using a linear regression model or non-linear regression model?
My goal is to predict Y.
In case of simple x and y dataset I could easily decide which regression model should be used by plotting a scatter plot.
In case of multi-variant like x1,x2,...,xn and y. How can I decide which regression model has to be used? That is, How will I decide about going with simple linear model or non linear models such as quadric, cubic etc.
Is there any technique or statistical approach or graphical plots to infer and decide which regression model has to be used? Please advise.
That is a pretty complex question.
You start visually first: if the data is normally distributed, and satisfy conditions for classical linear model, you use linear model. I normally start by making a scatter plot matrix to observe the relationships. If it is obvious that the relationship is non linear then you use non-linear model. But, a lot of times, I visually inspect, assuming that the number of factors are just not too many.
For example, this would be a non linear model:
However, if you want to use data mining (and computationally demanding methods), I suggest starting with stepwise regression. What you do is set a model evaluation criteria first: could be R^2 for example. You start a model with nothing and sequentially add predictors or permutations of them until your model evaluation criteria is "maximized". However, adding new predictor almost always increases R^2, a type of over-fitting.
The solution is to split the data into training and testing. You should make model based on the training and evaluate the mean error on testing. The best model will be the one that that minimized mean error on the testing set.
If your data is sparse, try integrating ridge or lasso regression in model evaluation.
Again, this is a kind of a complex question. The answer also kind of depends on whether you are building descriptive or explanatory model.
I have data which has an associated binary outcome variable. Naturally I ran a logistic regression in order to see parameter estimates and odds ratios. I was curious though, to change this data from a binary outcome to count data. Then I ran a poisson regression (and negative binomial regression) on the count data.
I have no idea of how to compare these different models though, all comparisons I see seem to only be concerned with nested models.
How would you go about deciding on the best model to use in this situation?
Essentially both models will be roughly equal. What really matters is what is your objective- what you really want to predict. If you want to determine how many of cases are good or bad (1 or 0), then you go for logistic regression. If you are really interested on how much the cases are going to do (counts) then do poisson.
In other words, the only difference between these two models is the logistic transformation and the fact that logistic regression tries to minimize the misclassification error (-2 log likelihood) .To put it simply, even if you run a linear regression (OLS) on the binary outcome, you should not see big differences from your logistic model apart from the fact that the results may not be between 0 and 1 (e.g. the Area under the RoC curve will be similar to the logistic model) .
To sum up, don't worry about which of these two models is better, they should be roughly the same in the way the capture your features' information. Just think what makes more sense to optimize, counts or probabilties. The answer might have been different if you were considering non-linear models (e.g random forests or neural networks etc), but the two you are considering are both (almost) linear- so don't worry about it.
One thing to consider is the sample design. If you are using a case-control study, then logistic regression is the way to go because of its logit link function, rather than log of ratios as in Poisson regression. This is because, where there is an oversampling of cases such as in case-control study, odds ratio is unbiased.