We have following methods to develop a linear regression model.
1. Ordinary Least square
2. Linear Algebra
3. Gradient Descent
How to choose between those models. Can anyone pls clarify the pros and cons of those?
My understanding is that linear algebra is used to implement Ordinary Least Squares (OLS), such that in your question (1) and (2) are effectively the same thing. OLS can only be used for curve fitting equations that are linear in the coefficients and cannot directly be used for non-linear equations. Gradient descent is one of the ways to curve fit non-linear equations, but requires good starting parameters from which to begin the descent in error space.
I invite the more experienced statisticians on this list to comment on my small summary.
Related
I've tried to implement the lasso regression with coordinate descent. In the later process the objective function will include the first derivative of the function as well. All derivatives are computed by a automatic differentiation tool. In the first step I've tried to implement the lasso with simple cyclic coordinate descent without including the derivative.
In an small example with 4 features and ~100 samples the algorithm is converging to the right solution. But the solutions of my real dataset and the solution of the lasso regression from scikit-learn are diffrent. Furthermore, scikit-learns algorithm converges a lot faster. I've used default settings on the scikit-learn setup.
My question is: What is the diffrence between the defaulth scikit-learn algorithm of the lasso regression and the simple coordinate descent? Is there a paper which describes the implemented algorithm?
BR
I am confused about how gpytorch calculates the gradients with respect to parameters of the model. For instance, lets say I am using ExactGP with Gaussian likelihood, RBF kernel, and constant mean and using MLE (maximum likelihood estimate) for finding the parameters of the model (mean, kernel parameters, and noise). One way to calculate the gradient w.r.t parameters of the model is using analytical gradient which means taking derivative of negative log-likelihood with respect to parameters and finding the equation for each derivation. Another way is to use automatic differentiation provided by pytorch.
Gpytorch authors have mentioned in their paper with the title of "GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration" that they are using analytical gradient or at least this is what I understood by reading the paper. Am I correct? Also, I couldn't find the code that they have implemented the analytical gradient.
Could anyone help me understand this better, please?
The "automatic differentiation provided by PyTorch" does compute the analytic gradient (via back-propagation, note that there is no finite differencing or anything like that involved) - it just does so automatically.
https://github.com/cornellius-gp/gpytorch/discussions/1949#discussioncomment-2384471
I would like to fit a regression model to probabilities. I am aware that linear regression is often used for this purpose, but I have several probabilities at or near 0.0 and 1.0 and would like to fit a regression model where the output is constrained to lie between 0.0 and 1.0. I want to be able to specify a regularization norm and strength for the model and ideally do this in python (but an R implementation would be helpful as well). All the logistic regression packages I've found seem to be only suited for classification whereas this is a regression problem (albeit one where I want to use the logit link function). I use scikits-learn for my classification and regression needs so if this regression model can be implemented in scikits-learn, that would be fantastic (it seemed to me that this is not possible), but I'd be happy about any solution in python and/or R.
The question has two issues, penalized estimation and fractional or proportions data as dependent variable. I worked on each separately but never tried the combination.
Penalization
Statsmodels has had L1 regularized Logit and other discrete models like Poisson for some time. In recent months there has been a lot of effort to support more penalization but it is not in statsmodels yet. Elastic net for linear and Generalized Linear Model (GLM) is in a pull request and will be merged soon. More penalized GLM like L2 penalization for GAM and splines or SCAD penalization will follow over the next months based on pull requests that still need work.
Two examples for the current L1 fit_regularized for Logit are here
Difference in SGD classifier results and statsmodels results for logistic with l1 and https://github.com/statsmodels/statsmodels/blob/master/statsmodels/examples/l1_demo/short_demo.py
Note, the penalization weight alpha can be a vector with zeros for coefficients like the constant if they should not be penalized.
http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.fit_regularized.html
Fractional models
Binary and binomial models in statsmodels do not impose that the dependent variable is binary and work as long as the dependent variable is in the [0,1] interval.
Fractions or proportions can be estimated with Logit as Quasi-maximum likelihood estimator. The estimates are consistent if the mean function, logistic, cumulative normal or similar link function, is correctly specified but we should use robust sandwich covariance for proper inference. Robust standard errors can be obtained in statsmodels through a fit keyword cov_type='HC0'.
Best documentation is for Stata http://www.stata.com/manuals14/rfracreg.pdf and the references therein. I went through those references before Stata had fracreg, and it works correctly with at least Logit and Probit which were my test cases. (I don't find my scripts or test cases right now.)
The bad news for inference is that robust covariance matrices have not been added to fit_regularized, so the correct sandwich covariance is not directly available. The standard covariance matrix and standard errors of the parameter estimates are derived under the assumption that the model, i.e. the likelihood function, is correctly specified, which will not be the case if the data are fractions and not binary.
Besides using Quasi-Maximum Likelihood with binary models, it is also possible to use a likelihood that is defined for fractional data in (0, 1). A popular model is Beta regression, which is also waiting in a pull request for statsmodels and is expected to be merged within the next months.
How should one decide between using a linear regression model or non-linear regression model?
My goal is to predict Y.
In case of simple x and y dataset I could easily decide which regression model should be used by plotting a scatter plot.
In case of multi-variant like x1,x2,...,xn and y. How can I decide which regression model has to be used? That is, How will I decide about going with simple linear model or non linear models such as quadric, cubic etc.
Is there any technique or statistical approach or graphical plots to infer and decide which regression model has to be used? Please advise.
That is a pretty complex question.
You start visually first: if the data is normally distributed, and satisfy conditions for classical linear model, you use linear model. I normally start by making a scatter plot matrix to observe the relationships. If it is obvious that the relationship is non linear then you use non-linear model. But, a lot of times, I visually inspect, assuming that the number of factors are just not too many.
For example, this would be a non linear model:
However, if you want to use data mining (and computationally demanding methods), I suggest starting with stepwise regression. What you do is set a model evaluation criteria first: could be R^2 for example. You start a model with nothing and sequentially add predictors or permutations of them until your model evaluation criteria is "maximized". However, adding new predictor almost always increases R^2, a type of over-fitting.
The solution is to split the data into training and testing. You should make model based on the training and evaluate the mean error on testing. The best model will be the one that that minimized mean error on the testing set.
If your data is sparse, try integrating ridge or lasso regression in model evaluation.
Again, this is a kind of a complex question. The answer also kind of depends on whether you are building descriptive or explanatory model.
In the Java version of LIBLINEAR there is a class called 'SolverType' in which one can choose type of the loss function to which they want to optimize the function. For example 'SolverType.L2LOSS_SVM_DUAL'. Is there any way to define a user-defined loss function?
The short answer is no.
The "loss function" defines the optimization problem, in fact this parameter changes (in particular) this model to
linear regression
logistic regression
support vector machine
While first two are quite similar, third requires completely different machinery to solve it, much more complex methods. In particular one can define very arbitrary functions, which fall into "linear models" category, which are unsolvable (are solvable by very complex techniques).
On the other hand, if the function is very simple, ie. it is a differentiable function, without any bounds (optimization is performed on the whole parameters space) then (assuming you know analytical form of the derivatives) you can plug it in into any steepest descent algorithm implementation (there are dozens of such solvers avaliable).
SVM is formulated as a QP problem.
minimize ||w|| w.r.t
y * (w'x) >= 1 for all (x, y) in the training dataset
This is the dual form of the problem and the objective is to minimize the L2 norm of the weight w.
If you change the objective ||w|| then it is no longer SVM. However, you can change the weight of training examples. You can find a tutorial here:
http://scikit-learn.org/stable/modules/svm.html#unbalanced-problems