Does amazon sagemaker has built-in polynomial regression algorithm? - amazon

I am exploring Amazon Sagemaker and need to know whether it has built-in polynomial regression algorithm.

Polynomial regression can be implemented using Linear Regression. It can be implemented by creating x^2, x^3, x^4…and so on in the training data.

Check out the Sagemaker documentation. You might be especially interested in linear learner:
For input, you give the model labeled examples (x, y). x is a high-dimensional vector and y is a numeric label.
...
Continuous objectives, such as mean square error, cross entropy loss, absolute error.

Related

Lasso with Coordinate Descent in Scikit-Learn

I've tried to implement the lasso regression with coordinate descent. In the later process the objective function will include the first derivative of the function as well. All derivatives are computed by a automatic differentiation tool. In the first step I've tried to implement the lasso with simple cyclic coordinate descent without including the derivative.
In an small example with 4 features and ~100 samples the algorithm is converging to the right solution. But the solutions of my real dataset and the solution of the lasso regression from scikit-learn are diffrent. Furthermore, scikit-learns algorithm converges a lot faster. I've used default settings on the scikit-learn setup.
My question is: What is the diffrence between the defaulth scikit-learn algorithm of the lasso regression and the simple coordinate descent? Is there a paper which describes the implemented algorithm?
BR

Is there any place in scikit-learn Lasso/Quantile Regression source code that L1 regularization is applied?

I could not find where the Manhattan distance of weights is calculated and multiplied with alpha (L1 reg. coefficient) in the Lasso Regression and the Quantile Regression source code of scikit-learn.
I was trying to implement Lasso Regression and Quantile Regression w/ NumPy and compare results w/ scikit-learn models.
I don't believe the loss function (including the regularization penalty) is ever explicitly calculated, no.
Instead, the loss function is optimized by coordinate descent, and so we only ever need to actually calculate derivatives of the loss function. That happens in the enet_coordinate_descent function (or relatives), and I think the relevant bit is here.

Does Gpytorch use Analytic gradient or Automatic differentiation for training?

I am confused about how gpytorch calculates the gradients with respect to parameters of the model. For instance, lets say I am using ExactGP with Gaussian likelihood, RBF kernel, and constant mean and using MLE (maximum likelihood estimate) for finding the parameters of the model (mean, kernel parameters, and noise). One way to calculate the gradient w.r.t parameters of the model is using analytical gradient which means taking derivative of negative log-likelihood with respect to parameters and finding the equation for each derivation. Another way is to use automatic differentiation provided by pytorch.
Gpytorch authors have mentioned in their paper with the title of "GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration" that they are using analytical gradient or at least this is what I understood by reading the paper. Am I correct? Also, I couldn't find the code that they have implemented the analytical gradient.
Could anyone help me understand this better, please?
The "automatic differentiation provided by PyTorch" does compute the analytic gradient (via back-propagation, note that there is no finite differencing or anything like that involved) - it just does so automatically.
https://github.com/cornellius-gp/gpytorch/discussions/1949#discussioncomment-2384471

Modelling probabilities in a regularized (logistic?) regression model in python

I would like to fit a regression model to probabilities. I am aware that linear regression is often used for this purpose, but I have several probabilities at or near 0.0 and 1.0 and would like to fit a regression model where the output is constrained to lie between 0.0 and 1.0. I want to be able to specify a regularization norm and strength for the model and ideally do this in python (but an R implementation would be helpful as well). All the logistic regression packages I've found seem to be only suited for classification whereas this is a regression problem (albeit one where I want to use the logit link function). I use scikits-learn for my classification and regression needs so if this regression model can be implemented in scikits-learn, that would be fantastic (it seemed to me that this is not possible), but I'd be happy about any solution in python and/or R.
The question has two issues, penalized estimation and fractional or proportions data as dependent variable. I worked on each separately but never tried the combination.
Penalization
Statsmodels has had L1 regularized Logit and other discrete models like Poisson for some time. In recent months there has been a lot of effort to support more penalization but it is not in statsmodels yet. Elastic net for linear and Generalized Linear Model (GLM) is in a pull request and will be merged soon. More penalized GLM like L2 penalization for GAM and splines or SCAD penalization will follow over the next months based on pull requests that still need work.
Two examples for the current L1 fit_regularized for Logit are here
Difference in SGD classifier results and statsmodels results for logistic with l1 and https://github.com/statsmodels/statsmodels/blob/master/statsmodels/examples/l1_demo/short_demo.py
Note, the penalization weight alpha can be a vector with zeros for coefficients like the constant if they should not be penalized.
http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.fit_regularized.html
Fractional models
Binary and binomial models in statsmodels do not impose that the dependent variable is binary and work as long as the dependent variable is in the [0,1] interval.
Fractions or proportions can be estimated with Logit as Quasi-maximum likelihood estimator. The estimates are consistent if the mean function, logistic, cumulative normal or similar link function, is correctly specified but we should use robust sandwich covariance for proper inference. Robust standard errors can be obtained in statsmodels through a fit keyword cov_type='HC0'.
Best documentation is for Stata http://www.stata.com/manuals14/rfracreg.pdf and the references therein. I went through those references before Stata had fracreg, and it works correctly with at least Logit and Probit which were my test cases. (I don't find my scripts or test cases right now.)
The bad news for inference is that robust covariance matrices have not been added to fit_regularized, so the correct sandwich covariance is not directly available. The standard covariance matrix and standard errors of the parameter estimates are derived under the assumption that the model, i.e. the likelihood function, is correctly specified, which will not be the case if the data are fractions and not binary.
Besides using Quasi-Maximum Likelihood with binary models, it is also possible to use a likelihood that is defined for fractional data in (0, 1). A popular model is Beta regression, which is also waiting in a pull request for statsmodels and is expected to be merged within the next months.

Regarding Probability Estimates predicted by LIBSVM

I am attempting 3 class classification by using SVM classifier. How do we interpret the probabililty estimates predicted by LIBSVM. Is it based on perpendicular distance of the instance from the maximal margin hyperplane?.
Kindly through some light on the interpretation of probability estimates predicted by LIBSVM classifier. Parameters C and gamma are first tuned and then probability estimates are outputted by using -b option with both training and testing.
Multiclass SVM is always decomposed into several binary classifiers (typically a set of one vs all classifiers). Any binary SVM classifier's decision function outputs a (signed) distance to the separating hyperplane. In short, an SVM maps the input domain to a one-dimensional real number (the decision value). The predicted label is determined by the sign of the decision value. The most common technique to obtain probabilistic output from SVM models is through so-called Platt scaling (paper of LIBSVM authors).
Is it based on perpendicular distance of the instance from the maximal margin hyperplane?
Yes. Any classifier that outputs such a one-dimensional real value can be post-processed to yield probabilities, by calibrating a logistic function on the decision values of the classifier. This is the exact same approach as in standard logistic regression.
SVM performs binary classification. In order to achieve multiclass classification libsvm performs what it's called one vs all. What you get when you invoke -bis the probability related to this technique that you can found explained here .

Resources