Is elastic net equivalent in scikit-learn and glmnet? - scikit-learn

In particular, glmnet docs imply it creates a "Generalised Linear Model" of the gaussian family for regression, while scikit-learn imply no such thing (ie, seems like it's a pure linear regression, not generalised). But I'm not sure about this.

In the documentation you link to, there is an optimization problem which shows exactly what is optimized in GLMnet:
1/(2N) * sum_i(y_i - beta_0 - x_i^T beta) + lambda * [(1 - alpha)/2 ||beta||_2^2 + alpha * ||beta||_1]
Now take a look here, where you will find the same formula written as the optimization of a euclidean norm. Note that the docs have omitted the intercept w_0, equivalent to beta_0, but the code does estimate it.
Please also note that lambda becomes alpha and alpha becomes rho...
The "Gaussian family" aspect probably refers to the fact that an L2-loss is used, which corresponds to assuming that the noise is additive Gaussian.

Related

Why is Standard Deviation the square of difference of an obsevation from the mean?

I am learning statistics, and have some basic yet core questions on SD:
s = sample size
n = total number of observations
xi = ith observation
μ = arithmetic mean of all observations
σ = the usual definition of SD, i.e. ((1/(n-1))*sum([(xi-μ)**2 for xi in s])**(1/2) in Python lingo
f = frequency of an observation value
I do understand that (1/n)*sum([xi-μ for xi in s]) would be useless (= 0), but would not (1/n)*sum([abs(xi-μ) for xi in s]) have been a measure of variation?
Why stop at power of 1 or 2? Would ((1/(n-1))*sum([abs((xi-μ)**3) for xi in s])**(1/3) or ((1/(n-1))*sum([(xi-μ)**4 for xi in s])**(1/4) and so on have made any sense?
My notion of squaring is that it 'amplifies' the measure of variation from the arithmetic mean while the simple absolute difference is somewhat a linear scale notionally. Would it not amplify it even more if I cubed it (and made absolute value of course) or quad it?
I do agree computationally cubes and quads would have been more expensive. But with the same argument, the absolute values would have been less expensive... So why squares?
Why is the Normal Distribution like it is, i.e. f = (1/(σ*math.sqrt(2*pi)))*e**((-1/2)*((xi-μ)/σ))?
What impact would it have on the normal distribution formula above if I calculated SD as described in (1) and (2) above?
Is it only a matter of our 'getting used to the squares', it could well have been linear, cubed or quad, and we would have trained our minds likewise?
(I may not have been 100% accurate in my number of opening and closing brackets above, but you will get the idea.)
So, if you are looking for an index of dispersion, you actually don't have to use the standard deviation. You can indeed report mean absolute deviation, the summary statistic you suggested. You merely need to be aware of how each summary statistic behaves, for example the SD assigns more weight to outlying variables. You should also consider how each one can be interpreted. For example, with a normal distribution, we know how much of the distribution lies between ±2SD from the mean. For some discussion of mean absolute deviation (and other measures of average absolute deviation, such as the median average deviation) and their uses see here.
Beyond its use as a measure of spread though, SD is related to variance and this is related to some of the other reasons it's popular, because the variance has some nice mathematical properties. A mathematician or statistician would be able to provide a more informed answer here, but squared difference is a smooth function and is differentiable everywhere, allowing one to analytically identify a minimum, which helps when fitting functions to data using least squares estimation. For more detail and for a comparison with least absolute deviations see here. Another major area where variance shines is that it can be easily decomposed and summed, which is useful for example in ANOVA and regression models generally. See here for a discussion.
As to your questions about raising to higher powers, they actually do have uses in statistics! In general, the mean (which is related to average absolute mean), the variance (related to standard deviation), skewness (related to the third power) and kurtosis (related to the fourth power) are all related to the moments of a distribution. Taking differences raised to those powers and standardizing them provides useful information about the shape of a distribution. The video I linked provides some easy intuition.
For some other answers and a larger discussion of why SD is so popular, See here.
Regarding the relationship of sigma and the normal distribution, sigma is simply a parameter that stretches the standard normal distribution, just like the mean changes its location. This is simply a result of the way the standard normal distribution (a normal distribution with mean=0 and SD=variance=1) is mathematically defined, and note that all normal distributions can be derived from the standard normal distribution. This answer illustrates this. Now, you can parameterize a normal distribution in other ways as well, but I believe you do need to provide sigma, whether using the SD or precisions. I don't think you can even parametrize a normal distribution using just the mean and the mean absolute difference. Now, a deeper question is why normal distributions are so incredibly useful in representing widely different phenomena and crop up everywhere. I think this is related to the Central Limit Theorem, but I do not understand the proofs of the theorem well enough to comment further.

Is the definition of hyperparameter C in SVR opposite to the corresponding C in SVM?

I just realized that support vector machine can be used for regression thanks to the nice article However, I am quite confused with the definition of the hyperparameter C.
I am well aware of the slack variables \xi_i associated with each data point and the hyperparameter C in classification SVM. There, the objective function is
\min_{w, b} \frac{|w|}{2} + C\sum_{i=1}^N \xi_i, such that
y_i (w \cdot x_i + b) \ge 1 - \xi_i and \xi_i \ge 0.
In SVM, the larger C is, the larger the penalty and hence soft SVM reduces to hard SVM as C goes to infinity. (sorry for the raw latex code, i remember latex is supported but it seems not the case)
From the linked article, the objective function and the constraints are as follows
I think the equations also imply that the larger C is, the larger the penalty. However, the author of the article claims the opposite,
I noticed that someone asked the author the same question at the end of the article but there has been no response.
I guess there might be a typo in the equation, so I looked for support from any reference and then I found that SVR in Python uses the same convention that the strength of regularization is inversely proportional to C. I tried to check the source code of SVR but I can't find any formula. Can someone help resolve this? Thanks!
C is regularization parameter, which means its before the square of W, not before the relaxation, so the C parameter may be equal to 1/C.

What is the difference between Freidman mse and mse?

I'm looking into a GradientBoostingClassifier in sklearn. Then, I found there are 3 kind of criterion. Friedman mse, mse, mae.
the descriptions provided by sklearn are:
The function to measure the quality of a split. Supported criteria are “friedman_mse” for the mean squared error with improvement score by Friedman, “mse” for mean squared error, and “mae” for the mean absolute error. The default value of “friedman_mse” is generally the best as it can provide a better approximation in some cases.
I can't understand what is different?
Who's gonna let me know?
thanks!
I've provided a full answer in this link due to the convenience of writing TeX. However, it resumes in the fact that this splitting criterion allow us to take the decision not only on how close we're to the desired outcome (which is what MSE does), but also based on the probabilities of the desired k-class that we're going to find in the region l or in the region r (by considering a global weight w1*w2 / (w1 + w2)). I strongly recommend you to check the above link for a full explanation.
According to the scikit-learn source code, the main difference between these two criteria is the impurity-improvement method. The MSE / FriedmanMSE criterion calculates an impurity of the current node and tries to reduce (improve) it, The smaller the impurity the better.
Mean squared error impurity criterion.
MSE = sum_square_of_left / w_l + sum_square_of_right / w_r
source
On the other side FriedmanMSE impurity criterion use following to improve purity:
diff = w_r * total_left_sum - w_l * total_rigth_sum
improvement = diff**2 / (w_r * w_l)
Note: w_r (right) is for total left sum and visa versa.
you can simplify the following equations with the better notation, which was provided in Friedman published paper itself (eq. 35).
which says
improvement = (w_l * w_r) / (w_l + w_r) * (mean_left - mean_right) ^ 2
Which w_l, w_r are the corresponding sum of weights for respective left or right part.
source
For assigning meaning to left and right keywords, imagine the whole system in an array (e.g samples[start: end]), so for example left means the left elements of the current node.

fitting for offset in a patsy model

Using patsy, I understand how to turn intercepts on or off. But I haven't managed to get horizontal offsets. For instance, I would like to be able to fit, in essence
y = alpha + beta * abs(x_opt - x_obs)
with x_opt free in the fit. I tried write this like so:
y ~ 1 + np.abs(y - x)
using a constant column for y. But within the np.abs() parentheses, patsy "turns off," and y - x is just interpreted as a number. If I shift y to 1 or 20, I get different answers.
A similar question applies for e.g., np.pow(1-x, 2) or a sine wave. Being able to fit for the x offset would be extremely helpful. Is this possible? Or is this precisely what is meant that patsy doesn't do non-linear?
patsy and most of statsmodels only handle models that are linear in parameters. Or more precisely, models where the design matrix and estimated parameters are combined in a linear way, x * beta.
Polynomials and splines are nonlinear in the underlying variables but have a linear representation in terms of basis function and are therefore linear in parameters.
The only non-linearities in the models that are currently implemented in statsmodels are predefined nonlinearities like link functions in GLM or discrete models, shape parameters in models like NegativeBinomial, or covariances in mixed models and GEE.
The best Python package for nonlinear least squares is currently lmfit https://pypi.python.org/pypi/lmfit/

Quadratic Programming and quasi newton method BFGS

Yesterday, I posted a question about general concept of SVM Primal Form Implementation:
Support Vector Machine Primal Form Implementation
and "lejlot" helped me out to understand that what I am solving is a QP problem.
But I still don't understand how my objective function can be expressed as QP problem
(http://en.wikipedia.org/wiki/Support_vector_machine#Primal_form)
Also I don't understand how QP and Quasi-Newton method are related
All I know is Quasi-Newton method will SOLVE my QP problem which supposedly formulated from
my objective function (which I don't see the connection)
Can anyone walk me through this please??
For SVM's, the goal is to find a classifier. This problem can be expressed in terms of a function that you are trying to minimize.
Let's first consider the Newton iteration. Newton iteration is a numerical method to find a solution to a problem of the form f(x) = 0.
Instead of solving it analytically we can solve it numerically by the follwing iteration:
x^k+1 = x^k - DF(x)^-1 * F(x)
Here x^k+1 is the k+1th iterate, DF(x)^-1 is the inverse of the Jacobian of F(x) and x is the kth x in the iteration.
This update runs as long as we make progress in terms of step size (delta x) or if our function value approaches 0 to a good degree. The termination criteria can be chosen accordingly.
Now consider solving the problem f'(x)=0. If we formulate the Newton iteration for that, we get
x^k+1 = x - HF(x)^-1 * DF(x)
Where HF(x)^-1 is the inverse of the Hessian matrix and DF(x) the gradient of the function F. Note that we are talking about n-dimensional Analysis and can not just take the quotient. We have to take the inverse of the matrix.
Now we are facing some problems: In each step, we have to calculate the Hessian matrix for the updated x, which is very inefficient. We also have to solve a system of linear equations, namely y = HF(x)^-1 * DF(x) or HF(x)*y = DF(x).
So instead of computing the Hessian in every iteration, we start off with an initial guess of the Hessian (maybe the identity matrix) and perform rank one updates after each iterate. For the exact formulas have a look here.
So how does this link to SVM's?
When you look at the function you are trying to minimize, you can formulate a primal problem, which you can the reformulate as a Dual Lagrangian problem which is convex and can be solved numerically. It is all well documented in the article so I will not try to express the formulas in a less good quality.
But the idea is the following: If you have a dual problem, you can solve it numerically. There are multiple solvers available. In the link you posted, they recommend coordinate descent, which solves the optimization problem for one coordinate at a time. Or you can use subgradient descent. Another method is to use L-BFGS. It is really well explained in this paper.
Another popular algorithm for solving problems like that is ADMM (alternating direction method of multipliers). In order to use ADMM you would have to reformulate the given problem into an equal problem that would give the same solution, but has the correct format for ADMM. For that I suggest reading Boyds script on ADMM.
In general: First, understand the function you are trying to minimize and then choose the numerical method that is most suited. In this case, subgradient descent and coordinate descent are most suited, as stated in the Wikipedia link.

Resources