What are the advantages of L2-SVM? - svm

I don't clearly understand about the advantages of L2-SVM compared with L1-SVM. The effect of L2-SVM is better than L1-SVM. I don't understand it. I want to know why?

L2-SVM is differentiable and imposes a bigger (quadratic vs. linear) loss for points which violate the margin.
From http://deeplearning.net/wp-content/uploads/2013/03/dlsvm.pdf

Related

Feature scaling and its affect on various algorithm

Despite going through lots of similar question related to this I still could not understand why some algorithm is susceptible to it while others are not.
Till now I found that SVM and K-means are susceptible to feature scaling while Linear Regression and Decision Tree are not.Can somebody please elaborate me why? in general or relating to this 4 algorithm.
As I am a beginner, please explain this in layman terms.
One reason I can think of off-hand is that SVM and K-means, at least with a basic configuration, uses an L2 distance metric. An L1 or L2 distance metric between two points will give different results if you double delta-x or delta-y, for example.
With Linear Regression, you fit a linear transform to best describe the data by effectively transforming the coordinate system before taking a measurement. Since the optimal model is the same no matter the coordinate system of the data, pretty much by definition, your result will be invariant to any linear transform including feature scaling.
With Decision Trees, you typically look for rules of the form x < N, where the only detail that matters is how many items pass or fail the given threshold test - you pass this into your entropy function. Because this rule format does not depend on dimension scale, since there is no continuous distance metric, we again have in-variance.
Somewhat different reasons for each, but I hope that helps.

Does sklearn.linear_model.LogisticRegression always converge to best solution?

When using this code, I've noticed that it converges unbelievably quickly (small
fraction of one second), even when the model and/or the data is very large. I
suspect that in some cases I am not getting anything close to the best solution,
but this is hard to prove. It would be nice to have the option for some type of
global optimizer such as the basin hopping algorithm, even if this consumed 100
to 1,000 times as much CPU. Does anyone have any thoughts on this subject?
This is a very complex question and this answer might be incomplete, but should give you some hints (as your question also indicates some knowledge gaps):
(1) First i disagree with the desire for some type of global optimizer such as the basin hopping algorithm, even if this consumed 100 to 1,000 times as much CPU as this does not help in most cases (in ML world) as the differences are so subtle and the optimization-error will often be negligible compared to the other errors (model-power; empirical-risk)
Read "Stochastic Gradient Descent Tricks" (Battou) for some overview (and the error-components!)
He even gives a very important reason to use fast approximate algorithms (not necessarily a good fit in your case if 1000x training-time is not a problem): approximate optimization can achieve better expected risk because more training examples can be processed during the allowed time
(2) Basin-hopping is some of these highly heuristic tools of global-optimization (looking for global-minima instead of local minima) without any guarantees at all (touching NP-hardness and co.). It's the last algorithm you want to use here (see point (3))!
(3) The problem of logistic-regression is a convex optimization problem!
The local minimum is always the global-minimum, which follows from convexity (i'm ignoring stuff like strictly/unique solutions and co)!
Therefore you will always use something tuned for convex-optimization! And never Basin-hopping!
(4) There are different solvers and each support different variants of problems (different regularization and co.). We don't know exactly what you are optimizing, but of course these solvers are working differently in regards to convergence:
Take the following comments with a grain of salt:
liblinear: is probably using some CG-based algorithm (conjugated-gradient) which means convergence is highly dependent on the data
if accurate convergence is achieved is solely depending on the exact implementation (liblinear is high-quality)
as it's a first-order method i would call the general accuracy medium
sag/saga: seems to have a better convergence-theory (did not check it much), but again: it's dependent on your data as mentioned in sklearn's docs and if solutions are accurate is highly depending on the implementation details
as these are first-order methods: general accuracy medium
newton-cg: an inexact newton-method
in general much more robust in terms of convergence as line-searches replace heuristics or constant learning-rates (LS costly in first-order opt)
second-order method with inexact-core: expected accuracy: medium-high
lbfgs: quasi-newton method
again in general much more robust in terms of convergence like newton-cg
second-order method: expected accuracy: medium-high
Of course second-order methods get more hurt with large-scale data (even complexity-wise) and as mentioned, not all solvers are supporting every logreg-optimization-problem supported in sklearn.
I hope you get the idea how complex this question is (because of highly complex solver-internals).
Most important things:
LogReg is convex -> use solvers tuned for unconstrained convex optimization
If you want medium-high accuracy: use those second-order based methods available and do many iterations (it's a parameter)
If you want high accuracy: use second-order based methods which are even more conservative/careful (no: hessian-approx; inverse-hessian-approx; truncating...):
e.g. any off-the-shelve solver from convex-optimization
Open-source: cvxopt, ecos and co.
Commercial: Mosek
(but you need to formulate the model yourself in their frameworks or some wrapper; probably some examples for classic logistic-regression available)
As expected: some methods will get very slow with much data.

What does a 'tractable' distribution mean?

For example, in generative adversarial network, we often hear that inference is easy because the conditional distribution of x given latent variable z is 'tractable'.
Also, I read somewhere that Boltzmann machine and variational autoencoder is used where the posterior distribution is not tractable so some sort of approximation need to be applied.
Could anyone tell me what 'tractable' means, in a rigorous definition? Or could anyone explain in any of the examples I gave above, what tractable exactly means in that context?
First of all, let's define what tractable and intractable problems are (Reference: http://www.cs.ucc.ie/~dgb/courses/toc/handout29.pdf).
Tractable Problem: a problem that is solvable by a polynomial-time algorithm. The upper bound is polynomial.
Intractable Problem: a problem that cannot be solved by a polynomial-time algorithm. The lower bound is exponential.
From this perspective, a definition for tractable distribution is that it takes polynomial-time to calculate the probability of this distribution at any given point.
If a distribution is in a closed-form expression, the probability of this distribution can definitely be calculated in polynomial-time, which, in the world of academia, means the distribution is tractable. Intractable distributions take equal to or more than exponential-time, which usually means that with existing computational resources, we can never calculate the probability at a given point with relatively "short" time (any time longer than polynomial-time is long...).

How does the Needleman Wunsch algorithm compare to brute force?

I'm wondering how you can quantify the results of the Needleman-Wunsch algorithm (typically used for aligning nucleotide/protein sequences).
Consider some fixed scoring scheme and two sequences of varying length S1 and S2. Say we calculate every possible alignment of S1 and S2 by brute force, and the highest scoring alignment has a score x. And of course, this has considerably higher complexity than the Needleman-Wunsch approach.
When using the Needleman-Wunsch algorithm to find a sequence alignment, say that it has a score y.
Consider r to be the score generated via Needleman-Wunsch for two random sequences R1 and R2.
How does x compare to y? Is y always greater than r for two sequences of known homology?
In general, I do understand that we use the Needleman-Wunsch algorithm to significantly speed up sequence alignment (vs a brute-force approach), but don't understand the cost in accuracy (if any) that comes with it. I had a go at reading the original paper (Needleman & Wunsch, 1970) but am still left with this question.
Needlman-Wunsch always produces an optimal answer - it's much faster than brute force and doesn't sacrifice accuracy in the process. The key insight it uses is that it's not actually necessary to generate all possible alignments, since most of them contain bad sub-alignments and couldn't possibly be optimal. The Needleman-Wunsch algorithm works by instead slowly building up optimal alignments for fragments of the original strands and then slowly growing those smaller alignments into larger alignments using the guarantee that any optimal alignment must contain an optimal alignment for a slightly smaller case.
I think your question boils down to whether dynamic programming finds the optimal solution ie, garantees that y >= x. For a discussion on this I would refer to people who are likely smarter than me:
https://cs.stackexchange.com/questions/23599/how-is-dynamic-programming-different-from-brute-force
Basically, it says that dynamic programming will likely produce optimal result ie, same as brute force, but only for particular problems that satisfy the Bellman principle of optimality.
According to Wikipedia page for Needleman-Wunsch, the problem does satisfy Bellman principle of optimality:
https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm
Specifically:
The Needleman–Wunsch algorithm is still widely used for optimal global
alignment, particularly when the quality of the global alignment is of
the utmost importance. However, the algorithm is expensive with
respect to time and space, proportional to the product of the length
of two sequences and hence is not suitable for long sequences.
There is also mention of optimality elsewhere in the same Wikipedia page.

k-means with ellipsoids

I have n points in R^3 that I want to cover with k ellipsoids or cylinders (I don't really care; whichever is easier). I want to approximately minimize the union of the volumes. Let's say n is tens of thousands and k is a handful. Development time (i.e. simplicity) is more important than runtime.
Obviously I can run k-means and use perfect balls for my ellipsoids. Or I can run k-means, then use minimum enclosing ellipsoids per cluster rather than covering with balls, though in the worst case that's no better. I've seen talk of handling anisotropy with k-means but the links I saw seemed to think I had a tensor in hand; I don't, I just know the data will be a union of ellipsoids. Any suggestions?
[Edit: There's a couple votes for fitting a mixture of multivariate Gaussians, which seems like a viable thing to try. Firing up an EM code to do that won't minimize the volume of the union, but of course k-means doesn't minimize volume either.]
So you likely know k-means is NP-hard, and this problem is even more general (harder). Because you want to do ellipsoids it might make a lot of sense to fit a mixture of k multivariate gaussian distributions. You would probably want to try and find a maximum likelihood solution, which is a non-convex optimization, but at least it's easy to formulate and there is likely code available.
Other than that you're likely to have to write your own heuristic search algorithm from scratch, this is just a huge undertaking.
I did something similar with multi-variate gaussians using this method. The authors use kurtosis as the split measure, and I found it to be a satisfactory method for my application, clustering points obtained from a laser range finder (i.e. computer vision).
If the ellipsoids can overlap a lot,
then methods like k-means that try to assign points to single clusters
won't work very well.
Part of each ellipsoid has to fit the surface of your object,
but the rest may be inside it, don't-cares.
That is, covering algorithms
seem to me quite different from clustering / splitting algorithms;
unions are not splits.
Gaussian mixtures with lots of overlaps ?
No idea, but see the picture and code on Numerical Recipes p. 845.
Coverings are hard even in 2d, see
find-near-minimal-covering-set-of-discs-on-a-2-d-plane.

Resources