Kernel SVM primal with Stochastic Gradient Descent - statistics

In short: I am currently reading Online Learning with Kernels (http://books.nips.cc/papers/files/nips14/AA33.pdf) for fun and I can't figure out how he got to equation 8 from equations 6 and 7.
The idea is: We want to minimize a risk function
$R_stoch\[f,t\]:=c(x_t,y_t,f(x_t))+\lambda\Omega\[f\]$
If we want apply the representer theorem on f, writing it as
$f(x)=\sum\alpha_i k(x,x_i)$
how can we get to the STOCHASTIC gradient descent update?

A set of k(xi, x) seems to form a basis of H, and since f is in H, then f can be written as a linear combination of "kernel functions".
So pretending set of k(xi, x) forms a basis of H, it's obvious that if we have some linear combation of the left-hand side and another on the right-hand side, and they're equal, then their basis vector coefficients should be equal too (it's well-known fact from linear algebra that vector equality means vector coefficients (in the same basis!) equality).

Related

Using Gradient Descent on Exponentially Modified Gaussian

I am trying to implement a gradient descent method to fit an EMG as shown in this paper. The paper describes two versions of the EMG equation:
(1)
$$F(t)= \frac{h\cdot\sigma}{\tau}\sqrt{\frac{\pi}{2}}e^{(\frac{\mu-t}
{\tau}+\frac{\sigma^2}{2\tau^2})}\cdot erfc(\frac{1}{\sqrt{2}}(\frac{\mu-t} {\sigma}+\frac{\sigma}{\tau}))$$
and
(2)
[F(t)= h\cdot e^{\frac{-(\mu-t)^2}{2\sigma^2}}\cdot \frac{\sigma}{\tau}\sqrt{\frac{\pi}{2}} \cdot erfcx(\frac{1}{\sqrt{2}}(\frac{\mu-t}{\sigma}+\frac{\sigma}{\tau}))]
The paper defines z as:
[z = \frac{1}{\sqrt{2}}(\frac{\mu-t}{\sigma}+\frac{\sigma}{\tau}))"]
If z is negative equation (1) is used, otherwise equation (2) is used to prevent the function from blowing up. I'm implementing a batch gradient algorithm following the outline from this site, knowing my objective function is a bit different. I'm setting my theta as the following:
[\theta_j = [\mu, h, sigma, \tau]]
So my gradient update formula is the following:
[\theta_j^{k+1} = \theta_j^{k} - \frac{1}{m}\alpha\sum_{i=0}^{m} (F(t_i) - y_i)\frac{\partial F(t_i)}{\partial \theta_j^{k}}]
I've used wolfram alpha to determine all the partial derivatives. My total samples are >4000, so I'm used a batch of around 300 samples to speed up the process. I've found that I have to adjust the initial parameters very slightly to get the best results, otherwise the gradient will just blow-up.
The paper also discusses finding the time coordinate of the EMG Peak, which I'm not sure how the is useful in the gradient descent algorithm. It is found with the following:
[t_0 = \mu + y \cdot \sigma \cdot \sqrt{2}-\frac{\sigma^2}{\tau}]
where y is:
erfcx(y) = \frac{\tau}{\sigma}\sqrt{\frac{2}{\pi}}
So my main questions are:
Am I setting up the gradient descent incorrectly for a non-linear equation?
How does the determination of the time coordinate help with the algorithm?
Please let me know if there is any additional information I can provide. The algorithm itself is each and I didn't want to clog up the post with my function definitions of each gradient.

Pytorch function name demystification: gels for least squares estimation

What does "gels" stand for in Pytorch?
It solves least squares, but what does the name stand for?
It is hard to get comfortable with a function without getting its name and it is surprising that these are not explained in the documentation.
gels is actually a function from LAPACK (Linear Algebra Package) and stands for GEneralalized Least Squares meaning that it works on general matrices:
General matrix
A general real or complex m by n matrix is represented by a real or complex matrix of size (m, n).

Sympy Geometric Algebra: switching between both covariant and contravariant forms

This question regards making sympy's geometric algebra module use both covariant and contravariant vector forms to make the output much more compact. So far I am able to use one or the other, but not both together. It may be that I don't know the maths well enough, and the answer is in the documentation after all.
Some background:
I have a system of equations that I want to solve in a complicated non-orthogonal coordinate system. The metric tensor elements of this coordinate system are known, but their expressions are unwieldy so I'd like to to keep them hidden and simply use gij, the square root of its determinant J, and gij. Also it's useful to describe vectors, V, in either their contravariant or their covariant forms,
V = ∑Viei = ∑Viei,
and transform between them where necessary.
Here ei = ∇u(i) and u(i) is the ith coordinate, and ei = ∂R/∂u(i).
This notation is the same as that used in this invaluable text, which I cannot recommend more. Specifically, chapter 2 will be useful for this question.
There are many curls and divergence operations in the system of equations I'm trying to solve. The former is most simply expressed with the contravariant form of the a vector, and the latter with the covariant:
∇.V = 1/J ∑∂u(i)JVi,
∇ x V = εijk/J (∂u(i)Vi)ei,
where εijk is the Levi-Cevitta symbol. I would consider this question answered if I could print the above two equations using sympy's geometric algebra module.
How does one configure sympy's geometric algebra module to express calculations in this manner i.e. using covariant and contravariant vector expressions in order to hide away the complicated nature of the coordinate system?
Maybe there is an alternative toolbox that does exactly this?

Quadratic Programming and quasi newton method BFGS

Yesterday, I posted a question about general concept of SVM Primal Form Implementation:
Support Vector Machine Primal Form Implementation
and "lejlot" helped me out to understand that what I am solving is a QP problem.
But I still don't understand how my objective function can be expressed as QP problem
(http://en.wikipedia.org/wiki/Support_vector_machine#Primal_form)
Also I don't understand how QP and Quasi-Newton method are related
All I know is Quasi-Newton method will SOLVE my QP problem which supposedly formulated from
my objective function (which I don't see the connection)
Can anyone walk me through this please??
For SVM's, the goal is to find a classifier. This problem can be expressed in terms of a function that you are trying to minimize.
Let's first consider the Newton iteration. Newton iteration is a numerical method to find a solution to a problem of the form f(x) = 0.
Instead of solving it analytically we can solve it numerically by the follwing iteration:
x^k+1 = x^k - DF(x)^-1 * F(x)
Here x^k+1 is the k+1th iterate, DF(x)^-1 is the inverse of the Jacobian of F(x) and x is the kth x in the iteration.
This update runs as long as we make progress in terms of step size (delta x) or if our function value approaches 0 to a good degree. The termination criteria can be chosen accordingly.
Now consider solving the problem f'(x)=0. If we formulate the Newton iteration for that, we get
x^k+1 = x - HF(x)^-1 * DF(x)
Where HF(x)^-1 is the inverse of the Hessian matrix and DF(x) the gradient of the function F. Note that we are talking about n-dimensional Analysis and can not just take the quotient. We have to take the inverse of the matrix.
Now we are facing some problems: In each step, we have to calculate the Hessian matrix for the updated x, which is very inefficient. We also have to solve a system of linear equations, namely y = HF(x)^-1 * DF(x) or HF(x)*y = DF(x).
So instead of computing the Hessian in every iteration, we start off with an initial guess of the Hessian (maybe the identity matrix) and perform rank one updates after each iterate. For the exact formulas have a look here.
So how does this link to SVM's?
When you look at the function you are trying to minimize, you can formulate a primal problem, which you can the reformulate as a Dual Lagrangian problem which is convex and can be solved numerically. It is all well documented in the article so I will not try to express the formulas in a less good quality.
But the idea is the following: If you have a dual problem, you can solve it numerically. There are multiple solvers available. In the link you posted, they recommend coordinate descent, which solves the optimization problem for one coordinate at a time. Or you can use subgradient descent. Another method is to use L-BFGS. It is really well explained in this paper.
Another popular algorithm for solving problems like that is ADMM (alternating direction method of multipliers). In order to use ADMM you would have to reformulate the given problem into an equal problem that would give the same solution, but has the correct format for ADMM. For that I suggest reading Boyds script on ADMM.
In general: First, understand the function you are trying to minimize and then choose the numerical method that is most suited. In this case, subgradient descent and coordinate descent are most suited, as stated in the Wikipedia link.

How can I know a radiosity linear system can be implemented using the iterative method?

That is, I want to check if the linear system derived from a radiosity problem is convergent.
I also want to know is there any book/paper giving a proof on the convergence of the radiosity problem?
Thanks.
I assume you're solving B = (I - rho*F) B (based on the wikipedia article)
Gauss-Seidel and Jacobi iteration methods are both guaranteed to converge if the matrix is diagonally dominant (Gauss-Seidel is also guaranteed to converge if the matrix is symmetric and positive definite).
The rows of the F matrix (view factors) sum to 1, so if rho (reflectivity) is < 1, which physically it should be, the matrix will be diagonally dominant.

Resources