I am trying to implement a gradient descent method to fit an EMG as shown in this paper. The paper describes two versions of the EMG equation:
(1)
$$F(t)= \frac{h\cdot\sigma}{\tau}\sqrt{\frac{\pi}{2}}e^{(\frac{\mu-t}
{\tau}+\frac{\sigma^2}{2\tau^2})}\cdot erfc(\frac{1}{\sqrt{2}}(\frac{\mu-t} {\sigma}+\frac{\sigma}{\tau}))$$
and
(2)
[F(t)= h\cdot e^{\frac{-(\mu-t)^2}{2\sigma^2}}\cdot \frac{\sigma}{\tau}\sqrt{\frac{\pi}{2}} \cdot erfcx(\frac{1}{\sqrt{2}}(\frac{\mu-t}{\sigma}+\frac{\sigma}{\tau}))]
The paper defines z as:
[z = \frac{1}{\sqrt{2}}(\frac{\mu-t}{\sigma}+\frac{\sigma}{\tau}))"]
If z is negative equation (1) is used, otherwise equation (2) is used to prevent the function from blowing up. I'm implementing a batch gradient algorithm following the outline from this site, knowing my objective function is a bit different. I'm setting my theta as the following:
[\theta_j = [\mu, h, sigma, \tau]]
So my gradient update formula is the following:
[\theta_j^{k+1} = \theta_j^{k} - \frac{1}{m}\alpha\sum_{i=0}^{m} (F(t_i) - y_i)\frac{\partial F(t_i)}{\partial \theta_j^{k}}]
I've used wolfram alpha to determine all the partial derivatives. My total samples are >4000, so I'm used a batch of around 300 samples to speed up the process. I've found that I have to adjust the initial parameters very slightly to get the best results, otherwise the gradient will just blow-up.
The paper also discusses finding the time coordinate of the EMG Peak, which I'm not sure how the is useful in the gradient descent algorithm. It is found with the following:
[t_0 = \mu + y \cdot \sigma \cdot \sqrt{2}-\frac{\sigma^2}{\tau}]
where y is:
erfcx(y) = \frac{\tau}{\sigma}\sqrt{\frac{2}{\pi}}
So my main questions are:
Am I setting up the gradient descent incorrectly for a non-linear equation?
How does the determination of the time coordinate help with the algorithm?
Please let me know if there is any additional information I can provide. The algorithm itself is each and I didn't want to clog up the post with my function definitions of each gradient.
Related
I have implemented multivariate linear regression, where parameters theta0 (intersect), theta1, theta2 are optimized by minimizing MSE loss, chosen with line search in gradient descent. How do I visually illustrate the mathematical property that the direction of steepest descent (negative gradient) of successive steps are orthogonal? I'm trying to generate a contour map similar to this image: Plot, but with respect to 2 parameters instead of 1 (if it's not possible, 2 separate plots would also be great).
Also, I originally wanted to perform multivariate linear regression with 4 features, but ultimately decided to use only the 2 most strongly correlated ones (after comparing their PCC) in order to be able to plot a graph. Although I'm not aware of any way to plot 4-dimensional data, does anyone know if this is possible and how?
I was working with one dataset and found the curve to be sigmoidal. i have fitted the curve and got the equation A2+((A1-A2)/1+exp((x-x0)/dx)) where:
x0 : Mid point of the curve
dx : slope of the curve
I need to find the slope and midpoint in order to give generalized equation. any suggestions?
You should be able to simplify the modeling of the sigmoid with a function of the following form:
The source includes code in R showing how to fit your data to the sigmoid curve, which you can adapt to whatever language you're writing in. The source also notes the following form:
Which you can adapt the linked R code to solve for. The nice thing about the general functions here will be that you can solve for the derivative from them. Also, you should note that the midpoint of the sigmoid is just where the derivative of dx (or dx^2) is 0 (where it goes from neg to pos or vice versa).
Assuming your equation is a misprint of
A2+(A1-A2)/(1+exp((x-x0)/dx))
then your graph does not reflect zero residual, since in your graph the upper shoulder is sharper than the lower shoulder.
Likely the problem is your starting values. Try using the native R function SSfpl, as in
nls(y ~ SSfpl(x,A2,A1,x0,dx))
Yesterday, I posted a question about general concept of SVM Primal Form Implementation:
Support Vector Machine Primal Form Implementation
and "lejlot" helped me out to understand that what I am solving is a QP problem.
But I still don't understand how my objective function can be expressed as QP problem
(http://en.wikipedia.org/wiki/Support_vector_machine#Primal_form)
Also I don't understand how QP and Quasi-Newton method are related
All I know is Quasi-Newton method will SOLVE my QP problem which supposedly formulated from
my objective function (which I don't see the connection)
Can anyone walk me through this please??
For SVM's, the goal is to find a classifier. This problem can be expressed in terms of a function that you are trying to minimize.
Let's first consider the Newton iteration. Newton iteration is a numerical method to find a solution to a problem of the form f(x) = 0.
Instead of solving it analytically we can solve it numerically by the follwing iteration:
x^k+1 = x^k - DF(x)^-1 * F(x)
Here x^k+1 is the k+1th iterate, DF(x)^-1 is the inverse of the Jacobian of F(x) and x is the kth x in the iteration.
This update runs as long as we make progress in terms of step size (delta x) or if our function value approaches 0 to a good degree. The termination criteria can be chosen accordingly.
Now consider solving the problem f'(x)=0. If we formulate the Newton iteration for that, we get
x^k+1 = x - HF(x)^-1 * DF(x)
Where HF(x)^-1 is the inverse of the Hessian matrix and DF(x) the gradient of the function F. Note that we are talking about n-dimensional Analysis and can not just take the quotient. We have to take the inverse of the matrix.
Now we are facing some problems: In each step, we have to calculate the Hessian matrix for the updated x, which is very inefficient. We also have to solve a system of linear equations, namely y = HF(x)^-1 * DF(x) or HF(x)*y = DF(x).
So instead of computing the Hessian in every iteration, we start off with an initial guess of the Hessian (maybe the identity matrix) and perform rank one updates after each iterate. For the exact formulas have a look here.
So how does this link to SVM's?
When you look at the function you are trying to minimize, you can formulate a primal problem, which you can the reformulate as a Dual Lagrangian problem which is convex and can be solved numerically. It is all well documented in the article so I will not try to express the formulas in a less good quality.
But the idea is the following: If you have a dual problem, you can solve it numerically. There are multiple solvers available. In the link you posted, they recommend coordinate descent, which solves the optimization problem for one coordinate at a time. Or you can use subgradient descent. Another method is to use L-BFGS. It is really well explained in this paper.
Another popular algorithm for solving problems like that is ADMM (alternating direction method of multipliers). In order to use ADMM you would have to reformulate the given problem into an equal problem that would give the same solution, but has the correct format for ADMM. For that I suggest reading Boyds script on ADMM.
In general: First, understand the function you are trying to minimize and then choose the numerical method that is most suited. In this case, subgradient descent and coordinate descent are most suited, as stated in the Wikipedia link.
I am writing a ray tracer and I wish to fire rays from a point p into a hemisphere above that point according to some distribution.
1) I have derived a method to uniformly sample within a solid angle (defined by theta) above p Image
phi = 2*pi*X_1
alpha = arccos (1-(1-cos(theta))*X_2)
x = sin(alpha)*cos(phi)
y = sin(alpha)*sin*phi
z = -cos(alpha)
Where X is a uniform random number
That works and Im pretty happy with that. But my question is what happens if I do not want a uniform distribution.
I have used the algorithm on page 27 from here and I can draw samples from a piecewise arbitrary distribution. However if I simply say:
alpha = arccos (1-(1-cos(theta)) B1)
Where B is a random number generated from an arbiatry distribution.
It doesn't behave nicely...What am I doing wrong? Thanks in advance. I really really need help on this
Additional:
Perhaps I am asking a leading question. Taking a step back:
Is there a way to generate points on a hemisphere according to an arbitrary distribution. I have a method for uniformly sampling a hemisphere and one for cosine-weighted hemisphere sampling. (pg 663-669 pbrt.org)
With an uniform distribution, you can just average the sample results and obtain the correct result. This is equivalent to divide each sample result by the sample Probability Density Function (PDF) and, in the case of an uniform distribution, it is just 1 / sample_count (i.e. the same of averaging the results).
With an arbitrary distribution, you have still to divide the sample result by the sample PDF however the PDF now depends on the arbitrary distribution you are using. I assume your error is here.
I apologise for the newbishness of this question in advance but I am stuck. I am trying to solve this question,
I can do parts i)-1v) but I am stuck on v. I know to calculate the margin y, you do
y=2/||W||
and I know that W is the normal to the hyperplane, I just don't know how to calculate it. Is this always
W=[1;1] ?
Similarly, the bias, W^T * x + b = 0
how do I find the value x from the data points? Thank you for your help.
Consider building an SVM over the (very little) data set shown in Picture for an example like this, the maximum margin weight vector will be parallel to the shortest line connecting points of the two classes, that is, the line between and , giving a weight vector of . The optimal decision surface is orthogonal to that line and intersects it at the halfway point. Therefore, it passes through . So, the SVM decision boundary is:
Working algebraically, with the standard constraint that , we seek to minimize . This happens when this constraint is satisfied with equality by the two support vectors. Further we know that the solution is for some . So we have that:
Therefore a=2/5 and b=-11/5, and . So the optimal hyperplane is given by
and b= -11/5 .
The margin boundary is
This answer can be confirmed geometrically by examining picture.