lsqcurvefit when expecting small coefficients - rounding

I've generated a plot of the attenutation seen in an electrical trace up to a frequency of 14e10 rad/s. The ydata ranges from approximately 1-10 Np/m. I'm trying to generate a fit of the form
y = A*sqrt(x) + B*x + C*x^2.
I expect A to be around 10^-6, B to be around 10^-11, and C to be around 10^-23. However, the smallest coefficient lsqcurvefit will return is 10^-7. Also, its will only return a nonzero coefficient for A, while returning 0 for B and C. The fit actually looks really good however the physics indicates that B and C should not be 0.
Here is how I'm calling the function
% measurement estimate
x_alpha = [1e-6 1e-11 1e-23];
lb = [1e-7, 1e-13, 1e-25];
ub = [1e-3, 1e-6, 1e-15];
x_alpha = lsqcurvefit(#modelfun, x_alpha, omega, alpha_t, lb,ub)
Here is the model function
function [ yhat ] = modelfun( x, xdata )
yhat = x(1)*xdata.^.5 + x(2)*xdata + x(3)*xdata.^2;
end
Is it possible to get lsqcurvefit to return such small coefficients? Is the error in rounding or is it something else? Any ways I can change the tolerance to see a fit closer to what I expect?

Found a stackoverflow page that seems to address this issue!
fit using lsqcurvefit

Related

Why my fit for a logarithm function looks so wrong

I'm plotting this dataset and making a logarithmic fit, but, for some reason, the fit seems to be strongly wrong, at some point I got a good enough fit, but then I re ploted and there were that bad fit. At the very beginning there were a 0.0 0.0076 but I changed that to 0.001 0.0076 to avoid the asymptote.
I'm using (not exactly this one for the image above but now I'm testing with this one and there is that bad fit as well) this for the fit
f(x) = a*log(k*x + b)
fit = fit f(x) 'R_B/R_B.txt' via a, k, b
And the output is this
Also, sometimes it says 7 iterations were as is the case shown in the screenshot above, others only 1, and when it did the "correct" fit, it did like 35 iterations or something and got a = 32 if I remember correctly
Edit: here is again the good one, the plot I got is this one. And again, I re ploted and get that weird fit. It's curious that if there is the 0.0 0.0076 when the good fit it's about to be shown, gnuplot says "Undefined value during function evaluation", but that message is not shown when I'm getting the bad one.
Do you know why do I keep getting this inconsistence? Thanks for your help
As I already mentioned in comments the method of fitting antiderivatives is much better than fitting derivatives because the numerical calculus of derivatives is strongly scattered when the data is slightly scatered.
The principle of the method of fitting an integral equation (obtained from the original equation to be fitted) is explained in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales . The application to the case of y=a.ln(c.x+b) is shown below.
Numerical calculus :
In order to get even better result (according to some specified criteria of fitting) one can use the above values of the parameters as initial values for iterarive method of nonlinear regression implemented in some convenient software.
NOTE : The integral equation used in the present case is :
NOTE : On the above figure one can compare the result with the method of fitting an integral equation to the result with the method of fitting with derivatives.
Acknowledgements : Alex Sveshnikov did a very good work in applying the method of regression with derivatives. This allows an interesting and enlightening comparison. If the goal is only to compute approximative values of parameters to be used in nonlinear regression software both methods are quite equivalent. Nevertheless the method with integral equation appears preferable in case of scattered data.
UPDATE (After Alex Sveshnikov updated his answer)
The figure below was drawn in using the Alex Sveshnikov's result with further iterative method of fitting.
The two curves are almost indistinguishable. This shows that (in the present case) the method of fitting the integral equation is almost sufficient without further treatment.
Of course this not always so satisfying. This is due to the low scatter of the data.
In ADDITION , answer to a question raised in comments by CosmeticMichu :
The problem here is that the fit algorithm starts with "wrong" approximations for parameters a, k, and b, so during the minimalization it finds a local minimum, not the global one. You can improve the result if you provide the algorithm with starting values, which are close to the optimal ones. For example, let's start with the following parameters:
gnuplot> a=47.5087
gnuplot> k=0.226
gnuplot> b=1.0016
gnuplot> f(x)=a*log(k*x+b)
gnuplot> fit f(x) 'R_B.txt' via a,k,b
....
....
....
After 40 iterations the fit converged.
final sum of squares of residuals : 16.2185
rel. change during last iteration : -7.6943e-06
degrees of freedom (FIT_NDF) : 18
rms of residuals (FIT_STDFIT) = sqrt(WSSR/ndf) : 0.949225
variance of residuals (reduced chisquare) = WSSR/ndf : 0.901027
Final set of parameters Asymptotic Standard Error
======================= ==========================
a = 35.0415 +/- 2.302 (6.57%)
k = 0.372381 +/- 0.0461 (12.38%)
b = 1.07012 +/- 0.02016 (1.884%)
correlation matrix of the fit parameters:
a k b
a 1.000
k -0.994 1.000
b 0.467 -0.531 1.000
The resulting plot is
Now the question is how you can find "good" initial approximations for your parameters? Well, you start with
If you differentiate this equation you get
or
The left-hand side of this equation is some constant 'C', so the expression in the right-hand side should be equal to this constant as well:
In other words, the reciprocal of the derivative of your data should be approximated by a linear function. So, from your data x[i], y[i] you can construct the reciprocal derivatives x[i], (x[i+1]-x[i])/(y[i+1]-y[i]) and the linear fit of these data:
The fit gives the following values:
C*k = 0.0236179
C*b = 0.106268
Now, we need to find the values for a, and C. Let's say, that we want the resulting graph to pass close to the starting and the ending point of our dataset. That means, that we want
a*log(k*x1 + b) = y1
a*log(k*xn + b) = yn
Thus,
a*log((C*k*x1 + C*b)/C) = a*log(C*k*x1 + C*b) - a*log(C) = y1
a*log((C*k*xn + C*b)/C) = a*log(C*k*xn + C*b) - a*log(C) = yn
By subtracting the equations we get the value for a:
a = (yn-y1)/log((C*k*xn + C*b)/(C*k*x1 + C*b)) = 47.51
Then,
log(k*x1+b) = y1/a
k*x1+b = exp(y1/a)
C*k*x1+C*b = C*exp(y1/a)
From this we can calculate C:
C = (C*k*x1+C*b)/exp(y1/a)
and finally find the k and b:
k=0.226
b=1.0016
These are the values used above for finding the better fit.
UPDATE
You can automate the process described above with the following script:
# Name of the file with the data
data='R_B.txt'
# The coordinates of the last data point
xn=NaN
yn=NaN
# The temporary coordinates of a data point used to calculate a derivative
x0=NaN
y0=NaN
linearFit(x)=Ck*x+Cb
fit linearFit(x) data using (xn=$1,dx=$1-x0,x0=$1,$1):(yn=$2,dy=$2-y0,y0=$2,dx/dy) via Ck, Cb
# The coordinates of the first data point
x1=NaN
y1=NaN
plot data using (x1=$1):(y1=$2) every ::0::0
a=(yn-y1)/log((Ck*xn+Cb)/(Ck*x1+Cb))
C=(Ck*x1+Cb)/exp(y1/a)
k=Ck/C
b=Cb/C
f(x)=a*log(k*x+b)
fit f(x) data via a,k,b
plot data, f(x)
pause -1

Solving vector second order differential equation while indexing into an array

I'm attempting to solve the differential equation:
m(t) = M(x)x'' + C(x, x') + B x'
where x and x' are vectors with 2 entries representing the angles and angular velocity in a dynamical system. M(x) is a 2x2 matrix that is a function of the components of theta, C is a 2x1 vector that is a function of theta and theta' and B is a 2x2 matrix of constants. m(t) is a 2*1001 array containing the torques applied to each of the two joints at the 1001 time steps and I would like to calculate the evolution of the angles as a function of those 1001 time steps.
I've transformed it to standard form such that :
x'' = M(x)^-1 (m(t) - C(x, x') - B x')
Then substituting y_1 = x and y_2 = x' gives the first order linear system of equations:
y_2 = y_1'
y_2' = M(y_1)^-1 (m(t) - C(y_1, y_2) - B y_2)
(I've used theta and phi in my code for x and y)
def joint_angles(theta_array, t, torques, B):
phi_1 = np.array([theta_array[0], theta_array[1]])
phi_2 = np.array([theta_array[2], theta_array[3]])
def M_func(phi):
M = np.array([[a_1+2.*a_2*np.cos(phi[1]), a_3+a_2*np.cos(phi[1])],[a_3+a_2*np.cos(phi[1]), a_3]])
return np.linalg.inv(M)
def C_func(phi, phi_dot):
return a_2 * np.sin(phi[1]) * np.array([-phi_dot[1] * (2. * phi_dot[0] + phi_dot[1]), phi_dot[0]**2])
dphi_2dt = M_func(phi_1) # (torques[:, t] - C_func(phi_1, phi_2) - B # phi_2)
return dphi_2dt, phi_2
t = np.linspace(0,1,1001)
initial = theta_init[0], theta_init[1], dtheta_init[0], dtheta_init[1]
x = odeint(joint_angles, initial, t, args = (torque_array, B))
I get the error that I cannot index into torques using the t array, which makes perfect sense, however I am not sure how to have it use the current value of the torques at each time step.
I also tried putting odeint command in a for loop and only evaluating it at one time step at a time, using the solution of the function as the initial conditions for the next loop, however the function simply returned the initial conditions, meaning every loop was identical. This leads me to suspect I've made a mistake in my implementation of the standard form but I can't work out what it is. It would be preferable however to not have to call the odeint solver in a for loop every time, and rather do it all as one.
If helpful, my initial conditions and constant values are:
theta_init = np.array([10*np.pi/180, 143.54*np.pi/180])
dtheta_init = np.array([0, 0])
L_1 = 0.3
L_2 = 0.33
I_1 = 0.025
I_2 = 0.045
M_1 = 1.4
M_2 = 1.0
D_2 = 0.16
a_1 = I_1+I_2+M_2*(L_1**2)
a_2 = M_2*L_1*D_2
a_3 = I_2
Thanks for helping!
The solver uses an internal stepping that is problem adapted. The given time list is a list of points where the internal solution gets interpolated for output samples. The internal and external time lists are in no way related, the internal list only depends on the given tolerances.
There is no actual natural relation between array indices and sample times.
The translation of a given time into an index and construction of a sample value from the surrounding table entries is called interpolation (by a piecewise polynomial function).
Torque as a physical phenomenon is at least continuous, a piecewise linear interpolation is the easiest way to transform the given function value table into an actual continuous function. Of course one also needs the time array.
So use numpy.interp1d or the more advanced routines of scipy.interpolate to define the torque function that can be evaluated at arbitrary times as demanded by the solver and its integration method.

How to calculate the standard deviation from a histogram? (Python, Matplotlib)

Let's say I have a data set and used matplotlib to draw a histogram of said data set.
n, bins, patches = plt.hist(data, normed=1)
How do I calculate the standard deviation, using the n and bins values that hist() returns? I'm currently doing this to calculate the mean:
s = 0
for i in range(len(n)):
s += n[i] * ((bins[i] + bins[i+1]) / 2)
mean = s / numpy.sum(n)
which seems to work fine as I get pretty accurate results. However, if I try to calculate the standard deviation like this:
t = 0
for i in range(len(n)):
t += (bins[i] - mean)**2
std = np.sqrt(t / numpy.sum(n))
my results are way off from what numpy.std(data) returns. Replacing the left bin limits with the central point of each bin doesn't change this either. I have the feeling that the problem is that the n and bins values don't actually contain any information on how the individual data points are distributed within each bin, but the assignment I'm working on clearly demands that I use them to calculate the standard deviation.
You haven't weighted the contribution of each bin with n[i]. Change the increment of t to
t += n[i]*(bins[i] - mean)**2
By the way, you can simplify (and speed up) your calculation by using numpy.average with the weights argument.
Here's an example. First, generate some data to work with. We'll compute the sample mean, variance and standard deviation of the input before computing the histogram.
In [54]: x = np.random.normal(loc=10, scale=2, size=1000)
In [55]: x.mean()
Out[55]: 9.9760798903061847
In [56]: x.var()
Out[56]: 3.7673459904902025
In [57]: x.std()
Out[57]: 1.9409652213499866
I'll use numpy.histogram to compute the histogram:
In [58]: n, bins = np.histogram(x)
mids is the midpoints of the bins; it has the same length as n:
In [59]: mids = 0.5*(bins[1:] + bins[:-1])
The estimate of the mean is the weighted average of mids:
In [60]: mean = np.average(mids, weights=n)
In [61]: mean
Out[61]: 9.9763028267760312
In this case, it is pretty close to the mean of the original data.
The estimated variance is the weighted average of the squared difference from the mean:
In [62]: var = np.average((mids - mean)**2, weights=n)
In [63]: var
Out[63]: 3.8715035807387328
In [64]: np.sqrt(var)
Out[64]: 1.9676136767004677
That estimate is within 2% of the actual sample standard deviation.
The following answer is equivalent to Warren Weckesser's, but maybe more familiar to those who prefer to want mean as the expected value:
counts, bins = np.histogram(x)
mids = 0.5*(bins[1:] + bins[:-1])
probs = counts / np.sum(counts)
mean = np.sum(probs * mids)
sd = np.sqrt(np.sum(probs * (mids - mean)**2))
Do take note in certain context you may want the unbiased sample variance where the weights are not normalized by N but N-1.

Fitting exponent with gnuplot

I am trying to fit the beneath data to the form - I am most interested in 'c' (I know that c ≈ 1/8, b ≈ 3) but would like to extract all these values from the data.
Formula:
y = a*(x-b)**c
Values.txt:
# "values.txt"
2.000000e+00 6.058411e-04
2.200000e+00 5.335520e-04
2.400000e+00 3.509583e-03
2.600000e+00 1.655943e-03
2.800000e+00 1.995418e-03
3.000000e+00 9.437851e-04
3.200000e+00 5.516159e-04
3.400000e+00 6.765981e-04
3.600000e+00 3.860859e-04
3.800000e+00 2.942881e-04
4.000000e+00 5.039975e-04
4.200000e+00 3.962199e-04
4.400000e+00 4.659717e-04
4.600000e+00 2.892683e-04
4.800000e+00 2.248839e-04
5.000000e+00 2.536980e-04
I have tried using the following commands in gnuplot however I am not meaningful results
f(x) = a*(x-b)**c
b = 3
c = 1/8
fit f(x) "values.txt" via a,b,c
Does anyone know the best way to extract these values? I would rather not provide initial guesses for 'b' & 'c' if possible.
Thanks,
J
The main problem with your fitting function is finding b. You can express your equation as a linear function in log(x-b), after which the fitting is trivial:
b = 3
f(x) = c0 + c1 * x
fit f(x) "values.txt" using (log($1-b)):(log($2)) via c0, c1
a = exp(c0)
c = c1
As you see, you need to provide b but do not need initial guesses for the other parameters because it's a trivial linear fit.
Now, I would suggest that you provide a series of values of b and check how good the fitting is for each value. gnuplot gives you the error in the fitting parameter. Then you can plot the overall error (error_c0 + error_c1) as a function of b and figure out for which b the error is minimum. About the optimum b the curve error_c0 + error_c1 vs b should be quadratic and have the minimum at b_opt. Then run the fitting as in the code above with this b = b_opt and get a and c.

How to find variability of a set of Cartesian Points (xyz) or fitting/distance to 3D line and/or plane?

So I was looking at this question:
Matlab - Standard Deviation of Cartesian Points
Which basically answers my question, except the problem is I have xyz, not xy. So I don't think Ax=b would work in this case.
I have, say, 10 Cartesian points, and I want to be able to find the standard deviation of these points. Now, I don't want standard deviation of each X, Y and Z (as a result of 3 sets) but I just want to get one number.
This can be done using MATLAB or excel.
To better understand what I'm doing, I have this desired point (1,2,3) and I recorded (1.1,2.1,2.9), (1.2,1.9,3.1) and so on. I wanted to be able to find the variability of all the recorded points.
I'm open for any other suggestions.
If you do the same thing as in the other answer you linked, it should work.
x_vals = xyz(:,1);
y_vals = xyz(:,2);
z_vals = xyz(:,3);
then make A with 3 columns,
A = [x_vals y_vals ones(size(x_vals))];
and
b = z_vals;
Then
sol=A\b;
m = sol(1);
n = sol(2);
c = sol(3);
and then
errs = (m*x_vals + n*y_vals + c) - z_vals;
After that you can use errs just as in the linked question.
Randomly clustered data
If your data is not expected to be near a line or a plane, just compute the distance of each point to the centroid:
xyz_bar = mean(xyz);
M = bsxfun(#minus,xyz,xyz_bar);
d = sqrt(sum(M.^2,2)); % distances to centroid
Then you can compute variability anyway you like. For example, standard deviation and RMS error:
std(d)
sqrt(mean(d.^2))
Data about a 3D line
If the data points are expected to be roughly along the path of a line, with some deviation from it, you might look at the distance to a best fit line. First, fit a 3D line to your points. One way is using the following parametric form of a 3D line:
x = a*t + x0
y = b*t + y0
z = c*t + z0
Generate some test data, with noise:
abc = [2 3 1]; xyz0 = [6 12 3];
t = 0:0.1:10;
xyz = bsxfun(#plus,bsxfun(#times,abc,t.'),xyz0) + 0.5*randn(numel(t),3)
plot3(xyz(:,1),xyz(:,2),xyz(:,3),'*') % to visualize
Estimate the 3D line parameters:
xyz_bar = mean(xyz) % centroid is on the line
M = bsxfun(#minus,xyz,xyz_bar); % remove mean
[~,S,V] = svd(M,0)
abc_est = V(:,1).'
abc/norm(abc) % compare actual slope coefficients
Distance from points to a 3D line:
pointCentroidSeg = bsxfun(#minus,xyz_bar,xyz);
pointCross = cross(pointCentroidSeg, repmat(abc_est,size(xyz,1),1));
errs = sqrt(sum(pointCross.^2,2))
Now you have the distance from each point to the fit line ("error" of each point). You can compute the mean, RMS, standard deviation, etc.:
>> std(errs)
ans =
0.3232
>> sqrt(mean(errs.^2))
ans =
0.7017
Data about a 3D plane
See David's answer.

Resources