I have an issue with curve fitting process using Gnuplot. I have data with the time starting at 0.5024. I want to use a linear sin/cos combo to fit a value M over time (M=a+bsin(wt)+ccos(wt)). For further processing I only need the c value.
My code is
f(x)=a+b*sin(w*x)+c*cos(w*x)
fit f(x) "data.dat" using 1:2 via a,b,c,w
the asymptotic standard error ist 66% for parameter c which seems quite high. I suspect that it has to do with the fact, that the time starts at 0.5024 instead of 0. What I could do of course is
fit f(x) "data.dat" using ($1-0.5024):2 via a,b,c,w
with an asymptotic error of about 10% which is way lower. The question is: Can I do that? Does my new plot with the time offset still represent the original curve? Any other ideas?
Thanks in advance for your help :-)
It's a bit difficult to answer this without having seen your data, but your observation is typical.
The problem is an effect of the fit itself, or even your formula. Let me explain it using an example data set. (Well, this will become offtopic...)
An statistics excourse
The data follows the function f(x)=x and all y-values have been shifted by gassian random numbers. In addtion, the data is in the x-dange [600:800].
You can now simply apply a linear fit f(x)=m*x+b. According to Gauß' error distribution, the error is df(x)=sqrt((dm*x)²+(db)²). So, you can plot the data, the linear function and the error margin f(x) +/- df(x)
Here is the result:
The parameters:
m = 0.981822 +/- 0.1212 (12.34%)
b = 0.974375 +/- 85.13 (8737%)
The correlation matrix:
m b
m 1.000
b -0.997 1.000
You may notice three things:
The error for b is very large!
The error margin is small at x=0, but increases with x. Shouldn't it be smallest where the data is, i.e. at x=700?
The correlation between m and b is -0.997, which is near the maximum (absolute) value of 1.
The third point can be understood at the plot: If you increase the slope m, the y-offset decreases, too. Both parameters are very correlated, and an error on one of them is distributed to the other!
From statistics you may know, that a linear regression function always goes through the center of gravity (cog) of your data. So, let's shift the data so that the cog is the origin (it's enough to shift it so that the cog is on the y-axis, but I did it so)
Result:
m = 1.0465 +/- 0.1211 (11.57%)
b = -12.0611 +/- 7.027 (58.26%)
Correlation:
m b
m 1.000
b -0.000 1.000
Compared to the first plot, the value and error for m is almost the same, but the very large error ob b is much smaller now. The reason is that m and b are not correlated any more, and so a (tiny) variation m does not give a (very big) variation of b. It is also nice to see that the error margin has shrunk a lot.
Here is a last plot with the original data, the first fit function and the "back-shifted function for the shifted data":
About your fit function:
First, there is a big correlation problem: b and c are extremely correlated, as both together define the phase and amplitude of your oscillation. It would help a lot to use another, equivalent function:
f(x)=a+N*sin(w*x+p)
Here, you have phase and amplitude separated. You can still calculate your c from the fit results, and I guess, the error is much better for it.
Like in my example, if the data is far away from the y-axis, a small variation of w will have a big impact on p . So, I would suggest to shift your data so that it's cog is on the y-axis to get almost rid of this.
Is this shift allowed?
Yes. You do not alter the data, you simply change your coordinate system to get better errors. Also, the fit function should describe the data, so it should be very accurate in the range where your data is. In my first plot, the highest accuracy is at the y-axis, not where the data is.
Important
You should always remark which tricks you applied. Otherwise, someome may check your results and fit the data without the tricks, sees the red curve instead youre green one, and may accuse you of cheating...
Whether you can do that or not depends on whether the curve you're fitting to represents the physical phenomena you're studying and is consistent with the physical model you need to comply with. My suggestion is that you provide those and ask this question again in a physics forum (or chemistry, biology, etc., depending on your field).
Related
I have a problem fitting an exponentional function
f(x)= Aexp(-bx)sin(2pi*x/T + phi) + S
data
it kept being a straight line then I tried giving it some values for A, b, T, phi, S and it became something closer to the data but still shite
Multidimensional fitting is very non-trivial and algorithms often fail on this one. Try to help the algorithm by giving a better initial guess. You can also try to fit variables 1 by 1, e.g., the average S first, then the periodic length, then this 2 together, etc.
Please also provide how you tried to fit the function and which version of Gnuplot you used. If the 3rd column consists of 0s and you provided it as error values for fit in Gnuplot v4, fit completely fails.
On this given set of data, using a bad guess, the fit fails. But a better guess can succeed:
f(x)=A*exp(-b*x)*sin(2.*pi*x/T+phi)+S
A = 40.
b = 1/500.
T = 400.
phi = 1.
S = 170.
f_bad_guess(x) = 40. * exp(-x/500.) * sin(2.*pi*x/150+3.) + 170.
f_good_guess(x) = 40. * exp(-x/500.) * sin(2.*pi*x/400+1.) + 170.
fit f(x) "data.txt" via A,b,T,phi,S
p "data.txt" t "data", f(x) t "fitted function", f_good_guess(x) t "good initial guess set manually", f_bad_guess(x) t "bad initial guess set manually"
The non-linear regression calculus is iterative starting from "guessed" initial values of the parameters. Especially when the model involves sinusoidal functions the key point is to start with guessed values close enough to the correct values which are not known.
Probably your difficulty is to guess good enough values (or the difficulty of the software to try some initial good enough values).
A non-conventional method which is not iterative and which doesn't need initial values is explained in this paper : https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales. The application of this method in the present case is shown below :
If more accuracy is wanted one have to try a non-linear regression (with an available software). Using the numerical values of the parameters found above as inital values increases the chances of good convergence.
I have the datafile:
10.0000 -330.12684910
15.0000 -332.85109334
20.0000 -333.85785274
25.0000 -334.18315783
30.0000 -334.28078907
35.0000 -334.30486903
40.0000 -334.30824069
45.0000 -334.30847874
50.0000 -334.30940105
55.0000 -334.31091085
60.0000 -334.31217217
The commands a used to fit this
f(x) = a+b*exp(c*x)
fit f(x) datafile via a, b, c
didn't get the negative exponential that I expected, then just to see how the hyperbola fitted I tried
f(x) = a+b/x
fit f(x) datafile via a, b
but decided to do this:
f(x) = a+b*exp(-c*x)
fit f(x) datafile via a, b, c
and it worked. I continued doing fits but in some point it started to mark this error undefined value during function evaluation.
I restarted the session and deleted the fit.log file, I thought it was a gnuplot bug but since then I always receive the undefined value error. I've been reading similar issues. It could be a, b, c seeds but I have introduced very similar values to the ones a received that time it fitted well but didn't work. I'm thinking the problem might be chaotic or I'm doing something wrong.
Thank you for your method. I understand gnuplot uses non linear least squares method for fitting.
I found one solution is to use the model y=a+b*exp(-c*c*x), I also found better initial values and it worked. anyway I have this other dataset:
2 -878.11598213
6 -878.08846509
10 -878.08105262
19 -878.07882425
28 -878.07793702
44 -878.07755010
60 -878.07738151
85 -878.07729504
110 -878.07725107
And gnuplot fit does the work but really bad. instead I used your method, here I show a comparison:
gnuplot fit and jjaquelin comparison
it is way better.
I don't know the algorithm used by gnuplot. Probably an iterative method starting from guessed values of the parameters. The difficulty might come from not convenient initial values and/or from no convergence of the process.
From my own calculus the result is close to the values below. The method, which is not iterative and doesn't require initial guess, is explained in the paper : https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
FOR INFORMATION :
The linearisation of the regression is obtained thanks to the integral equation
to which the function to be fitted is solution.
The paper referenced above is mainly written in French. It is pratially translated in : https://scikit-guess.readthedocs.io/en/latest/appendices/references.html
I'm having trouble fitting the following data in gnuplot 4.4
0.0007629768 -0.1256279199 0.0698209297
0.0007565689 0.5667065856 0.0988522507
0.00071274 1.3109126758 0.7766233743
f1(x) = -a1 * x + b1
a1 = 28000
fit f1(x) "56demo.csv" using 1:2:3 via a1, b1
plot "56demo.csv" using 1:2:3 with yerrorbars title "56%", \
f1(x) notitle
This converges to values of a1 and b1 which are higher than I would like.
Several similar tests converge to values in the range in which they should be, but for some reason these don't.
Specifically, I'd like to have
a1 = 28000, approximately.
I'm looking for some way to hit a local minimum. I've tried making the fit limit smaller, but I haven't had much luck that way.
Is it possible to set an upper limit to the values of a1 and b1? That is one way I'd like to try.
Thanks
The most common method of fitting is the chi-square (χ²) method. Chi-square is the expression
where xi, yi and σi are the data points with error in y, and f(x) is a model function which describes your data. This function has some parameters, and the goal is to find those values for the parameters, for which this expression has a global minimum. A program like gnuplot will try several sets of values for this parameters, to find the one set for which χ² is minimal.
In general, several things can go wrong, which usually means that the algorithm has found a local minimum, not the global one. This happens for example when the initial values for the parameters are bad. It helps to estimate the initial values as good as possible.
Another problem is when the algorithm uses too big steps between the sets of parameter values. This often happens for example, if you have a very narrow peak on a broader peak. Usually, you will end up with a parameter set describing a sum of two identical peaks, which describes the broad peak well and ignores the narrow one. Again, a good initial value set will help. You may also first keep the peak positions fixed (i.e. not in the via-list in gnuplot ) and fit all other parameters, and then fit all parameters in a second command.
But if f(x) is a linear function, this problems do not exist !
You can replace f(x) by m*x+b and do the math. The result is that χ² is a parabola in the parameter space, which has a single, unique minimum, which can also be calculated explicitly.
So, if gnuplot gives you a set of parameters for that data, this result is absolutely correct, even if you don't like that result.
i try to fit this plot as you cans see the fit is not so good for the data.
My code is:
clear
reset
set terminal pngcairo size 1000,600 enhanced font 'Verdana,10'
set output 'LocalEnergyStepZoom.png'
set ylabel '{/Symbol D}H/H_0'
set xlabel 'n_{step}'
set format y '%.2e'
set xrange [*:*]
set yrange [1e-16:*]
f(x) = a*x**b
fit f(x) "revErrEnergyGfortCaotic.txt" via a,b
set logscale
plot 'revErrEnergyGfortCaotic.txt' w p,\
'revErrEnergyGfortRegular.txt' w p,\
f(x) w l lc rgb "black" lw 3
exit
So the question is how mistake i compute here? because i suppose that in a log-log plane a fit of the form i put in the code should rappresent very well the data.
Thanks a lot
Finally i can be able to solve the problem using the suggestion in the answer of Christop and modify it just a bit.
I found the approximate slop of the function (something near to -4) then taking this parameter fix i just fit the curve with only a, found it i fix it and modify only b. After that using the output as starting solution for the fit i found the best fit.
You must find appropriate starting values to get a correct fit, because that kind of fitting doesn't have one global solution.
If you don't define a and b, both are set to 1 which might be too far away. Try using
a = 100
b = -3
for a better start. Maybe you need to tweak those value a bit more, I couldn't because I don't have the data file.
Also, you might want to restrict the region of the fitting to the part above 10:
fit [10:] f(x) "revErrEnergyGfortCaotic.txt" via a,b
Of course only, if it is appropriate.
This is a common issue in data analysis, and I'm not certain if there's a nice Gnuplot way to solve it.
The issue is that the penalty functions in standard fitting routines are typically the sum of squares of errors, and try as you might, if your data have a lot of dynamic range, the errors for the smallest y-values come out to essentially zero from the point of view of the algorithm.
I recently taught a course to students where they needed to fit such data. Lots of them beat their (matlab) fitting routines into submission by choosing very stringent convergence criteria, but even this did not help too much.
What you really need to do, if you want to fit this power-law tail well, is to convert the data into log-log form and run a linear regression on that log-log representation.
The main problem here is that the residual errors of the function values of the higher x are very small compared to the residuals at lower x values. After all, you almost span 20 orders of magnitude on the y axis.
Just weight the y values with 1/y**2, or even better: if you have the standard deviations of your data points weight the values with 1/std**2. Then the fit should converge much much better.
In gnuplot weighting is done using a third data column:
fit f(x) 'data' using 1:2:(1/$2**2") via ...
Or you can use Raman Shah's advice and linearize the y axis and do a linear regression.
you need to use weights for your fit (currently low values are not considered as important) and have a better starting guess (via "pars_file.pars")
here is what I want to do (preferably with Matlab):
Basically I have several traces of cars driving on an intersection. Each one is noisy, so I want to take the mean over all measurements to get a better approximation of the real route. In other words, I am looking for a way to approximate the Curve, which has the smallest distence to all of the meassured traces (in a least-square sense).
At the first glance, this is quite similar what can be achieved with spap2 of the CurveFitting Toolbox (good example in section Least-Squares Approximation here).
But this algorithm has some major drawback: It assumes a function (with exactly one y(x) for every x), but what I want is a curve in 2d (which may have several y(x) for one x). This leads to problems when cars turn right or left with more then 90 degrees.
Futhermore it takes the vertical offsets and not the perpendicular offsets (according to the definition on wolfram).
Has anybody an idea how to solve this problem? I thought of using a B-Spline and change the number of knots and the degree until I reached a certain fitting quality, but I can't find a way to solve this problem analytically or with the functions provided by the CurveFitting Toolbox. Is there a way to solve this without numerical optimization?
mbeckish is right. In order to get sufficient flexibility in the curve shape, you must use a parametric curve representation (x(t), y(t)) instead of an explicit representation y(x). See Parametric equation.
Given n successive points on the curve, assign them their true time if you know it or just integers 0..n-1 if you don't. Then call spap2 twice with vectors T, X and T, Y instead of X, Y. Now for arbitrary t you get a point (x, y) on the curve.
This won't give you a true least squares solution, but should be good enough for your needs.