Consider the data file with two columns and two rows:
3869. 1602.
3882. 9913.
I'd like to fit a line using gnuplot
gnuplot> f(x) = a * x + b
gnuplot> fit f(x) './data.txt' u 1:2 via a, b
Iteration 0
WSSR : 3.43474e+07 delta(WSSR)/WSSR : 0
delta(WSSR) : 0 limit for stopping : 1e-05
lambda : 2740.4
initial set of free parameter values
a = 1.7524
b = -1026.99
/
Iteration 1
WSSR : 3.43474e+07 delta(WSSR)/WSSR : -1.49847e-12
delta(WSSR) : -5.14686e-05 limit for stopping : 1e-05
lambda : 274.04
resultant parameter values
a = 1.7524
b = -1026.99
After 1 iterations the fit converged.
final sum of squares of residuals : 3.43474e+07
rel. change during last iteration : -1.49847e-12
Exactly as many data points as there are parameters.
In this degenerate case, all errors are zero by definition.
Final set of parameters
=======================
a = 1.7524
b = -1026.99
gnuplot>
which gives wrong values for fit parameters. Why is this happening? My gnuplot version is Version 4.4 patchlevel 0.
It looks to me that the curve-fitting function is struggling to find the true parameters. This could be associated with the magnitude of your data points and/or trying to fit a line with two parameters to only two data points.
In any case, doing the calculation of a and b in Excel or equivalent yields:
a= 577.769
b = -2233787
If you give gnuplot a good guess at what they should be, e.g. a=500 and b=-2233700 and repeat the procedure, it should successfully find the correct solution:
Final set of parameters
=======================
a = 577.769
b = -2.23379e+06
Of course, if you're fitting two points to a two-parameter straight line, it's much easier to calculate the values of a and b by hand:
a = (9113-1602) / (3882-3869)
b = 1602 - a * 3869
Gnuplot uses a non-linear method to determine the parameters of your function f with respect to a certain error value: limit for stopping : 1e-05.
If you change that error value your function will be exactly fit. The error value can be specified with the FIT_LIMIT variable like so:
FIT_LIMIT = 1e-8
With this setting your points will be exactly matched after 12 iterations. (At least on my machine^^)
Related
Say I need to fit some data to a parabola, and then perform some calculations involving the correlation matrix elements of the fit parameters: is there a way to use these parameters directly in gnuplot after the fit converges? Are they stored in some variable like the error estimates?.
I quote the explicit problem I'm having. All of this is written to a plot.gp text file and ran with gnuplot plot.gp.
I include set fit errorbariables at the beginning, and then proceed with:
f(x)=a+b*x+c*x*x
fit f(x) 'file.dat' u 1:2:3 yerrors via a,b,c
Once the fit is done, I can use the values of a,b,c and their errors a_err, b_err and c_err directly in the plot.gp script; my question is: can I do the same with the correlation matrix of the parameters?
The problem is that the matrix is printed to terminal once the script finishes to run:
correlation matrix of the fit parameters:
a b e
a 1.000
b 0.910 1.000
c -0.956 -0.987 1.000
Are the entries of the matrix stores in some variable (like a_err, b_err) that I can access after the fit is done but before the script ends?
I think the command you are looking for is
set fit covariancevariables
If the `covariancevariables` option is turned on, the covariances between
final parameters will be saved to user-defined variables. The variable name
for a certain parameter combination is formed by prepending "FIT_COV_" to
the name of the first parameter and combining the two parameter names by
"_". For example given the parameters "a" and "b" the covariance variable is
named "FIT_COV_a_b".
Edit: I certainly missed gnuplot's intended way via option covariancevariables (apparently available since gnuplot 5.0). Ethan's answer is the way to go. I nevertheless leave my answer, with some modifications it might maybe be useful to extract something else from the fit output.
Maybe I missed it, but I am not aware that you can directly store the elements of the correlation matrix into variables, however, you can do it with some workaround.
You can set the output file for your fit results (check help set fit). The shortest output will be created with the option results. The results will be written to this file (actually, appended if the file already exists).
Example:
After 5 iterations the fit converged.
final sum of squares of residuals : 0.45
rel. change during last iteration : -3.96255e-10
degrees of freedom (FIT_NDF) : 1
rms of residuals (FIT_STDFIT) = sqrt(WSSR/ndf) : 0.67082
variance of residuals (reduced chisquare) = WSSR/ndf : 0.45
Final set of parameters Asymptotic Standard Error
======================= ==========================
a = 1.75 +/- 0.3354 (19.17%)
b = -2.65 +/- 1.704 (64.29%)
c = 1.75 +/- 1.867 (106.7%)
correlation matrix of the fit parameters:
a b c
a 1.000
b -0.984 1.000
c 0.898 -0.955 1.000
Now, you can read this file back into a datablock (check gnuplot: load datafile 1:1 into datablock) and extract the values from the last lines (here: 3), check help word and check real.
Script:
### get fit correlation matrix into variables
reset session
$Data <<EOD
1 1
2 3
3 10
4 19
EOD
f(x) = a*x**2 + b*x + c
myFitFILE = "SO71788523_fit.dat"
set fit results logfile myFitFILE
fit f(x) $Data u 1:2 via a,b,c
set key top left
set grid x,y
# load file 1:1 into datablock
FileToDatablock(f,d) = GPVAL_SYSNAME[1:7] eq "Windows" ? \
sprintf('< echo %s ^<^<EOD & type "%s"',d,f) : \
sprintf('< echo "\%s <<EOD" & cat "%s"',d,f) # Linux/MacOS
load FileToDatablock(myFitFILE,'$FIT')
# extract parameters into variables
N = 3 # number of parameters
getValue(p1,p2) = real(word($FIT[|$FIT|-N+p1],p2+1)) # extract value as floating point number
aa = getValue(1,1)
ba = getValue(2,1)
bb = getValue(2,2)
ca = getValue(3,1)
cb = getValue(3,2)
cc = getValue(3,3)
set label 1 at graph 0.1,graph 0.8 \
sprintf("Correlation matrix:\naa: %g\nba: %g\nbb: %g\nca: %g\ncb: %g\ncc: %g",aa,ba,bb,ca,cb,cc)
plot $Data u 1:2 w lp pt 7 lc "red", \
f(x) w l lc "blue" title sprintf("fit: a=%g, b=%g, c=%g",a,b,c)
### end of script
Result:
I'm plotting this dataset and making a logarithmic fit, but, for some reason, the fit seems to be strongly wrong, at some point I got a good enough fit, but then I re ploted and there were that bad fit. At the very beginning there were a 0.0 0.0076 but I changed that to 0.001 0.0076 to avoid the asymptote.
I'm using (not exactly this one for the image above but now I'm testing with this one and there is that bad fit as well) this for the fit
f(x) = a*log(k*x + b)
fit = fit f(x) 'R_B/R_B.txt' via a, k, b
And the output is this
Also, sometimes it says 7 iterations were as is the case shown in the screenshot above, others only 1, and when it did the "correct" fit, it did like 35 iterations or something and got a = 32 if I remember correctly
Edit: here is again the good one, the plot I got is this one. And again, I re ploted and get that weird fit. It's curious that if there is the 0.0 0.0076 when the good fit it's about to be shown, gnuplot says "Undefined value during function evaluation", but that message is not shown when I'm getting the bad one.
Do you know why do I keep getting this inconsistence? Thanks for your help
As I already mentioned in comments the method of fitting antiderivatives is much better than fitting derivatives because the numerical calculus of derivatives is strongly scattered when the data is slightly scatered.
The principle of the method of fitting an integral equation (obtained from the original equation to be fitted) is explained in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales . The application to the case of y=a.ln(c.x+b) is shown below.
Numerical calculus :
In order to get even better result (according to some specified criteria of fitting) one can use the above values of the parameters as initial values for iterarive method of nonlinear regression implemented in some convenient software.
NOTE : The integral equation used in the present case is :
NOTE : On the above figure one can compare the result with the method of fitting an integral equation to the result with the method of fitting with derivatives.
Acknowledgements : Alex Sveshnikov did a very good work in applying the method of regression with derivatives. This allows an interesting and enlightening comparison. If the goal is only to compute approximative values of parameters to be used in nonlinear regression software both methods are quite equivalent. Nevertheless the method with integral equation appears preferable in case of scattered data.
UPDATE (After Alex Sveshnikov updated his answer)
The figure below was drawn in using the Alex Sveshnikov's result with further iterative method of fitting.
The two curves are almost indistinguishable. This shows that (in the present case) the method of fitting the integral equation is almost sufficient without further treatment.
Of course this not always so satisfying. This is due to the low scatter of the data.
In ADDITION , answer to a question raised in comments by CosmeticMichu :
The problem here is that the fit algorithm starts with "wrong" approximations for parameters a, k, and b, so during the minimalization it finds a local minimum, not the global one. You can improve the result if you provide the algorithm with starting values, which are close to the optimal ones. For example, let's start with the following parameters:
gnuplot> a=47.5087
gnuplot> k=0.226
gnuplot> b=1.0016
gnuplot> f(x)=a*log(k*x+b)
gnuplot> fit f(x) 'R_B.txt' via a,k,b
....
....
....
After 40 iterations the fit converged.
final sum of squares of residuals : 16.2185
rel. change during last iteration : -7.6943e-06
degrees of freedom (FIT_NDF) : 18
rms of residuals (FIT_STDFIT) = sqrt(WSSR/ndf) : 0.949225
variance of residuals (reduced chisquare) = WSSR/ndf : 0.901027
Final set of parameters Asymptotic Standard Error
======================= ==========================
a = 35.0415 +/- 2.302 (6.57%)
k = 0.372381 +/- 0.0461 (12.38%)
b = 1.07012 +/- 0.02016 (1.884%)
correlation matrix of the fit parameters:
a k b
a 1.000
k -0.994 1.000
b 0.467 -0.531 1.000
The resulting plot is
Now the question is how you can find "good" initial approximations for your parameters? Well, you start with
If you differentiate this equation you get
or
The left-hand side of this equation is some constant 'C', so the expression in the right-hand side should be equal to this constant as well:
In other words, the reciprocal of the derivative of your data should be approximated by a linear function. So, from your data x[i], y[i] you can construct the reciprocal derivatives x[i], (x[i+1]-x[i])/(y[i+1]-y[i]) and the linear fit of these data:
The fit gives the following values:
C*k = 0.0236179
C*b = 0.106268
Now, we need to find the values for a, and C. Let's say, that we want the resulting graph to pass close to the starting and the ending point of our dataset. That means, that we want
a*log(k*x1 + b) = y1
a*log(k*xn + b) = yn
Thus,
a*log((C*k*x1 + C*b)/C) = a*log(C*k*x1 + C*b) - a*log(C) = y1
a*log((C*k*xn + C*b)/C) = a*log(C*k*xn + C*b) - a*log(C) = yn
By subtracting the equations we get the value for a:
a = (yn-y1)/log((C*k*xn + C*b)/(C*k*x1 + C*b)) = 47.51
Then,
log(k*x1+b) = y1/a
k*x1+b = exp(y1/a)
C*k*x1+C*b = C*exp(y1/a)
From this we can calculate C:
C = (C*k*x1+C*b)/exp(y1/a)
and finally find the k and b:
k=0.226
b=1.0016
These are the values used above for finding the better fit.
UPDATE
You can automate the process described above with the following script:
# Name of the file with the data
data='R_B.txt'
# The coordinates of the last data point
xn=NaN
yn=NaN
# The temporary coordinates of a data point used to calculate a derivative
x0=NaN
y0=NaN
linearFit(x)=Ck*x+Cb
fit linearFit(x) data using (xn=$1,dx=$1-x0,x0=$1,$1):(yn=$2,dy=$2-y0,y0=$2,dx/dy) via Ck, Cb
# The coordinates of the first data point
x1=NaN
y1=NaN
plot data using (x1=$1):(y1=$2) every ::0::0
a=(yn-y1)/log((Ck*xn+Cb)/(Ck*x1+Cb))
C=(Ck*x1+Cb)/exp(y1/a)
k=Ck/C
b=Cb/C
f(x)=a*log(k*x+b)
fit f(x) data via a,k,b
plot data, f(x)
pause -1
I'm using the following gnuplot script to plot a linear fit:
#!/usr/bin/gnuplot
set term cairolatex
set output "linear_fit.tex"
c = 299792458.
x(x) = c / x
y(x) = x
h(x) = a * x + b
fit h(x) "linear_fit.dat" u (x($1)):(y($2)) via a,b
plot "linear_fit.dat" u (x($1)):(y($2)) w points title "", \
(h(x)) with lines linecolor rgb "black" title "Linear Fit"
However, after the iterations converge, b is always 1.0: https://dpaste.de/ozReq/
How can I get gnuplot to adjust b as well as a?
Update: Repeating the fit command a few hundred times with alternating via a/via b does give pretty good results, but that just can't be how it's supposed to be done.
Update 2: Here's the data in linear_fit.dat:
# lambda, V
360e-9 1.119
360e-9 1.148
360e-9 1.145
400e-9 0.949
400e-9 0.993
400e-9 0.971
440e-9 0.883
440e-9 0.875
440e-9 0.863
490e-9 0.737
490e-9 0.728
490e-9 0.755
540e-9 0.575
540e-9 0.571
540e-9 0.592
590e-9 0.457
590e-9 0.455
590e-9 0.482
I think your troubles stem from the fact that your x-values are very large (on the order of 10e14).
If you do not provide gnuplot with an initial guess for a and b, it will assume a=1 and b=1 as starting points for the fit. However, this is a poor initial guess:
Please note the log scale on both the x- and y-axis.
From the gnuplot documentation:
fit may, and often will get "lost" if started far from a solution, where SSR is large and changing slowly as the parameters are varied, or it may reach a numerically unstable region (e.g., too large a number causing a floating point overflow) which results in an "undefined value" message or gnuplot halting.
To improve the chances of finding the global optimum, you should set the starting values at least roughly in the vicinity of the solution, e.g., within an order of magnitude, if possible. The closer your starting values are to the solution, the less chance of stopping at another minimum. One way to find starting values is to plot data and the fitting function on the same graph and change parameter values and replot until reasonable similarity is reached. The same plot is also useful to check whether the fit stopped at a minimum with a poor fit.
In your case, such starting values could be:
a = 1e-15
b = -0.5
I obtained these values by eye-balling your range of values.
With those starting values, the linear fit results in:
Final set of parameters Asymptotic Standard Error
======================= ==========================
a = 1.97355e-015 +/- 6.237e-017 (3.161%)
b = -0.5 +/- 0.04153 (8.306%)
Which looks like this:
You can play with the control setting of fit (such as setting FIT_LIMIT = 1.e-35) or the starting values to achieve a better fit than this.
EDIT
While I still have not been able to coax gnuplot into modifying both parameters a, b at the same time, I found an alternate approach using R. I am aware that there are many other (scripting) languages that can perform a linear fit and this question was about gnuplot. However, the required effort with R appeared to be minimal.
Here's an example, which, when saved as linear_fit.R and called with
R CMD BATCH linear_fit.R
will provide the two coefficients of the linear fit, that gnuplot failed to provide.
y <- c(1.119, 1.148, 1.145, 0.949, 0.993, 0.971, 0.883, 0.875, 0.863,
0.737, 0.728, 0.755, 0.575, 0.571, 0.592, 0.457, 0.455, 0.482)
x <- c(3.60E-007, 3.60E-007, 3.60E-007, 4.00E-007, 4.00E-007,
4.00E-007, 4.40E-007, 4.40E-007, 4.40E-007, 4.90E-007,
4.90E-007, 4.90E-007, 5.40E-007, 5.40E-007, 5.40E-007,
5.90E-007, 5.90E-007, 5.90E-007)
c = 299792458.
x <- c/x
lm.out <- lm(y ~ x)
svg("linear_fit.svg")
plot(x,y)
abline(lm.out,col="red")
summary(lm.out)
You will end up with an svg-file that contains the plot and a linear_fit.Rout text file. In there you'll find the following coefficients:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.429e-01 4.012e-02 -13.53 3.55e-10 ***
x 2.037e-15 6.026e-17 33.80 2.61e-16 ***
So, in the terminology of the original question, we obtain:
a = 2.037e-15
b = -5.429e-01
These values are very close to the values you quoted from alternating the fit.
In case the comments get purged, these questions were identified as related:
What is gnuplot's internal representation of floating point numbers?
Gnuplot behaves oddly in polynomial fit. Why is that?
I keep having the w = 0 in Givens(); error message when I try to use gnuplot built-in curve fitting feature.
What I do is trying to fit experimental data to a certain mathematical model in gnuplot.
I define the model function s(x):
gnuplot> z(x)=(x-mu)/be
gnuplot> s(x)=(k/be)*exp(-z(x)-exp(-z(x)))
Then I plot the actual data and the model function to get an initial guess for the model parameters:
Then I adjust the initial guess:
gnuplot> k=2.6; mu=-8.8;
gnuplot> replot
To obtain a pretty fine picture:
Then I try to precisely fit the curve:
gnuplot> fit s(x) '701_707_TRACtdetq.log30.hist1.txt' u 2:6 via k,be,mu
And what I get is the single iteration and a error message:
Iteration 0
WSSR : 3.85695 delta(WSSR)/WSSR : 0
delta(WSSR) : 0 limit for stopping : 1e-05
lambda : 0.223951
initial set of free parameter values
k = 2.6
be = 1
mu = -8.8
/
Iteration 1
WSSR : 0.0720502 delta(WSSR)/WSSR : -52.5315
delta(WSSR) : -3.7849 limit for stopping : 1e-05
lambda : 0.0223951
resultant parameter values
k = 2.03996
be = 0.777868
mu = -8.87082
w = 0 in Givens(); Cjj = 3.37383e-196, Cij = 2.54469e-192
And the curve pretty fit:
What does that error means and how would I get the fit process going?
What I'm just about to say might seem strange but it works!
When I run into the 'w = 0 in Givens()' error I use:
gnuplot> set xrange [a,b]
where 'a' and 'b' are chosen to window the 'most interesting' parts. If you now do the fitting command that you have:
gnuplot> fit s(x) '701_707_TRACtdetq.log30.hist1.txt' u 2:6 via k,be,mu
You might find that your fit now converges. I'm not sure why 'set range' affects the fitting algorithm but it does! In your example, I might let:
a = -12
b = -2
The error message w = 0 in Givens(); seems to be related to inability of fit to perform the next iteration of fit parameters estimation. The error message is accompanied by the values of a certain matrix C[][] that is related to the direction of the next step of the fit iterations. Those values are usually very small, like in the example, Cjj = 3.37383e-196, Cij = 2.54469e-192. This means that the fit process has converged to a state where every other local set of fit parameters are less optimal than the current (local state extreme), but the current residuals are above the convergence limit, in this case delta(WSSR) : -3.7849 limit for stopping : 1e-05. This happens when the data to be fitted exhibits a disturbance (at approximately x=-13 in this case) that yields significant delta despite the perfect fit.
Long story short: the error usually happens when the fit is fine but the delta is still high.
I often use Octave to create data that I can plot from my lab results. That data is then fitted with some function in gnuplot:
f1(x) = a * exp(-x*g);
fit f1(x) "c_1.dat" using 1:2:3 via a,g
That creates a fit.log:
*******************************************************************************
Tue May 8 19:13:39 2012
FIT: data read from "e_schwach.dat" using 1:2:3
format = x:z:s
#datapoints = 16
function used for fitting: schwach(x)
fitted parameters initialized with current variable values
Iteration 0
WSSR : 12198.7 delta(WSSR)/WSSR : 0
delta(WSSR) : 0 limit for stopping : 1e-05
lambda : 14.2423
initial set of free parameter values
mu2 = 1
omega2 = 1
Q2 = 1
After 70 iterations the fit converged.
final sum of squares of residuals : 46.0269
rel. change during last iteration : -2.66463e-06
degrees of freedom (FIT_NDF) : 13
rms of residuals (FIT_STDFIT) = sqrt(WSSR/ndf) : 1.88163
variance of residuals (reduced chisquare) = WSSR/ndf : 3.54053
Final set of parameters Asymptotic Standard Error
======================= ==========================
mu2 = 0.120774 +/- 0.003851 (3.188%)
omega2 = 0.531482 +/- 0.0006112 (0.115%)
Q2 = 17.6593 +/- 0.7416 (4.199%)
correlation matrix of the fit parameters:
mu2 omega2 Q2
mu2 1.000
omega2 -0.139 1.000
Q2 -0.915 0.117 1.000
Is there some way to get the parameters and their error back into Octave? I mean I can write a Python program that parses that, but I hoped to avoid that.
Update
This question is not applicable to me any more, since I use Python and matplotlib for my lab work now, and it can does all this from a single program. I leave this question open in case somebody else has the same problem.
I don't know much about the gnuplot-Octave interface, but what can make your (parsing) life easier is you can:
set fit errorvariables
fit a*x+g via a,g
set print "fit_parameters.txt"
print a,a_err
print g,g_err
set print
Now your variables and their respective errors are in the file "fit_parameters.txt" with
no parsing needed from python.
from the documentation on fit:
If gnuplot was built with this option, and you activated it using set
fit errorvariables, the error for each fitted parameter will be
stored in a variable named like the parameter, but with _err
appended. Thus the errors can be used as input for further
computations.