i try to fit this plot as you cans see the fit is not so good for the data.
My code is:
clear
reset
set terminal pngcairo size 1000,600 enhanced font 'Verdana,10'
set output 'LocalEnergyStepZoom.png'
set ylabel '{/Symbol D}H/H_0'
set xlabel 'n_{step}'
set format y '%.2e'
set xrange [*:*]
set yrange [1e-16:*]
f(x) = a*x**b
fit f(x) "revErrEnergyGfortCaotic.txt" via a,b
set logscale
plot 'revErrEnergyGfortCaotic.txt' w p,\
'revErrEnergyGfortRegular.txt' w p,\
f(x) w l lc rgb "black" lw 3
exit
So the question is how mistake i compute here? because i suppose that in a log-log plane a fit of the form i put in the code should rappresent very well the data.
Thanks a lot
Finally i can be able to solve the problem using the suggestion in the answer of Christop and modify it just a bit.
I found the approximate slop of the function (something near to -4) then taking this parameter fix i just fit the curve with only a, found it i fix it and modify only b. After that using the output as starting solution for the fit i found the best fit.
You must find appropriate starting values to get a correct fit, because that kind of fitting doesn't have one global solution.
If you don't define a and b, both are set to 1 which might be too far away. Try using
a = 100
b = -3
for a better start. Maybe you need to tweak those value a bit more, I couldn't because I don't have the data file.
Also, you might want to restrict the region of the fitting to the part above 10:
fit [10:] f(x) "revErrEnergyGfortCaotic.txt" via a,b
Of course only, if it is appropriate.
This is a common issue in data analysis, and I'm not certain if there's a nice Gnuplot way to solve it.
The issue is that the penalty functions in standard fitting routines are typically the sum of squares of errors, and try as you might, if your data have a lot of dynamic range, the errors for the smallest y-values come out to essentially zero from the point of view of the algorithm.
I recently taught a course to students where they needed to fit such data. Lots of them beat their (matlab) fitting routines into submission by choosing very stringent convergence criteria, but even this did not help too much.
What you really need to do, if you want to fit this power-law tail well, is to convert the data into log-log form and run a linear regression on that log-log representation.
The main problem here is that the residual errors of the function values of the higher x are very small compared to the residuals at lower x values. After all, you almost span 20 orders of magnitude on the y axis.
Just weight the y values with 1/y**2, or even better: if you have the standard deviations of your data points weight the values with 1/std**2. Then the fit should converge much much better.
In gnuplot weighting is done using a third data column:
fit f(x) 'data' using 1:2:(1/$2**2") via ...
Or you can use Raman Shah's advice and linearize the y axis and do a linear regression.
you need to use weights for your fit (currently low values are not considered as important) and have a better starting guess (via "pars_file.pars")
Related
I have the datafile:
10.0000 -330.12684910
15.0000 -332.85109334
20.0000 -333.85785274
25.0000 -334.18315783
30.0000 -334.28078907
35.0000 -334.30486903
40.0000 -334.30824069
45.0000 -334.30847874
50.0000 -334.30940105
55.0000 -334.31091085
60.0000 -334.31217217
The commands a used to fit this
f(x) = a+b*exp(c*x)
fit f(x) datafile via a, b, c
didn't get the negative exponential that I expected, then just to see how the hyperbola fitted I tried
f(x) = a+b/x
fit f(x) datafile via a, b
but decided to do this:
f(x) = a+b*exp(-c*x)
fit f(x) datafile via a, b, c
and it worked. I continued doing fits but in some point it started to mark this error undefined value during function evaluation.
I restarted the session and deleted the fit.log file, I thought it was a gnuplot bug but since then I always receive the undefined value error. I've been reading similar issues. It could be a, b, c seeds but I have introduced very similar values to the ones a received that time it fitted well but didn't work. I'm thinking the problem might be chaotic or I'm doing something wrong.
Thank you for your method. I understand gnuplot uses non linear least squares method for fitting.
I found one solution is to use the model y=a+b*exp(-c*c*x), I also found better initial values and it worked. anyway I have this other dataset:
2 -878.11598213
6 -878.08846509
10 -878.08105262
19 -878.07882425
28 -878.07793702
44 -878.07755010
60 -878.07738151
85 -878.07729504
110 -878.07725107
And gnuplot fit does the work but really bad. instead I used your method, here I show a comparison:
gnuplot fit and jjaquelin comparison
it is way better.
I don't know the algorithm used by gnuplot. Probably an iterative method starting from guessed values of the parameters. The difficulty might come from not convenient initial values and/or from no convergence of the process.
From my own calculus the result is close to the values below. The method, which is not iterative and doesn't require initial guess, is explained in the paper : https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
FOR INFORMATION :
The linearisation of the regression is obtained thanks to the integral equation
to which the function to be fitted is solution.
The paper referenced above is mainly written in French. It is pratially translated in : https://scikit-guess.readthedocs.io/en/latest/appendices/references.html
In gnuplot, you can create a histogram like
binwidth=#whatever#
set boxwidth binwidth
bin(x,width)=width*round(x/width)
plot "gaussian.data" u (bin($1,binwidth)):(1.0/10000) smooth freq w boxes
Here, I am interested in a probability histogram, hence the 1.0/10000.
I have spend a lot of time reading the gnuplot documentation on using and what I understand is that I am telling gnuplot to plot data from gaussian.data using certain values for the x and y. In fact, when I open the data file associated with the plot command (achieved through making a temporary file), I see that the y values are 1/10000, as expected. But then, the x and y values change. It seems like there's something dynamic about it. I do not quite understand this behavior of using. Could anyone please guide me?
In case anyone else would like further explanation.
http://psy.swansea.ac.uk/staff/carter/gnuplot/gnuplot_frequency.htm
I'm having trouble fitting the following data in gnuplot 4.4
0.0007629768 -0.1256279199 0.0698209297
0.0007565689 0.5667065856 0.0988522507
0.00071274 1.3109126758 0.7766233743
f1(x) = -a1 * x + b1
a1 = 28000
fit f1(x) "56demo.csv" using 1:2:3 via a1, b1
plot "56demo.csv" using 1:2:3 with yerrorbars title "56%", \
f1(x) notitle
This converges to values of a1 and b1 which are higher than I would like.
Several similar tests converge to values in the range in which they should be, but for some reason these don't.
Specifically, I'd like to have
a1 = 28000, approximately.
I'm looking for some way to hit a local minimum. I've tried making the fit limit smaller, but I haven't had much luck that way.
Is it possible to set an upper limit to the values of a1 and b1? That is one way I'd like to try.
Thanks
The most common method of fitting is the chi-square (χ²) method. Chi-square is the expression
where xi, yi and σi are the data points with error in y, and f(x) is a model function which describes your data. This function has some parameters, and the goal is to find those values for the parameters, for which this expression has a global minimum. A program like gnuplot will try several sets of values for this parameters, to find the one set for which χ² is minimal.
In general, several things can go wrong, which usually means that the algorithm has found a local minimum, not the global one. This happens for example when the initial values for the parameters are bad. It helps to estimate the initial values as good as possible.
Another problem is when the algorithm uses too big steps between the sets of parameter values. This often happens for example, if you have a very narrow peak on a broader peak. Usually, you will end up with a parameter set describing a sum of two identical peaks, which describes the broad peak well and ignores the narrow one. Again, a good initial value set will help. You may also first keep the peak positions fixed (i.e. not in the via-list in gnuplot ) and fit all other parameters, and then fit all parameters in a second command.
But if f(x) is a linear function, this problems do not exist !
You can replace f(x) by m*x+b and do the math. The result is that χ² is a parabola in the parameter space, which has a single, unique minimum, which can also be calculated explicitly.
So, if gnuplot gives you a set of parameters for that data, this result is absolutely correct, even if you don't like that result.
I have an issue with curve fitting process using Gnuplot. I have data with the time starting at 0.5024. I want to use a linear sin/cos combo to fit a value M over time (M=a+bsin(wt)+ccos(wt)). For further processing I only need the c value.
My code is
f(x)=a+b*sin(w*x)+c*cos(w*x)
fit f(x) "data.dat" using 1:2 via a,b,c,w
the asymptotic standard error ist 66% for parameter c which seems quite high. I suspect that it has to do with the fact, that the time starts at 0.5024 instead of 0. What I could do of course is
fit f(x) "data.dat" using ($1-0.5024):2 via a,b,c,w
with an asymptotic error of about 10% which is way lower. The question is: Can I do that? Does my new plot with the time offset still represent the original curve? Any other ideas?
Thanks in advance for your help :-)
It's a bit difficult to answer this without having seen your data, but your observation is typical.
The problem is an effect of the fit itself, or even your formula. Let me explain it using an example data set. (Well, this will become offtopic...)
An statistics excourse
The data follows the function f(x)=x and all y-values have been shifted by gassian random numbers. In addtion, the data is in the x-dange [600:800].
You can now simply apply a linear fit f(x)=m*x+b. According to Gauß' error distribution, the error is df(x)=sqrt((dm*x)²+(db)²). So, you can plot the data, the linear function and the error margin f(x) +/- df(x)
Here is the result:
The parameters:
m = 0.981822 +/- 0.1212 (12.34%)
b = 0.974375 +/- 85.13 (8737%)
The correlation matrix:
m b
m 1.000
b -0.997 1.000
You may notice three things:
The error for b is very large!
The error margin is small at x=0, but increases with x. Shouldn't it be smallest where the data is, i.e. at x=700?
The correlation between m and b is -0.997, which is near the maximum (absolute) value of 1.
The third point can be understood at the plot: If you increase the slope m, the y-offset decreases, too. Both parameters are very correlated, and an error on one of them is distributed to the other!
From statistics you may know, that a linear regression function always goes through the center of gravity (cog) of your data. So, let's shift the data so that the cog is the origin (it's enough to shift it so that the cog is on the y-axis, but I did it so)
Result:
m = 1.0465 +/- 0.1211 (11.57%)
b = -12.0611 +/- 7.027 (58.26%)
Correlation:
m b
m 1.000
b -0.000 1.000
Compared to the first plot, the value and error for m is almost the same, but the very large error ob b is much smaller now. The reason is that m and b are not correlated any more, and so a (tiny) variation m does not give a (very big) variation of b. It is also nice to see that the error margin has shrunk a lot.
Here is a last plot with the original data, the first fit function and the "back-shifted function for the shifted data":
About your fit function:
First, there is a big correlation problem: b and c are extremely correlated, as both together define the phase and amplitude of your oscillation. It would help a lot to use another, equivalent function:
f(x)=a+N*sin(w*x+p)
Here, you have phase and amplitude separated. You can still calculate your c from the fit results, and I guess, the error is much better for it.
Like in my example, if the data is far away from the y-axis, a small variation of w will have a big impact on p . So, I would suggest to shift your data so that it's cog is on the y-axis to get almost rid of this.
Is this shift allowed?
Yes. You do not alter the data, you simply change your coordinate system to get better errors. Also, the fit function should describe the data, so it should be very accurate in the range where your data is. In my first plot, the highest accuracy is at the y-axis, not where the data is.
Important
You should always remark which tricks you applied. Otherwise, someome may check your results and fit the data without the tricks, sees the red curve instead youre green one, and may accuse you of cheating...
Whether you can do that or not depends on whether the curve you're fitting to represents the physical phenomena you're studying and is consistent with the physical model you need to comply with. My suggestion is that you provide those and ask this question again in a physics forum (or chemistry, biology, etc., depending on your field).
I have been wondering about this for a while, and it might already be implemented in gnuplot but I haven't been able to find info online.
When you have a data file, it is possible to exchange the axes and assign the "dummy variable", say x, (in gnuplot's help terminology) to the vertical axis:
plot "data" u 1:2 # x goes to horizontal axis, standard
plot "data" u 2:1 # x goes to vertical axis, exchanged axes
However, when you have a function, you need to resort to a parametric function to do this. Imagine you want to plot x = y² (as opposite to y = x²), then (as far as I know) you need to do:
set parametric
plot t**2,t
which works nicely in this case. I think however that a more flexible approach would be desirable, something like
plot x**2 axes y1x1 # this doesn't work!
Is something like the above implemented, or is there an easy way to use y as dummy variable without the need to set parametric?
So here is another ugly, but gnuplot-only variant: Use the special filename '+' to generate a dynamic data set for plotting:
plot '+' using ($1**2):1
The development version contains a new feature, which allows you to use dummy variables instead of column numbers for plotting with '+':
plot sample [y=-10:10] '+' using (y**2):(y)
I guess that's what come closest to your request.
From what I have seen, parametric plots are pretty common in order to achieve your needs.
If you really hate parametric plots and you have no fear for a VERY ugly solutions, I can give you my method...
My trick is to use a data file filled with a sequence of numbers. To fit your example, let's make a file sq with a sequence of reals from -10 to 10 :
seq -10 .5 10 > sq
And then you can do the magic you want using gnuplot :
plot 'sq' u ($1**2):($1)
And if you uses linux you can also put the command directly in the command line :
plot '< seq -10 .5 10' u ($1**2):($1)
I want to add that I'm not proud of this solution and I'd love the "axis y1x1" functionality too.
As far as I know there is no way to simply invert or exchange the axes in gnuplot when plotting a function.
The reason comes from the way functions are plotted in the normal plotting mode. There is a set of points at even intervals along the x axis which are sampled (frequency set by set samples) and the function value computed. This only allows for well-behaved functions; one y-value per x-value.