Normalizing histogram bins in gnuplot - gnuplot

I'm trying to plot a histogram whose bins are normalized by the number of elements in the bin.
I'm using the following
binwidth=5
bin(x,width)=width*floor(x/width) + binwidth/2.0
plot 'file' using (bin($2, binwidth)):($4) smooth freq with boxes
to get a basic histogram, but I want the value of each bin to be divided by the size of the bin. How can I go about this in gnuplot, or using external tools if necessary?

In gnuplot 4.4, functions take on a different property, in that they can execute multiple successive commands, and then return a value (see gnuplot tricks) This means that you can actually calculate the number of points, n, within the gnuplot file without having to know it in advance. This code runs for a file, "out.dat", containing one column: a list of n samples from a normal distribution:
binwidth = 0.1
set boxwidth binwidth
sum = 0
s(x) = ((sum=sum+1), 0)
bin(x, width) = width*floor(x/width) + binwidth/2.0
plot "out.dat" u ($1):(s($1))
plot "out.dat" u (bin($1, binwidth)):(1.0/(binwidth*sum)) smooth freq w boxes
The first plot statement reads through the datafile and increments sum once for each point, plotting a zero.
The second plot statement actually uses the value of sum to normalise the histogram.

In gnuplot 4.6, you can count the number of points by stats command, which is faster than plot. Actually, you do not need such a trick s(x)=((sum=sum+1),0), but directly count the number by variable STATS_records after running of stats 'out.dat' u 1.

Here is how I would do, with n=500 random gaussian variates generated from R with the following command:
Rscript -e 'cat(rnorm(500), sep="\\n")' > rnd.dat
I use quite the same idea as yours for defining a normalized histogram, where y is defined as 1/(binwidth * n), except that I use int instead of floor and I didn't recenter at the bin value. In short, this is a quick adaptation from the smooth.dem demo script, and a similar approach is described in Janert's textbook, Gnuplot in Action (Chapter 13, p. 257, freely available). You can replace my sample data file with random-points which is available in the demo folder coming with Gnuplot. Note that we need to specify the number of points as Gnuplot as no counting facilities for records in a file.
bw1=0.1
bw2=0.3
n=500
bin(x,width)=width*int(x/width)
set xrange [-3:3]
set yrange [0:1]
tstr(n)=sprintf("Binwidth = %1.1f\n", n)
set multiplot layout 1,2
set boxwidth bw1
plot 'rnd.dat' using (bin($1,bw1)):(1./(bw1*n)) smooth frequency with boxes t tstr(bw1)
set boxwidth bw2
plot 'rnd.dat' using (bin($1,bw2)):(1./(bw2*n)) smooth frequency with boxes t tstr(bw2)
Here is the result, with two bin width
Besides, this really is a rough approach to histogram and more elaborated solutions are readily available in R. Indeed, the problem is how to define a good bin width, and this issue has already been discussed on stats.stackexchange.com: using Freedman-Diaconis binning rule should not be too difficult to implement, although you'll need to compute the inter-quartile range.
Here is how R would proceed with the same data set, with default option (Sturges rule, because in this particular case, this won't make a difference) and equally spaced bin like the ones used above.
The R code that was used is given below:
par(mfrow=c(1,2), las=1)
hist(rnd, main="Sturges", xlab="", ylab="", prob=TRUE)
hist(rnd, breaks=seq(-3.5,3.5,by=.1), main="Binwidth = 0.1",
xlab="", ylab="", prob=TRUE)
You can even look at how R does its job, by inspecting the values returned when calling hist():
> str(hist(rnd, plot=FALSE))
List of 7
$ breaks : num [1:14] -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 ...
$ counts : int [1:13] 1 1 12 20 49 79 108 87 71 43 ...
$ intensities: num [1:13] 0.004 0.004 0.048 0.08 0.196 0.316 0.432 0.348 0.284 0.172 ...
$ density : num [1:13] 0.004 0.004 0.048 0.08 0.196 0.316 0.432 0.348 0.284 0.172 ...
$ mids : num [1:13] -3.25 -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 ...
$ xname : chr "rnd"
$ equidist : logi TRUE
- attr(*, "class")= chr "histogram"
All that to say that you can use R results to process your data with Gnuplot if you like (although I would recommend to use R directly :-).

Another way of counting the number of data points in a file is by using a system command. This proves useful if you are plotting multiple files, and you don't know the number of points beforehand. I used:
countpoints(file) = system( sprintf("grep -v '^#' %s| wc -l", file) )
file1count = countpoints (file1)
file2count = countpoints (file2)
file3count = countpoints (file3)
...
The countpoints functions avoids counting lines that start with '#'. You would then use the already mentioned functions to plot the normalized histogram.
Here's a complete example:
n=100
xmin=-50.
xmax=50.
binwidth=(xmax-xmin)/n
bin(x,width)=width*floor(x/width)+width/2.0
countpoints(file) = system( sprintf("grep -v '^#' %s| wc -l", file) )
file1count = countpoints (file1)
file2count = countpoints (file2)
file3count = countpoints (file3)
plot file1 using (bin(($1),binwidth)):(1.0/(binwidth*file1count)) smooth freq with boxes,\
file2 using (bin(($1),binwidth)):(1.0/(binwidth*file2count)) smooth freq with boxes,\
file3 using (bin(($1),binwidth)):(1.0/(binwidth*file3count)) smooth freq with boxes
...

Simply
plot 'file' using (bin($2, binwidth)):($4/$4) smooth freq with boxes

Related

Using the correlation matrix after a fit in Gnuplot

Say I need to fit some data to a parabola, and then perform some calculations involving the correlation matrix elements of the fit parameters: is there a way to use these parameters directly in gnuplot after the fit converges? Are they stored in some variable like the error estimates?.
I quote the explicit problem I'm having. All of this is written to a plot.gp text file and ran with gnuplot plot.gp.
I include set fit errorbariables at the beginning, and then proceed with:
f(x)=a+b*x+c*x*x
fit f(x) 'file.dat' u 1:2:3 yerrors via a,b,c
Once the fit is done, I can use the values of a,b,c and their errors a_err, b_err and c_err directly in the plot.gp script; my question is: can I do the same with the correlation matrix of the parameters?
The problem is that the matrix is printed to terminal once the script finishes to run:
correlation matrix of the fit parameters:
a b e
a 1.000
b 0.910 1.000
c -0.956 -0.987 1.000
Are the entries of the matrix stores in some variable (like a_err, b_err) that I can access after the fit is done but before the script ends?
I think the command you are looking for is
set fit covariancevariables
If the `covariancevariables` option is turned on, the covariances between
final parameters will be saved to user-defined variables. The variable name
for a certain parameter combination is formed by prepending "FIT_COV_" to
the name of the first parameter and combining the two parameter names by
"_". For example given the parameters "a" and "b" the covariance variable is
named "FIT_COV_a_b".
Edit: I certainly missed gnuplot's intended way via option covariancevariables (apparently available since gnuplot 5.0). Ethan's answer is the way to go. I nevertheless leave my answer, with some modifications it might maybe be useful to extract something else from the fit output.
Maybe I missed it, but I am not aware that you can directly store the elements of the correlation matrix into variables, however, you can do it with some workaround.
You can set the output file for your fit results (check help set fit). The shortest output will be created with the option results. The results will be written to this file (actually, appended if the file already exists).
Example:
After 5 iterations the fit converged.
final sum of squares of residuals : 0.45
rel. change during last iteration : -3.96255e-10
degrees of freedom (FIT_NDF) : 1
rms of residuals (FIT_STDFIT) = sqrt(WSSR/ndf) : 0.67082
variance of residuals (reduced chisquare) = WSSR/ndf : 0.45
Final set of parameters Asymptotic Standard Error
======================= ==========================
a = 1.75 +/- 0.3354 (19.17%)
b = -2.65 +/- 1.704 (64.29%)
c = 1.75 +/- 1.867 (106.7%)
correlation matrix of the fit parameters:
a b c
a 1.000
b -0.984 1.000
c 0.898 -0.955 1.000
Now, you can read this file back into a datablock (check gnuplot: load datafile 1:1 into datablock) and extract the values from the last lines (here: 3), check help word and check real.
Script:
### get fit correlation matrix into variables
reset session
$Data <<EOD
1 1
2 3
3 10
4 19
EOD
f(x) = a*x**2 + b*x + c
myFitFILE = "SO71788523_fit.dat"
set fit results logfile myFitFILE
fit f(x) $Data u 1:2 via a,b,c
set key top left
set grid x,y
# load file 1:1 into datablock
FileToDatablock(f,d) = GPVAL_SYSNAME[1:7] eq "Windows" ? \
sprintf('< echo %s ^<^<EOD & type "%s"',d,f) : \
sprintf('< echo "\%s <<EOD" & cat "%s"',d,f) # Linux/MacOS
load FileToDatablock(myFitFILE,'$FIT')
# extract parameters into variables
N = 3 # number of parameters
getValue(p1,p2) = real(word($FIT[|$FIT|-N+p1],p2+1)) # extract value as floating point number
aa = getValue(1,1)
ba = getValue(2,1)
bb = getValue(2,2)
ca = getValue(3,1)
cb = getValue(3,2)
cc = getValue(3,3)
set label 1 at graph 0.1,graph 0.8 \
sprintf("Correlation matrix:\naa: %g\nba: %g\nbb: %g\nca: %g\ncb: %g\ncc: %g",aa,ba,bb,ca,cb,cc)
plot $Data u 1:2 w lp pt 7 lc "red", \
f(x) w l lc "blue" title sprintf("fit: a=%g, b=%g, c=%g",a,b,c)
### end of script
Result:

Reading right ascension/declination coordinates in gnuplot

I have a two column file with right ascension/declination coordinates:
18:42:21.8 -23:04:52
20:55:00.8 -17:23:19
I can read the first column specifying data as 'timefmt' but it seems there is no way to do a similar reading for angular data. I could, of course delete :'s and plot ($2+$3/60+$3/3600) but I wonder if there is a more elegant way.
You can define a function which is doing the job for you, which might be a bit more convenient and shorter in the plot command.
Convert your hours, minutes, seconds or degrees, minutes, seconds into seconds via strptime() or timecolumn(). In gnuplot console type check help strptime, help timecolumn and help time_specifiers. Use %tH:%tM:%tS, not %H:%M:%S.
However, you have to be careful how gnuplot interprets negative times:
if your input time is for example -00:17:56.7 gnuplot will interpret this as +00:17:56.7 which is not what you expect. Apparently, -00 is equal to +00 and hence 17 is interpreted as positive, although you intended it to be negative. A workaround in this special case would be the following:
Create a function myTimeSign(s) which checks if hours are 0 and if the first character of your time is - and will return -1, and 1 otherwise.
myTimeSign(s) = strptime("%tH",s)==0 && s[1:1] eq '-' ? -1 : 1
Multiply this with your time. This will do here as workaround, but not in general.
Update:
This has been reported as bug (https://sourceforge.net/p/gnuplot/bugs/2245/) and is already fixed in the development version of gnuplot.
Code:
### time / angle conversion
reset session
set size square
set object 1 rect from graph 0,0 to graph 1,1 fc rgb "black"
$Orion <<EOD
05:55:10.29 +07:24:25.3 0.42 Betelgeuse
05:14:32.27 -08:12:05.9 0.18 Rigel
05:25:07.87 +06:20:59.0 1.64 Bellatrix
05:32:00.40 -00:17:56.7 2.20 Mintaka
05:36:12.81 -01:12:06.9 1.69 Alnilam
05:40:45.52 -01:56:33.3 1.88 Alnitak
05:47:45.39 -09:40:10.6 2.07 Saiph
05:35:08.28 +09:56:03.0 3.47 Meissa
EOD
myTimeFmt = "%tH:%tM:%tS"
RA(n) = timecolumn(n,myTimeFmt)
myTimeSign(s) = strptime("%tH",s)==0 && s[1:1] eq '-' ? -1 : 1 # returns -1 if hours are -00
Dec(n) = timecolumn(n,myTimeFmt)*myTimeSign(strcol(n))
set xrange[strptime(myTimeFmt,"06:12"):strptime(myTimeFmt,"05:00")] reverse
set format x "%H^h%M^m" time
set yrange[strptime(myTimeFmt,"-12:00"):strptime(myTimeFmt,"+12:00")]
set format y "%tH°%tM'" time
set tics out
plot $Orion u (RA(1)):(Dec(2)):(-log10($3)+1.5) w p pt 7 ps var lc rgb "yellow" notitle
### end of code
Result:

Plotting Average curve for points in gnuplot

[Current]
I am importing a text file in which the first column has simulation time (0~150) the second column has the delay (0.01~0.02).
1.000000 0.010007
1.000000 0.010010
2.000000 0.010013
2.000000 0.010016
.
.
.
149.000000 0.010045
149.000000 0.010048
150.000000 0.010052
150.000000 0.010055
which gives me the plot:
[Desired]
I need to plot an average line on it like shown in the following image with red line:
Here is a gnuplot only solution with sample data:
set table "test.data"
set samples 1000
plot rand(0)+sin(x)
unset table
You should check the gnuplot demo page for a running average. I'm going to generalize this demo in terms of dynamically building the functions. This makes it much easier to change the number of points include in the average.
This is the script:
# number of points in moving average
n = 50
# initialize the variables
do for [i=1:n] {
eval(sprintf("back%d=0", i))
}
# build shift function (back_n = back_n-1, ..., back1=x)
shift = "("
do for [i=n:2:-1] {
shift = sprintf("%sback%d = back%d, ", shift, i, i-1)
}
shift = shift."back1 = x)"
# uncomment the next line for a check
# print shift
# build sum function (back1 + ... + backn)
sum = "(back1"
do for [i=2:n] {
sum = sprintf("%s+back%d", sum, i)
}
sum = sum.")"
# uncomment the next line for a check
# print sum
# define the functions like in the gnuplot demo
# use macro expansion for turning the strings into real functions
samples(x) = $0 > (n-1) ? n : ($0+1)
avg_n(x) = (shift_n(x), #sum/samples($0))
shift_n(x) = #shift
# the final plot command looks quite simple
set terminal pngcairo
set output "moving_average.png"
plot "test.data" using 1:2 w l notitle, \
"test.data" using 1:(avg_n($2)) w l lc rgb "red" lw 3 title "avg\\_".n
This is the result:
The average lags quite a bit behind the datapoints as expected from the algorithm. Maybe 50 points are too many. Alternatively, one could think about implementing a centered moving average, but this is beyond the scope of this question.
And, I also think that you are more flexible with an external program :)
Here's some replacement code for the top answer, which makes this also work for 1000+ points and much much faster. Only works in gnuplot 5.2 and later I guess
# number of points in moving average
n = 5000
array A[n]
samples(x) = $0 > (n-1) ? n : int($0+1)
mod(x) = int(x) % n
avg_n(x) = (A[mod($0)+1]=x, (sum [i=1:samples($0)] A[i]) / samples($0))
Edit
The updated question is about a moving average.
You can do this in a limited way with gnuplot alone, according to this demo.
But in my opinion, it would be more flexible to pre-process your data using a programming language like python or ruby and add an extra column for whatever kind of moving average you require.
The original answer is preserved below:
You can use fit. It seems you want to fit to a constant function. Like this:
f(x) = c
fit f(x) 'S1_delay_120_LT100_LU15_MU5.txt' using 1:2 every 5 via c
Then you can plot them both.
plot 'S1_delay_120_LT100_LU15_MU5.txt' using 1:2 every 5, \
f(x) with lines
Note that this is technique can be used with arbitrary functions, not just constant or lineair functions.
I wanted to comment on Franky_GT, but somehow stackoverflow didn't let me.
However, Franky_GT, your answer works great!
A note for people plotting .xvg files (e.g. after doing analysis of MD simulations), if you don't add the following line:
set datafile commentschars "##&"
Franky_GT's moving average code will result in this error:
unknown type in imag()
I hope this is of use to anyone.
For gnuplot >=5.2, probably the most efficient solution is using an array like #Franky_GT's solution.
However, it uses the pseudocolumn 0 (see help pseudocolumns). In case you have some empty lines in your data $0 will be reset to 0 which eventually might mess up your average.
This solution uses an index t to count up the datalines and a second array X[] in case a centered moving average is desired. Datapoints don't have to be equidistant in x.
At the beginning there will not be enough datapoints for a centered average of N points so for the x-value it will use every second point and the other will be NaN, that's why set datafile missing NaN is necessary to plot a connected line at the beginning.
Code:
### moving average over N points
reset session
# create some test data
set print $Data
y = 0
do for [i=1:5000] {
print sprintf("%g %g", i, y=y+rand(0)*2-1)
}
set print
# average over N values
N = 250
array Avg[N]
array X[N]
MovAvg(col) = (Avg[(t-1)%N+1]=column(col), n = t<N ? t : N, t=t+1, (sum [i=1:n] Avg[i])/n)
MovAvgCenterX(col) = (X[(t-1)%N+1]=column(col), n = t<N ? t%2 ? NaN : (t+1)/2 : ((t+1)-N/2)%N+1, n==n ? X[n] : NaN) # be aware: gnuplot does integer division here
set datafile missing NaN
plot $Data u 1:2 w l ti "Data", \
t=1 '' u 1:(MovAvg(2)) w l lc rgb "red" ti sprintf("Moving average over %d",N), \
t=1 '' u (MovAvgCenterX(1)):(MovAvg(2)) w l lw 2 lc rgb "green" ti sprintf("Moving average centered over %d",N)
### end of code
Result:

Discrete heat map with GNUPLOT

I'm trying to make something as a heat map with GNUPLOT but I need that my palette takes discrete colors for defined values.
I mean, my data file has three columns, for example:
x y value
0.0 0.0 10
0.0 0.5 2
0.0 1.0 2
0.5 1.0 10
1.0 0.0 -1
1.0 1.0 -1
I need that each point has one color depending of its value. Traditional heat map mixes point making regions of continuos colors, but I need it in a discrete form.
If your data forms a "matrix", i.e., there are M x-samples, N y-samples, and you have the data for all MxN points, then probably the easiest solution is to use
plot ... w rgbimage u 1:2:(r($3)):(g($3)):(b($3))
and supply the r,g,b values as three additional columns as shown above.
However, if your data is "sparse" (only some of the samples are available as shown in your question) and there are not many points, one might be tempted to generate the elementary squares forming the plot manually. To this end, one could proceed as:
set terminal png enhanced
set output 'plot.png'
#custom value -> color mapping
rgb(r, g, b) = 65536 * int(r) + 256 * int(g) + int(b)
fn(val) = rgb(100 + val*10, 0, 0)
#square size
delta = 0.5
set xr [-delta/2:1+delta/2]
set yr [-delta/2:1+delta/2]
set xtics 0,delta/2,1 out nomirror
set ytics 0,delta/2,1 out nomirror
set format x "%.2f"
set format y "%.2f"
set size ratio 1
unset key
fName="test.dat"
load sprintf("<gawk -v d=%f -f parse.awk %s", delta, fName)
plot fName u 1:2:3 w labels tc rgb 'white'
This script assumes the presence of auxiliary gawk script parse.awk in the same directory:
{
printf "set object rectangle from %f,%f to %f,%f fc rgb fn(%d) fs solid\n",
$1-d/2, $2-d/2, $1+d/2, $2+d/2, $3
}
This scripts accepts the required square size (-v d=%f in the invocation of gawk) and generates for each point a statement generating the corresponding square. These statements are consequently executed by the load command.
Mapping of the colors is done via the function fn defined in the main Gnuplot script. It takes the passed value and generates a rgb value which is then used with fc rgb in the rectangle specification.
Together, this then produces:
This might do what you want, after some fiddling:
set view map
set style fill transparent solid noborder
splot 'data' u 1:2:3:(100+200*$3) pt 5 lc rgbcolor var ps 14
The pt 5 will plot a square (at least in the x11 term) at each point in the datafile, colored according to a transformation on the last column.

Histogram using gnuplot?

I know how to create a histogram (just use "with boxes") in gnuplot if my .dat file already has properly binned data. Is there a way to take a list of numbers and have gnuplot provide a histogram based on ranges and bin sizes the user provides?
yes, and its quick and simple though very hidden:
binwidth=5
bin(x,width)=width*floor(x/width)
plot 'datafile' using (bin($1,binwidth)):(1.0) smooth freq with boxes
check out help smooth freq to see why the above makes a histogram
to deal with ranges just set the xrange variable.
I have a couple corrections/additions to Born2Smile's very useful answer:
Empty bins caused the box for the adjacent bin to incorrectly extend into its space; avoid this using set boxwidth binwidth
In Born2Smile's version, bins are rendered as centered on their lower bound. Strictly they ought to extend from the lower bound to the upper bound. This can be corrected by modifying the bin function: bin(x,width)=width*floor(x/width) + width/2.0
Be very careful: all of the answers on this page are implicitly taking the decision of where the binning starts - the left-hand edge of the left-most bin, if you like - out of the user's hands. If the user is combining any of these functions for binning data with his/her own decision about where binning starts (as is done on the blog which is linked to above) the functions above are all incorrect. With an arbitrary starting point for binning 'Min', the correct function is:
bin(x) = width*(floor((x-Min)/width)+0.5) + Min
You can see why this is correct sequentially (it helps to draw a few bins and a point somewhere in one of them). Subtract Min from your data point to see how far into the binning range it is. Then divide by binwidth so that you're effectively working in units of 'bins'. Then 'floor' the result to go to the left-hand edge of that bin, add 0.5 to go to the middle of the bin, multiply by the width so that you're no longer working in units of bins but in an absolute scale again, then finally add back on the Min offset you subtracted at the start.
Consider this function in action:
Min = 0.25 # where binning starts
Max = 2.25 # where binning ends
n = 2 # the number of bins
width = (Max-Min)/n # binwidth; evaluates to 1.0
bin(x) = width*(floor((x-Min)/width)+0.5) + Min
e.g. the value 1.1 truly falls in the left bin:
this function correctly maps it to the centre of the left bin (0.75);
Born2Smile's answer, bin(x)=width*floor(x/width), incorrectly maps it to 1;
mas90's answer, bin(x)=width*floor(x/width) + binwidth/2.0, incorrectly maps it to 1.5.
Born2Smile's answer is only correct if the bin boundaries occur at (n+0.5)*binwidth (where n runs over integers). mas90's answer is only correct if the bin boundaries occur at n*binwidth.
Do you want to plot a graph like this one?
yes? Then you can have a look at my blog article: http://gnuplot-surprising.blogspot.com/2011/09/statistic-analysis-and-histogram.html
Key lines from the code:
n=100 #number of intervals
max=3. #max value
min=-3. #min value
width=(max-min)/n #interval width
#function used to map a value to the intervals
hist(x,width)=width*floor(x/width)+width/2.0
set boxwidth width*0.9
set style fill solid 0.5 # fill style
#count and plot
plot "data.dat" u (hist($1,width)):(1.0) smooth freq w boxes lc rgb"green" notitle
As usual, Gnuplot is a fantastic tool for plotting sweet looking graphs and it can be made to perform all sorts of calculations. However, it is intended to plot data rather than to serve as a calculator and it is often easier to use an external programme (e.g. Octave) to do the more "complicated" calculations, save this data in a file, then use Gnuplot to produce the graph. For the above problem, check out the "hist" function is Octave using [freq,bins]=hist(data), then plot this in Gnuplot using
set style histogram rowstacked gap 0
set style fill solid 0.5 border lt -1
plot "./data.dat" smooth freq with boxes
I have found this discussion extremely useful, but I have experienced some "rounding off" problems.
More precisely, using a binwidth of 0.05, I have noticed that, with the techniques presented here above, data points which read 0.1 and 0.15 fall in the same bin. This (obviously unwanted behaviour) is most likely due to the "floor" function.
Hereafter is my small contribution to try to circumvent this.
bin(x,width,n)=x<=n*width? width*(n-1) + 0.5*binwidth:bin(x,width,n+1)
binwidth = 0.05
set boxwidth binwidth
plot "data.dat" u (bin($1,binwidth,1)):(1.0) smooth freq with boxes
This recursive method is for x >=0; one could generalise this with more conditional statements to obtain something even more general.
We do not need to use recursive method, it may be slow. My solution is using a user-defined function rint instesd of instrinsic function int or floor.
rint(x)=(x-int(x)>0.9999)?int(x)+1:int(x)
This function will give rint(0.0003/0.0001)=3, while int(0.0003/0.0001)=floor(0.0003/0.0001)=2.
Why? Please look at Perl int function and padding zeros
I have a little modification to Born2Smile's solution.
I know that doesn't make much sense, but you may want it just in case. If your data is integer and you need a float bin size (maybe for comparison with another set of data, or plot density in finer grid), you will need to add a random number between 0 and 1 inside floor. Otherwise, there will be spikes due to round up error. floor(x/width+0.5) will not do because it will create pattern that's not true to original data.
binwidth=0.3
bin(x,width)=width*floor(x/width+rand(0))
With respect to binning functions, I didn't expect the result of the functions offered so far. Namely, if my binwidth is 0.001, these functions were centering the bins on 0.0005 points, whereas I feel it's more intuitive to have the bins centered on 0.001 boundaries.
In other words, I'd like to have
Bin 0.001 contain data from 0.0005 to 0.0014
Bin 0.002 contain data from 0.0015 to 0.0024
...
The binning function I came up with is
my_bin(x,width) = width*(floor(x/width+0.5))
Here's a script to compare some of the offered bin functions to this one:
rint(x) = (x-int(x)>0.9999)?int(x)+1:int(x)
bin(x,width) = width*rint(x/width) + width/2.0
binc(x,width) = width*(int(x/width)+0.5)
mitar_bin(x,width) = width*floor(x/width) + width/2.0
my_bin(x,width) = width*(floor(x/width+0.5))
binwidth = 0.001
data_list = "-0.1386 -0.1383 -0.1375 -0.0015 -0.0005 0.0005 0.0015 0.1375 0.1383 0.1386"
my_line = sprintf("%7s %7s %7s %7s %7s","data","bin()","binc()","mitar()","my_bin()")
print my_line
do for [i in data_list] {
iN = i + 0
my_line = sprintf("%+.4f %+.4f %+.4f %+.4f %+.4f",iN,bin(iN,binwidth),binc(iN,binwidth),mitar_bin(iN,binwidth),my_bin(iN,binwidth))
print my_line
}
and here's the output
data bin() binc() mitar() my_bin()
-0.1386 -0.1375 -0.1375 -0.1385 -0.1390
-0.1383 -0.1375 -0.1375 -0.1385 -0.1380
-0.1375 -0.1365 -0.1365 -0.1375 -0.1380
-0.0015 -0.0005 -0.0005 -0.0015 -0.0010
-0.0005 +0.0005 +0.0005 -0.0005 +0.0000
+0.0005 +0.0005 +0.0005 +0.0005 +0.0010
+0.0015 +0.0015 +0.0015 +0.0015 +0.0020
+0.1375 +0.1375 +0.1375 +0.1375 +0.1380
+0.1383 +0.1385 +0.1385 +0.1385 +0.1380
+0.1386 +0.1385 +0.1385 +0.1385 +0.1390
Different number of bins on the same dataset can reveal different features of the data.
Unfortunately, there is no universal best method that can determine the number of bins.
One of the powerful methods is the Freedman–Diaconis rule, which automatically determines the number of bins based on statistics of a given dataset, among many other alternatives.
Accordingly, the following can be used to utilise the Freedman–Diaconis rule in a gnuplot script:
Say you have a file containing a single column of samples, samplesFile:
# samples
0.12345
1.23232
...
The following (which is based on ChrisW's answer) may be embed into an existing gnuplot script:
...
## preceeding gnuplot commands
...
#
samples="$samplesFile"
stats samples nooutput
N = floor(STATS_records)
samplesMin = STATS_min
samplesMax = STATS_max
# Freedman–Diaconis formula for bin-width size estimation
lowQuartile = STATS_lo_quartile
upQuartile = STATS_up_quartile
IQR = upQuartile - lowQuartile
width = 2*IQR/(N**(1.0/3.0))
bin(x) = width*(floor((x-samplesMin)/width)+0.5) + samplesMin
plot \
samples u (bin(\$1)):(1.0/(N*width)) t "Output" w l lw 1 smooth freq

Resources