Histogram using gnuplot? - gnuplot

I know how to create a histogram (just use "with boxes") in gnuplot if my .dat file already has properly binned data. Is there a way to take a list of numbers and have gnuplot provide a histogram based on ranges and bin sizes the user provides?

yes, and its quick and simple though very hidden:
binwidth=5
bin(x,width)=width*floor(x/width)
plot 'datafile' using (bin($1,binwidth)):(1.0) smooth freq with boxes
check out help smooth freq to see why the above makes a histogram
to deal with ranges just set the xrange variable.

I have a couple corrections/additions to Born2Smile's very useful answer:
Empty bins caused the box for the adjacent bin to incorrectly extend into its space; avoid this using set boxwidth binwidth
In Born2Smile's version, bins are rendered as centered on their lower bound. Strictly they ought to extend from the lower bound to the upper bound. This can be corrected by modifying the bin function: bin(x,width)=width*floor(x/width) + width/2.0

Be very careful: all of the answers on this page are implicitly taking the decision of where the binning starts - the left-hand edge of the left-most bin, if you like - out of the user's hands. If the user is combining any of these functions for binning data with his/her own decision about where binning starts (as is done on the blog which is linked to above) the functions above are all incorrect. With an arbitrary starting point for binning 'Min', the correct function is:
bin(x) = width*(floor((x-Min)/width)+0.5) + Min
You can see why this is correct sequentially (it helps to draw a few bins and a point somewhere in one of them). Subtract Min from your data point to see how far into the binning range it is. Then divide by binwidth so that you're effectively working in units of 'bins'. Then 'floor' the result to go to the left-hand edge of that bin, add 0.5 to go to the middle of the bin, multiply by the width so that you're no longer working in units of bins but in an absolute scale again, then finally add back on the Min offset you subtracted at the start.
Consider this function in action:
Min = 0.25 # where binning starts
Max = 2.25 # where binning ends
n = 2 # the number of bins
width = (Max-Min)/n # binwidth; evaluates to 1.0
bin(x) = width*(floor((x-Min)/width)+0.5) + Min
e.g. the value 1.1 truly falls in the left bin:
this function correctly maps it to the centre of the left bin (0.75);
Born2Smile's answer, bin(x)=width*floor(x/width), incorrectly maps it to 1;
mas90's answer, bin(x)=width*floor(x/width) + binwidth/2.0, incorrectly maps it to 1.5.
Born2Smile's answer is only correct if the bin boundaries occur at (n+0.5)*binwidth (where n runs over integers). mas90's answer is only correct if the bin boundaries occur at n*binwidth.

Do you want to plot a graph like this one?
yes? Then you can have a look at my blog article: http://gnuplot-surprising.blogspot.com/2011/09/statistic-analysis-and-histogram.html
Key lines from the code:
n=100 #number of intervals
max=3. #max value
min=-3. #min value
width=(max-min)/n #interval width
#function used to map a value to the intervals
hist(x,width)=width*floor(x/width)+width/2.0
set boxwidth width*0.9
set style fill solid 0.5 # fill style
#count and plot
plot "data.dat" u (hist($1,width)):(1.0) smooth freq w boxes lc rgb"green" notitle

As usual, Gnuplot is a fantastic tool for plotting sweet looking graphs and it can be made to perform all sorts of calculations. However, it is intended to plot data rather than to serve as a calculator and it is often easier to use an external programme (e.g. Octave) to do the more "complicated" calculations, save this data in a file, then use Gnuplot to produce the graph. For the above problem, check out the "hist" function is Octave using [freq,bins]=hist(data), then plot this in Gnuplot using
set style histogram rowstacked gap 0
set style fill solid 0.5 border lt -1
plot "./data.dat" smooth freq with boxes

I have found this discussion extremely useful, but I have experienced some "rounding off" problems.
More precisely, using a binwidth of 0.05, I have noticed that, with the techniques presented here above, data points which read 0.1 and 0.15 fall in the same bin. This (obviously unwanted behaviour) is most likely due to the "floor" function.
Hereafter is my small contribution to try to circumvent this.
bin(x,width,n)=x<=n*width? width*(n-1) + 0.5*binwidth:bin(x,width,n+1)
binwidth = 0.05
set boxwidth binwidth
plot "data.dat" u (bin($1,binwidth,1)):(1.0) smooth freq with boxes
This recursive method is for x >=0; one could generalise this with more conditional statements to obtain something even more general.

We do not need to use recursive method, it may be slow. My solution is using a user-defined function rint instesd of instrinsic function int or floor.
rint(x)=(x-int(x)>0.9999)?int(x)+1:int(x)
This function will give rint(0.0003/0.0001)=3, while int(0.0003/0.0001)=floor(0.0003/0.0001)=2.
Why? Please look at Perl int function and padding zeros

I have a little modification to Born2Smile's solution.
I know that doesn't make much sense, but you may want it just in case. If your data is integer and you need a float bin size (maybe for comparison with another set of data, or plot density in finer grid), you will need to add a random number between 0 and 1 inside floor. Otherwise, there will be spikes due to round up error. floor(x/width+0.5) will not do because it will create pattern that's not true to original data.
binwidth=0.3
bin(x,width)=width*floor(x/width+rand(0))

With respect to binning functions, I didn't expect the result of the functions offered so far. Namely, if my binwidth is 0.001, these functions were centering the bins on 0.0005 points, whereas I feel it's more intuitive to have the bins centered on 0.001 boundaries.
In other words, I'd like to have
Bin 0.001 contain data from 0.0005 to 0.0014
Bin 0.002 contain data from 0.0015 to 0.0024
...
The binning function I came up with is
my_bin(x,width) = width*(floor(x/width+0.5))
Here's a script to compare some of the offered bin functions to this one:
rint(x) = (x-int(x)>0.9999)?int(x)+1:int(x)
bin(x,width) = width*rint(x/width) + width/2.0
binc(x,width) = width*(int(x/width)+0.5)
mitar_bin(x,width) = width*floor(x/width) + width/2.0
my_bin(x,width) = width*(floor(x/width+0.5))
binwidth = 0.001
data_list = "-0.1386 -0.1383 -0.1375 -0.0015 -0.0005 0.0005 0.0015 0.1375 0.1383 0.1386"
my_line = sprintf("%7s %7s %7s %7s %7s","data","bin()","binc()","mitar()","my_bin()")
print my_line
do for [i in data_list] {
iN = i + 0
my_line = sprintf("%+.4f %+.4f %+.4f %+.4f %+.4f",iN,bin(iN,binwidth),binc(iN,binwidth),mitar_bin(iN,binwidth),my_bin(iN,binwidth))
print my_line
}
and here's the output
data bin() binc() mitar() my_bin()
-0.1386 -0.1375 -0.1375 -0.1385 -0.1390
-0.1383 -0.1375 -0.1375 -0.1385 -0.1380
-0.1375 -0.1365 -0.1365 -0.1375 -0.1380
-0.0015 -0.0005 -0.0005 -0.0015 -0.0010
-0.0005 +0.0005 +0.0005 -0.0005 +0.0000
+0.0005 +0.0005 +0.0005 +0.0005 +0.0010
+0.0015 +0.0015 +0.0015 +0.0015 +0.0020
+0.1375 +0.1375 +0.1375 +0.1375 +0.1380
+0.1383 +0.1385 +0.1385 +0.1385 +0.1380
+0.1386 +0.1385 +0.1385 +0.1385 +0.1390

Different number of bins on the same dataset can reveal different features of the data.
Unfortunately, there is no universal best method that can determine the number of bins.
One of the powerful methods is the Freedman–Diaconis rule, which automatically determines the number of bins based on statistics of a given dataset, among many other alternatives.
Accordingly, the following can be used to utilise the Freedman–Diaconis rule in a gnuplot script:
Say you have a file containing a single column of samples, samplesFile:
# samples
0.12345
1.23232
...
The following (which is based on ChrisW's answer) may be embed into an existing gnuplot script:
...
## preceeding gnuplot commands
...
#
samples="$samplesFile"
stats samples nooutput
N = floor(STATS_records)
samplesMin = STATS_min
samplesMax = STATS_max
# Freedman–Diaconis formula for bin-width size estimation
lowQuartile = STATS_lo_quartile
upQuartile = STATS_up_quartile
IQR = upQuartile - lowQuartile
width = 2*IQR/(N**(1.0/3.0))
bin(x) = width*(floor((x-samplesMin)/width)+0.5) + samplesMin
plot \
samples u (bin(\$1)):(1.0/(N*width)) t "Output" w l lw 1 smooth freq

Related

Removing vertical lines due to sudden jumps in gnuplot

I am trying to plot a function that contains discontinuities in gnuplot. As a result, gnuplot automatically draws a vertical line connecting the jump discontinuities. I would like to remove this line. I have looked around and found two solutions, none of which worked: One solution was to use smooth unique when plotting, and the other one was to define the function in a conditional form and remove the discontinuity manually. The first solution simply did not make any changes to the plot (at least visually). The second solution seemed to move the location of the jump discontinuity to left or right, not get rid of the vertical line. Please note that I would like to plot with lines. I know with points works, but I do not wish to plot with points.
set sample 10000
N=50
l1(x)=2*cosh(1/x)
l2(x)=2*sinh(1/x)
Z(x)=l1(x)**N+l2(x)**N
e(x)=(-1/Z(x))*(l2(x)*l1(x)**(N-1)+l1(x)*l2(x)**(N-1))
plot e(x)
Produces:
If all you need to do is to remove the vertical line at the singularity you could use conditional plotting:
plot (x<0 ? 1/x : 1/0) w l ls 1, (x>0 ? 1/x : 1/0) w l ls 1
However, your function is more complicated: it cannot be numerically evaluated in a region around 0:
set grid
set xrange [-0.3:0.3]
plot e(x) with linespoints
If Mathematica is to be trusted, the function e(x) goes to 1 and -1 as x approaches 0 from the left and the right, respectively. However, you see in the picture above that gnuplot fails to properly evaluate the function already at x=0.1. print e(0.1) gives -0.0, and print e(0.05) already gives NaN. In this region the numerator and denominator of the function e(x) get too large to be handled with floating point numbers.
You can either exclude this region using conditional plotting,
plot (x<-0.15 ? e(x) : 1/0) w l ls 1, (x>0.15 ? e(x) : 1/0) w l ls 1
or you have to rewrite the function e(x) so you avoid extremely large values in its evaluation (if that is possible). Alternatively you can use a software package that can switch to higher precision, such as Mathematica.
You can redefine your function e(x) to avoid calculations of large exponentials like
e(x) = -(l2(x)/l1(x) + (l2(x)/l1(x))**(N-1))/(1 + (l2(x)/l1(x))**N)
Now you always calculate l2(x)/l1(x) before taking the power.
For your high sampling rate of 10000, this still gives some undefined points near the singularity, so that you have not connecting line. For lower sampling rates of e.g. 1000 you would also see a line crossing zero. To avoid that you can use an odd sampling rate:
set sample 1001
N=50
l1(x)=2*cosh(1/x)
l2(x)=2*sinh(1/x)
Z(x)=l1(x)**N+l2(x)**N
e(x) = -(l2(x)/l1(x) + (l2(x)/l1(x))**(N-1))/(1 + (l2(x)/l1(x))**N)
set autoscale yfix
set offsets 0,0,0.05,0.05
plot e(x) with lines
Late answer... but you can use the same principle as
here:
How to remove line between "jumping" values, in gnuplot?
or here:
Avoid connection of points when there is empty data
Just find the condition for where you want the line to be interrupted.
The condition in this case would be for example:
If two successive values y0 and y1 have different signs then make the line color fully transparent according to the color scheme 0xaarrggbb, e.g. 0xff123456, actually it doesn't matter what comes after 0xff, because 0xff means fully transparent.
Script:
### remove connected "jump" in curve
reset session
N=50
l1(x)=2*cosh(1/x)
l2(x)=2*sinh(1/x)
Z(x)=l1(x)**N+l2(x)**N
e(x)=(-1/Z(x))*(l2(x)*l1(x)**(N-1)+l1(x)*l2(x)**(N-1))
set key noautotitle
set grid x,y
plot y1=NaN '+' u 1:(y0=y1, y1=e(x)):(sgn(y0)!=sgn(y1)?0xff123456:0xff0000) w l lc rgb var
### end of code
Result: (identical independent of the number of samples)

Plot filledcurve with variable transparency

I want to plot a function in gnuplot with filledcurve, but having a variable transparency.
To be clear, consider the following example:
http://www.gnuplot.info/demo_canvas_4.7/rgba_lines.html
In this case, the arrow angle (an integer variable) is used as parameter to define the alpha channel. Is it possible to do the same using a function instead?
PS: I am aware that I'll probably will need to convert to an integer, as the alpha channel are bits...
One simple approach to achieve your goal is plotting several highly transparent (e.g. alpha 0xef) scaled areas on top of each other. For this, the areas need to be easily scalable from a center point, e.g. like circles, ellipses, squares, triangles, etc.
In order to tune the results, you need to play with:
N: number of areas you are stacking. If N is small you will see steps, if N is large plotting will be slow (and resulting graph maybe large in case of vector format)
Alpha: transparency of objects. In order to fade it out the value should be high, e.g. >0xee.
R(): function of scaling radius of areas, e.g. (real(i)/N) would be linear
This question could also be of interest: Gnuplot: transparency of data points when using palette
Script: (works with gnuplot>=5.0.0)
### cirles with variable radial transparency
reset session
# create some random test data
set table $Data
set samples 60
# x,y,r,color
plot '+' u (rand(0)*100):(rand(0)*100):(rand(0)*10+1):(int(rand(0)*0xffffff)) w table
unset table
set style fill solid 1.0 noborder
set samples 100
set key noautotitle
set angle degrees
N = 50
Alpha = 0xef
R(col,n) = column(col)*(real(n)/N)**2
myColor(col) = (Alpha<<24) + int(column(col))
plot for [n=1:N] $Data u 1:2:(R(3,n)):(myColor(4)) w circle lc rgb var
### end of script
Result:

Making eye-diagram with gnuplot

I would like to plot 1000+ curves and display their eye diagram with gnuplot.
Example of eye-diagram example with matlab: http://www.mathworks.fr/fr/help/comm/ref/commscope.eyediagram.html
I can already plot the curves using the script bellow:
gnuplot> plot for [col=1:1000] 'input_dataset1.txt' using 0:col with lines linecolor rgb("#0000ff")
Result: output_image.png
My problem is that when two lines intersects, the intersection has the same color as the line. The eye-diagram should display area with lots of intersections in a different color.
I haven't fould any example of such diagrams made with gnuplot.
Playing with line transparency didn't work: the intersction of two semi-transparent lines is the same color as the line.
Any ideas ?
Thanks,
I worked out a gnuplot-only way to do this, it involves a bit of work and you'll probably have to fine tune the details for your particular problem.
As an example I generated a data file containing values for the function exp(x) and its Taylor expansions from order zero (T^(0)[exp(x)] = 1) to order 3 (T^(3)[exp(x)] = 1 + x + x**2/2. + x**3/6.). This kind of data is suited to this problem because you will have a high data density around the origin, where all the approximations converge to the exact value, and lower data density away from it. It can be generated like this with gnuplot:
set xrange [0:1]
set table
set output "| grep -v '^$' > data"
plot exp(x), 1, 1+x, 1+x+x**2/2., 1+x+x**2/2.+x**3/6.
unset table ; unset output
Note I'm formatting the output so my data file has no blank lines, otherwise gnuplot treats fields separated by blank lines as different data blocks and this eventually messes up the histograms below. This data looks like this (plot "data"):
Now, I create a 2D histogram with this data. It would be extremely helpful if gnuplot offered this feature, but it doesn't, so the task gets a bit tricky. What I will do is create several 1D histograms. For more info on how to generate the latter, check this.
The first thing is to figure out the width along x and y for your bins, xwidth and ywidth, where the number of data points are counted, that is, we divide the data space into a grid where each element measures xwidth by ywidth and is assigned a number equal to the number of data points contained within. The smaller these elements the better resolution your graph will have, but also the more data points you'll need for it to look good. For my data above, this could be something like
xwidth = 0.02
ywidth = 0.05
Now we declare a function to define our 1D bins (details):
bin(x,width)=width*floor(x/width)+width/2.0
and define the number of bins along each direction. Because the xrange for my data is [0:1] and my yrange is [1:2.8], the number of bins would be 50 and 36, respectively. I could use Nx = xrange / xwidth but that would lead to a float Nx and I want an integer. To be safe I do:
Nx = 50
Ny = 36
It might make more sense to define these values the other way around: calculate xwidth as xrange / Nx, in which case you should not have problems with integer/float.
Now I generate the 1D histograms along y, looping over x values:
set output "| grep -v 'u\\|^$' | sed 's/#/\\n#/g' > data2"
set table
plot for [i=0:(Nx-1)] "./data" using \
(bin($2,ywidth)):( i*xwidth <= $1 && (i+1.)*xwidth > $1 ? 1.0 : 0.0) \
smooth freq
unset table ; unset output
Now data2 contains Nx blocks of data, each of them being a scan along y with Ny data points. The value of these data points is the number of data entries in the original data file. As it is, data2 contains 2D data (y, color), which I need to remap to 3D. The x value is given by the data block position, accessible with the every option in gnuplot. To plot this 3-dimensionally I do:
set output "| grep -v 'u\\|^$' | sed 's/#/\\n#/g' > data3"
set table
splot for [i=0:(Nx-1)] "./data2" every :::i::i using \
((i+0.5)*xwidth):1:2
unset table ; unset output
This data3 can now be plotted as a color map:
plot "./data3" with image
which looks like this:
Had I used higher quality data (i.e. with higher resolution) the graph would look nicer. With 2x resolution along each direction, the same looks like below:

reduce datapoints when using logscale in gnuplot

I have a large set of data points from x = 1 to x = 10e13 (step size is fixed to about 3e8).
When I try to plot them using a logscale I certainly get an incredible huge point-density towards the end. Of course this affects my output plots since postscript and svg files (holding each and every data point) are getting really big.
Is there a way to tell gnuplot to decrease the data density dynamically?
Sample data here. Shows a straight line using logarithmic x-axis.
Usually, for this kind of plots, one can use a filter function which selects the desired points and discards all others (sets their value to 1/0:
Something like:
plot 'sample.dat' using (filter($1) ? $1 : 1/0):2
Now you must define an appropriate filter function to change the data density. Here is a proposal, with pseudo-data, although you might for sure find a better one, which doesn't show this typical logarithmic pattern:
set logscale x
reduce(x) = x/(10**(floor(log10(x))))
filterfunc(x) = abs(log10(sc)+(log10(x) - floor(log10(x))) - log10(floor(sc*reduce(x))))
filter(x) = filterfunc(x) < 1e-5 ? x : 1/0
set multiplot layout 1,2
sc = 1
plot 'sample.data' using (filter($1)):2 notitle
sc = 10
replot
The variable sc allows to change the density. The result is (with 4.6.5) is:
I did some work inspired by Christoph's answer and able to get equal spacing in log scale. I made a filtering, if you have numbers in the sequence you can simply use Greatest integer function and then find the nearest to it in log scale by comparing the fraction part. Precision is tuned by precision_parameter here.
precision_parameter=100
function(x)=(-floor(precision_parameter*log10(x))+(precision_parameter*log10(x)))
Now filter by using the filter function defined below
density_parameter = 3.5
filter(x)=(function(x) < 1/(log10(x))**density_parameter & function(x-1) > 1/(log10(x))**density_parameter ) ? x : 1/0
set datafile missing "NaN"
Last line helps in plotting with line point. I used x and x-1 assuming the xdata is in arithmetic progression with 1 as common difference, change it accordingly with your data. Just replace x by filter(x) in the plot command.
plot 'sample_data.dat' u (filter($1)):2 w lp

Linear Fit does not adjust b independently form a

I'm using the following gnuplot script to plot a linear fit:
#!/usr/bin/gnuplot
set term cairolatex
set output "linear_fit.tex"
c = 299792458.
x(x) = c / x
y(x) = x
h(x) = a * x + b
fit h(x) "linear_fit.dat" u (x($1)):(y($2)) via a,b
plot "linear_fit.dat" u (x($1)):(y($2)) w points title "", \
(h(x)) with lines linecolor rgb "black" title "Linear Fit"
However, after the iterations converge, b is always 1.0: https://dpaste.de/ozReq/
How can I get gnuplot to adjust b as well as a?
Update: Repeating the fit command a few hundred times with alternating via a/via b does give pretty good results, but that just can't be how it's supposed to be done.
Update 2: Here's the data in linear_fit.dat:
# lambda, V
360e-9 1.119
360e-9 1.148
360e-9 1.145
400e-9 0.949
400e-9 0.993
400e-9 0.971
440e-9 0.883
440e-9 0.875
440e-9 0.863
490e-9 0.737
490e-9 0.728
490e-9 0.755
540e-9 0.575
540e-9 0.571
540e-9 0.592
590e-9 0.457
590e-9 0.455
590e-9 0.482
I think your troubles stem from the fact that your x-values are very large (on the order of 10e14).
If you do not provide gnuplot with an initial guess for a and b, it will assume a=1 and b=1 as starting points for the fit. However, this is a poor initial guess:
Please note the log scale on both the x- and y-axis.
From the gnuplot documentation:
fit may, and often will get "lost" if started far from a solution, where SSR is large and changing slowly as the parameters are varied, or it may reach a numerically unstable region (e.g., too large a number causing a floating point overflow) which results in an "undefined value" message or gnuplot halting.
To improve the chances of finding the global optimum, you should set the starting values at least roughly in the vicinity of the solution, e.g., within an order of magnitude, if possible. The closer your starting values are to the solution, the less chance of stopping at another minimum. One way to find starting values is to plot data and the fitting function on the same graph and change parameter values and replot until reasonable similarity is reached. The same plot is also useful to check whether the fit stopped at a minimum with a poor fit.
In your case, such starting values could be:
a = 1e-15
b = -0.5
I obtained these values by eye-balling your range of values.
With those starting values, the linear fit results in:
Final set of parameters Asymptotic Standard Error
======================= ==========================
a = 1.97355e-015 +/- 6.237e-017 (3.161%)
b = -0.5 +/- 0.04153 (8.306%)
Which looks like this:
You can play with the control setting of fit (such as setting FIT_LIMIT = 1.e-35) or the starting values to achieve a better fit than this.
EDIT
While I still have not been able to coax gnuplot into modifying both parameters a, b at the same time, I found an alternate approach using R. I am aware that there are many other (scripting) languages that can perform a linear fit and this question was about gnuplot. However, the required effort with R appeared to be minimal.
Here's an example, which, when saved as linear_fit.R and called with
R CMD BATCH linear_fit.R
will provide the two coefficients of the linear fit, that gnuplot failed to provide.
y <- c(1.119, 1.148, 1.145, 0.949, 0.993, 0.971, 0.883, 0.875, 0.863,
0.737, 0.728, 0.755, 0.575, 0.571, 0.592, 0.457, 0.455, 0.482)
x <- c(3.60E-007, 3.60E-007, 3.60E-007, 4.00E-007, 4.00E-007,
4.00E-007, 4.40E-007, 4.40E-007, 4.40E-007, 4.90E-007,
4.90E-007, 4.90E-007, 5.40E-007, 5.40E-007, 5.40E-007,
5.90E-007, 5.90E-007, 5.90E-007)
c = 299792458.
x <- c/x
lm.out <- lm(y ~ x)
svg("linear_fit.svg")
plot(x,y)
abline(lm.out,col="red")
summary(lm.out)
You will end up with an svg-file that contains the plot and a linear_fit.Rout text file. In there you'll find the following coefficients:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.429e-01 4.012e-02 -13.53 3.55e-10 ***
x 2.037e-15 6.026e-17 33.80 2.61e-16 ***
So, in the terminology of the original question, we obtain:
a = 2.037e-15
b = -5.429e-01
These values are very close to the values you quoted from alternating the fit.
In case the comments get purged, these questions were identified as related:
What is gnuplot's internal representation of floating point numbers?
Gnuplot behaves oddly in polynomial fit. Why is that?

Resources