gnuplot removing outliers when plotting a data file - statistics

I need to plot a data file of 2 colums using gnuplot, scatter plot is what I need I think. My understanding of gnuplot goes as far as :
plot "first_click" using 2:1
3 lines from head and tail of my data looks as follows:
1 612856
3 3840538
5 5240597
.
.
.
139845 1
141101 1
141584 1
I am expecting my scatter plot to show a logarithmic trend, however my data (as most data) has tons of outliers . So I need to do one of two things:
Automatically "zoom" to where most of the data is.
Automatically prune outliers.
Provide a predicate for each of the columns to manually prune the data, and perhaps predicates that can take both columns of an entry in scope --e.g., !column1 > x && ! column2 == 1
Precision is not a concern.
At this stage I prefer 1 and 2, but I'd like to see if option 3 is possible as well since I am a programmer and not a statistician.

You could also try
plot "first_click" using 2:1 smooth bezier with lines
This has the side effect of not showing most outliers.

gnuplot should automatically zoom to fit the data plotted (if not, you can use reset yrange, xrange to auto-zoom again). If the outliers are pruned prior to plotting then your first requirement would already be met.
Number two and three could be achieved by modifying your plot command as follows:
plot "first_click" using ($2 != 1 ? $2 : 1/0):($1 < x ? $1 : 1/0)
Would plot only values for which the second column is not equal to 1 and the first column is less than x. Where x is the value at which you want to start pruning outliers. 1/0 is a way of telling gnuplot the point is invalid and it won't be plotted.

Related

Gnuplot 3D bar graph from data files

I have a gnuplot script that produces bar graphs like this:
The input data is in files that have a number of columns, each column ultimately contributes to a cluster in the chart (2 clusters shown in the example). Each file contributes to a bar in the chart (there are 9 in the example). Each file may have a large number of rows.
The script takes the input data files and, using the stats command, produces new files containing one row per column of the original files. Each row contains a mean, min and max value for its source column.
These new files are then used to plot the bar chart with error bars. Each file represents one bar and each row contributes to one cluster. The plot code is as follows:
plot for [f in FILES] f.'.stats' using 2:3:4 title columnhead(1), \
'' using (0):xticlabels(1) with lines
Now I have a second set of files and that produce another similar bar chart. I would like to combine these charts onto one so there will be two rows of 3-D bars, one in front of the other (rendered with a 3-D style - the new 'z' axis representing the two data sets (two sets of FILES).
Here is an example to illustrate the look I'm after (obviously not made with gnuplot!):
Can I do this with Gunplot?
I have read the user manual and the Gnuplot In Action book but haven't found anything that would indicate this is possible.
gnuplot version 5.3 (the development branch) adds a 3D barchart variant
3D boxes demo. However rendering the boxes in 3D unfortunately depends on features that were not present in earlier gnuplot release versions so I cannot offer a work-around for the current one (5.2.4). Also the new 3D variant does not show error bars, although I think one could construct a plot command that would add them.
I produced a 3D bar chart using the development 5.3 version (git checkout). Here is my splot command:
splot for [c = 1:ncats] for [f = 1:nfiles] \
word(cat_files[c],f).'.stats' \
using (f+column(0)*(nfiles+2)):(scale_y(c)):2 \
with boxes \
title (c==1 ? columnhead(1) : '')
The input data is in a set of 'stats' files as described in the question. To draw the plot, I separated the input FILES into categories - two (ncats) sets of files held in the array cat_files, each containing the same number of files (nfiles).
The categores equate to positions on the y-axis (rows) and the individual files equate to positions on the x-axis (bars). Rows in each file equate to clusters of bars and the values in each row is the bar height which is the Z axis. The Z axis was the Y axis in the 2D model. The nasty expressions are to position the bars on the x and y axes as I explain below.
I had a lot of difficulty getting this to work but I think that the result looks good:
The problems, which I cover below are:
matching colours between chart 'rows' of the y-axis
bar dimensions - making square bars is very hit-and-miss, hence my scale_y function.
x-axis label orientation
repeated items in the key, hence conditional expression for title.
no clustering support, hence nasty positioning expressions
What I have is brittle---it works on my Linux system but relies heavily on shell helpers. But it works. Hopefully this information helps others or can be taken as feedback to improve gnuplot to make it even more awesome!
Colours
To get the colours in each data set to line up, I set linetype cycle nfiles and hope gnuplot defines sufficient colours.
The reason for doing this is to reset the colour assignment between file sets (categories on the y-axis) so that the same bar in different file sets had the same colour. By explictly setting it to cycle after the known number of files (chart bars) I ensured the colours matched.
Bar dimensions
The bar dimensions (boxwidth and boxdepth) are relative to the axis ranges and it's therefore difficult to make them square.
If a bar rests on the extreme of the y axis (lower or upper) then it is cut in half vertically (it's visible box depth is half the defined boxdepth value).
I had to play with scaling the y axis so that my two category sets were displayed near each other. The default behaviour displayed a range from 1 to 2 in steps of 0.2 and placed the two plots at 1 and 2, making them appear far apart.
I tried set ytics to no effect. I ended up scaling the y value.
scale(y) = 0.1 * y - 0.05
set yrange [0:1]
set boxdepth (0.8 / clusters)
all the numbers are fudge factors. clusters is the number of clusters (rows in files). The numbers I have maintain a square appearance with my test data (I have data to display up to 5 clusters).
I had to start the x axis at 0.5 otherwise the first bar would appear too far in (if x starts at 0) or vertically half-cut off (if x starts at 1).
set xrange [0.5:*]
Axis labels
I replaced the automatic tick marks with custom labels. On the Y axis:
set ytics ()
set for [c = 1:ncats] ytics add (word(CATS,c) scale_y(c) )
Similarly for the x axis. First, where there is 1 cluster I label each category
set xtics ()
set for [f = 1:nfiles] xtics add (label(word(cat_files[1],f)) f)
Or where there are multiple clusters, I label the clusters:
set xtics ()
set set for [c = 2:(clusters+1)] xtics add (cell(f,c,1) (nfiles/2)+2+((c-2)*nfiles))
Here, cell is a shell helper that returns the value from file f at row c position 1. The horrible formula is a hack to position the label along the axis in the middle of the cluster. I also use shell helpers to get the number of clusters. I could not find a way in gnuplot to query rows and columns. Note that previously (when 2D plotting) I would have used xticlabels(1) to plot the clusered x-axis.
I wanted to turn the x labels to run perpendicular to the axis but this doesn't seem possible. I also wanted to tweak their positions with 'right' alignment but couldn't make that work either.
Key labels
An entry is added into the key for each bar plotted. Because these are repeated within each category they get duplicted in the key. I made it only add them once by using a conditional, changing from
title columnhead(1)
to
title (c==1 ? columnhead(1) : '')
I only show the key when there is more than one cluster.
Clustering
The 2D plot was clustered. I had difficulty making a clustered appearance in 3D. If I run the plot on clustered data then they overlay (they have the same Y values). To overcome this I used a formula to shift latter clusters along the x-axis and add a gap between them. So instead of a simple value for x:
... using (f):(scale_y(c)):2 ...
I have a formula:
... using (f+column(0)*(nfiles+2)):(scale_y(c)):2 ...
where f is the file number (eq. the bar number), column(0) is the cluster number, nfiles is the number of files (eq. the numer of bars, or cluster size), and 2 is the separator gap.
Incidentally, whilst doing this I discovered that ($0) doesn't work in gnuplot 5.3, you have to use column(0) instead ($0 works in 5.2.4).
I used the Arch Linux AUR package to build which gave me a package gnuplot-git-5.3r20180810.10527-1-x86_64.pkg.tar.xz.
An example plot with one cluster.
An example plot with three clusters and a key legend.
There are probably better ways to do the things I've done here. Being relatively new to gnuplot, I would be interested in any ways to improve upon this solution.
(I can't figure out how to format text in a comment, so I'll provide this as a separate answer)
Matching color: This is more reliably done by providing the color in a separate field of the using spec. From the help text:
splot with boxes requires at least 3 columns of input data. Additional
input columns may be used to provide information such as box width or
fill color.
3 columns: x y z
4 columns: x y z [x_width or color]
5 columns: x y z x_width color
The last column is used as a color only if the splot command specifies a
variable color mode. Examples
splot 'blue_boxes.dat' using 1:2:3 fc "blue"
splot 'rgb_boxes.dat' using 1:2:3:4 fc rgb variable
splot 'category_boxes.dat' using 1:2:3:4:5 lc variable
In the first example all boxes are blue and have the width previously set
by set boxwidth. In the second example the box width is still taken from
set boxwidth because the 4th column is interpreted as a 24-bit RGB color.
The third example command reads box width from column 4 and interprets the
value in column 5 as an integer linetype from which the color is derived.
Half-depth boxes at each end: This was an autoscaling bug (now fixed)

draw lines from a given point to all points in a file in gnuplot

I want to draw lines from origin (0,0) to all points whose coordinates are given in a file, using gnuplot. For e.g. if the file contains data as:
1,1
1,2
Then I want a straight lines from (0,0) to (1,1) and (0,0) to (1,2). Since I have a lot of points, I can't do it manually for each point in the file. How to accomplish this ?
One simple way to accomplish this would be to plot using vectors, but setting the origin as (0,0) for all points, then removing the vector heads:
plot "datafile" using (0):(0):1:2 with vectors
which results in:
More info here. by the way, if your input files looks exactly like the one you posted:
1,1
1,2
You'll need to add set datafile separator ',' before plotting. Hope it helps!
A possible way is to use a plot for loop over the block index. If you insert two white lines between the coordinates in your file, they are viewed as different blocks, so that you can write
plot for [j=0:N] 'data.dat' index j u 1:2 with lines
where N is the number of points. However, in this way you need to add the point of the origin in your file in each block, i.e. in the form
#your data file
0 0
1 1
0 0
1 2
I don't know how many points you have or if you have to perform this on many files. With few points you could modify the file by hand, otherwise I will suggest preparing a script (in bash for example with sed or others...).

Gnuplot - Plot data on another abscissa by interpolation

Good evening,
I have a problem with Gnuplot. I tried to sum up my problem to make the comprehension easier.
What I have : 2 sets of data, the first one is my experimental data, about 20 points, the second one is my numerical data, about 300 points. But the two sets don't have the same abscissa.
What I want to have : I want my numerical data be interpolate on the x-experimental abscissa.
I know it is possible to do that with Xmgrace (paragraph Interpolation at http://plasma-gate.weizmann.ac.il/Xmgr/doc/trans.html#interp) but with Gnuplot ?
What I want to have in addition : is it possible, then, to subtract the y-experimental data of my y-numerical data at the x-experimental abscissa points ?
Thank you in advance for your answer,
zackalucard
You cannot interpolate the ordinate values of one set to the abscissa values of the other. gnuplot has no mechanism for that.
You can however plot both datasets using one of the smoothing algorithms (check "help smooth") with common abscissa values (which might (be made to) coincide with the original values of one set.)
set table "data1.tmp"
plot dataf1 smooth cspline
set xrange [GPVAL_x_min:GPVAL_X_max] # fix xrange settings
set table "data2.tmp"
plot dataf2 smooth cspline
unset table
Now you have the interpolated data in two temporary files, and only need to combine them into one:
system("paste data1.tmp data2.tmp > correlation.dat") # unixoid "paste" command
plot "correlation.dat" using 2:4
(If you have a sensible fit function for both datasets, the whole thing becomes much easier : plot dataf1 using (fit1($1)):(fit2($1)))
You can use smoothing, this should do the trick
plot "DATA" smooth csplines
(csplines is just one options, there others, e.g. bezier)
But I don't think you can automatically determine the intersection of the smoothed curved. You use the mouse to determine the intersection visually, or alternatively fit some functions f(x) and g(x) to your curves and solve f(x)=g(x) analytically

Comparative histogram of two data files, one with frequency, the other with boxes

I have two sets of data, and aim to make a comparative histogram from them. However one is a two-column data, x and its frequency, the second one is a one-column unsorted data which gnuplot should derive out the frequencies. I want a continuous histogram, but whatever I find on the web has gaps.
how should I do this?
I tried using the following script
binwidth=5
bin(x,width)=width*floor(x/width)
plot'data1.txt' with boxes, 'data2.txt' using (bin($1,binwidth)):(1.0) smooth freq with boxes
with the data files data1.txt:
1 3
5 1
7 1
and the second data file data2.txt:
1
1
1
5
7
This doesn't give the expected result.
Use the smooth frequency option, which makes the data monotonic in x; points with the same x-value are replaced by a single point having the summed y-values. So, if you use the first column as x-values and 1 as y-value you get the count:
plot 'secondfile.dat' using 1:(1) smooth frequency with linespoints
The plotting style is almost independent of the plotting style, so you can use points, lines, boxes etc.

Plotting GNUPlot graph after computation

I have a data file that lists hits and misses for a certain cache system. Following is the data file format
time hits misses
1 12 2
2 34 8
3 67 13
...
To plot a 2D graph in GNUPlot for time vs hits, the command would be:
plot "data.dat" using 1:2 using lines
Now I want to plot a graph of time vs hit-ratio, For this can I do some computation for the second column like :
plot "data.dat" using 1:2/ (2 + 3) using lines
Here 1, 2, 3 represent the column number.
Any reference to these kind of graph plotting will also be appreciated.
Thanks in advance.
What you have is almost correct. You need to use $ symbols to indicate the column in the calculation:
plot "data.dat" using 1:($2/($2 + $3))
Since you are using $n to refer to the column numbers, you now are able to use n to refer to the number itself. For example,
plot "data.dat" using 1:(2 * $2)
will double the value in the second column.
In general, you can even plot C functions like log and cos of a given column. For example:
plot "data.dat" u 1:(exp($2))
Note the parens on the outside of the argument that uses the value of a particular column.
See here for more info.

Resources