Remove duplicated outliers in gnuplot boxplot [duplicate] - statistics

I have a large set of data points. I try to plot them with a boxplot, but some of the outliers are the exact same value and they are represented on a line beside each other. I found How to set the horizontal distance between outliers in gnuplot boxplot, but it doesn't help too much, as it is apparently not possible.
Is it possible to group the outliers together, print one point and then print a number in brackets beside it to indicate how many points there are? I think this would make it more readable in a graph.
For information, I have three boxplots for one x value and that times six in one graph. I am using gnuplot 5 and already played around with the pointsize, which doesn't reduce the distance anymore.
I hope you can help!
Edit:
set terminal pdf
set output 'dat.pdf'
file0 = 'dat1.dat'
file1 = 'dat2.dat'
file2 = 'dat3.dat'
set pointsize 0.2
set notitle
set xlabel 'X'
set ylabel 'Y'
header = system('head -1 '.file0);
N = words(header)
set xtics ('' 1)
set for [i=1:N] xtics add (word(header, i) i)
set style data boxplot
plot file0 using (1-0.25):1:(0.2) with boxplot lw 2 lc rgb '#8B0000' fs pattern 16 title 'A'
plot file1 using (1):1:(0.2) with boxplot lw 2 lc rgb '#00008B' fs pattern 4 title 'B'
plot file2 using (1+0.25):1:(0.2) with boxplot lw 2 lc rgb '#006400' fs pattern 5 title 'C'
for [i=2:N] plot file0 using (i-0.25):i:(0.2) with boxplot lw 2 lc rgb '#8B0000' fs pattern 16 notitle
for [i=2:N] plot file1 using (i):i:(0.2) with boxplot lw 2 lc rgb '#00008B' fs pattern 4 notitle
for [i=2:N] plot file2 using (i+0.25):i:(0.2) with boxplot lw 2 lc rgb '#006400' fs pattern 5 notitle
What is the best way to implement it with this code already in place?

There is not option to have this done automatically. Required steps to do this manually in gnuplot are:
(In the following I assume, that the data file data.dat has only a single column.)
Analyze your data with stats to determine the boundaries for the outliers:
stats 'data.dat' using 1
range = 1.5 # (this is the default value of the `set style boxplot range` value)
lower_limit = STATS_lo_quartile - range*(STATS_up_quartile - STATS_lo_quartile)
upper_limit = STATS_up_quartile + range*(STATS_up_quartile - STATS_lo_quartile)
Count only the outliers and write them to a temporary file
set table 'tmp.dat'
plot 'data.dat' using 1:($1 > upper_limit || $1 < lower_limit ? 1 : 0) smooth frequency
unset table
Plot the boxplot without the outliers, and the outliers with the labels plotting style:
set style boxplot nooutliers
plot 'data.dat' using (1):1 with boxplot,\
'tmp.dat' using (1):($2 > 0 ? $1 : 1/0):(sprintf('(%d)', int($2))) with labels offset 1,0 left point pt 7
And this needs to be done for every single boxplot.
Disclaimer: This procedure should work basically, but having no example data I couldn't test it.

Related

gnuplot - intersection of two plots

I am using gnuplot to plot data from two separate csv files (found in this link: https://drive.google.com/open?id=0B2Iv8dfU4fTUZGV6X1Bvb3c4TWs) with a different number of rows which generates the following graph.
These data seem to have no common timestamp (the first column) in both csv files and yet gnuplot seems to fit the plotting as shown above.
Here is the gnuplot script that I use to generate my plot.
# ###### GNU Plot
set style data lines
set terminal postscript eps enhanced color "Times" 20
set output "output.eps"
set title "Actual vs. Estimated Comparison"
set style line 99 linetype 1 linecolor rgb "#999999" lw 2
#set border 1 back ls 11
set key right top
set key box linestyle 50
set key width -2
set xrange [0:10]
set key spacing 1.2
#set nokey
set grid xtics ytics mytics
#set size 2
#set size ratio 0.4
#show timestamp
set xlabel "Time [Seconds]"
set ylabel "Segments"
set style line 1 lc rgb "#ff0000" lt 1 pi 0 pt 4 lw 4 ps 0
plot "estimated.csv" using ($1):2 with lines title "Estimated", "actual.csv" using ($1):2 with lines title "Actual";
Is there any way where we can print out (write to a file) the values of the intersection of these plots by ignoring the peaks above green plot? I also have tried to do an sql-join query but it doesn't seem to print out anything for the same reason I explained above.
PS: If the blue line doesn't touch the green line (i.e. if it is way below the green line), I want to take the values of the closest green line so that it will be a one-to-one correspondence (or very close) with the actual dataset.
Perhaps one could somehow force Gnuplot to reinterpolate both data sets on a fine grid, save this auxiliary data and then compare it row by row. However, I think that it's indeed much more practical to delegate this task to an external tool.
It's certainly not the most efficient way to do it, nevertheless a "lazy approach" could be to read the data points, interpret each dataset as a LineString (collection of line segments, essentially equivalent to assuming a linear interpolation between data points) and then calculate the intersection points. In Python, the script to do this might look like this:
#!/usr/bin/env python
import sys
import numpy as np
from shapely.geometry import LineString
#-------------------------------------------------------------------------------
def load_data(fname):
return LineString(np.genfromtxt(fname, delimiter = ','))
#-------------------------------------------------------------------------------
lines = list(map(load_data, sys.argv[1:]))
for g in lines[0].intersection(lines[1]):
if g.geom_type != 'Point':
continue
print('%f,%f' % (g.x, g.y))
Then in Gnuplot, one can invoke it directly:
set terminal pngcairo
set output 'fig.png'
set datafile separator comma
set yr [0:700]
set xr [0:10]
set xtics 0,2,10
set ytics 0,100,700
set grid
set xlabel "Time [seconds]"
set ylabel "Segments"
plot \
'estimated.csv' w l lc rgb 'dark-blue' t 'Estimated', \
'actual.csv' w l lc rgb 'green' t 'Actual', \
'<python filter.py estimated.csv actual.csv' w p lc rgb 'red' ps 0.5 pt 7 t ''
which gives:

Avoid connection of points when there is empty data

I am trying make a line chart using Gnuplot. I need to get something like the following but with an exception:
In the example above you can see a straight line which joins two separate points over empty data. It is the one that crosses the '2016-09-27 00:00:00' x tick. I would like there would be a empty space instead of that straight line. How could I achieve this?
This is the current code:
set xdata time
set terminal pngcairo enhanced font "arial,10" fontscale 1.0 size 900, 350
set output filename
set key off
set timefmt '"%Y-%m-%d %H:%M:%S"'
set format x "%Y-%m-%d %H:%M"
set xtics rotate by -80
set mxtics 10
set datafile missing "-"
set style line 1 lt 2 lc rgb 'blue' lw 1
set style line 2 lt 2 lc rgb 'green' lw 1
set style line 3 lt 2 lc rgb 'red' lw 1
plot\
fuente using 1:2 ls 1 with lines,\
fuente using 1:3 ls 2 with lines,\
fuente using 1:4 ls 3 with lines
Three options:
In the data file, put an empty line where the gap is. This results in exactly what you want, but would also affect the other data from that file.
Use every to only plot a portion of the data and plot it twice, once up to the gap, once from the gap. Suppose that the gap occurs between data points 42 and 43 in your case, then you could use:
plot\
fuente using 1:2 ls 1 every ::::41 with lines,\
fuente using 1:2 ls 1 every ::42 with lines,\
fuente using 1:3 ls 2 with lines,\
fuente using 1:4 ls 3 with lines
(The every statement takes up to six arguments separated by colons but you can leave them empty for default values. The fifth argument is the end point, the third is the starting point.)
If you use - for missing data in your file (as indicated by your set datafile missing "-"), you have modify your using statement for this to be effective:
plot\
fuente using 1:($2) ls 1 with lines,\
fuente using 1:3 ls 2 with lines,\
fuente using 1:4 ls 3 with lines
Of course, you can always change your data and e.g. insert empty lines (as #Wrzlprmft suggested) when data is missing which will interrupt your line.
With large datasets and a lot of "breaks" this would be painful if you have to do it manually.
I would say that there is a solution without changing your data.
Let me ask: "What do you consider as missing data?"
My assumption would be: you have e.g. a data logger which takes values every 10 minutes.
If for some reason the logger did not take some data there will be a "gap" of missing data.
Now, you can define what you consider as a gap, e.g. >1 hour of no data would be a gap.
Hence, you simply compare two consecutive values t0 and t1 and if the difference is larger then your gap you change the line color from whatever color to transparent (according to the scheme 0xaarrggbb). Check help linecolor variable and help colorspec.
Script:
### don't show line in missing data gaps
reset session
myFmt = "%Y-%m-%d %H:%M"
# create some random test data
set print $Data
tStart = "2016-09-27"
tEnd = "2016-10-10"
t0 = strptime(myFmt,tStart)
t1 = strptime(myFmt,tEnd)
y0 = 100
do for [t=t0:t0+(t1-t0)*0.2:600] { print sprintf("%s %g",strftime(myFmt,t),y0=y0+(rand(0)-0.5)) }
do for [t=t0+(t1-t0)*0.3:t0+(t1-t0)*0.5:600] { print sprintf("%s %g",strftime(myFmt,t),y0=y0+(rand(0)-0.5)) }
do for [t=t0+(t1-t0)*0.8:t0+(t1-t0):600] { print sprintf("%s %g",strftime(myFmt,t),y0=y0+(rand(0)-0.5)) }
set print
set format x "%d.%m." timedate
gap = 3600 # 1 hour
myColor(tCol,color) = (t0=t1, t1=timecolumn(tCol,myFmt), t1-t0>gap ? 0xff123456 : color)
set multiplot layout 2,1
plot $Data u (timecolumn(1,myFmt)):3 w l lc rgb 0xff0000 ti "data as is"
plot t1=NaN $Data u (timecolumn(1,myFmt)):3:(myColor(1,0x0000ff)) w l lc rgb var ti "with removed gaps"
unset multiplot
### end of script
Result:

Gnuplot plotting wrong lines and some strange values as well

I am using gnuplot to postprocess some calculation that I have done and I am having hard time getting gnuplot to select the right lines as it is outputting some strange values that I do not know where come from.
The first 200 points of the results start in line 3 and stop in 202 but that is not working when I use every ::3::202.
Does anyone have any suggestions of what I am doing wrong?
Gnuplot image:
Datafile
set terminal pngcairo transparent nocrop enhanced size 3200,2400 font "arial,40"
set output "Mast41_voltage_muffe.png"
set key right
set samples 500, 500
set xzeroaxis ls 1 lt 8 lw 3
set style line 12 lc rgb '#808080' lt 0 lw 1
set style line 13 lt 0 lw 3
set grid back ls 12
set decimalsign '.'
set datafile separator whitespace
set ylabel "Spenna [pu]"
set xlabel "Timi [s]"
plot "mrunout_01.out" every ::3::202 using 2:3 title '5 ohm' with lines lw 3 linecolor rgb '#D0006E',\
"mrunout_01.out" every ::203::402 using 2:3 title '10 ohm' with lines lw 3 linecolor rgb '#015DD4',\
"mrunout_01.out" every ::403::602 using 2:3 title '15 ohm' with lines lw 3 linecolor rgb '#F80419',\
"mrunout_01.out" every ::603::802 using 2:3 title '20 ohm' with lines lw 3 linecolor rgb '#07826A'
unset output
unset zeroaxis
unset terminal
every refers to the actual plottable points. In your case, you have to skip 2 lines and the bunch of data at the end of your datafile.
Since you know the actual lines you need to plot I would pre-parse the file with some external tools like sed
So you can omit the every and your plot line becomes:
plot "< sed -n '3,202p' mrunout_01.out" using 2:3 title '5 ohm' with lp lw 3 linecolor rgb '#D0006E'
With yor datafile as it is, gnuplot has problems reading it. It can't even run stats on it:
stats 'mrunout_01.out'
bad data on line 1 of file mrunout_01.out
There is no need for using external tools, you can simply do it with gnuplot.
It's advantageous with your data that it is regular, every 200 points plotted in a different color.
And the data you want to plot is separated by one empty line from some additional data at the end of the file which you don't want to plot.
So, you simply address the 4th set of 200 lines in the 0th block via every ::600:0:799:0.
From help every:
Syntax:
plot 'file' every {<point_incr>}
{:{<block_incr>}
{:{<start_point>}
{:{<start_block>}
{:{<end_point>}
{:<end_block>}}}}}
Comments:
you can skip two lines at the beginning of the files with skip 2
you can plot your curves in a loop plot for [i=1:4] ...
you can define your color myColor(n) via index n from a string "#D0006E #015DD4 #F80419 #07826A"
you can define the legend myTitle(n) also from a list "5 10 15 20"
Script: (tested with gnuplot 5.0.0, version at the time of OP's question)
### plot parts of a file in a loop
reset session
FILE = "SO36103041.dat"
myColor(n) = word("#D0006E #015DD4 #F80419 #07826A",n)
myTitle(n) = word("5 10 15 20",n)
set xlabel "Timi [s]"
set ylabel "Spenna [pu]"
set yrange[0:30]
plot for [i=1:4] FILE u 2:3 skip 2 every ::((i-1)*200):0:(200*i-1):0 \
w l lw 3 lc rgb myColor(i) ti myTitle(i)
### end of script
Result:

Gnuplot read line style from data file column

I'd like to draw an impulse graph from a text file that looks like this:
II 5 0 0 288.40 1.3033e+14
II 6 0 0 289.60 1.5621e+14
II 1 4 0 302.70 3.0084e+13
II 2 4 0 303.40 4.0230e+13
II 1 5 1 304.40 3.4089e+13
The plot conceptually should be plot "datafile.dat" using 5:6 w impulses ls $2.
Basically, given a previously defined set of line styles, I'd like to input the line style number from column 2 for every couple of plotted points from column 5 and 6.
Also I'd like to create a text box, for every plotted point, taking strings from the first four columns.
Does somebody know if that's possible?
To use the data from column two as line style use set style increment user and linecolor variable:
set style increment user
plot "datafile.dat" using 5:6:2 with impulses lc var
In order to place a label, use the labels plotting style:
plot "datafile.dat" using 5:6:1 with labels offset 0,1
Putting everything together, you have:
set style increment user
set for [i=1:6] style line i lt i
set yrange [0:*]
set offsets 0,0,graph 0.1,0
plot "datafile.dat" using 5:6:2 with impulses lc var, "" using 5:6:1 with labels offset 0,1
The result with 4.6.3 is:
Thanks for the helpful answer above. It almost solved my problem
I'm actually trying to use a column from my data file to specify a linestyle (dot, squares,triangles, whatever as long as it's user-defined), and not a linecolor. Is there any way to do that?
This line works : I get points with different colors (specified in column 4), but the point style is the same.
plot "$file" u 1:2:4 w p notitle lc var, "" using 1:2:3 with labels offset 0,1 notitle
Replacing lc with ls after defining my own styles doesn't work (ls can't have variable as an option)
I can live without different linestyles, but it would be much prettier.
You only have to replace the lineset for [i=1:6] style line i lt i for set for [i=1:6] style line i lt i pt %, Where % can be any type of point you want

setting multiple labels at the top of the x-axis

After the answer got in my earlier post drawing vertical lines in between bezier curves, I have been trying to label the segments separated by the dotted lines. I used x2label but found out that if I use it multiple times then the data gets replaced though they are positioned in different places. Below is the script:
set term x11 persist
set title "Animation curves"
set xlabel "Time (secs.)"
set ylabel "Parameter"
set x2label "Phoneme1" offset -35
set pointsize 2
set key off
set style line 2 lt 0 lc 1 lw 2
plot [0.04:0.15] "curve.dat" u 1:2 smooth csplines ls 1, "" u 1:($2-0.2):(0):(0.3) w vectors nohead ls 2, \
"curve.dat" u 1:2 with points
The output is the following.
I want to label Phoneme1, Phoneme2...and so on.. on top of each segment. How would I do it? Also as I was suggested in my earlier post to play with the line "" u 1:($2-0.2):(0):(0.3) w vectors nohead ls 2 to get a top to bottom vertical lines. But that also did not work. How do I get the lines from top margin to bottom? Thank you.
The horizontal lines
The horizontal lines can be accomplished with setting the yrange to an explicit value. Otherwise gnuplot would try to get some space between the lines and the axis. You could choose the values
set yrange [0.3:1.2]
Then you simply modify the vector using directions like so:
"" u 1:(0.3):(0):(1.2) w vectors nohead ls 2
(see below for the complete script)
The labeling of the sections
A quick way of doing this with your set of data would be this:
set key off
set style line 2 lt 0 lc 1 lw 2
set yrange [0.3:1.2]
plot [0.04:0.15] "Data.csv" u 1:2 smooth csplines ls 1, \
"" u 1:(0.3):(0):(1.2) w vectors nohead ls 2, \
"" u ($1+0.005):(1):(sprintf("P %d", $0)) w labels
However, this will probably not look the way you want it to look. You could think of modifying your data file to also include some information about the labeling like:
#x-value y-value x-label y-label label
0.06 0.694821399177 0.65 0.1 Phoneme1
0.07 0.543022222222 0.75 0.1 Phoneme2
Then the labels line would simply look like:
"" u 3:4:5 w labels
The complete plot then looks like this:

Resources