gnuplot : using a logarithmic axis for a histogram - gnuplot

I have a data file that I am creating a histogram from.
The data file is :
-0.1 0 0 JANE
1 1 1 BILL
2 2 1 BILL
1 3 1 BILL
6 4 0 JANE
35 5 0 JANE
9 6 1 BILL
4 7 1 BILL
24 8 1 BILL
28 9 1 BILL
9 10 0 JANE
16 11 1 BILL
4 12 0 JANE
45 13 1 BILL
My gnuplot script is :
file='test.txt'
binwidth=10
bin(x,width)=width*floor(x/width)
set boxwidth 1
plot file using (bin($1,binwidth)):(1.0) smooth freq with boxes, \
file using (1+(bin($2,binwidth))):(1.0) smooth freq with boxes
I would like to plot this data on a logscale in y. However there are some 0 values (because some of the bins are empty) that cannot be handled by set logscale y. I get the error Warning: empty y range [1:1], adjusting to [0.99:1.01].
According to gnuplot's help, "The frequency option makes the data monotonic in x; points with the same x-value are replaced by a single point having the summed y-values."
How can I take the log10() of the summed y-values computed by smooth freq with boxes?

There are at least two things that you could do. One is to use a linear axis between 0 and 1 and then use the logarithmic one as explained in this answer. The other one is to plot to a table first and then set the log scale ignoring the points with zero value.
With a normal linear axis and your code (plus set yrange [0:11]) your data looks:
Now lets plot to a table, then set the log scale, then plot ignoring the zero values:
file='test.txt'
binwidth=10
bin(x,width)=width*floor(x/width)
set table "data"
plot file using (bin($1,binwidth)):(1.0) smooth freq, \
file using (1+(bin($2,binwidth))):(1.0) smooth freq
unset table
set boxwidth 1
set logscale y
set yrange [0.1:11]
plot "data" index 0 using ($1):($2 == 0 ? 1/0 : $2) with boxes lc 1, \
"data" index 1 using ($1):($2 == 0 ? 1/0 : $2) with boxes lc 2
set table sometimes generates some undesirable points in the plot, which you can see at x = 0. To get rid of them you can use "< grep -v u data" instead of "data".

Related

Plot HTTP Status Codes Grouped by Days

I have a stream of timestamped HTTP status codes:
2021-02-09T10:54:00 200 50
2021-02-09T10:57:00 200 35
2021-02-09T11:00:00 200 50
2021-02-09T11:03:00 500 150
2021-02-09T11:06:00 500 350
2021-02-09T11:09:00 500 450
2021-02-09T11:12:00 500 1000
2021-02-09T11:15:00 404 35
2021-02-09T11:18:00 404 50
2021-02-09T11:21:00 200 50
2021-02-09T11:24:00 200 35
2021-02-09T11:27:00 200 50
2021-02-09T11:30:00 200 50
I already managed to setup gnuplot to group the days:
set xdata time
set ydata time
set format y "%H:%M"
set timefmt "%Y-%m-%dT%H:%M:%S"
set xrange ["2021-02-08T00:00:00":"2021-02-14T23:59:59"]
plot 'availability.csv' using (timecolumn(1,"%Y-%m-%d")):(timecolumn(1,"%H-%M")):2…
I already found a lot of samples like summing over the day (boxes/ histogram) or marking the point in time per day (point). But none of them match my goal of availability over time.
My goal is to have a bar per day binned to 15min blocks. Each block should be colored according to the max status code, e.g. HTTP.500=red, HTTP.404=yellow, HTTP.200=green (only these 3, no teapot/redirect/spooky ones, and the colors as a sort of traffic light). Y-axis is the hour of the day, x-axis is the day.
Am I on the right track, is this possible at all with gnuplot?
What does the using clause look like?
How is binning to 15min intervals merged into the second column?
How to color the specific codes? (It is not like a heatmap calculating color from frequency)
I would start with something like the following.
timecolumn(1,"%H-%M") does not extract hour and minute from timestrings like "2021-02-08T12:34:56". As far as I know, first we have to extract the 12:34 part and then convert this to hours and minutes:
strptime("%H:%M", strcol(1)[12:17])
timestamps are internally stored as seconds, so binning into 15 minute (= 900 second) bins can be reached by using integer division: int(<seconds>)/900*900.0
A gnuplot command like plot "a.dat" using 1:(<expression>, value) evaluates expression and plots value. This is used to ...
"manually" select the max value within a bin. The script goes through all points within a bin and remembers the max value. Please read help ternary. I use the ternary operator twice: once for checking the bin and once for checking the max value
for color, please read help set palette
This is the complete script:
set xdata time
set ydata time
set format y "%H:%M"
set timefmt "%Y-%m-%dT%H:%M:%S"
set xrange ["2021-02-08T00:00:00":"2021-02-14T23:59:59"]
set palette defined (200 "green", 400 "yellow", 500 "red")
unset colorbox
bin = 0
bin_before = 0
max_value = 0
plot 'availability.csv' using \
(timecolumn(1,"%Y-%m-%d")):\
(bin = (int(strptime("%H:%M", strcol(1)[12:17]))/900*900), bin):\
(y = $2, bin == bin_before ? (y>max_value ? max_value = y : max_value = max_value) \
: (max_value = y, bin_before = bin), max_value ) \
linecolor palette pt 5 ps 2 notitle
This is the result:
I think we are not finished, one should add a legend, and it might be interesting to check the possibilities with splot and pm3d.
Interesting challenge. My suggestion would be the following. It's probably not the easiest, but I would say the result looks reasonable. It uses the plotting style with boxxyerror (see help boxxyerror).
From your question, I get that you want to have a binning of 15 minutes and display only the color of the maximum status in that interval. Why not showing a histogram of the different states for each interval? For example: if in the interval there are the following HTTP states: 2x 200, 1x 404 and 2x 500. Then the horizontal bar in this interval will be split into 40% green, 20% yellow and 40% red.
What the following code basically does:
creating some random test data (just for illustration)
binning of the data using smooth freq (check help smooth) with adding a little offset of 1,2,3 seconds for the 3 different states.
do some table rearrangements
create the final table with the x,y positions of the boxes and corresponding to the relative contribution of each status within the binning interval.
In order to get a better understanding:
Example data of datablock $Data:
2021-02-10T12:30:00 200 407
2021-02-10T12:33:00 200 922
2021-02-10T12:36:00 404 615
2021-02-10T12:39:00 200 689
2021-02-10T12:42:00 200 628
2021-02-10T12:45:00 500 10
2021-02-10T12:48:00 200 185
2021-02-10T12:51:00 200 2
2021-02-10T12:54:00 404 743
2021-02-10T12:57:00 200 618
Example data of datablock $Histo3:
1612960200 5 i
1612960201 4 i
1612960202 1 i
1612961100 5 i
1612961101 3 i
1612961102 1 i
1612961103 1 i
Example data of datablock $Histo4:
NaN 0 nan 12:30 0
2021-02-10 0 0.8 12:30 1
2021-02-10 0.8 1 12:30 2
NaN 0 nan 12:45 0
2021-02-10 0 0.6 12:45 1
2021-02-10 0.6 0.8 12:45 2
2021-02-10 0.8 1 12:45 3
The code can certainly be optimized. So, look at it as a starting point...
Code:
### status overview as date/time dependent histograms
reset session
# general settings
myDateFmt = "%Y-%m-%d" # date only format
myTimeFmt = "%H:%M:%S" # time only format
myDateTimeFmt = myDateFmt."T".myTimeFmt # datetime format
SecPerDay = 24*3600 # seconds per day
myStatusList = "200 404 500" # possible states
myColorList = "0x00ff00 0xffff00 0xff0000" # green, yellow, red
# create some random test data
set print $Data
myTime = time(0) # now
myRandomStatus(x) = x<0.70 ? 1 : x<0.95 ? 2 : 3 # random status
myInterval = 3 # interval in minutes
do for [i=1:5000] {
myTime = myTime + myInterval*60
myStatus = word(myStatusList,myRandomStatus(rand(0))) # random status
myValue = int(rand(0)*1000) # random value 0-999
print sprintf("%s %s %g",strftime("%Y-%m-%dT%H:%M:00",myTime),myStatus,myValue)
}
set print
# functions
myStatusNo(col) = column(col)==200 ? 1 : column(col)==404 ? 2 : 3
myColor(i) = int(i) ? int(word(myColorList,int(i))) : 1
myDayTime(t) = tm_hour(t)*3600 + tm_min(t)*60 + tm_sec(t)
# binning
BinWidthSec = 900 # in seconds 900 sec = 15 min
BinTime(col) = floor(myDayTime(timecolumn(col,myDateTimeFmt))/BinWidthSec)*BinWidthSec
set table $Histo1
set format x "%.0f"
plot $Data u (timecolumn(1,myDateFmt)+BinTime(1)):(1) smooth freq
plot $Data u (timecolumn(1,myDateFmt)+BinTime(1)+myStatusNo(2)):(1) smooth freq
set table $Histo2
plot $Histo1 u (sprintf("%.0f",$1)):2 w table # remove empty lines etc.
set table $Histo3
set format x "%.0f"
plot $Histo2 u 1:2 smooth freq # sort the events by time
unset table
# create final table
myX(col1,col2) = int(column(col1))%4==0 ? (Sum=0.0, Total=column(col2),"NaN") : \
strftime(myDateFmt,column(col1))
myXRelStart(col1,col2) = Sum/Total
myXRelEnd(col1,col2) = int(column(col1))%4==0 ? NaN : (Sum=Sum+column(col2), Sum/Total)
BinTimeT(col) = strftime("%H:%M",column(col))
set table $Histo4
plot $Histo3 u (sprintf("% 10s % 5g % 5g % 7s % 3d", \
myX(1,2), myXRelStart(1,2), myXRelEnd(1,2), BinTimeT(1), tm_sec($1))) w table
unset table
# plot settings
set format x "%d.%m." timedate
set format y "%H:%M" timedate
set style fill transparent solid 0.5 noborder
set yrange [0:SecPerDay]
set tics out
set key out title "HTTP status"
plot $Histo4 u (timecolumn(1,myDateFmt)+($3+$2)/2*SecPerDay) : \
(timecolumn(4,myTimeFmt)+BinWidthSec/2) : \
(($3-$2)/2*SecPerDay) : (BinWidthSec/2.):(myColor($5)) \
w boxxy lc rgb var notitle, \
for [i=1:3] keyentry w boxes lc rgb myColor(i) title word(myStatusList,i)
### end of code
Result:

How to omit values in a threshold when plotting

I am trying to plot data using gnuplot. My data is in a matrix, i.e.:
-2 1 2
0 2 -5
3 0 1
For clarity I would like to assign all values in the threshold -2:2 white color, or assign them as missing. Is there an easy way to do it?
That depends on your actual plotting style and how you encode the color.
When plotting with image you can simply mark all values inside your range as invalid (as 1/0), and then they get plotted as white:
$data <<EOD
-2 1 2
0 2 -5
3 0 1
EOD
plot $data matrix u 1:2:($3 >= -2 && $3 <= 2 ? 1/0 : $3) with image notitle

gnuplot : data table type value = 'u' and strange bars in histogram boxes

I previously asked this question. This is a related question.
Using the test.txt file :
-0.1 0 0 JANE
1 1 1 BILL
2 2 1 BILL
1 3 1 BILL
6 4 0 JANE
35 5 0 JANE
9 6 1 BILL
4 7 1 BILL
24 8 1 BILL
28 9 1 BILL
9 10 0 JANE
16 11 1 BILL
4 12 0 JANE
45 13 1 BILL
and the Gnuplot script :
file='test.txt'
binwidth=10
bin(x,width)=width*floor(x/width)
set table "data_table"
plot file using (bin($1,binwidth)):(1.0) smooth freq,\
file using (1+(bin($2,binwidth))):(1.0) smooth freq
unset table
set boxwidth 1
set logscale y
set yrange[0.01:15]
plot "data_table" index 0 using ($1):($2 == 0 ? 1/0 : $2) with boxes,\
"data_table" index 1 using ($1):($2 == 0 ? 1/0 : $2) with boxes
I get the data_table :
# Curve 0 of 2, 7 points
# Curve title: "file using (bin($1,binwidth)):(1.0)"
# x y type
-10 1 i
0 8 i
10 1 i
20 2 i
30 1 i
40 1 i
0 1 u
# Curve 1 of 2, 3 points
# Curve title: "file using (1+(bin($2,binwidth))):(1.0)"
# x y type
1 10 i
11 4 i
1 1 u
Per "help set table" in the Gnuplot shell :
". . . character R takes on one of three values:
"i" if the point is in the active range, "o" if it is out-of-range, or "u"
if it is undefined."
Question 1 : Why is the last line of each index group in data_table have a value of u and why does it's x value seem to be out of order?
Question 2 : The plot that is generated looks very similar to Miguel's second plot. If you look at the bins at (x=0, y=1), you'll notice a bar in the middle of the histogram box. What is it and how do I get rid of it?
The superfluous points, marked as undefined by the u, are due to a bug, see bug #1274.
Gnuplot itself doesn't automatically respect the values in the third column. So, although the last points in each block are marked as being undefined, gnuplot plots them which causes the additional bars at y=1.
To get rid of them you must explicitely skip those points which have an u in their third column by checking for strcol(3) eq "u":
file='test.txt'
binwidth=10
bin(x,width)=width*floor(x/width)
set table "data_table"
plot file using (bin($1,binwidth)):(1.0) smooth freq,\
file using (1+(bin($2,binwidth))):(1.0) smooth freq
unset table
set boxwidth 1
set logscale y
set yrange[0.01:15]
unset key
plot "data_table" index 0 using ($1):($2 == 0 || strcol(3) eq "u" ? 1/0 : $2) with boxes,\
"data_table" index 1 using ($1):($2 == 0 || strcol(3) eq "u" ? 1/0 : $2) with boxes

Gnuplot: plotting points with variable point types

I have x,y values for points in the first 2 colums and a number that indicates the point type (symbol) in the 3. column, in one data file. How do I plot data points with different symbols?
Unfortunately, there isn't a way (AFAIK) to automatically set the point of the plot from a column value using vanilla GNUPLOT.
However, there is a way to get around that by setting a linestyle for each data series, and then plotting the values based on that defined style:
set style line 1 lc rgb 'red' pt 7 #Circle
set style line 2 lc rgb 'blue' pt 5 #Square
Remember that the number after pt is the point-type.
Then, all you have to do is plot (assuming that the data in "data.txt" is ordered ColX ColY Col3):
plot "data.txt" using 1:2 title 'Y Axis' with points ls 1, \
"data.txt" using 1:3 title 'Y Axis' with points ls 2
Try it here using this data (in the section titled "Data" - also note that column 3 "Symbol" is noted used, it's mainly there for illustrative purposes):
# This file is called force.dat
# Force-Deflection data for a beam and a bar
# Deflection Col-Force Symbol
0.000 0 5
0.001 104 5
0.002 202 7
0.003 298 7
And in the Plot Script Heading:
set key inside bottom right
set xlabel 'Deflection (m)'
set ylabel 'Force (kN)'
set title 'Some Data'
set style line 1 lc rgb 'red' pt 7
set style line 2 lc rgb 'blue' pt 5
plot "data.txt" using 1:2 title 'Col-Force' with points ls 1, \
"data.txt" using 1:3 title 'Beam-Force' with points ls 2
The one caveat is of course that you have have to reconfigure your data input source.
REFERENCES:
http://www.gnuplotting.org/plotting-single-points/
http://www.gnuplotting.org/plotting-data/
Here is a possible solution (which is a simple extrapolation from gnuplot conditional plotting with if), that works as long as you don't have tens of different symbols to handle.
Suppose I want to plot 2D points in a coordinate system. I have only two symbols, that I arbitrarily represented with a 0 and a 1 in the last column of my data file :
0 -0.29450470209121704 1.2279523611068726 1
1 -0.4006965458393097 1.0025811195373535 0
2 -0.7109975814819336 0.9022682905197144 1
3 -0.8540692329406738 1.0190201997756958 1
4 -0.5559651851654053 0.7677079439163208 0
5 -1.1831613779067993 1.5692367553710938 0
6 -0.24254602193832397 0.8055955171585083 0
7 -0.3412654995918274 0.6301406025886536 0
8 -0.25005266070365906 0.7788659334182739 1
9 -0.16853423416614532 0.09659398347139359 1
10 0.169997438788414 0.3473801910877228 0
11 -0.5252010226249695 -0.1398928463459015 0
12 -0.17566296458244324 0.09505800902843475 1
To achieve what I want, I just plot my file using conditionals. Using an undefined value like 1/0 results in no plotting of the given point:
# Set styles
REG_PTS = 'pointtype 7 pointsize 1.5 linecolor rgb "purple"'
NET_PTS = 'pointtype 4 pointsize 1.5 linecolor rgb "blue"'
set grid
# Plot each category with its own style
plot "data_file" u 2:($4 == 0 ? $3 : 1/0) title "regular" #REG_PTS, \
"data_file" u 2:($4 == 1 ? $3 : 1/0) title "network" #NET_PTS
Here is the result :
Hope this helps
Variable pointype (pt variable) was introduced (I guess) not until gnuplot 5.2.0 (Sept 2017) (check help points).
Just in retrospective, another (awkward) solution would be the following for those who are still using such early versions.
Data:
1 1.0 4 # empty square
2 2.0 5 # filled square
3 3.0 6 # empty circle
4 4.0 7 # filled circle
5 5.0 8 # empty triangle up
6 6.0 9 # filled triangle down
7 7.0 15 # filled pentagon (cross in gnuplot 4.6 to 5.0)
Script: (works from gnuplot>=4.6.0, March 2012; but not necessary since 5.2.0)
### variable pointtype for gnuplot>=4.6
reset
FILE = 'SO23707979.dat'
set key noautotitle
set offsets 1,1,1,1
set pointsize 4
stats FILE u 0 nooutput
N = STATS_records # get the number of rows
p0=x1=y1=NaN
plot for [n=0:N-1 ] FILE u (x0=x1, x1=$1, x0):(y0=y1, y1=$2, y0):(p0=$3) \
every ::n::n w p pt p0 lc rgb "red", \
FILE u 1:2 every ::N-1::N-1 w p pt p0 lc rgb "red"
### end of script
Result:

Plot cyclic sum of some row data

I have a data file that store for a given timestamp k values.
Ex:
# data.dat
# Example for k = 3
# Time ID value
1 0 1.555
1 1 1.76
1 2 12.56
2 0 1.75
2 1 2.04
2 2 13.04
3 0 2.01
3 1 0.52
3 2 12.99
# ...
I can print individually the data of each ID versus the time as follows:
set xrange [0:4]
set yrange[0:14]
set xtics 1
plot "data.dat" every 3 using 1:3 title "ID=0" with lp, \
"" every 3::1 using 1:3 title "ID=1" with lp, \
"" every 3::2 using 1:3 title "ID=2" with lp
Yet I'm interested to plot the average sum of the 3 values vs time.
Of course, I could regenerate a new data file containing (with evaluated sum):
# avg_data.dat modified to
# Example for k = 3
# Time ID value
1 (1.555+1.76+12.56)/3
2 (1.75+2.04+13.04)/3
3 (2.01+0.52+12.99)/3
# ...
But of course, I'm seeking an automated way do express that in gnuplot using the data.dat file directly...
Drawing some inspiration from the running average demo on the gnuplot site:
k = 3
back1 = back2 = back3 = 0
shifter(x) = (back3 = back2, back2 = back1, back1 = x)
avger(x,y) = (shifter(x), y == k - 1 ? (back1 + back2 + back3)/3 : 1/0)
plot 'data.dat' u 1:(avger($3, $2)) with points pt 7
This works for me in gnuplot 4.6.1. If you want to have the points at each timestep connected in a line, it may be better to preprocess the data, since gnuplot in general won't connect points resulting from an expression evaluation (see discussion here and here, and in the gnuplot docs for set datafile missing).

Resources