How to create clustered bar graph based on one of the columns? - gnuplot

I am trying to create a bar graph using the data below, and I want to do binning based on the category provided in the first column, I am looking for a feature like hue in seaborn with gnuplot.
The csv file I am using looks as follows (snippet):
DS_TYPE, arg1, arg2, arg3, arg4
type1, 24, 20000, 15, 20
type2, 48, 20000, 20, 60
type3, 96, 20000, 25, 90
type3, 144, 200000, 30, 110
...
The fig I wanted is a bar chart using arg1 as x-axis, arg4 as y-axis, DS_TYPE (there are only 3 types) as hue.
Currently I only see solutions by adding more columns to this csv file, (arg1_type1, arg1_type2...and so on). I tried:
#!/bin/bash
gnuplot -persist <<-EOFMarker
set datafile separator ','
plot './test.csv' using 2:3:xtic(1) with boxes
EOFMarker
I read similar code like this from the gnuplot manual on (rowstacked) histogram, but I cannot find a solution for bar chart:
Each cluster of boxes is derived from a single row of the input data file. It is common in such input files that
the first element of each row is a label. Labels from this column may be placed along the x-axis underneath
the appropriate cluster of boxes with the xticlabels option to using.
I tried to use similar code for bar chart (I am not sure whether I understand it correctly, I think if I use xtic(1), it means I will use the first column for binning), using the code described above, but it didn't work.
The graph I am looking for (I did it with seaborn) is like this:
Note:
In this example there are only 3 types, but I am looking for a binning approach where it can handle cases when the number of type is N(unknown).

After clarification and illustrative example of OP, here is an attempt to get to the desired plot.
create unique lists of your keywords for the items and the group. The sequence will be in the order of first occurrence in the data. Unfortunately, gnuplot has no internal sorting feature. Alphanumerical sorting would require some external tools or very weird workarounds.
plot the data in two nested loops by filtering the data accordingly.
add the legend by using the ternary operator (check help ternary).
add a single xtic centered per group, independent if there are odd or even numbers of items.
This solution is not very obvious and maybe there is a simpler approach using the plotting style histogram. I would love to learn about a simpler solution.
Script:
### plot grouped box chart
reset session
$Data <<EOD
DS_TYPE, arg1, arg2, arg3, arg4
type1, 24, 20000, 15, 20
type2, 48, 20000, 20, 60
type3, 96, 20000, 25, 90
type3, 144, 200000, 30, 110
EOD
set datafile separator comma
myColors = "0x3a7ca4 0xe38a3f 0x439549"
myColor(i) = int(word(myColors,i))
colI = 1 # column item
colG = 3 # column group
colY = 5 # column y-value
uniq1 = uniq2 = ''
addToList(list,s) = list.( int(sum [_i=1:words(list)] word(list,_i) eq s) ? '' : ' '.s)
stats $Data u (uniq1=addToList(uniq1,strcol(colI)), uniq2=addToList(uniq2,strcol(colG))) skip 1 nooutput
gap = 1
xPos(i,j) = (i-1) + (j-1)*words(uniq1) + j*gap
set boxwidth 1.0
set style fill solid 1.0
set key noautotitle top left
set tics out
set yrange[0:]
set offsets 0.5,0.5,0.5,0
plot for [i=1:words(uniq1)] for [j=1:words(uniq2)] $Data u (xPos(i,j)): \
(word(uniq1,i) eq strcol(colI) && word(uniq2,j) eq strcol(colG) ? column(colY) : NaN): \
(myColor(i)) w boxes lc rgb var ti j==1?word(uniq1,i):'', \
for [j=1:words(uniq2)] '+' u ((xPos(1,j)+xPos(words(uniq1),j))/2.):(NaN):xtic(word(uniq2,j)) every ::::0
### end of script
Result:

Related

gnuplot: simple beeswarm example

I have been struggling with a basic beeswarm plot from page 62 in this doc. I imagine they are skipping some details, and I'm not sure what actual data they used. I think in particular the problem is mapping a categorical/string variable to an X-axis value.
I used this data:
A 1
A 2
A 3
B 4
B 5
B 6
With this script:
set terminal png
set output "graph.png"
set jitter
plot "data.csv" using 1:2:1 with points lc variable
I get this error:
"graph_script" line 4: warning: Skipping data file with no valid points
plot "data.csv" using 1:2:1 with points lc variable
^
"graph_script" line 4: x range is invalid
In their demos gallery, I see something like set xtics ("A" -1, "B" 0) which could maybe help me to label already-numeric data better, but what if my data doesn't start off numeric to begin with?
Do I need something like (hash_string_to_large_int($1) % 2)? There must be an easier way!
As mentioned in the comments you have to "convert" your keys into numbers in order to plot them.
You can do this by creating a list with your unique keywords and defining a function to get the indices.
First, the following example creates some random data
The code after knows nothing about the keywords, so it creates the unique list from scratch from the random data.
Maybe there is (and I am not aware) a simpler solution with gnuplot only.
Code:
### bee-swarm plot with string keys
reset session
# create some random test data
myExts = '.py .sh .html'
set print $Data
do for [i=1:100] {
print sprintf("%s %d",word(myExts,int(rand(0)*3)+1),int(rand(0)*10+1)*5)
}
set print
# create a unique list of strings from a data stringcolumn
Uniques = ''
addToList(list,col) = list.( strstrt(list,'"'.strcol(col).'"') > 0 ? '' : ' "'.strcol(col).'"')
stats $Data u (Uniques = addToList(Uniques,1),0) nooutput
getIdx(key) = (_idx=NaN, sum [_i=1:words(Uniques)] (word(Uniques,_i) eq key ? _idx=_i : 0), _idx)
set offsets 0.5,0.5,1,1
set key noautotitle
set multiplot layout 1,2
set title "No jitter"
plot $Data u (idx=getIdx(strcol(1))):2:(idx):xtic(word(Uniques,idx)) w points pt 7 lc var
set title "With jitter"
set jitter
replot
unset multiplot
### end of code
Result:

Getting plot title and caption data from the data file

Consider the following file that I want to plot using gnuplot: Servos20211222_105253.csv
# Date/Time 2021/12/22, 10:52:53
# PonE=0,LsKp=200,LsKi=0,LsKd=250,HsKp=40,HsKi=0,HsKd=130,Sp=800,TDEC=1175137
#
# Rel. Time, currentPos, PosPID, currentSpeed, speedPID, Lag, ServoPos
0.00000,4693184,0,0,0,0,4693184
0.00000,4693184,2300,0,368,0,4693184
0.00391,4693185,2300,12,367,0,4693184
:
:
I would like to:
set the plot title to the date/time from the first comment record.
display the record that starts "# PonE" as a caption.
extract the value for TDEC and plot a horizontal line with the name "Target"
I have some influence over the format of the header records, so if (for example) it would be better that they were not comments but provided in some other way, then that can be done.
It is a common problem to get text values from files using only gnuplot. If you can use OS and shell dependent solutions, I'd suggest to use remove the comments from the file and try something like
set title "`head -1 Servos20211222_105253.csv`"
You can place text anywhere using set label <"label text">, where the label text can be the 2nd line from the file.
You can plot a straight line using plot:
p sin(x), 0.5 title "TDEC"
But instead of 0.5, you need to get the value using shell scripts again, e.g. the cut unix command.
There are ways with gnuplot only, although sometimes a bit cumbersome compared with using tools which you have available on Linux (or comparable tools which you need to install on Windows).
Update: shorter and "simplified" script
One possible gnuplot-only way:
set commentschar to nothing, i.e. ''
assign the columns to variables and/or arrays, e.g. myDate, myTime, P[1..9].
Merge P[1..8] into a multi-line string Params by "mis"-using sum (check help sum)
Convert P[9] into a floating point number TDEC for plotting
Script: (modified the data a bit just for illustration)
### extract values from headers with gnuplot only
reset session
$Data <<EOD
# Date/Time 2021/12/22, 10:52:53
# PonE=0,LsKp=200,LsKi=0,LsKd=250,HsKp=40,HsKi=0,HsKd=130,Sp=800,TDEC=1175137
#
# Rel. Time, currentPos, PosPID, currentSpeed, speedPID, Lag, ServoPos
0.00000,1300000,0,0,0,0,4693184
0.00200,1200000,2300,0,368,0,4693184
0.00391,1100000,2300,12,367,0,4693184
EOD
set datafile separator comma commentschar ''
array P[9] # array to store parameters
stats $Data u ($0==0 ? (myDate=strcol(1)[3:], myTime=strcol(2)) : \
sum [_i=1:9] (P[_i] = _i==1 ? strcol(_i)[3:] : strcol(_i) ,0 )) \
every ::0::1 nooutput
set datafile commentschar # set back to default
Params = P[1]
Params = (sum [_i=2:8] (Params=Params.sprintf("\n%s",P[_i]),0),Params)
set title sprintf("%s %s", myDate, myTime)
TDEC = real(P[9][6:]) # convert to real number
set label 1 at graph 0.02, first TDEC P[9] offset 0,-0.7
set label 2 at graph 0.02, graph 0.85 Params
plot $Data u 1:2 w lp pt 7 title "Data", \
TDEC w l lc "red" title "Target"
### end of script
Result:

gnuplot single plot in different colors

I have a single column of data (say 100 samples):
plot 'file' using 1 with lines
But this data is segmented: 10 points, then 10 more, etc... and I'd like each block of 10 to appear in a different color. I did filter them to 10 separate files and used
plot 'file.1' with lines, 'file.2' with lines...
But then the X axis goes 0..10 instead of 0..100 and all 10 graphs are stacked. Is there a simple way to do that without having to generate fake X data ?
Depending on your detailed data format... the following is doing what I think you are asking for.
Your "fake x data" is called pseudocolumn 0, check help pseudocolumns. The color you can change with lc var, check help linecolor variable.
Code:
### variable line color
reset session
# create some test data
set print $Data
do for [i=1:100] {
print sprintf("%g", rand(0)*i)
}
set print
plot $Data u 0:1:(int($0/10)) w lp pt 7 lc var notitle
### end of code
Result:

max value for same minute over multiple days from csv with unix timestamps

I have a CSV with a unix timestamp column that was collected over multiple days having a data row for every 5 minutes (output log of my photo voltaik roof power plant).
I'd like to create a plot for 24 hours that shows the maximum value for every single (fifth) minute over all days.
Can this be done with gnuplots own capabilities or do I have to do the processing outside gnuplot via scrips?
You don't show how your exact data structure looks like, - theozh
This files are rather large. I placed an example here:
http://www.filedropper.com/log-pv-20190607-20190811 (300kB)
I'm specially interested in column 4 (DC1 P) and 9 (DC2 P).
Column 1 (Zeit) holds the unix timestamp.
The final goal is separate graphs (colors) for DC1 P and DC2 P, but that's a different question... ;o)
Update/Revision:
After revisiting this answer, I guess it is time for a clean up and simpler and extended solution. After some iterations and clarifications and after OP provided some data (although, the link is not valid anymore), I came up with some suggestions, which can be improved.
You can do all in gnuplot, no need for external tools!
The original request to plot the maximum values from several days is easy if you use the plotting style with boxes. But this is basically only a graphical solution. In that case is was apparently sufficient. However, if you are interested in the maximum values as numbers it is a little bit more effort.
gnuplot has the option smooth unique and smooth frequency (check help smooth). With this you can easily get the average and sum, respectively, but there is no smooth max or smooth min. As #meuh suggested, you can get maximum or mimimum with arrays, which are available since gnuplot 5.2.0
Script: (Requires gnuplot>=5.2.0)
### plot time data modulo 24h avg/sum/min/max
reset session
FILE = 'log-pv-20190607-20190811.csv'
set datafile separator comma
HeaderCount = 7
myTimeFmt = "%Y-%m-%d %H:%M:%S"
StartTime = ''
EndTime = ''
# if you don't define start/end time it will be taken automatically
if (StartTime eq '' || EndTime eq '') {
stats FILE u 1 skip HeaderCount nooutput
StartTime = (StartTime eq '' ? STATS_min : strptime(myTimeFmt,StartTime))
EndTime = (EndTime eq '' ? STATS_max : strptime(myTimeFmt,EndTime))
}
Modulo24Hours(t) = (t>=StartTime && t<=EndTime) ? (int(t)%86400) : NaN
set key noautotitle
set multiplot layout 3,2
set title "All data" offset 0,-0.5
set format x "%d.%m." timedate
set grid x,y
set yrange [0:]
myHeight = 1./3*1.1
set size 1.0,myHeight
plot FILE u 1:4:(tm_mday($1)) skip HeaderCount w l lc var
set multiplot next
set title "Data per 24 hours"
set format x "%H:%M" timedate
set xtics 3600*6
set size 0.5,myHeight
plot FILE u (Modulo24Hours($1)):4:(tm_mday($1)) skip HeaderCount w l lc var
set title "Average"
set size 0.5,myHeight
plot FILE u (int(Modulo24Hours($1))):4 skip HeaderCount smooth unique w l lc "web-green"
set title "Sum"
set size 0.5,myHeight
plot FILE u (int(Modulo24Hours($1))):4 skip HeaderCount smooth freq w l
set title "Min/Max"
set size 0.5,myHeight
N = 24*60/5
SecPerDay = 3600*24
array Min[N]
array Max[N]
do for [i=1:N] { Min[i]=NaN; Max[i]=0 } # initialize arrays
stats FILE u (idx=(int($1)%SecPerDay)/300+1, $4>Max[idx] ? Max[idx]=$4:0, \
Min[idx]!=Min[idx] ? Min[idx]=$4 : $4<Min[idx] ? Min[idx]=$4:0 ) skip HeaderCount nooutput
plot Min u ($1*300):2 w l lc "web-blue", \
Max u ($1*300):2 w l lc "red"
unset multiplot
### end of script
Result:
From gnuplot 5.2 you could use the new array datatype to calculate a maximum value for each 5 minute slot. I am not a gnuplot expert, so the following example needs more work, but shows the potential.
Assume data is similar to these lines, where there is a date in the format
yyyy.mm.dd.HH:MM, a comma and a y value:
2018.02.03.18:23,4
2018.02.03.19:23,7
2018.02.04.18:23,8
2018.02.05.19:23,11
Instead of using gnuplot's built-in time parsing, since we want to ignore the date, we create a function fsecs to use substr(stringcolumn(...),12,16) to get just the hours and minutes from data column 1, and strptime("%H:%M",...) to convert this to seconds:
set datafile separator ","
fsecs(v) = strptime("%H:%M",substr(stringcolumn(v),12,16))
We create an array Max indexed by "5 minute slot", of which there are 24*60/5 per day. It is initialised to NaN, not-a-number.
Nitems = int(24*60/5)+1
array Max[Nitems]
do for [i=1:Nitems] {
Max[i] = NaN
}
We then "plot" the data file data.csv into a dummy table, rather than generating any output. As we go through the data we index Max by the data x value (column 1) converted to seconds by fsecs(1) and then to slot by findex(). This is Max[findex(fsecs(1))].
We call our function fmax() to return the new maximum to set in the array.
findex(x) = int(((x)/60)/5)
fmax(a,b) = ((a>=b)?a:b)
set table $Dummy
plot 'data.csv' using \
(Max[findex(fsecs(1))] = fmax(Max[findex(fsecs(1))],$2)):2
unset table
Finally, we plot the array, which is the slot number against the value held in that slot number.
plot Max using 1:(Max[$1]) with points lw 2 title "max day"
This works for me on 5.2. You still need to label the x axes with HH:MM, and change the date parsing to fit your needs.
For time formating, please see Gnuplot date/time in x axis
If you do not care about format as time, you may use the every command, see gnuplot docu, but that does not take a maximum or something.
For the maximum value over a given time interval I suggest an awk script, see e.g. https://unix.stackexchange.com/a/207287/297901

gnuplot setting line titles by variables

Iam trying to plot multiple data lines with their titles in the key based on the variable which I am using as the index:
plot for [i=0:10] 'filename' index i u 2:7 w lines lw 2 t ' = '/(0.5*i)
However, it cannot seem to do this for a fractional multiple of i. Is there a way around this other than to set the title for each line separately?
sprintf should provide all the functionality needed, e.g.,
plot for [i=0:10] .... t sprintf(" = %.1f", 0.5*i)
in order to use the value of 0.5*i with 1 decimal digit...

Resources