How do I group strings and their data using Gnuplot? - gnuplot

I'm brand new to Gnuplot and want to be able to graph a huge amount of data that looks like this:
Description violFine state
"Red Light Violation" $75.00 MD
"No Stop/Park Handicap" $502.00 MD
"Red Light Violation" $75.00 MD
"No Stop/Park Handicap" $502.00 MD
"Red Light Violation" $75.00 MD
"Red Light Violation" $75.00 MD
"Red Light Violation" $75.00 VA
"All Other Stopping or Parking Violations" $32.00 MD
"Red Light Violation" $75.00 MD
"Red Light Violation" $75.00 MD
As you can see, the top line is the names of the columns and I have many duplicate string values in the "Description" column. What I want to do is add up all the "violFine" numbers per unique "Description" and plot it with the "Description" on the x-axis and the total of the "violFines" on the y-axis. I've made a graph to illustrate what I'm talking about accessible at this link: http://i.imgur.com/NtZsZCR.jpg
(Sorry, I would've made it available on this page if I had enough reputation points).
Any help with going about this would be awesome! Thanks!

This sort of data processing task isn't well suited for gnuplot. Luckily, gnuplot is happy to let you use other tools to process the data and then pipe the result in. Here, I would use python:
from collections import defaultdict
import csv
import sys
d = defaultdict(list)
with open(sys.argv[1]) as fin:
next(fin) #remove the first line which doesn't contain data
reader = csv.reader(fin,delimiter=' ',quotechar='"')
for row in reader:
d[row[0]].append(float(row[1][1:]))
for k,v in d.items():
print '"{0}"'.format(k),sum(v)
Now in gnuplot, you can plot this as:
plot '< python script.py datafilename' using (column(0)):2:xtic(1) with lines

You can also do it in gnuplot only without external tools.
define a function inList(), which determines if an item is already in the list
create a list of unique items
define a function to get the index (i.e. x-value) of an item in the unique list
sum up the second column (after removing $) for equal x-values via smooth freq
every ::1 is skipping the first (header) line
For gnuplot>=5.0.0 you could also use sum and word() for the function inList(), which, however doesn't work for gnuplot 4.x because word() will ignore matching double quotes, e.g. word('"abc def" ghi',2) will return ghi in gnuplot 5.x, but def" in gnuplot 4.x. Hence, for 4.x there is another approach using strstrt() and adding an index number which will also work for 5.x
Script: (works for gnuplot>=4.6.0, March 2012)
### sum up values depending on keyword
reset
FILE = "SO/SO15316764.dat"
# create list of unique elements
c = 0
uniq = ''
inList(list,s) = strstrt(list,'"'.s.'"')
stats FILE u (uniq=uniq.(inList(uniq,strcol(1)) ? '' : sprintf('"%s" %d ',strcol(1),c=c+1))) every ::1 nooutput
getIndex(list,s) = (_n=inList(list,s)) ? int(word(list[_n+2+strlen(s):],1)) : 0
set boxwidth 0.8
set style fill solid 0.4
set key noautotitle
set xrange[0.5:c+0.5]
plot FILE u (getIndex(uniq,strcol(1))):(real(strcol(2)[2:])):xtic(1) every ::1 smooth freq w boxes
### end of script
Result:

Related

Putting a value from an input data file into 'Set Label'

I'm plotting an animated surface in gnuplot and want to read in an average or sum of the mapped z values and include this in a label to be printed in the plot, so that I get a running total updated as the GIF progresses. It's probably straightforward, but I'm a "gnu"bie, so to speak, and find this system pretty confusing!
I've tried putting the running sum and average numbers in additional columns ...
splot 'output3.dat' index i:i using 1:2:(column(3), TD1 = strcol(4), TD2 = strcol(5)) with pm3d
but this doesn't plot, and the string variables TD1, TD2 don't seem to exist outside the splot command.
The command you show would indeed set variables TD1 and TD2 globally if you change the order of clauses in the serial evaluation expression (the comma-separated sub-expressions):
splot 'output3.dat' index i:i using 1:2:(TD1 = strcol(4), TD2 = strcol(5), column(3)) with pm3d
However, if the idea is to create a label using set label that will appear as part of the resulting graph, this won't work. The set label command would have to be executed before the splot command, so TD1 and TD2 will not have the correct values yet.
There is an alternative that might serve you better. Instead of trying to put this dynamically evaluate information in a label, put it in the plot title. Unlike a label, the plot title is evaluated after the corresponding plot is generated, so any variables set or updated by that plot will be current. [caveat: this is true for current gnuplot (version 5.4) but was not always true. If you have an older gnuplot version the title is evaluated before the plot rather than after].
Since current gnuplot also allows you to place the individual plot titles somewhere other than in the key proper, you have the same freedom that you would with a label to position the text anywhere on the output page. For example, if you want to sum the values in column 3 of the data file and print the total as part of a title above the resulting plot:
SUM = 0
splot 'foo.dat' using 1:2:(SUM = SUM+column(3), column(3)) with linespoints title 'foo.dat', \
keyentry title = sprintf("Points sum to %g", SUM) at screen 0.5 0.9
I used a separate keyentry clause because this allows to omit the sample line segment that would otherwise be generated, but it would also be possible to make this the title of the plot itself if you want that sample line.

Iteratively generate datablocks in gnuplot

Is it possible to iteratively generate datablocks, where the name of the datablock is build up inside the loop?
Let's assume I have three fruits (in reality there are more):
array namelist[3] = ['apple', 'banana', 'pineapple']
I want to create three datablocks with the names $apple_data, $banana_data and $pineapple_data, so I tried the following:
do for [i=1:|namelist|] {
set table '$'.namelist[i]."_data"
plot ...
unset table
}
Unfortunately, instead of datablocks gnuplot created files with these names in the working directory. I guess gnuplot is checking whether the first character after set table is a $?
My second attempt was to remove the apostrophes around $:
set table $.namelist[i]."_data"
But this raised the weird error "Column number or datablock line expected", pointing at the period right after $.
Any ideas on how to achieve that?
The reason for all this is that I read in the banana/apple data files with a lengthy path, apply some lengthy calculations within using, and reuse these for lots of successive stats and plot commands. I would like to avoid having to hard-code and copy-paste the same long path and the cumbersome using command over and over again.
Not sure if I fully understood your detailed intention.
If you only want to avoid typing (or copy pasting) a lengthy path again and again, simply use variables:
FILE1 = 'C:/Dir1/SubDir1/SubSubDir1/SubSubSubDir1/File1'
FILE2 = 'C:/Dir2/SubDir2/SubSubDir2/SubSubSubDir2/File2'
plot FILE1 u 1:2, FILE2 u 1:2
Anyway, you asked for dynamically generated datablocks. One way which comes to my mind is using evaluate, check help evaluate. Check the example below as a starting point, which can probably be simplified.
Code: (simplified thanks to #Eldrad's comment)
### dynamically generate some datablocks
reset session
myNames = 'apple banana pineapple'
myName(i) = word(myNames,i)
N = words(myNames)
set samples 11
do for [i=1:N] {
eval sprintf('set table $%s_data',myName(i))
plot '+' u 1:(rand(0)) w table
unset table
}
plot for [i=1:N] sprintf('$%s_data',myName(i)) w lp pt 7 ti myName(i)
### end of code
Result:

Bumbling around plotting two sets of seasonal data on the same chart

I have series of monthly inventory data since 2017.
I have a series of inventory_forecasts since Dec2018
I am trying to plot the inventory data on a monthly-seasonal basis, and then overlay the inventory_forecasts of Jan2019 through Dec2019.
The dataframe looks like:
The first way I tried to make the chart does show all the data I want, but I'm unable to control the color of the inventory_zj line. Its color seems to be dominated by the color=year(date):N of the alt.Chart I configured. It is ignoring the color='green' I pass to the mark_line()
base = alt.Chart(inv.loc['2000':].reset_index(), title=f"usa total inventory").mark_line().encode(
x='month',
y="inventory",
color="year(date):N"
)
#this ignores my 'green' color instruction, and marks it the same light blue 2019 color
joe = base.mark_line(color='green').encode(
alt.Y('inventory_zj', scale=alt.Scale(zero=False), )
)
base+joe
I tried to use a layering system, but it's not working at all -- I cannot get it to display the "joe" layer
base = alt.Chart(inv.loc['2000':].reset_index(), title=f"usa total inventory").encode(
x='month(date)'
)
doe = base.mark_line().encode(
alt.Y('inventory', scale=alt.Scale(zero=False), ),
color="year(date):N"
)
joe = base.mark_line(color="green").encode(
alt.Y('inventory_zj', scale=alt.Scale(zero=False), ),
)
#looks identical to the first example
alt.layer(
doe, joe
).resolve_scale(
y='shared'
).configure_axisLeft(labelColor='black').configure_axisRight(labelColor='green',titleColor='green')
#independent shows a second y-axis (which is different from the left y-axis) but no line
alt.layer(
doe, joe
).resolve_scale(
y='independent'
).configure_axisLeft(labelColor='black').configure_axisRight(labelColor='green',titleColor='green')
I feel like i must be trying to assemble this chart in a fundamentally wrong way. I should be able to share teh same left y-axis, have the historic data colored by its year, and have a unique color for the 2019-forecasted data. But I seem to be making a mess of it.
As mentioned in the Customizing Visualizations docs, there are multiple ways to specify things like line color, with a well-defined hierarchy: encodings override mark properties, which override top-level configurations.
In your chart, you write base.mark_point(color='green'), where base contains a color encoding which overrides the mark property. If you don't derive the layer from base (so that it does not have a color encoding), then the line will be green as you hoped. Something like this:
base = alt.Chart(inv.loc['2000':].reset_index(), title=f"usa total inventory")
inventory = base.mark_line().encode(
x='month',
y="inventory",
color="year(date):N"
)
joe = base.mark_line(color='green').encode(
x='month',
y=alt.Y('inventory_zj', scale=alt.Scale(zero=False))
)
inventory + joe

Change color and legend of plotLearnerPrediction ggplot2 object

I've been producing a number of nice plots with the plotLearnerPrediction function in the mlr package for R. They look like this. From looking into the source code of the plotLearnerPrediction function it looks like the color surfaces are made with geom_tile.
A plot can for example be made by:
library(mlr)
data(iris)
#make a learner
lrn <- "classif.qda"
#make a task
my.task <- makeClassifTask(data = iris, target = "Species")
#make plot
plotLearnerPrediction(learner = lrn, task = my.task)
Now I wish to change the colors, using another red, blue and green tone to match those of some other plots that I've made for a project. for this I tried scale_fill_continuous and scale_fill_manual without any luck (Error: Discrete value supplied to continuous scale) I also wish to change the legend title and the labels for each legend entry (Which I tried giving appropriate parameters to the above scale_fill's). There's a lot of info out there on how to set the geom_tile colours when producing the plot, but I haven't found any info on how to do this post-production (i.e. in somebody else's plot object). Any help would be much appreciated.
When you look into the source code you see how the plot is generated and then you can see which scale has to be overwritten or set.
In this example it's fairly easy:
g = plotLearnerPrediction(learner = lrn, task = my.task)
library(ggplot2)
g + scale_fill_manual(values = c(setosa = "yellow", versicolor = "blue", virginica = "red"))

How to prevent labels to overlap

I am running the following command to draw a few X,Y points in gnuplot:
plot "Output.tsv" using ($2+3):($3+3):1 with labels, "Output.tsv" using 2:3
Some of the data points are very close to each other and it makes the label unreadable. Is there a way to ask gnuplot to eliminate/reduce the overlap between labels?
I think you could consider 3 options:
1) make your graph huge and hope your labels do not overlap
2) plot the points as different series with each item having its own legend
3) use letters instead of labels, you can put a letter at each point using
plot "???" using 1:2
plot "" using 1:2:(stringcolumn(3) ne 'compare to' ? 'if equal' : 'if not equal' ) with labels
the stringcolumn function looks in column 3, compares the value to the string 'compareto' and if there is a match it puts 'if equal' at that location, otherwise 'if not equal'
Hence, I see something like Simulator in your graph, you could keep the green point and put an S with it/on it using
plot "" using 1:2:(stringcolumn(3) ne 'Simulator' ? 'S' : '' ) with labels
I hope this helps.

Resources