Plotting columns by calling their header with GnuPlot - gnuplot

I have file of this format:
x y1 y2 y3 ei1 ei2 ei3 es1 es2 es3
1 4 5 4 7 7 2 4 7 7
2 7 3 3 3 8 3 3 3 8
3 2 1 4 4 9 6 4 4 9
I want to produce plots similar to what the following command would
give
plot "filename" using 1:2:5:8 with yerrorbars
but using the columns headers(x, y1, ei1 and es1) to call them.
How can this be done?
Page 84 of the gnuplot manual (documenting the using command) reads:
Height Weight Age
val1 val1 val1
... ... ...
then the following plot commands are all equivalent
plot ’datafile’ using 3:1, ’’ using 3:2
plot ’datafile’ using (column("Age")):(column(1)), \
’’ using (column("Age")):(column(2))
plot ’datafile’ using "Age":"Height", ’’ using "Age":"Weight"
However when I tried them I only got the row indices versus themselves.

Taking a quick look at the documentation for gnuplot 4.4 vs gnuplot 4.6 (current stable release), it appears that the feature you are trying to use was probably introduced in gnuplot 4.5 (Odd numbers are the development branches -- when they are deemed stable, they get incremented to an even number). The only way that can think of to accomplish this is to write a simple script in some other language which returns the column number (to stdout). Here's a simple example using python although I'm positive that you could do this in awk if you wanted to remain in an all-POSIX environment:
#python indexing is 0 based, gnuplot datafile indexing 1 based
COL_AGE=`python -c 'print(open("datafile").readline().split().index("AGE")+1)'`
COL_HEIGHT=`python -c 'print(open("datafile").readline().split().index("HEIGHT")+1)'`
plot "datafile" u COL_AGE:COL_HEIGHT
This little script doesn't do anything fancy (It assumes the column headers are on the first line for example), but using the power of python, it would be pretty easy to extend the script further:
#!/usr/bin/env python
import sys
with open(sys.argv[1]) as f
for line in f:
if (line.strip()):
print (line.split().index(sys.argv[2])+1)
sys.exit(0)
Now you can call this script as: python script.py datafile AGE to find out which column "AGE" is in. It is an error if "AGE" isn't in any column.

Related

How can I subtract the value in first row from all other values in the column?

I want to subtract all rows of a file from its first row, and then plot it. How can I implement such math work in gnuplot?
Here is an example of what i want to do:
Let's say i have a file that has two columns and 1000 rows. I want a script that subtract all data's in 2nd column from the 2nd column value in first row.
I am pretty sure that there are a similar questions on SO, however, apparently not so easy to find.
I would have searched for "normalization" or "offset".
The following example works even if you have single or double empty lines in your data. The expression in the plot command uses serial evaluation, check help operators binary.
Sometimes, you might see similar solutions using the pseudocolumn 0 (check help pseudocolumns), however, which might lead to wrong results if you have empty lines in your data.
Script:
### offset data: subtraction of first value in a column
reset session
$Data <<EOD
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
EOD
plot t=0 $Data u 1:(t==0?y0=$2:0,t=t+1,$2-y0) w lp pt 7 lc "red"
### end of script
Result:

Histogram with ggplot2 requires a continuous x variable

I have a dataset in a table format that looks like this:
test frequency
1 test40 3
2 test33 5
3 test19 2
4 test4521 1
5 test34 1
6 test27 3
7 test42 3
8 test35 1
....
If I use this command:
library(ggplot2)
ggplot(t, aes("frequency")) +
geom_histogram()
("t" is the name of my table)
Then RStudio says: "StatBin requires a continuous x variable: the x variable is discrete. Perhaps you want stat="count"?"
I just want to see how many times a 3 or a 5 etc. occurs.
Thanks for your help.
It looks like your data is already aggregated? Maybe the ggplot2::geom_histogram() function might not appropriate for you to use? Have you tried the geom_col() function? This simply takes the numbers declared in the input data frame, and displays a column plot with that data.
Using the below code
# Declare data frame
t <- data.frame(test = c("test40", "test33", "test19", "test4521",
"test34", "test27", "test42", "test35"),
frequency = c(3, 5, 2, 1,
1, 3, 3, 1))
returns the data frame like this
# View data
print(t)
test frequency
1 test40 3
2 test33 5
3 test19 2
4 test4521 1
5 test34 1
6 test27 3
7 test42 3
8 test35 1
and therefore you can plot it like this
# Load package
library(ggplot2)
# Generate column plot
ggplot(t, aes(test, frequency)) +
geom_col()
If you simply wanted a count of the times that the number 2 or the number 3 occurred in your data frame, then yes the geom_histogram() is the correct function to use. See, the geom_histogram() function counts the frequency that a term occurs in the data frame, then returns the result. It has an internal validation that looks at the type of data that you are trying to plot across the x-axis, and notices that if it is discrete, then you need to parse the parameter stat="count" in the function. If you don't include this parameter, then ggplot will try to bin your data to create the histogram, which is illogical because all you want is a count.
Check out this link for a description of the difference between continuous and discrete data: What is the difference between discrete data and continuous data?
With this in mind, you can plot the histogram like this
# Generate histogram plot
ggplot(t, aes(frequency)) +
geom_histogram(stat="count")
I hope that helps mate.

Gnuplot with Linear Regression

I am trying to apply linear regression with (Fit(x)). Instead of having two columns in a data file, e.g. x and y values, this file have, for example, 5 columns. I want to pick the avg value of each column and feed it to the F(X) function.
Data:
A B C D E
2 2 5 10 20
4 5 6 11 1
6 8 7 12 4
8 9 12 13 8
10 11 10 14 17
Could I?
Thanks for help
Assume that your data is as specified and you wish to fit a function f(x)=m*x+b to the data where the 0-based column index (0,1,2,3 or 4) should be the x value and the column average should be the y value. We need to construct a new data file that contains the averages.
In gnuplot 5, we can use something called inline data. This is a special variable that behaves like a file. We will find the average of each column of the data and construct an inline data variable containing these. We do this by looping over the column indices and applying the stats function. The print command can be instructed to print to an inline data variable.
set print $l append
do for [i=1:5] {
stats datafile u i nooutput
print STATS_mean
}
set print # restores ordinary print behavior
With your data, we can see what is contained in $l by printing it with print $l:
6.0
7.0
8.0
12.0
10.0
We now can apply the fit command with this data
f(x) = m*x + b
fit f(x) $l u 0:1 via m,b
This will fit the data so that f(x) = average of column x (or as close to it as can be obtained with the fit).
In gnuplot 4.6, inline data is not available, but we can use a temporary file. Replacing all occurrences of $l with "tempfile" will work the same (except for the print $l command), but will add the data to a temporary file named tempfile.

Plot all columns in a file using gnuplot without specifying number of columns

I have large number of files of data which I want to plot using gnuplot. The files are in text form, in the form of multiple columns. I wanted to use gnuplot to plot all columns in a given file, without the need for having to identify the number of the columns to be plotted or even then total number of columns in the file, since the total number of columns tend to vary between the files I am having. Is there some way I could do this using gnuplot?
There are different ways you can go about this, some more and some less elegant.
Take the following file data as an example:
1 2 3
2 4 5
3 1 3
4 5 2
5 9 5
6 4 2
This has 3 columns, but you want to write a general script without the assumption of any particular number. The way I would go about it would be to use awk to get the number of columns in your file within the gnuplot script by a system() call:
N = system("awk 'NR==1{print NF}' data")
plot for [i=1:N] "data" u 0:i w l title "Column ".i
Say that you don't want to use a system() call and know that the number of columns will always be below a certain maximum, for instance 10:
plot for [i=1:10] "data" u 0:i w l title "Column ".i
Then gnuplot will complain about non-existent data but will plot columns 1 to 3 nonetheless.
Now you can use "*" symbol:
plot for [i=1:*] 'data' using 0:i with lines title 'Column '.i

Gnuplot: Plotting multiple series on graph, but number of different series to overlay unknown ahead of time

I am trying to write a script wrapping gnuplot that will take a dataset and produce an overlayed graph, the number of series to be plotted based on the number of distinct values in a given column, or based on the number of different datasets in the file. An example file would be:
#SeriesName x y
Series1 0 10
Series1 1 11
Series1 2 13
...
SeriesN 0 14
SeriesN 1 19
SeriesN 2 15
I have this in one continuous set of lines, but I can split it into index-able chunks if necessary. The problem is that I don't know the different names of the SeriesName values I'll have ahead of time, nor how many of distinct values there will be. But I want one line on the graph per distinct value of SeriesName. I can see how to make graphs if I know ahead of time the different values of SeriesName, but I don't know how to tell gnuplot to "make one line per value of series, and label each line with the name that is the value of SeriesName that was used for each line."
Can gnuplot do this? Otherwise, I can make two passes through the data, the first one of which I will gather the unique values of SeriesName, and then use bash/perl/python to explicitly build a `plot' statement, but it seems like gnuplot should have some functionality for a user to have to avoid that. Am I missing something?
Thanks in advance.
Update: I also posted to a forum to where the author of Gnuplot in Action (Philipp Janert) posts, and I posted a workaround to my own problem, but I don't think it qualifies as an answer, as what it ultimately does is make a second run through the data and then does a source code filter on gnuplot commands to make a gnuplot script compliant with a particular dataset. I would think that there would be an answer using just the syntax of gnuplot better than what I did. For reference, here is the link: http://www.manning-sandbox.com/thread.jspa?messageID=122752#122752
Just for the records, here is a solution which works with gnuplot>=4.4.0 and gnuplot 5.x.
When the series label changes in column 1 it will be added to a string. This string will be used later to plot the legend.
Data: SO8812078.dat
#SeriesName x y
Series1 0 10
Series1 1 11
Series1 2 13
Series2 0 12
Series2 1 13
Series2 2 14
SeriesN 0 14
SeriesN 1 19
Script: (works with gnuplot>=4.4.0, March 2010)
### take legend from column
reset
FILE = "SO8812078.dat"
myTitles = ''
set key noautotitle
plot t1='' FILE u (t0=t1,t1=strcol(1),t0 ne t1?myTitles=myTitles.' '.t1:0,$2):3:(words(myTitles)) w lp pt 7 lc var, \
for [i=0:words(myTitles)] 1/0 w lp pt 7 lc i ti word(myTitles,i)
### end of script
Result: (created with gnuplot 4.4.0)

Resources