Gnuplot with Linear Regression - gnuplot

I am trying to apply linear regression with (Fit(x)). Instead of having two columns in a data file, e.g. x and y values, this file have, for example, 5 columns. I want to pick the avg value of each column and feed it to the F(X) function.
Data:
A B C D E
2 2 5 10 20
4 5 6 11 1
6 8 7 12 4
8 9 12 13 8
10 11 10 14 17
Could I?
Thanks for help

Assume that your data is as specified and you wish to fit a function f(x)=m*x+b to the data where the 0-based column index (0,1,2,3 or 4) should be the x value and the column average should be the y value. We need to construct a new data file that contains the averages.
In gnuplot 5, we can use something called inline data. This is a special variable that behaves like a file. We will find the average of each column of the data and construct an inline data variable containing these. We do this by looping over the column indices and applying the stats function. The print command can be instructed to print to an inline data variable.
set print $l append
do for [i=1:5] {
stats datafile u i nooutput
print STATS_mean
}
set print # restores ordinary print behavior
With your data, we can see what is contained in $l by printing it with print $l:
6.0
7.0
8.0
12.0
10.0
We now can apply the fit command with this data
f(x) = m*x + b
fit f(x) $l u 0:1 via m,b
This will fit the data so that f(x) = average of column x (or as close to it as can be obtained with the fit).
In gnuplot 4.6, inline data is not available, but we can use a temporary file. Replacing all occurrences of $l with "tempfile" will work the same (except for the print $l command), but will add the data to a temporary file named tempfile.

Related

How can I subtract the value in first row from all other values in the column?

I want to subtract all rows of a file from its first row, and then plot it. How can I implement such math work in gnuplot?
Here is an example of what i want to do:
Let's say i have a file that has two columns and 1000 rows. I want a script that subtract all data's in 2nd column from the 2nd column value in first row.
I am pretty sure that there are a similar questions on SO, however, apparently not so easy to find.
I would have searched for "normalization" or "offset".
The following example works even if you have single or double empty lines in your data. The expression in the plot command uses serial evaluation, check help operators binary.
Sometimes, you might see similar solutions using the pseudocolumn 0 (check help pseudocolumns), however, which might lead to wrong results if you have empty lines in your data.
Script:
### offset data: subtraction of first value in a column
reset session
$Data <<EOD
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
EOD
plot t=0 $Data u 1:(t==0?y0=$2:0,t=t+1,$2-y0) w lp pt 7 lc "red"
### end of script
Result:

Histogram with ggplot2 requires a continuous x variable

I have a dataset in a table format that looks like this:
test frequency
1 test40 3
2 test33 5
3 test19 2
4 test4521 1
5 test34 1
6 test27 3
7 test42 3
8 test35 1
....
If I use this command:
library(ggplot2)
ggplot(t, aes("frequency")) +
geom_histogram()
("t" is the name of my table)
Then RStudio says: "StatBin requires a continuous x variable: the x variable is discrete. Perhaps you want stat="count"?"
I just want to see how many times a 3 or a 5 etc. occurs.
Thanks for your help.
It looks like your data is already aggregated? Maybe the ggplot2::geom_histogram() function might not appropriate for you to use? Have you tried the geom_col() function? This simply takes the numbers declared in the input data frame, and displays a column plot with that data.
Using the below code
# Declare data frame
t <- data.frame(test = c("test40", "test33", "test19", "test4521",
"test34", "test27", "test42", "test35"),
frequency = c(3, 5, 2, 1,
1, 3, 3, 1))
returns the data frame like this
# View data
print(t)
test frequency
1 test40 3
2 test33 5
3 test19 2
4 test4521 1
5 test34 1
6 test27 3
7 test42 3
8 test35 1
and therefore you can plot it like this
# Load package
library(ggplot2)
# Generate column plot
ggplot(t, aes(test, frequency)) +
geom_col()
If you simply wanted a count of the times that the number 2 or the number 3 occurred in your data frame, then yes the geom_histogram() is the correct function to use. See, the geom_histogram() function counts the frequency that a term occurs in the data frame, then returns the result. It has an internal validation that looks at the type of data that you are trying to plot across the x-axis, and notices that if it is discrete, then you need to parse the parameter stat="count" in the function. If you don't include this parameter, then ggplot will try to bin your data to create the histogram, which is illogical because all you want is a count.
Check out this link for a description of the difference between continuous and discrete data: What is the difference between discrete data and continuous data?
With this in mind, you can plot the histogram like this
# Generate histogram plot
ggplot(t, aes(frequency)) +
geom_histogram(stat="count")
I hope that helps mate.

Plot all columns in a file using gnuplot without specifying number of columns

I have large number of files of data which I want to plot using gnuplot. The files are in text form, in the form of multiple columns. I wanted to use gnuplot to plot all columns in a given file, without the need for having to identify the number of the columns to be plotted or even then total number of columns in the file, since the total number of columns tend to vary between the files I am having. Is there some way I could do this using gnuplot?
There are different ways you can go about this, some more and some less elegant.
Take the following file data as an example:
1 2 3
2 4 5
3 1 3
4 5 2
5 9 5
6 4 2
This has 3 columns, but you want to write a general script without the assumption of any particular number. The way I would go about it would be to use awk to get the number of columns in your file within the gnuplot script by a system() call:
N = system("awk 'NR==1{print NF}' data")
plot for [i=1:N] "data" u 0:i w l title "Column ".i
Say that you don't want to use a system() call and know that the number of columns will always be below a certain maximum, for instance 10:
plot for [i=1:10] "data" u 0:i w l title "Column ".i
Then gnuplot will complain about non-existent data but will plot columns 1 to 3 nonetheless.
Now you can use "*" symbol:
plot for [i=1:*] 'data' using 0:i with lines title 'Column '.i

How to get the value of a specific column in a specific line in any time of processing in gnuplot?

I got a data file in the format like this:
# begin
16 1
15 2
14 3
13 4
12 5
11 6
Now I want to use gnuplot to draw a line through the points:
(1, (16/16)) (2, (16/15)) (3, (16/14)) ... (6, (16/11))
As you see, the x axis is the range [1:6] and the Y axis corresponds the values obtained from the number in the first line at the first column(ie. 16 in this example) divided by the number in each line at the first column.
The problem is that I don't know how to get the value of the number at the first column in the first line (16), so that I could do something like
plot "datafile" using 2:(16/$1) with linespoints
I have done a lot of search about how to achieve that but with no luck. It seems that gnuplot doesn't provide some flexible ways to allow arbitrary data selection. Any ideas how to do that? Or maybe I just got stuck into a not so common problem?
Thanks for your help in advance.
You can use the stats command to extract a single numerical value from your data file. The row is selected with the every option, the column with the using:
col = 1
row = 0
stats 'datafile' every ::row::row using col nooutput
value = STATS_min
plot "datafile" using 2:(value/$1) w lp
Note, that column numbering starts at 1, and row numbering at 0 (comment lines are skipped and aren't counted).

GNU set heatmap axis limits around a dynamically computed point

I'm plotting a heatmap in gnuplot from a text file that is in matrix format:
z11 z12 z13
z21 z22 z23
z31 z32 z33
and so forth, using the following command (not including axis labelling, etc, for brevity):
plot '~/some_text_file.txt' matrix notitle with image
The matrix is quite large, in excess of 50 000 elements in the majority of cases, and it's mostly due to the size of my y-dimension (#rows). I would like to know if there's a way to change the limits in the y-dimension for a set number of values around a maximum, while keeping the x and z dimensions the same. E.g. if a maximum in the matrix is at [4000, 33], I want my y range to be centred at 4000 +- let's say 20% of length of the y-dimension.
Thanks.
Edit:
The solution below is basically the correct idea, however it works in my example but not in general because a bug in how gnuplot uses the stats command with matrix files. See the comments after the answer for further info.
You can do this using stats to get the indices that correspond to the maximum value dynamically.
Consider the following file which I named data:
0 1 2 3 4
0 1 2 3 4
0 1 2 3 4
0 1 5 3 4
0 1 2 3 4
If I run statsI get:
gnuplot> stats "data" matrix
* FILE:
Records: 25
Out of range: 0
Invalid: 0
Blank: 0
Data Blocks: 1
* MATRIX: [5 X 5]
Mean: 2.1200
Std Dev: 1.5315
Sum: 53.0000
Sum Sq.: 171.0000
Minimum: 0.0000 [ 0 0 ]
Maximum: 5.0000 [ 3 2 ]
COG: 2.9434 2.0566
The maximum value is in position [ 3 2 ] meaning row 3+1 and column 2+1 (in gnuplot the first row/column would be number 0). After running stats some variables are created automatically (help stats for more info), with STATS_index_max_x and STATS_index_max_y among them, which store the position of the maximum:
gnuplot> print STATS_index_max_x
3.0
gnuplot> print STATS_index_max_y
2.0
Which you can use to automatically set the ranges. Now, because STATS_index_max_x actually gives you the y (instead of x) position, you'll need to be careful. The total number of rows to obtain the range can be obtained with a system call (there might be a better built-in function, which I do not know):
gnuplot> range = system("awk 'END{print NR}' data")
gnuplot> print range
5
So basically you'll do:
stats "data" matrix
range = system("awk 'END{print NR}' data")
range_center = STATS_index_max_x
d = 0.2 * range
set yrange [range_center - d : range_center + d]
which will center the yrange at the position of your maximum value and will stretch it by +-20% of its total range.
The result of plot "data" matrix w image is now
instead of

Resources