Gnuplot - plotting series based on label in third column - gnuplot

I have data in the format:
1 1 A
2 3 ab
1 2 A
3 3 x
4 1 x
2 3 A
and so on. The third column indicates the series. That is in the case above there are 3 distinct data series, one designated A, another designated ab and last designated x. Is there a way to plot the three data series from such data structure in gnuplot without using eg. awk? The difficulty here is that the number of categories (here denoted A, ab, x) is quite large and it is not feasible to write them out by hand.
I was thinking along the lines:
plot data u 1:2:3 w dots
but that does not work and I get warning: Skipping data file with no valid points (I tried quoted and unquoted version of the third column). A similar question has to manually define the palette which is undesirable.

With a little bit of work you can make a list of unique categories from within gnuplot without using external tools. The following code snippet first assembles a list of the entire third column of the data file, and then loops over it to generate a list of unique category names. If memory use or processing time become an issue then one could probably combine these steps and avoid forming a single string with the entire third column.
delimiter = "#" # some character that does not appear in category name
categories = ""
stats "test.dat" using (categories = categories." ".delimiter.strcol(3).delimiter) nooutput
unique_categories = ""
do for [cat in categories] {
if (strstrt (unique_categories, cat) ==0) {
unique_categories = unique_categories." ".cat
}
}
set xrange[0:5]
set yrange [0:4]
plot for [cat in unique_categories] "test.dat" using 1:(delimiter.strcol(3).delimiter eq cat ? $2 : NaN) title cat[2:strlen(cat)-1]
Take a look at the contents of the string variables categories and unique_categories to get a better idea of what this code does.

Related

Excel - max between 2 series

I have 2 series of data. For sake of simplicity, lets say the data looks like below,
set 1:
1 3
2 3.5
3 4
4 4.5
5 5
6 5.5
7 6
8 6.5
9 7
10 7.5
set 2:
1.5 2
2.8 4.5
3.5 8
4.5 6
5.5 4.8
6.5 4
7.5 6.5
8.5 9
9.5 3
10.5 4
After charting these 2 sets, I want to get the line with the higher data. I want the black line, In the attached pic. How do I get that? My actual data has thousands of data points, so doing this manually isn't possible.
Added later: Another thing I forgot to mention, in my actual data 1 set has about 500 x,y values, and the other set has about 50 values. Though the end points have same/similar x values.
Thanks for your help.
Given your information about the chart and the tables, I would do something like this:
The new series will be based on two formulas:
In Column H, I have the formula for the max value (between your two series):
=MAX(B2,E2)
In Column G, I have the formula that based on the Max value (formula above), which X value I should use (X-value from Series 1 or 2).
=IF(H2=B2,A2,D2)
Then I can plot my graph:
Series 1, Column B
Series 2, Column E
Series 3, Column H.
All series uses the X values of Column G.
Introduction
A few assumptions/comments/pitfalls/constraints regarding my solution:
Set 1 and Set 2 are in columns A till D.
The combined data set will combine the x-values of both Sets, and will have additional data points where the lines cross.
It involves several helper columns, in particular to allow you to copy/paste this across multiple worksheet with data.
I did not try to condense too much, to improve readability, and probably some helper columns could be combined.
It was tested with the data set from the question, but difficult to guarantee all "boundary" conditions, e.g. identical data points between Set 1 and Set 2, zero overlap between the two data sets, empty data sets, etc. (I did test some of these, see my comments at the end).
Set 1 and Set 2 must be sorted (on x-values). If this is not the case, a few additional helper columns are needed to sort the data dynamically.
To better understand the solution described below, see herewith the resulting graph, based on the data set in the question (although I added one data point [2.5;3.75] to avoid having the data points of Set 1 and Set 2 perfectly alternating):
General solution outline / methodology
Combine both datasets in a single (sorted) column;
For all x-values, determine highest y-value, between the y-value in the Set, and the calculated y-value on the line segment from the neighboring values in the other Set (looks simple, in particular with the given example data set, but this is quite tricky to do when data sets have no alternating x-values);
Find the points (x & y values) where the lines of the graph are crossing (intersecting), let's call this Set 3
Combine and sort (on x-values) the three data sets in a two columns (for x & y values).
The details and formulas
For the formulas, I assume row 1 contains headings, and the data start on row 2. All formulas should be entered in row 2, except for a few, where I mention to put them in row 3 (because they need data from the preceding row). The result is in columns E (x-values) and F (y-values), and G till AG are helper columns).
Column E : =INDEX(AH$2:AH$30;MATCH(ROWS(AH$2:AH2);$AJ$2:$AJ$30;0)) These is the actual result. Gets all x-values in AH and sorts these based on an index column AJ; this should actually be the last column in the logical flow, but for presentation purposes it is cleaner to have this next to the input data sets;
F : =INDEX(AF$2:AF$30;MATCH(ROWS(AF$2:AF2);$AG$2:$AG$30;0)) Same for y-values;
G : =IF(ISNA(H2);NA();COUNTIF($H$2:$H$30;"<="&H2)) Creates index to sort combined x-values of both data sets. You also can dynamically sort without such helper column, but then you need a VLOOKUP or INDEX/MATCH and with long decimal numbers I have some bad experiences with these;
H : =IF(ROW()-1<=COUNT($A$2:$A$30);A2;IF((ROW()-1)<=(COUNT($A$2:$A$30)+COUNT($C$2:$C$30));INDEX($C$2:$C$30;ROW()-COUNT($A$2:$A$30)-1;1);NA())) Combines x-values of both data sets, i.e. in columns A & C;
I : =IF(ROW()-1<=COUNT($B$2:$B$30);B2;IF((ROW()-1)<=(COUNT($B$2:$B$30)+COUNT($D$2:$D$30));INDEX($D$2:$D$30;ROW()-COUNT($B$2:$B$30)-1;1);NA())) Same for the y-values;
J : =IF(ROW()-1<=COUNT($A$2:$A$30);"S1";IF((ROW()-1)<=(COUNT($A$2:$A$30)+COUNT($C$2:$C$30));"S2";NA())) Assign "S1", or "S2" to each data point, as indication from which data set they come;
K : =IF(J2=J3;INTERCEPT(I2:I3;H2:H3);NA()) Determines the intercept of the line segment starting at that data point;
L : =IF(J2=J3;SLOPE(I2:I3;H2:H3);NA()) Same for slope;
M : =INDEX(H$2:H$30;MATCH(ROWS(H$2:H2);$G$2:$G$30;0)) Sorts all x-values;
N : =INDEX(I$2:I$30;MATCH(ROWS(I$2:I2);$G$2:$G$30;0)) Same for y-values
O : =INDEX(J$2:J$30;MATCH(ROWS(J$2:J2);$G$2:$G$30;0)) Same for corresponding "S1/S2" value to indicate from which data set they come;
P : =INDEX(K$2:K$30;MATCH(ROWS(K$2:K2);$G$2:$G$30;0)) Same for intercept;
Q : =INDEX(L$2:L$30;MATCH(ROWS(L$2:L2);$G$2:$G$30;0)) Same for slope;
R : =IF(O2="S1";"S2";"S1") Inversion between S1 & S2.
S : {=IFERROR(INDEX($O$2:$Q2;MAX(IF($O$2:$O2=$R3;ROW($O$2:$O2)-ROW(INDEX($O$2:$O2;1;1))+1));2);NA())} Array formula to be put in cell S3 (hence ctrl+shift+enter) that will search for the intercept of the preceding data point of the other data set.
T : {=IFERROR(INDEX($O$2:$Q2;MAX(IF($O$2:$O2=$R3;ROW($O$2:$O2)-ROW(INDEX($O$2:$O2;1;1))+1));3);NA())} Same for slope;
U : =IF(OR(ISNA(N2);NOT(ISNUMBER(S2)));NA();M2*T2+S2) Calculates the y-value on the line segment of the other data set;
V : =MAX(IFNA(U2;N2);N2) Maximum value between the original y-value and the calculated y-value on the corresponding line segment of the other data set;
W : =(V2=N2) Checks whether the y-value comes from the original data set or not;
X : =IF(O2="S1";IF(W2;"S1";"S2");IF(W2;"S2";"S1")) Determines on which data set (line) the y-value sits (S1 or S2);
Y : =IFERROR(AND((X2<>X3);COUNTIF(X3:$X$30;X2)>0);FALSE) Determines when the data sets cross (i.e. the lines on the graph intersect);
Z : =IF(Y2;(S2-P2)/(Q2-T2);NA()) Calculates x-value of intersection;
AA : =IF(Y2;Z2*Q2+P2;NA()) Calculates y-value of intersection;
AB : =COUNTIF($Z$2:$Z$30;"<="&Z2) Index to sort the newly calculated intersection points (I sort them because then the combining with the other data sets is straightforward, re-using formula of column H;
AC : =INDEX(Z$2:Z$30;MATCH(ROWS(Z$2:Z2);$AB$2:$AB$30;0)) Sorted x-values of intersection points;
AD : =INDEX(AA$2:AA$30;MATCH(ROWS(AA$2:AA2);$AB$2:$AB$30;0)) Same for y-values;
AE : =IF(ROW()-1<=COUNT(M$2:M$30);M2;IF((ROW()-1)<=(COUNT(M$2:M$30)+COUNT(AC$2:AC$30));INDEX(AC$2:AC$30;ROW()-COUNT(M$2:M$30)-1;1);NA())) Combine x-values of Set 1, Set 2, and the intersection points;
AF : =IF(ROW()-1<=COUNT(V$2:V$30);V2;IF((ROW()-1)<=(COUNT(V$2:V$30)+COUNT(AD$2:AD$30));INDEX(AD$2:AD$30;ROW()-COUNT(V$2:V$30)-1;1);NA())) Same for y-values;
AG : =IF(ISNA(AE2);NA();COUNTIF($AE$2:$AE$30;"<="&AE2)) Create index to sort the resulting data set (and this is used to calculate the final results in columns E & F;
All formulas go until row 30, but this need to be changed of course based on the actual data sets. The idea is to add these formulas to one worksheet, and then columns E > AG can be copied to all other worksheets. There are obviously quite a few #NA values, but this is on purpose, and are not errors or mistakes. On request, I can share the actual spreadsheet, so you do not have to retype all formulas.
Some additional comments
You have to modify some formulas (the sort indices) if there are identical x-values, either within Set 1 (which I will not cover here, as it seems this would be unlikely, or be data input errors), or between Set 1 and Set 2. The dynamic sorting does not work in that case. A workaround is to create a "synthetic" sort column, e.g. with =TEXT(J2;"0000.00000000000")&L2. This formats all numbers the same way as text, and appends S1 or S2. So this should give unique sort values, which would sort the same way as the corresponding numbers.
Empty data sets or data sets with only 1 value are not treated correctly either (the intercept formulas and finding values for the "previous" data point are meaningless in these cases).

Scatter plot for every pairs in a two column matrix

I have a matrix which contains the atom numbers of the pairs of atoms which are in contact with each other. My matrix is like this:
column 1: atom number i;
column 2: atom number j
i,j runs from 1 to 800.
If there is a pair i-j in the matrix, place a dot corresponding to the position (i,j) of the matrix.
How do I plot such matrix?
Example:
A= [1,3; 3,8; 3,1; 6,2; 2,6; 1,2; 5,2; 8,3; 2,5; 2,1]
I want to Plot the matrix A, where X and Y-axis run from 1 to 8. Place a dot for every combination of X and Y which are present in A.
I want a plot like this:
Isn't this just a scatter plot?
If your m x 2 matrix is saved in a text file then this is trivial.
Here are the contents of an example data file "input.dat":
4 3
3 4
5 3
3 5
8 2
2 8
All you need to do is open the data file in xmgrace using xmgrace input.dat.
Now, initially it will be a line plot, but if you do 'Plot' > 'Set Appearance' and then with the only set already being selected you can set the 'Symbol Properties' 'Type:' to Diamond and 'Line Properties' 'Type:' to None you will already be on your way. Setting the symbol fill to solid red, tweaking the axis ranges and showing major tick grid lines will give a plot like the one you gave as an example.
You can save a parameters file and in future load the parameters at the beginning using
xmgrace -param template.par input2.dat.
But, having said all this, why not just plot it in matlab?

gnuplot: how to know the last column number?

I have a problem handling data using gnuplot.
My data has different column number per line.
I want to plot with X-axis of the first column and Y-axis of the last.
The last columns are always different every line.
For example, my data looks like that (my.dat)
1 2
2 1 3
3 4 4
4 5
5 2 1 3 6
plot 'my.dat' us 1:(lastcolumn) w l
Before reading in gnuplot, I can pre-process of the data.
But my gnuplot is windows version, I cannot use awk or any parsing program.
So I hope it handles only into gnuplot.
Is that possible?
Thanks
Yes, you can check that with gnuplot. The idea is as follows:
You analyze your data with stats and inside the using you check recursively with valid which column is the last valid. If an invalid column is reached you return the number of the previous column otherwise the next column is checked. The last column is then contained in the variable STATS_max
check_valid_column(c) = valid(c) ? check_valid_column(c + 1) : c - 1
stats 'my.dat' using (check_valid_column(1)) nooutput
last_column = int(STATS_max)
plot 'my.dat' using 1:last_column
Just for the records, here is an alternative suggestion. Christoph's solution is certainly more elegant and probably faster.
However, with the recursive approach you will get an error "recursion depth limit exceeded" if you have more than 250 columns (admittedly, probably very rare cases).
The solution below uses the lines as one string and counts the columns with words(). This, however, works only if you have whitespace as separator. With comma it will not work. Not sure what string length limit would be.
Code: (edit: no need to plot to a dummy table, stats can be used instead)
### find the maximum number of columns
reset session
# create some random test data
set print $Data
rows = int(rand(0)*5+5) # random 5 to 9 lines
do for [r=1:rows] {
minCols = 251 # if minCols >250, the recursive approach will fail
cols = int(rand(0)*10+minCols)
line = ''
do for [c=1:cols] { line = sprintf("%s %d",line,rand(0)*10) }
print line
}
set print
# alternative approach with word(). Works only for separator whitespace.
set datafile separator "\n"
maxCol=0
stats $Data u (cols=words(strcol(1)), cols>maxCol?maxCol=cols:0) nooutput
set datafile separator whitespace
print "words() approach: ", maxCol
# Recursive approach for comparison
print "Recursive approach: "
check_valid_column(c) = valid(c) ? check_valid_column(c + 1) : c - 1
stats $Data u (check_valid_column(1)) nooutput
last_column = int(STATS_max)
print last_column
### end of code
Result: (if number of max columns>250)
words() approach: 259
Recursive approach:
"SO41032862.gp" line 28: recursion depth limit exceeded

Gnuplot CCDF plotting and log-log scale

My data file is a set of sorted single-column:
1
1
2
2
2
3
...
999
1000
1000
I am able to successfully plot the CDF using the command like (assuming 10000 lines in the file):
plot "file" using 1:(1/10000.) smooth cumulative title "CDF"
I am also able to plot the logcale of x axis by:
set logscale x
My problem is how can I have a CCDF plotting with Gnuplot?
In additional, the CDF with log-log scale (set logscale xy) can not give me any output. What if I would like to have a log-log CCDF plotting?
Many thanks!
I found a workaround for this problem, because I do not think you can plot a CCDF only using gnuplot.
Briefly, I just parsed my data using bash to create a dataset where the cumulative data is explicit; then gnuplot may simply plot the new dataset. As an example, assuming that your file contains the (numerical) values you want to cumulate, I would do in a bash environment:
cat data | sort -n | uniq --count | awk 'BEGIN{sum=0}{print $2,$1,sum; sum=sum+$1}' > parsed.dat'
This command reads the dataset (cat data), sorts the numerical data using their value (sort -n), counts the occurrences of each sample (uniq --count) and creates a new dataset, calculating as well the cumulative sum of each data value (the awk command).
This new dataset contains 3 columns: the first column ($1 in gnuplot) contains the unique values of your dataset, the $2 contains the number of the occurrences of your values, and the third column represents the cumulative sum.
Finally, in gnuplot, you can do this:
stats "parsed.dat" using 3;
plot "parsed.dat" using 1:($3/STATS_max) with lines title "CDF",\
"" using 1:(1-$3/STATS_max) with lines title "CCDF",\
"" using 1:($2/STATS_max) with boxes title "PDF"
The stats command of gnuplot analyzes the third column (the one with the cumulative sum) and stores the values to some variables. STATS_max is the max value of this column (so it is the final cumulative sum). Now you have all the data you need to plot not only the CDF, but also the CCDF (which is 1 - CDF) and also the PDF (or the normalized histogram, for discrete values).

How to do math operation on rows in gnuplot

Say, my data file has two columns and five rows as follows,
1 3
2 5
3 3
4 4
5 2
Now I would like to plot them but with a little math operation on second column. For example,
plot 'test.dat' u 1:($2*)
What I mean by asterisk is I would like to sqrt(row2^2+row1^2), which is sqrt(5^2+3^2), on second column values. How I can do that? Many thanks!
Usually, one can access only the values of all columns of the current row. Accessing the values of a previous row is possible, but tricky. Basically, you must save the values in temporary variables.
That works in the following way:
In the first row, save the values of both columns and do not plot them (use NaN as value).
In the second row, save the current x-values, use the x-value of the previous row. Then save the current y-value, and compute your value based on the previous row (prevY) and the current row (currY).
That doesn't plot the last line. But that hasn't a next row anyway. If you want it to plot also the last line with e.g. 0 as additional value, you must add a last row with 0 0.
In the script I use set macros for better readability of the code:
set macros
prevX = currX = prevY = currY = 0
UsePreviousXvalue = '(($0 == 0) ? (prevX = NaN, currX = $1) : (prevX = currX, currX = $1)), prevX'
AssignYvalue = '(prevY = currY, currY = $2)'
plot 'test.dat' using (#UsePreviousXvalue):(#AssignYvalue, sqrt(prevY**2 + currY**2))

Resources