gnuplot auto sorts times - linux

i have a file which looks as follows:
19:40:47,2772
19:41:50,2896
19:42:50,2870
19:43:51,2851
19:44:53,2824
19:45:55,2891
.
.
.
07:52:53,2772
07:53:56,2767
07:55:00,2709
07:56:01,2713
07:57:04,2844
07:58:04,2750
07:59:05,2744
08:00:08,2812
08:01:11,2728
08:02:14,2852
and im trying to do the simple task of making a graph with time X axis & number Y axis.
code as follows:
#!/usr/bin/gnuplot
unset multiplot
set xdata time
set datafile separator ","
set timefmt "%H:%M:%S"
set format x "%H:%M"
set title "defect number"
set xlabel "X"
set ylabel "Y"
plot "Defect_number_03-03-16_08.04.49.csv" using 1:2 w lines
pause -1
problem is that gnuplot autosorts the time and my chart looks like this:
I want to make a chart according to the order in the file, any help will be great =)

When you give the plot command
plot datafile u 1:2
you are telling gnuplot that the first column is your x-value and the second is your y-value. Naturally, earlier times are further to the left (as you didn't post your full data, I have used only the part you did post - this will cause a "skip" in the axis labels).
You can use a pseudocolumn to use the line number as your x-value. The 0 column corresponds to the line number (see help pseudocolumns).
Thus plot datafile u 0:2 will use the line number as the x-coordinate and the 2nd column as the y-coordinate.
We still need to add the correct x-axis labels, and can't rely on them to be generated correctly in this case. We would use the xtic function to do this, as1
plot datafile u 0:2:xtic(1)
which tells gnuplot to use the value in column 1 as an xtic, but it will read this literally and not format it as you have desired with the time. To do this, we can manually cast this to the correct string
plot datafile u 0:2:xtic(strftime("%H:%M",strptime("%H:%M:%S",strcol(1)))) w lines
Here, the strcol function reads column 1 as a string, the strptime function turns this into the internal time representation using the specified format string for reading it, and finally the strftime formats this as time string using the specified output string.
As Christoph stated in his answer, these solutions will cause uniform spacing of the points. If the points are already uniform spaced, this is not a problem, and if the points are very close to uniform spaced, it is probably acceptable as well (it looks like your points are about 1 minute apart give or take a couple of seconds).
However, if we want the absolutely correct spacing, we will need to add a date to the lines. This could be done in the original data file during the creation, or we could use an external process to add the dates only when needed leaving the original file exactly the same.
As you are only marking off the time and not the day in your tic marks, the actual day doesn't matter. It only matters that the times from the next morning are in the next day from the times from the last night.
We can use an external program to add dates. The following python 3 program reads the data file and adds a date to it (using Jan 1st, 2015 for the first date - as previously mentioned this date doesn't really matter). If a time occurs earlier in the day from the previous one, it moves to the next day. Here is the program adddates.py:
from datetime import datetime,timedelta
from sys import argv
last = None
offset = timedelta(days=0)
for x in open(argv[1],"r"):
vals = x.split(",")
dte = datetime.strptime("01/01/2015 "+vals[0],"%m/%d/%Y %H:%M:%S") + offset
if last!=None and last>dte:
offset+= timedelta(days=1)
dte = dte + offset
last = dte
print(dte.strftime("%Y-%m-%d %H:%M:%S"),vals[1],sep=",",end="")
The output from running this on the data file looks like:
2015-01-01 19:40:47,2772
2015-01-01 19:41:50,2896
2015-01-01 19:42:50,2870
2015-01-01 19:43:51,2851
2015-01-01 19:44:53,2824
2015-01-01 19:45:55,2891
...
2015-01-02 07:52:53,2772
2015-01-02 07:53:56,2767
...
We can now read data from this program by opening a pipe in our plot command.
set timefmt "%Y-%m-%d %H:%M:%S"
plot "< adddates.py datafile" u 1:2 with lines
1 Note that this also causes labels to overlap, as it uses all of them. To use every other one, we could have used xtic(int($0) % 2 == 0 ? strcol(1):""). A similar technique can be used with the format using the correct labels as well.

A proper solution is to save your data with full date and time, or as timestamps.
All other solutions with $0 and labelling the xtics with xticlabel requires your data to be spaces equidistantly, which doesn't seem to be the case.
So, just save your data as e.g. UNIX timestamp and you can use all nice gnuplot features without fiddling.

Related

Gnuplot - plotting series based on label in third column

I have data in the format:
1 1 A
2 3 ab
1 2 A
3 3 x
4 1 x
2 3 A
and so on. The third column indicates the series. That is in the case above there are 3 distinct data series, one designated A, another designated ab and last designated x. Is there a way to plot the three data series from such data structure in gnuplot without using eg. awk? The difficulty here is that the number of categories (here denoted A, ab, x) is quite large and it is not feasible to write them out by hand.
I was thinking along the lines:
plot data u 1:2:3 w dots
but that does not work and I get warning: Skipping data file with no valid points (I tried quoted and unquoted version of the third column). A similar question has to manually define the palette which is undesirable.
With a little bit of work you can make a list of unique categories from within gnuplot without using external tools. The following code snippet first assembles a list of the entire third column of the data file, and then loops over it to generate a list of unique category names. If memory use or processing time become an issue then one could probably combine these steps and avoid forming a single string with the entire third column.
delimiter = "#" # some character that does not appear in category name
categories = ""
stats "test.dat" using (categories = categories." ".delimiter.strcol(3).delimiter) nooutput
unique_categories = ""
do for [cat in categories] {
if (strstrt (unique_categories, cat) ==0) {
unique_categories = unique_categories." ".cat
}
}
set xrange[0:5]
set yrange [0:4]
plot for [cat in unique_categories] "test.dat" using 1:(delimiter.strcol(3).delimiter eq cat ? $2 : NaN) title cat[2:strlen(cat)-1]
Take a look at the contents of the string variables categories and unique_categories to get a better idea of what this code does.

gnuplot: how to know the last column number?

I have a problem handling data using gnuplot.
My data has different column number per line.
I want to plot with X-axis of the first column and Y-axis of the last.
The last columns are always different every line.
For example, my data looks like that (my.dat)
1 2
2 1 3
3 4 4
4 5
5 2 1 3 6
plot 'my.dat' us 1:(lastcolumn) w l
Before reading in gnuplot, I can pre-process of the data.
But my gnuplot is windows version, I cannot use awk or any parsing program.
So I hope it handles only into gnuplot.
Is that possible?
Thanks
Yes, you can check that with gnuplot. The idea is as follows:
You analyze your data with stats and inside the using you check recursively with valid which column is the last valid. If an invalid column is reached you return the number of the previous column otherwise the next column is checked. The last column is then contained in the variable STATS_max
check_valid_column(c) = valid(c) ? check_valid_column(c + 1) : c - 1
stats 'my.dat' using (check_valid_column(1)) nooutput
last_column = int(STATS_max)
plot 'my.dat' using 1:last_column
Just for the records, here is an alternative suggestion. Christoph's solution is certainly more elegant and probably faster.
However, with the recursive approach you will get an error "recursion depth limit exceeded" if you have more than 250 columns (admittedly, probably very rare cases).
The solution below uses the lines as one string and counts the columns with words(). This, however, works only if you have whitespace as separator. With comma it will not work. Not sure what string length limit would be.
Code: (edit: no need to plot to a dummy table, stats can be used instead)
### find the maximum number of columns
reset session
# create some random test data
set print $Data
rows = int(rand(0)*5+5) # random 5 to 9 lines
do for [r=1:rows] {
minCols = 251 # if minCols >250, the recursive approach will fail
cols = int(rand(0)*10+minCols)
line = ''
do for [c=1:cols] { line = sprintf("%s %d",line,rand(0)*10) }
print line
}
set print
# alternative approach with word(). Works only for separator whitespace.
set datafile separator "\n"
maxCol=0
stats $Data u (cols=words(strcol(1)), cols>maxCol?maxCol=cols:0) nooutput
set datafile separator whitespace
print "words() approach: ", maxCol
# Recursive approach for comparison
print "Recursive approach: "
check_valid_column(c) = valid(c) ? check_valid_column(c + 1) : c - 1
stats $Data u (check_valid_column(1)) nooutput
last_column = int(STATS_max)
print last_column
### end of code
Result: (if number of max columns>250)
words() approach: 259
Recursive approach:
"SO41032862.gp" line 28: recursion depth limit exceeded

Read data in Gnuplot

I have a file with a matrix like :
1 2 3
4 5 6
7 8 9
Using gnuplot, I would like to extract the Variable in the 3th row on the 2th column, and store it in a variable called X for example. please how to do that using gnuplot.
Thanks
You can do that within a plot command,
set table "/dev/null"
X=0
X_row=3
X_col=2
plot "file.dat" using (($0==X_row)?(X=column(X_col),X):0)
unset table
To save time the plot command can do something useful at the same time, like... plotting something.
Thanks, It's solved actually using this syntax :
plot u 0:($0==RowIndex?(VariableName=$ColumnIndex):$ColumnIndex)
#RowIndex starts with 0, ColumnIndex starts with 1
print VariableName
It's already explained quite well here :
by #StackJack

Prevent backward lines in gnuplot

I have some values given by clock time, where the first column is the time. However, the values until 2 o clock still belong to the current day. Given
3 1
12 4
18 1
21 2
1 3
2 0
named as test.data, I'd like to print this in gnuplot:
set xrange [0:24]
plot 'test.data' with lines
However, the plot contains a backward line. It's striking through the whole diagram.
Is there a way to tell gnuplot to explicitly not print such backward lines, or even better, print them wrapping around the x axis (e.g. in my example drawing the line as a forward line up to 24, and then continuing it at 0)?
Note: The x axis of the plot should still start at 0 and end at 24.
As far as wrapping over the edge of the graph (a pac-man like effect), gnuplot can't do that on it's own. Even doing it manually, you would have to somehow calculate the right point to re-enter the graph based on the slope of the connecting line, and insert a new point into the data to control where the re-entry line enters, and where the exiting line exits. This would require external processing.
If you can do some outside preprocessing, adding a blank line before the 1 3 line will insert a discontinuity into the plot and prevent gnuplot from connecting those points (see help datafile for how gnuplot handles blank lines). Of course, you could always sort the data too.
I would recommend sorting the data before plotting, but if you do want to do this wrapping effect, the following python program (wrapper.py) will set up the data for it
data = [tuple(map(float,x.strip().split(" "))) for x in open("data.txt","r")]
data2 = sorted(data)
back_in_to = data2[0]
out_from = data2[-1]
xdelta = back_in_to[0] + 24 - out_from[0]
ydelta = back_in_to[1] - out_from[1]
slope = ydelta/xdelta
outy = out_from[1] + (24-out_from[0])*slope
print(0,outy)
for x in data2:
print(*x)
if x[0]==data[-1][0]: print("")
print(24,outy)
It reads in the data (assumed to be in data.txt, and calculates the points where a line should leave the graph and where it should re-enter, adding these points to the sorted data. It adds a blank line after the last point in the original graph, causing the break in the line. We can then plot like
plot "< wrapper.py" with lines
If we look at your original plot
we see the backward line that you referred to which reaches from the furthest right point to the next left point. The plot that the python program pre-processed reaches through the right of the graph to move back to this point.

Gnuplot CCDF plotting and log-log scale

My data file is a set of sorted single-column:
1
1
2
2
2
3
...
999
1000
1000
I am able to successfully plot the CDF using the command like (assuming 10000 lines in the file):
plot "file" using 1:(1/10000.) smooth cumulative title "CDF"
I am also able to plot the logcale of x axis by:
set logscale x
My problem is how can I have a CCDF plotting with Gnuplot?
In additional, the CDF with log-log scale (set logscale xy) can not give me any output. What if I would like to have a log-log CCDF plotting?
Many thanks!
I found a workaround for this problem, because I do not think you can plot a CCDF only using gnuplot.
Briefly, I just parsed my data using bash to create a dataset where the cumulative data is explicit; then gnuplot may simply plot the new dataset. As an example, assuming that your file contains the (numerical) values you want to cumulate, I would do in a bash environment:
cat data | sort -n | uniq --count | awk 'BEGIN{sum=0}{print $2,$1,sum; sum=sum+$1}' > parsed.dat'
This command reads the dataset (cat data), sorts the numerical data using their value (sort -n), counts the occurrences of each sample (uniq --count) and creates a new dataset, calculating as well the cumulative sum of each data value (the awk command).
This new dataset contains 3 columns: the first column ($1 in gnuplot) contains the unique values of your dataset, the $2 contains the number of the occurrences of your values, and the third column represents the cumulative sum.
Finally, in gnuplot, you can do this:
stats "parsed.dat" using 3;
plot "parsed.dat" using 1:($3/STATS_max) with lines title "CDF",\
"" using 1:(1-$3/STATS_max) with lines title "CCDF",\
"" using 1:($2/STATS_max) with boxes title "PDF"
The stats command of gnuplot analyzes the third column (the one with the cumulative sum) and stores the values to some variables. STATS_max is the max value of this column (so it is the final cumulative sum). Now you have all the data you need to plot not only the CDF, but also the CCDF (which is 1 - CDF) and also the PDF (or the normalized histogram, for discrete values).

Resources