I have temporal data, where some time intervals contain only missing values. I want to show explicitely those missing values intervals.
For now, the solution I have is to check whether the value is NaN or not, as such:
plot file_name using 1:(stringcolumn(num_column) eq "NaN" ? 1/0 : column(num_column)) with lines,\
"" using 1:(stringcolumn(num_column) eq "NaN" ? 1000 : 1/0) with points
Which will result in drawing points at y = 1000 instead of the line for missing values, which gives the following result:
However, this is not ideal because a) I need to specify a y value at which to draw the points and b) it's quite ugly, especially when the dataset is longer in time.
I would like to produce something like this instead:
That is, to fill completely this interval with a color (possibly with some transparency unlike my image). Note that in these examples there is only one interval of missing values, bu in reality there can be any number of them on one plot.
We can do some pre-processing to accomplish this. Suppose that we have the following data file, data.txt
1 8
2 6
4 NaN
5 NaN
6 NaN
7 9
8 10
9 NaN
10 NaN
11 6
12 11
and the following python 3 program (obviously, using python is not the only way to do this), process.py1
data = [x.strip().split() for x in open("data.txt","r")]
i = 0
while i<len(data):
if (data[i][1]=="NaN"):
print(data[i-1][0],end=" ") # or use data[i][0]
i+=1
while data[i][1]=="NaN": i+=1
print(data[i][0],end=" ") # or use data[i-1][0]
else: i+=1
This python program will read the data file, and for each range of NaN values, it will output the last good and next good x-coordinates. In the case of the example data file, it outputs 2 7 8 11 which can be used as bounds for drawing rectangles. Now we can do, in gnuplot2
breaks = system("process.py")
set for [i=0:words(breaks)/2-1] object (i+1) rectangle from word(breaks,2*i+1),graph 0 to word(breaks,2*i+2),graph 1 fillstyle solid noborder fc rgb "orange"
Which will draw filled rectangles over this range. It determines how many "blocks" (groups of two values) are in the breaks variable then reads these two at a time using the breaks as left and right bounds for rectangles.
Finally, plotting the data
plot "data.txt" u 1:2 with lines
produces
which shows the filled rectangles over the range of NaN values.
Just to provide a little more applicability, the following awk program, process.awk3 serves the same purpose as the above python program, if awk is available and python isn't:
BEGIN {
started = 0;
last = "";
vals = "";
}
($2=="NaN") {
if (started==0) {
vals = vals " " last;
started = 1;
}
}
($2!="NaN") {
last = $1
if (started==1) {
vals = vals " " last;
started = 0;
}
}
END {
sub(/^ /,"",vals);
print vals;
}
We can use this by replacing the system call above with
breaks = system("awk -f process.awk data.txt")
1 The boundaries are extended to the last and next point to completely fill the gap. If this is not desired, the commented values will cover only the region identified by NaN in the file (4-6 and 8-10 in the example case). The program will not handle NaN values as the first or last data point.
2 I used solid orange for the gaps. Feel free to use any color spec there.
3 The awk program extends the boundaries in the same way as the python program, but takes more modification to get the other behavior. It has the same limitations in not handling NaN values as the first or last data point.
Using two filled curves
A somewhat "hacky" way of doing it is using two filled curves, as such:
plot file_name using 1:(stringcolumn(num_column) eq "NaN" ? 1/0 : column(num_column)) with lines ls 2,\
"" using 1:(stringcolumn(num_column) eq "NaN" ? 0 : 1/0) with filledcurve x1 ls 3,\
"" using 1:(stringcolumn(num_column) eq "NaN" ? 0 : 1/0) with filledcurve x2 ls 3
Both filledcurve must have the same linestyle, so that we get one uniform rectangle.
One filledcurve has x1 as parameter and the other x2, so that one fills above 0 and the other below 0.
You can remove the curve at 0 and make the filling transparent using this:
set style fill transparent solid 0.8 noborder
This is the result:
Note that the dashed line at 0 under the rectangle is a bit glitchy compared to the other dashed lines. Note also that if some rectangles are very small in width, they will look lighter than expected.
Related
I have data in the format:
1 1 A
2 3 ab
1 2 A
3 3 x
4 1 x
2 3 A
and so on. The third column indicates the series. That is in the case above there are 3 distinct data series, one designated A, another designated ab and last designated x. Is there a way to plot the three data series from such data structure in gnuplot without using eg. awk? The difficulty here is that the number of categories (here denoted A, ab, x) is quite large and it is not feasible to write them out by hand.
I was thinking along the lines:
plot data u 1:2:3 w dots
but that does not work and I get warning: Skipping data file with no valid points (I tried quoted and unquoted version of the third column). A similar question has to manually define the palette which is undesirable.
With a little bit of work you can make a list of unique categories from within gnuplot without using external tools. The following code snippet first assembles a list of the entire third column of the data file, and then loops over it to generate a list of unique category names. If memory use or processing time become an issue then one could probably combine these steps and avoid forming a single string with the entire third column.
delimiter = "#" # some character that does not appear in category name
categories = ""
stats "test.dat" using (categories = categories." ".delimiter.strcol(3).delimiter) nooutput
unique_categories = ""
do for [cat in categories] {
if (strstrt (unique_categories, cat) ==0) {
unique_categories = unique_categories." ".cat
}
}
set xrange[0:5]
set yrange [0:4]
plot for [cat in unique_categories] "test.dat" using 1:(delimiter.strcol(3).delimiter eq cat ? $2 : NaN) title cat[2:strlen(cat)-1]
Take a look at the contents of the string variables categories and unique_categories to get a better idea of what this code does.
I have a problem handling data using gnuplot.
My data has different column number per line.
I want to plot with X-axis of the first column and Y-axis of the last.
The last columns are always different every line.
For example, my data looks like that (my.dat)
1 2
2 1 3
3 4 4
4 5
5 2 1 3 6
plot 'my.dat' us 1:(lastcolumn) w l
Before reading in gnuplot, I can pre-process of the data.
But my gnuplot is windows version, I cannot use awk or any parsing program.
So I hope it handles only into gnuplot.
Is that possible?
Thanks
Yes, you can check that with gnuplot. The idea is as follows:
You analyze your data with stats and inside the using you check recursively with valid which column is the last valid. If an invalid column is reached you return the number of the previous column otherwise the next column is checked. The last column is then contained in the variable STATS_max
check_valid_column(c) = valid(c) ? check_valid_column(c + 1) : c - 1
stats 'my.dat' using (check_valid_column(1)) nooutput
last_column = int(STATS_max)
plot 'my.dat' using 1:last_column
Just for the records, here is an alternative suggestion. Christoph's solution is certainly more elegant and probably faster.
However, with the recursive approach you will get an error "recursion depth limit exceeded" if you have more than 250 columns (admittedly, probably very rare cases).
The solution below uses the lines as one string and counts the columns with words(). This, however, works only if you have whitespace as separator. With comma it will not work. Not sure what string length limit would be.
Code: (edit: no need to plot to a dummy table, stats can be used instead)
### find the maximum number of columns
reset session
# create some random test data
set print $Data
rows = int(rand(0)*5+5) # random 5 to 9 lines
do for [r=1:rows] {
minCols = 251 # if minCols >250, the recursive approach will fail
cols = int(rand(0)*10+minCols)
line = ''
do for [c=1:cols] { line = sprintf("%s %d",line,rand(0)*10) }
print line
}
set print
# alternative approach with word(). Works only for separator whitespace.
set datafile separator "\n"
maxCol=0
stats $Data u (cols=words(strcol(1)), cols>maxCol?maxCol=cols:0) nooutput
set datafile separator whitespace
print "words() approach: ", maxCol
# Recursive approach for comparison
print "Recursive approach: "
check_valid_column(c) = valid(c) ? check_valid_column(c + 1) : c - 1
stats $Data u (check_valid_column(1)) nooutput
last_column = int(STATS_max)
print last_column
### end of code
Result: (if number of max columns>250)
words() approach: 259
Recursive approach:
"SO41032862.gp" line 28: recursion depth limit exceeded
I have some values given by clock time, where the first column is the time. However, the values until 2 o clock still belong to the current day. Given
3 1
12 4
18 1
21 2
1 3
2 0
named as test.data, I'd like to print this in gnuplot:
set xrange [0:24]
plot 'test.data' with lines
However, the plot contains a backward line. It's striking through the whole diagram.
Is there a way to tell gnuplot to explicitly not print such backward lines, or even better, print them wrapping around the x axis (e.g. in my example drawing the line as a forward line up to 24, and then continuing it at 0)?
Note: The x axis of the plot should still start at 0 and end at 24.
As far as wrapping over the edge of the graph (a pac-man like effect), gnuplot can't do that on it's own. Even doing it manually, you would have to somehow calculate the right point to re-enter the graph based on the slope of the connecting line, and insert a new point into the data to control where the re-entry line enters, and where the exiting line exits. This would require external processing.
If you can do some outside preprocessing, adding a blank line before the 1 3 line will insert a discontinuity into the plot and prevent gnuplot from connecting those points (see help datafile for how gnuplot handles blank lines). Of course, you could always sort the data too.
I would recommend sorting the data before plotting, but if you do want to do this wrapping effect, the following python program (wrapper.py) will set up the data for it
data = [tuple(map(float,x.strip().split(" "))) for x in open("data.txt","r")]
data2 = sorted(data)
back_in_to = data2[0]
out_from = data2[-1]
xdelta = back_in_to[0] + 24 - out_from[0]
ydelta = back_in_to[1] - out_from[1]
slope = ydelta/xdelta
outy = out_from[1] + (24-out_from[0])*slope
print(0,outy)
for x in data2:
print(*x)
if x[0]==data[-1][0]: print("")
print(24,outy)
It reads in the data (assumed to be in data.txt, and calculates the points where a line should leave the graph and where it should re-enter, adding these points to the sorted data. It adds a blank line after the last point in the original graph, causing the break in the line. We can then plot like
plot "< wrapper.py" with lines
If we look at your original plot
we see the backward line that you referred to which reaches from the furthest right point to the next left point. The plot that the python program pre-processed reaches through the right of the graph to move back to this point.
According to figure above. this picture is generated from data points in text file. My question is that how can i remove the line at any two points if graph is jumped? (In my picture see that graph is jump about on x~260)
note that my purpose is that i just want to make this graph look like piecewise function that mean line on the middle of graph should not be connected because is jumped.
In gnuplot you can split a line in several parts either when you have an invalid data value somewhere, or an empty line.
For the first situation, you could check inside the using statement, if the difference to the previous point is too large, and invalidate the current point. But that would also make you loose not only the connecting line, but also the first point after the jump:
lim=3
y2=y1=0
plot 'test.dat' using (y2=y1,y1=$2,$1):($0 > 0 && abs(y2-y1) > lim ? 1/0 : y1) with linespoints
The test data file I used is
1 1
2 1.1
3 0.95
4 1
5 5
6 6
7 5.5
8 5.8
9 -2
10 -2.5
11 -4
As you see, the points at x=5 and x=9 are lost.
Alternatively, you can pipe your data through an external tool like awk for the filtering. In this case you can insert an empty line when the difference between two consecutive y-values exceeds some limit:
filter(lim) = 'awk ''{if(NR > 1 && sqrt((y-$2)**2) > '.lim.') print ""; print; y=$2}'' test.dat'
plot '< '.filter(3) using 1:2 with lines
Note, that I used the sqrt((..)**2) only to simulate an abs function, which awk doesn't have.
I want to fit a function with a dataset using gnuplot.
I use a data set example, in the file "data":
1 2
5 4
6 5
7 8
If I do in gnuplot
>f(x) = a*x+b
>fit f(x) "data" via a,b
It works just good, (and with this example I get a≃0.855 and b≃0.687)
Now what I really want to do is to fit the function floor(a*x+b). So I tried exactly the same way
>f(x) = floor(a*x+b)
>fit f(x) "data" via a,b
And I get the output
Iteration 0
WSSR : 8 delta(WSSR)/WSSR : 0
delta(WSSR) : 0 limit for stopping : 1e-005
lambda : 0
initial set of free parameter values
a = 1
b = 1
Singular matrix in Givens()
error during fit
Googling it didn't help me, I also tried to find if there was some contraindication using fit with floor but again I didn't find anything.
Has someone an idea?
Note : I use Gnuplot 4.6 patchlevel 0, built for Windows 32bit
There is a fundamental problem fitting with floor, which is that your least squares error function is piecewise constant, so when you look for the gradient of the error with respect to your fit parameters you always get zero.
In this example the minimum sum of squares error is exactly 3 for a range of a,b in the neighborhood of .85,1.5
Mathermatica, (which is far more poweful), gives a result 1,1 along with a warning that due to the zero gradient it can not be sure if this is really a minimum.