Group by column and find the min and max date by group

Group by column and find the min and max date by group - python-3.x

I have the following data set:
Time Title
00:00:00 Title #one
00:12:38 Title #one
00:13:39 Title #one
00:33:14 Title #one
00:33:44 Title #two
00:49:27 Title #two
00:49:57 Title #two
00:59:43 Title #two
00:59:51 Title #three
00:59:59 Title #three
01:28:29 Title #three
01:28:38 Title #four
01:29:08 Title #four
01:29:38 Title #four
01:29:59 Title #four
01:37:08 Title #four
01:37:53 Title #four
01:38:53 Title #four
01:46:20 Title #four
I want to be able to group by Title and returns the min and max time for each title.
I've been able to get the min and max time for each titles with this:
df_max = df.groupby(['Title'])['Start time'].max()
df_min = df.groupby(['Title'])['Start time'].min()
However I don't really understand what df_max and df_min are returning exactly and how I should merge them...

You should use the .agg method.
df.groupby(['Title']).agg({'Start time':[min,max]})

Related

Getting plot title and caption data from the data file

Consider the following file that I want to plot using gnuplot: Servos20211222_105253.csv
# Date/Time 2021/12/22, 10:52:53
# PonE=0,LsKp=200,LsKi=0,LsKd=250,HsKp=40,HsKi=0,HsKd=130,Sp=800,TDEC=1175137
#
# Rel. Time, currentPos, PosPID, currentSpeed, speedPID, Lag, ServoPos
0.00000,4693184,0,0,0,0,4693184
0.00000,4693184,2300,0,368,0,4693184
0.00391,4693185,2300,12,367,0,4693184
:
:
I would like to:
set the plot title to the date/time from the first comment record.
display the record that starts "# PonE" as a caption.
extract the value for TDEC and plot a horizontal line with the name "Target"
I have some influence over the format of the header records, so if (for example) it would be better that they were not comments but provided in some other way, then that can be done.

It is a common problem to get text values from files using only gnuplot. If you can use OS and shell dependent solutions, I'd suggest to use remove the comments from the file and try something like
set title "`head -1 Servos20211222_105253.csv`"
You can place text anywhere using set label <"label text">, where the label text can be the 2nd line from the file.
You can plot a straight line using plot:
p sin(x), 0.5 title "TDEC"
But instead of 0.5, you need to get the value using shell scripts again, e.g. the cut unix command.

There are ways with gnuplot only, although sometimes a bit cumbersome compared with using tools which you have available on Linux (or comparable tools which you need to install on Windows).
Update: shorter and "simplified" script
One possible gnuplot-only way:
set commentschar to nothing, i.e. ''
assign the columns to variables and/or arrays, e.g. myDate, myTime, P[1..9].
Merge P[1..8] into a multi-line string Params by "mis"-using sum (check help sum)
Convert P[9] into a floating point number TDEC for plotting
Script: (modified the data a bit just for illustration)
### extract values from headers with gnuplot only
reset session
$Data <<EOD
# Date/Time 2021/12/22, 10:52:53
# PonE=0,LsKp=200,LsKi=0,LsKd=250,HsKp=40,HsKi=0,HsKd=130,Sp=800,TDEC=1175137
#
# Rel. Time, currentPos, PosPID, currentSpeed, speedPID, Lag, ServoPos
0.00000,1300000,0,0,0,0,4693184
0.00200,1200000,2300,0,368,0,4693184
0.00391,1100000,2300,12,367,0,4693184
EOD
set datafile separator comma commentschar ''
array P[9] # array to store parameters
stats $Data u ($0==0 ? (myDate=strcol(1)[3:], myTime=strcol(2)) : \
sum [_i=1:9] (P[_i] = _i==1 ? strcol(_i)[3:] : strcol(_i) ,0 )) \
every ::0::1 nooutput
set datafile commentschar # set back to default
Params = P[1]
Params = (sum [_i=2:8] (Params=Params.sprintf("\n%s",P[_i]),0),Params)
set title sprintf("%s %s", myDate, myTime)
TDEC = real(P[9][6:]) # convert to real number
set label 1 at graph 0.02, first TDEC P[9] offset 0,-0.7
set label 2 at graph 0.02, graph 0.85 Params
plot $Data u 1:2 w lp pt 7 title "Data", \
TDEC w l lc "red" title "Target"
### end of script
Result:

max value for same minute over multiple days from csv with unix timestamps

I have a CSV with a unix timestamp column that was collected over multiple days having a data row for every 5 minutes (output log of my photo voltaik roof power plant).
I'd like to create a plot for 24 hours that shows the maximum value for every single (fifth) minute over all days.
Can this be done with gnuplots own capabilities or do I have to do the processing outside gnuplot via scrips?
You don't show how your exact data structure looks like, - theozh
This files are rather large. I placed an example here:
http://www.filedropper.com/log-pv-20190607-20190811 (300kB)
I'm specially interested in column 4 (DC1 P) and 9 (DC2 P).
Column 1 (Zeit) holds the unix timestamp.
The final goal is separate graphs (colors) for DC1 P and DC2 P, but that's a different question... ;o)

Update/Revision:
After revisiting this answer, I guess it is time for a clean up and simpler and extended solution. After some iterations and clarifications and after OP provided some data (although, the link is not valid anymore), I came up with some suggestions, which can be improved.
You can do all in gnuplot, no need for external tools!
The original request to plot the maximum values from several days is easy if you use the plotting style with boxes. But this is basically only a graphical solution. In that case is was apparently sufficient. However, if you are interested in the maximum values as numbers it is a little bit more effort.
gnuplot has the option smooth unique and smooth frequency (check help smooth). With this you can easily get the average and sum, respectively, but there is no smooth max or smooth min. As #meuh suggested, you can get maximum or mimimum with arrays, which are available since gnuplot 5.2.0
Script: (Requires gnuplot>=5.2.0)
### plot time data modulo 24h avg/sum/min/max
reset session
FILE = 'log-pv-20190607-20190811.csv'
set datafile separator comma
HeaderCount = 7
myTimeFmt = "%Y-%m-%d %H:%M:%S"
StartTime = ''
EndTime = ''
# if you don't define start/end time it will be taken automatically
if (StartTime eq '' || EndTime eq '') {
stats FILE u 1 skip HeaderCount nooutput
StartTime = (StartTime eq '' ? STATS_min : strptime(myTimeFmt,StartTime))
EndTime = (EndTime eq '' ? STATS_max : strptime(myTimeFmt,EndTime))
}
Modulo24Hours(t) = (t>=StartTime && t<=EndTime) ? (int(t)%86400) : NaN
set key noautotitle
set multiplot layout 3,2
set title "All data" offset 0,-0.5
set format x "%d.%m." timedate
set grid x,y
set yrange [0:]
myHeight = 1./3*1.1
set size 1.0,myHeight
plot FILE u 1:4:(tm_mday($1)) skip HeaderCount w l lc var
set multiplot next
set title "Data per 24 hours"
set format x "%H:%M" timedate
set xtics 3600*6
set size 0.5,myHeight
plot FILE u (Modulo24Hours($1)):4:(tm_mday($1)) skip HeaderCount w l lc var
set title "Average"
set size 0.5,myHeight
plot FILE u (int(Modulo24Hours($1))):4 skip HeaderCount smooth unique w l lc "web-green"
set title "Sum"
set size 0.5,myHeight
plot FILE u (int(Modulo24Hours($1))):4 skip HeaderCount smooth freq w l
set title "Min/Max"
set size 0.5,myHeight
N = 24*60/5
SecPerDay = 3600*24
array Min[N]
array Max[N]
do for [i=1:N] { Min[i]=NaN; Max[i]=0 } # initialize arrays
stats FILE u (idx=(int($1)%SecPerDay)/300+1, $4>Max[idx] ? Max[idx]=$4:0, \
Min[idx]!=Min[idx] ? Min[idx]=$4 : $4<Min[idx] ? Min[idx]=$4:0 ) skip HeaderCount nooutput
plot Min u ($1*300):2 w l lc "web-blue", \
Max u ($1*300):2 w l lc "red"
unset multiplot
### end of script
Result:

From gnuplot 5.2 you could use the new array datatype to calculate a maximum value for each 5 minute slot. I am not a gnuplot expert, so the following example needs more work, but shows the potential.
Assume data is similar to these lines, where there is a date in the format
yyyy.mm.dd.HH:MM, a comma and a y value:
2018.02.03.18:23,4
2018.02.03.19:23,7
2018.02.04.18:23,8
2018.02.05.19:23,11
Instead of using gnuplot's built-in time parsing, since we want to ignore the date, we create a function fsecs to use substr(stringcolumn(...),12,16) to get just the hours and minutes from data column 1, and strptime("%H:%M",...) to convert this to seconds:
set datafile separator ","
fsecs(v) = strptime("%H:%M",substr(stringcolumn(v),12,16))
We create an array Max indexed by "5 minute slot", of which there are 24*60/5 per day. It is initialised to NaN, not-a-number.
Nitems = int(24*60/5)+1
array Max[Nitems]
do for [i=1:Nitems] {
Max[i] = NaN
}
We then "plot" the data file data.csv into a dummy table, rather than generating any output. As we go through the data we index Max by the data x value (column 1) converted to seconds by fsecs(1) and then to slot by findex(). This is Max[findex(fsecs(1))].
We call our function fmax() to return the new maximum to set in the array.
findex(x) = int(((x)/60)/5)
fmax(a,b) = ((a>=b)?a:b)
set table $Dummy
plot 'data.csv' using \
(Max[findex(fsecs(1))] = fmax(Max[findex(fsecs(1))],$2)):2
unset table
Finally, we plot the array, which is the slot number against the value held in that slot number.
plot Max using 1:(Max[$1]) with points lw 2 title "max day"
This works for me on 5.2. You still need to label the x axes with HH:MM, and change the date parsing to fit your needs.

For time formating, please see Gnuplot date/time in x axis
If you do not care about format as time, you may use the every command, see gnuplot docu, but that does not take a maximum or something.
For the maximum value over a given time interval I suggest an awk script, see e.g. https://unix.stackexchange.com/a/207287/297901

Gnuplot : 2 questions about my histograms X axis and add percentage

I need to generate a gnuplot histogram in order to see the CPU and RAM evolution of my cluster per month :
I want to generate the histogram from this file :
July 2018,19%,46%
August 2018,20%,45%
September 2018,20%,41%
October 2018,21%,39%
November 2018,21%,39%
December 2018,21%,41%
January 2019,25%,46%
February 2019,27%,50%
To do that, this my code :
set title " CLUSTER 1 "
set terminal png truecolor size 960, 720
set output " cluster1.png"
set key below
set grid
set style data histograms
set style fill solid 1.00 border -1
set datafile separator ","
plot 'cluster.txt' using 2:xtic(1) title " CPU consumption (%) ", '' using 3 title " RAM consumption (%)"
For the moment, I have this result :
But as you can see, I have a problem with my x axies. The dates overlap each other and I'm not able to change that... Can you show me how to change that ?
And, can you tell me how I can put the percents in/above the histograms bars ?
Finally, I would like an histograms like this :

To wrap words in categories, you can replace a space to line break if necessary with a ternary function:
f(w) = (strlen(w) > 10 ? word(w, 1) . "\n" . word(w, 2) : w)
It replaces a space to "\n" if a length of a label is more than 10 characters.
To add a percentage sign on Y axis, set y format like this:
set format y "%g%%"
To add labels, use plot with labels:
'' using 0:($2+1):(sprintf("%g%%",$2)) with labels notitle, \
'' using 0:($3+1):(sprintf(" %g%%",$3)) with labels notitle
You may need to change bottom margin of the plot to fit two-line labels and key:
set bmargin at screen 0.1
So the script becomes like this:
f(w) = (strlen(w) > 10 ? word(w, 1) . "\n" . word(w, 2) : w)
set title "CLUSTER 1"
set terminal png truecolor size 960, 720
set output "cluster1.png"
set bmargin at screen 0.1
set key below
set grid
set style data histograms
set style fill solid 1.00 border -1
set boxwidth 0.7 relative
set yrange [0:]
set format y "%g%%"
set datafile separator ","
plot 'cluster.txt' using 2:xtic(f(stringcolumn(1))) title " CPU consumption (%) ", \
'' using 3 title " RAM consumption (%)", \
'' using 0:($2+1):(sprintf("%g%%",$2)) with labels notitle, \
'' using 0:($3+1):(sprintf(" %g%%",$3)) with labels notitle

GNUPLOT if I have 30 data/1sec the code not working,why?

Hy
I wonder when I working with same code, but the data is only 1 data / 1 second - the code is working,
But when I change the sensor sample rate to about 30 Hz , so 30 data / 1 second I get this plot:
I want to something like this:
the code what i use is this:
set term png
set autoscale yfix
set autoscale xfix
set grid
#set offsets 0,0,30,30
#set offsets 0,0,30,30
set offsets graph 0, 0, 50, 50
PATCH = system("cat ./outputs/lastPatch.txt")
set title "MPU-9150 IMU sensor: RAW Accelerometer "
set xdata time
set timefmt '%H:%M:%S'
set format x '%H:%M:%S'
set xtics rotate
set xlabel "[time]"
set ylabel "[-]"
#set output "./outputs/IMUrawAccel.png"
set output sprintf("%sIMUrawAccel.png",PATCH)
#plot './outputs/IMUrawAccel.txt' using 1:2 title 'X' with lines, './outputs/IMUrawAccel.txt' using 1:3 title 'Y' with lines , './outputs/IMUrawAccel.txt' using 1:4 title 'Z' with lines
plot sprintf("%sIMUrawAccel.txt",PATCH) using 1:2 title 'X' with lines, sprintf("%sIMUrawAccel.txt",PATCH) using 1:3 title 'Y' with lines , sprintf("%sIMUrawAccel.txt",PATCH) using 1:4 title 'Z' with lines
And the output of the sensor is this (what I use in gnuplot)
22:20:59 2704 -1310 -15666
22:20:59 2886 -1278 -15716
22:20:59 2860 -1322 -15734
22:20:59 2844 -1322 -15684
22:20:59 2854 -1362 -15680
22:20:59 2834 -1242 -15766
22:20:59 2864 -1320 -15830
22:20:59 2836 -1304 -15724
22:20:59 2882 -1342 -15744
22:20:59 2888 -1266 -15794
22:20:59 2940 -1336 -15774
22:20:59 2866 -1282 -15786
22:20:59 2860 -1320 -15756
22:20:59 2810 -1340 -15710

There are multiple datapoints with the same timestamps.
So if you plot them by timestamp, they will end up on the same x-coordinate.
There are several possible solutions.
Change the sensor output to include e.g. milliseconds. This is by far the preferred solution.
If there are a constant number of points with the same timestamp (say 30), you could use the every keyword: plot sprintf("%sIMUrawAccel.txt",PATCH) every 30 using 1:2 title 'X' with lines
If that is not possible, write a script to enhance the timestamp. Count the number of data-points with the same time-stamp, and then add some milliseconds to each timestamp to space them evenly. (Assuming of course that the data is evenly spaced.)
Just plot the points without the time: plot sprintf("%sIMUrawAccel.txt",PATCH) using :2 title 'X' with lines

Changing title within loop

I am using gnuplot 5.0, and I have a data set I would like to plot using
key1 = 'Some title with multiple words'
key2 = 'Some other descriptive title '
key3 = '...and a third title'
plot for[i=1:3] datafile index i-1 using 1:2 with lines title eval('key'.i)
This is not working, but I would like to have a different string with multiple words for each curve. Using words() and word() will not work. So, how can I change the title in a plot-for command?

Gnuplot 5.0 introduces some limited support for using quoted strings with word and words:
keys = '"Some title with multiple words" '.\
'"Some other descriptive title" '.\
'"...and a third title"'
plot for[i=1:3] i*x with lines title word(keys, i)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Group by column and find the min and max date by group - python-3.x

You should use the .agg method. df.groupby(['Title']).agg({'Start time':[min,max]})

Related

Getting plot title and caption data from the data file

max value for same minute over multiple days from csv with unix timestamps

Gnuplot : 2 questions about my histograms X axis and add percentage

GNUPLOT if I have 30 data/1sec the code not working,why?

Changing title within loop

Categories

Resources