I have a database of hourly data for an entire year. I want to find the 98th percentile for NO2 (for example) for each hour for each season (Dec-Jan-Feb, Mar-Apr-May, etc.)
I'm trying to use MATCH and INDEX to find the cells for one hour for one season.
=INDEX(A1:E8985,MATCH(Z2,(C3:C8985=AA2,AA3,AA13)*(B3:B8985=Z2),0))
where A1:E8985 is the table area I'm looking in
Z2 is the hour (1:00), looking in column B, which contains the hours
AA2,AA3,AA13 are January, February, and December (one season), looking in column C, which contains the months.
Right now, I'm getting an #N/A error even though the criteria should be met multiple times. I have made sure that the columns match formats.
Sample of part of the table:
Date Time Month NO NO2
1/1/2016 1:00 January -0.1 0.2
1/1/2016 2:00 January -0.1 0.1
1/1/2016 3:00 January -0.1 0.1
1/1/2016 4:00 January -0.1 0.2
1/1/2016 5:00 January -0.1 0.2
1/1/2016 6:00 January -0.1 0.4
1/1/2016 7:00 January -0.1 0.3
1/1/2016 8:00 January -0.1 0.8
1/1/2016 9:00 January -0.1 0.5
1/1/2016 10:00 January -0.1 0.2
1/1/2016 11:00 January -0.1 1.3
1/1/2016 12:00 January -0.1 0.7
1/1/2016 13:00 January -0.1 0.4
1/1/2016 14:00 January 0 0.7
1/1/2016 15:00 January -0.1 0.5
1/1/2016 16:00 January -0.1 0.4
1/1/2016 17:00 January -0.1 1
1/1/2016 18:00 January -0.1 0.7
1/1/2016 19:00 January -0.1 0.9
1/1/2016 20:00 January 1.6 4.5
1/1/2016 21:00 January 2.8 6
1/1/2016 22:00 January 0.1 1.1
1/1/2016 23:00 January 0.2 1.3
1/2/2016 0:00 January 0.2 1.4
Let me summarize the logic you want, you want the 98 percentile of NO2 where the month is either January, February or December and the value of time is 1:00, then for 2:00 and so on.
If it is so find below the formula applied only to the current data you have provided.
Note that it is an array formula
=PERCENTILE.INC(
IF(C1:C25="January",
IF(B1:B25=Z2,
E1:E25,
""),
IF(C1:C25="February",
IF(B1:B25=Z2,
E1:E25,
""),
IF(C1:C25="December",
IF(B1:B25=Z2,
E1:E25,
""),
""))
),0.98)
Related
I have two dataframes df1 and df2
df1
Date/Time S
1/1/2012 0:00 7.51
1/1/2012 1:00 7.28
1/1/2012 2:00 6.75
1/1/2012 3:00 15.00
1/1/2012 4:00 8.18
1/1/2012 5:00 0.00
1/1/2012 6:00 5.00
df2
S Val
3.00 30
4.00 186
5.00 406
6.00 723
7.00 1169
8.00 1704
9.00 2230
10.00 2520
11.00 2620
12.00 2700
I would like to merged the two dataframes with interpolated val.
pd.merge(df1, df2, left_on=['S'], right_on=['S'])
For example:
df1 'S' column will be lookup value, and column 'S' in df2 will be lookupRange, and the outputRange will be column 'Val'.
The value below 3 and above 12 will be 0.
The output should be as shown below, How can i achieve this in pandas?. or any alternative solution in python other then looping much appreciated.
Output
Date/Time S Val
1/1/2012 0:00 7.51 1441.9
1/1/2012 1:00 7.28 1318.8
1/1/2012 2:00 6.75 1057.5
1/1/2012 3:00 15.00 0.0
1/1/2012 4:00 8.18 1798.7
1/1/2012 5:00 0.00 0.0
1/1/2012 6:00 5.00 406.00
Assuming df2 is sorted by column S, you can do:
tmp = df1.assign(tmp=df1.S.apply(np.floor)).merge(df2.assign(tmp2=(df2.Val.shift(-1) - df2.Val)), how='outer', left_on='tmp', right_on='S')
tmp.loc[tmp.Val.isna(), 'S_x'] = 0
tmp['Val'] = (tmp['S_x'] - tmp['S_y'].fillna(0)) * tmp['tmp2'].fillna(1) + tmp['Val'].fillna(0)
print(tmp[['Date/Time', 'S_x', 'Val']].dropna().sort_values(by='Date/Time').rename(columns={'S_x': 'S'}))
Prints:
Date/Time S Val
0 1/1/2012 0:00 7.51 1441.85
1 1/1/2012 1:00 7.28 1318.80
2 1/1/2012 2:00 6.75 1057.50
3 1/1/2012 3:00 15.00 0.00
4 1/1/2012 4:00 8.18 1798.68
5 1/1/2012 5:00 0.00 0.00
6 1/1/2012 6:00 5.00 406.00
I have a dataframe with a RangeIndex, timestamps in the first column and several thousands hourly temperature observations in the second.
It is easy enough to group the observations by 24 and find daily Tmax and Tmin. But I also want the timestamp of each day's max and min values.
How can I do that?
I hope I can get help without posting a working example, because the nature of the data makes it unpractical.
EDIT: Here's some data, spanning two days.
DT T-C
0 2015-01-01 00:00:00 -2.5
1 2015-01-01 01:00:00 -2.1
2 2015-01-01 02:00:00 -2.3
3 2015-01-01 03:00:00 -2.3
4 2015-01-01 04:00:00 -2.3
5 2015-01-01 05:00:00 -2.0
...
24 2015-01-02 00:00:00 1.1
25 2015-01-02 01:00:00 1.1
26 2015-01-02 02:00:00 0.8
27 2015-01-02 03:00:00 0.5
28 2015-01-02 04:00:00 1.0
29 2015-01-02 05:00:00 0.7
First create DatetimeIndex, then aggregate by Grouper with days and idxmax
idxmin for datetimes for min and max temperature:
df['DT'] = pd.to_datetime(df['DT'])
df = df.set_index('DT')
df = df.groupby(pd.Grouper(freq='D'))['T-C'].agg(['idxmax','idxmin','max','min'])
print (df)
idxmax idxmin max min
DT
2015-01-01 2015-01-01 05:00:00 2015-01-01 00:00:00 -2.0 -2.5
2015-01-02 2015-01-02 00:00:00 2015-01-02 03:00:00 1.1 0.5
I am trying to create a report that is grouped by day of week for each year.
I have a df that looks like this:
s1 s2 srd
dt
2004-02-04 11:21:00 2365.79 2372.37 -7.0
2004-02-05 10:15:00 2365.79 2368.03 -2.0
2004-02-17 06:43:00 2421.05 2425.26 -4.0
2004-02-17 12:43:00 2418.42 2420.53 -2.0
2004-02-17 12:44:00 2420.39 2420.53 -0.0
The dt index is in datetime format.
What I am looking for is a dataframe that looks like this (I only need srd column and function to group can be anything, like sum, count, etc.):
srd
dayOfWeek year
Mon 2004 10
2005 11
2006 8
2007 120
Tues 2004 105
2005 105
I have tried dayOfWeekDf = df.resample('B') , but I get a dataframe that looks like it is split by week number.
I also tried df.groupby([df.index.weekday, df.index.year])['srd'].transform('sum'), but it does not even group for some reason, as I get the following (Feb 17th appears 3 times).
srd
dt
2004-02-04 11:21:00 81.0
2004-02-05 10:15:00 203.0
2004-02-17 06:43:00 37.0
2004-02-17 12:43:00 37.0
2004-02-17 12:44:00 37.0
If you want the dayOfWeek and year names in the index, you can assign them:
>>> df.assign(year=df.index.year, dayOfWeek = df.index.weekday_name).groupby(['dayOfWeek','year']).srd.sum()
dayOfWeek year
Thursday 2004 -2.0
Tuesday 2004 -6.0
Wednesday 2004 -7.0
Name: srd, dtype: float64
Otherwise, you can use the way you were doing, but omit the transform:
>>> df.groupby([df.index.weekday_name, df.index.year])['srd'].sum()
dt dt
Thursday 2004 -2.0
Tuesday 2004 -6.0
Wednesday 2004 -7.0
Name: srd, dtype: float64
I have data taken at different times on different days, for example:
dateTimeRead(YYYY-MM-DD HH-mm-ss) rain_value(mm) air_pressure(hPa)
1/2/2015 0:00 0 941.5675
1/2/2015 0:15 0 941.4625
1/2/2015 0:30 0 941.3
1/2/2015 0:45 0 941.2725
1/2/2015 1:00 0.2 941.12
1/2/2015 1:15 0 940.8625
1/2/2015 1:30 0 940.7575
1/2/2015 1:45 0 940.6075
1/2/2015 2:00 0 940.545
1/2/2015 2:15 0 940.27
1/2/2015 2:30 0 940.2125
1/2/2015 16:15 0 940.625
1/2/2015 16:30 0 940.69
1/2/2015 16:45 0 940.6175
1/2/2015 17:00 0 940.635
1/2/2015 19:00 0 941.9975
1/2/2015 20:45 0 942.7925
1/2/2015 21:00 0 942.745
1/2/2015 21:15 0 942.6325
1/2/2015 21:30 0 942.735
1/2/2015 21:45 0 942.765
1/2/2015 22:00 0 7/30/1902
1/3/2015 2:30 0 941.1275
1/3/2015 2:45 0 941.125
1/3/2015 3:00 0 940.955
1/3/2015 3:15 0 941.035
There are dates with missing time stamps.
From these readings how may I extract the maximum values by day for rain_value(mm)?
There is a fairly standard array formula style to provide a pseudo-MAXIF function but I prefer to use INDEX and enter it as a standard formula.
With the date to be determined in F3, the formula in G3 is,
=MAX(INDEX(($A$2:$A$999>=$F3)*($A$2:$A$999<(F3+1))*$B$2:$B$999, , ))
A CSE array formula for the same thing would be something like,
=MAX(IF($A$2:$A$999>=$F3, IF($A$2:$A$999<$F3+1, $B$2:$B$900)))
Array formulas need to be finalized with Ctrl+Shift+Enter↵.
An array formula may not be suitable for your particular requirement since it seems you may have very many readings. Instead I would suggest a PivotTable, with the date/Time entries parsed (Text to Columns, Fixed width) and date for ROWS, Max of rain_value(mm) for VALUES.
how can I change my hourly to daily data while there are some missing values in my hourly data? my excel is like:
date hour ENERGY(MJ)
1/01/2002 0:00 0
1/01/2002 1:00 0
1/01/2002 2:00 0
1/01/2002 3:00 0
1/01/2002 4:00 0
1/01/2002 5:00 0
1/01/2002 6:00 0.15
1/01/2002 7:00 0.74
1/01/2002 8:00 1.46
1/01/2002 9:00 2.23
1/01/2002 10:00 2.89
Thanks
If your data is in A1:C12 you have 5 hourly readings (total 7.47MJ) which may be summed with SUM(C2:C12), the divided with 5 derivable with COUNTIF(C2:C12,">"&0) to compute an average hourly rate from the data available, then scaled up for a full day by multiplying by 24:
=SUM(C2:C12)*24/COUNTIF(C2:C12,">"&0)