Dealing with last hour in a resampled year in pandas - python-3.x

I have been dealing with time-series with one entry per hour over a year. In order to better analyse the data, I have been resampling by month with pandas and summing the results with df = df.resample('M').sum()
As the last hour of the last day runs from 23:00 31/12 to 00:00 01/01 of the following year, the final hour is resampled into January of the following year (e.g. my time-series is for 2020, the last hour of 31/12/2020 is resampled into January 2021). This means I lose data for December.
I have considered adding the data back in to December, but is there a better way to achieve this?

Unfortunately you need add it to previous hour, e.g. by:
rng = pd.date_range('2017-12-31 12:00:00', periods=13, freq='H')
df = pd.DataFrame({'a': range(len(rng))}, index=rng)
y = df.index.year
df.index = df.index.where(y != y.max(), df.index - pd.Timedelta(1, unit='H'))
print (df)
a
2017-12-31 12:00:00 0
2017-12-31 13:00:00 1
2017-12-31 14:00:00 2
2017-12-31 15:00:00 3
2017-12-31 16:00:00 4
2017-12-31 17:00:00 5
2017-12-31 18:00:00 6
2017-12-31 19:00:00 7
2017-12-31 20:00:00 8
2017-12-31 21:00:00 9
2017-12-31 22:00:00 10
2017-12-31 23:00:00 11
2017-12-31 23:00:00 12
df = df.resample('M').sum()
print (df)
a
2017-12-31 78

Related

Pad daily(ish) dataframe into hourly

I have a dataframe of gasoline prices in my area I'd like to input to a dataframe of electricity prices. Issue being that my gasoline data is daily-ish while electricity prices are hourly.
How can I duplicate my daily gas values so I can fit it with my electricity price dataframe? Is there a smarter way to do this, where I can reference the gas prices in the electricity dataframe?
My data:
Price
Date
2022-10-20 16.19
2022-10-19 16.49
2022-10-18 16.69
2022-10-15 16.99
I need to do hourly analysis of comparison between the two, so averaging the electricity price to daily won't work.
I tried using the below, but it failed.
df.set_index('DateTime').resample('H').pad()
(from [here](https://stackoverflow.com/questions/39966456/pandas-generating-hourly-data-from-daily-data-from-csv)
You can try:
# convert index to datetime (if required):
df.index = pd.to_datetime(df.index)
df = df.asfreq("H").interpolate()
print(df)
Prints:
Price
Date
2022-10-15 00:00:00 16.990000
2022-10-15 01:00:00 16.985833
2022-10-15 02:00:00 16.981667
2022-10-15 03:00:00 16.977500
2022-10-15 04:00:00 16.973333
2022-10-15 05:00:00 16.969167
2022-10-15 06:00:00 16.965000
2022-10-15 07:00:00 16.960833
...
2022-10-19 21:00:00 16.227500
2022-10-19 22:00:00 16.215000
2022-10-19 23:00:00 16.202500
2022-10-20 00:00:00 16.190000

Filter for timestamp with maximum number of records for each date and extract the filtered rows into another df

I have a dataframe with a timestamp column, another date column and price column.
The timestamp column is more like every 5 min data for a specific hour (between 10 am and 11 am) that am pulling out.
Eg:
Timestamp EndDate Price
2021-01-01 10:00:00 2021-06-30 08:00:00 100
2021-01-01 10:00:00 2021-09-30 08:00:00 105
2021-01-01 10:05:00 2021-03-30 08:00:00 102
2021-01-01 10:05:00 2021-06-30 08:00:00 100
2021-01-01 10:05:00 2021-09-30 08:00:00 105
2021-01-01 10:10:00 2021-03-30 08:00:00 102
2021-01-01 10:10:00 2021-06-30 08:00:00 100
2021-01-02 10:00:00 2021-06-30 08:00:00 100
2021-01-02 10:00:00 2021-09-30 08:00:00 105
2021-01-02 10:00:00 2021-03-30 08:00:00 102
2021-01-02 10:00:00 2021-06-30 08:00:00 100
2021-01-02 10:05:00 2021-09-30 08:00:00 105
2021-01-02 10:05:00 2021-03-30 08:00:00 102
2021-01-02 10:05:00 2021-06-30 08:00:00 100
For the snapshot every 5 min, some end up with 3 records, some with 2, some with 4 records.
Within that hour (or day) I want to pull out the set of records such that the set contains the maximum number of records, so for the 1st of jan in the above example, it should pull out 10:05 data, for 2nd jan it should pull out 10:00 data. If there are multiple sets with the same number of max records, then it can pull out the latest time for that day.
Am not sure how I can do this efficiently, perhaps use a count ?
u can split the the timstap for a better use, so i did this:
import numpy as np
import pandas as pd
filename=(r'C:xxxxxx\Example2.xlsx')
df0=pd.read_excel(filename)
df0['new_date'] = [d.date() for d in df0['Timestamp']]
df0['new_time'] = [d.time() for d in df0['Timestamp']]
this yields:
then we can use groupby() and thn apply() to count values as follow:
df = df0.groupby('new_date')['new_time'].apply(lambda x:
x.value_counts().index[0]).reset_index()
that yields:

Pandas: Find original index of a value with a grouped dataframe

I have a dataframe with a RangeIndex, timestamps in the first column and several thousands hourly temperature observations in the second.
It is easy enough to group the observations by 24 and find daily Tmax and Tmin. But I also want the timestamp of each day's max and min values.
How can I do that?
I hope I can get help without posting a working example, because the nature of the data makes it unpractical.
EDIT: Here's some data, spanning two days.
DT T-C
0 2015-01-01 00:00:00 -2.5
1 2015-01-01 01:00:00 -2.1
2 2015-01-01 02:00:00 -2.3
3 2015-01-01 03:00:00 -2.3
4 2015-01-01 04:00:00 -2.3
5 2015-01-01 05:00:00 -2.0
...
24 2015-01-02 00:00:00 1.1
25 2015-01-02 01:00:00 1.1
26 2015-01-02 02:00:00 0.8
27 2015-01-02 03:00:00 0.5
28 2015-01-02 04:00:00 1.0
29 2015-01-02 05:00:00 0.7
First create DatetimeIndex, then aggregate by Grouper with days and idxmax
idxmin for datetimes for min and max temperature:
df['DT'] = pd.to_datetime(df['DT'])
df = df.set_index('DT')
df = df.groupby(pd.Grouper(freq='D'))['T-C'].agg(['idxmax','idxmin','max','min'])
print (df)
idxmax idxmin max min
DT
2015-01-01 2015-01-01 05:00:00 2015-01-01 00:00:00 -2.0 -2.5
2015-01-02 2015-01-02 00:00:00 2015-01-02 03:00:00 1.1 0.5

Pandas resample/groupby day of week and year

I am trying to create a report that is grouped by day of week for each year.
I have a df that looks like this:
s1 s2 srd
dt
2004-02-04 11:21:00 2365.79 2372.37 -7.0
2004-02-05 10:15:00 2365.79 2368.03 -2.0
2004-02-17 06:43:00 2421.05 2425.26 -4.0
2004-02-17 12:43:00 2418.42 2420.53 -2.0
2004-02-17 12:44:00 2420.39 2420.53 -0.0
The dt index is in datetime format.
What I am looking for is a dataframe that looks like this (I only need srd column and function to group can be anything, like sum, count, etc.):
srd
dayOfWeek year
Mon 2004 10
2005 11
2006 8
2007 120
Tues 2004 105
2005 105
I have tried dayOfWeekDf = df.resample('B') , but I get a dataframe that looks like it is split by week number.
I also tried df.groupby([df.index.weekday, df.index.year])['srd'].transform('sum'), but it does not even group for some reason, as I get the following (Feb 17th appears 3 times).
srd
dt
2004-02-04 11:21:00 81.0
2004-02-05 10:15:00 203.0
2004-02-17 06:43:00 37.0
2004-02-17 12:43:00 37.0
2004-02-17 12:44:00 37.0
If you want the dayOfWeek and year names in the index, you can assign them:
>>> df.assign(year=df.index.year, dayOfWeek = df.index.weekday_name).groupby(['dayOfWeek','year']).srd.sum()
dayOfWeek year
Thursday 2004 -2.0
Tuesday 2004 -6.0
Wednesday 2004 -7.0
Name: srd, dtype: float64
Otherwise, you can use the way you were doing, but omit the transform:
>>> df.groupby([df.index.weekday_name, df.index.year])['srd'].sum()
dt dt
Thursday 2004 -2.0
Tuesday 2004 -6.0
Wednesday 2004 -7.0
Name: srd, dtype: float64

Setting start time from previous night without dates from CSV using pandas

tI would like to run timeseries analysis on repeated measures data (time only, no dates) taken overnight from 22:00:00 to 09:00:00 the next morning.
How is the time set so that the Timeseries starts at 22:00:00. At the moment even when plotting it starts at 00:00:00 and ends at 23:00:00 with a flat line between 09:00:00 and 23:00:00?
df = pd.read_csv('1310.csv', parse_dates=True)
df['Time'] = pd.to_datetime(df['Time'])
df['Time'].apply( lambda d : d.time() )
df = df.set_index('Time')
df['2017-05-16 22:00:00'] + pd.Timedelta('-1 day')
Note: The date in the last line of code is automatically added, seen when df['Time'] is executed, so I inserted the same format with date in the last line for 22:00:00.
This is the error:
TypeError: Could not operate Timedelta('-1 days +00:00:00') with block values unsupported operand type(s) for +: 'numpy.ndarray' and 'Timedelta'
You should consider your timestamps as pd.Timedeltas and add a day to the samples before your start time.
Create some example data:
import pandas as pd
d = pd.date_range(start='22:00:00', periods=12, freq='h')
s = pd.Series(d).dt.time
df = pd.DataFrame(pd.np.random.randn(len(s)), index=s, columns=['value'])
df.to_csv('data.csv')
df
value
22:00:00 -0.214977
23:00:00 -0.006585
00:00:00 0.568259
01:00:00 0.603196
02:00:00 0.358124
03:00:00 0.027835
04:00:00 -0.436322
05:00:00 0.627624
06:00:00 0.168189
07:00:00 -0.321916
08:00:00 0.737383
09:00:00 1.100500
Read in, make index a timedelta, add a day to timedeltas before the start time, then assign back to the index.
df2 = pd.read_csv('data.csv', index_col=0)
df2.index = pd.to_timedelta(df2.index)
s = pd.Series(df2.index)
s[s < pd.Timedelta('22:00:00')] += pd.Timedelta('1d')
df2.index = pd.to_datetime(s)
df2
value
1970-01-01 22:00:00 -0.214977
1970-01-01 23:00:00 -0.006585
1970-01-02 00:00:00 0.568259
1970-01-02 01:00:00 0.603196
1970-01-02 02:00:00 0.358124
1970-01-02 03:00:00 0.027835
1970-01-02 04:00:00 -0.436322
1970-01-02 05:00:00 0.627624
1970-01-02 06:00:00 0.168189
1970-01-02 07:00:00 -0.321916
1970-01-02 08:00:00 0.737383
1970-01-02 09:00:00 1.100500
If you want to set the date of the first day:
df2.index += (pd.Timestamp('2015-06-06') - pd.Timestamp(0))
df2
value
2015-06-06 22:00:00 -0.214977
2015-06-06 23:00:00 -0.006585
2015-06-07 00:00:00 0.568259
2015-06-07 01:00:00 0.603196
2015-06-07 02:00:00 0.358124
2015-06-07 03:00:00 0.027835
2015-06-07 04:00:00 -0.436322
2015-06-07 05:00:00 0.627624
2015-06-07 06:00:00 0.168189
2015-06-07 07:00:00 -0.321916
2015-06-07 08:00:00 0.737383
2015-06-07 09:00:00 1.100500

Resources