Data resolution change in Pandas - python-3.x

I have a dataframe whose data has a resolution of 10 minutes as seen below:
DateTime TSM
0 2011-03-18 14:20:00 26.8
1 2011-03-18 14:30:00 26.5
2 2011-03-18 14:40:00 26.3
... ... ...
445088 2019-09-03 11:40:00 27.6
445089 2019-09-03 11:50:00 27.6
445090 2019-09-03 12:00:00 27.6
Now, I would like to reduce its resolution to 1 day. Does Pandas have any function that can help me with this?

Your dataframe should have datetime index in order to use resample method. Also you need to apply an aggregate function, for example mean()
# Make sure DateTime type is datetime
df['DateTime'] = df['DateTime'].astype('datetime64')
# Set DateTime column as index
df.set_index('DateTime', inplace=True)
# 1D stands for 1 day offset
df.resample('1D').mean()

Related

How to read in unusual date\time format

I have a small df with a date\time column using a format I have never seen.
Pandas reads it in as an object even if I use parse_dates, and to_datetime() chokes on it.
The dates in the column are formatted as such:
2019/12/29 GMT+8 18:00
2019/12/15 GMT+8 05:00
I think the best approach is using a date parsing pattern. Something like this:
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)
But I simply do not know how to approach this format.
The datatime format for UTC is very specific for converting the offset.
strftime() and strptime() Format Codes
The format must be + or - and then 00:00
Use str.zfill to backfill the 0s between the sign and the integer
+08:00 or -08:00 or +10:00 or -10:00
import pandas as pd
# sample data
df = pd.DataFrame({'datetime': ['2019/12/29 GMT+8 18:00', '2019/12/15 GMT+8 05:00', '2019/12/15 GMT+10 05:00', '2019/12/15 GMT-10 05:00']})
# display(df)
datetime
2019/12/29 GMT+8 18:00
2019/12/15 GMT+8 05:00
2019/12/15 GMT+10 05:00
2019/12/15 GMT-10 05:00
# fix the format
df.datetime = df.datetime.str.split(' ').apply(lambda x: x[0] + x[2] + x[1][3:].zfill(3) + ':00')
# convert to a utc datetime
df.datetime = pd.to_datetime(df.datetime, format='%Y/%m/%d%H:%M%z', utc=True)
# display(df)
datetime
2019-12-29 10:00:00+00:00
2019-12-14 21:00:00+00:00
2019-12-14 19:00:00+00:00
2019-12-15 15:00:00+00:00
print(df.info())
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 4 non-null datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1)
memory usage: 160.0 bytes
You could pass the custom format with GMT+8 in the middle and then subtract eight hours with timedelta(hours=8):
import pandas as pd
from datetime import datetime, timedelta
df['Date'] = pd.to_datetime(df['Date'], format='%Y/%m/%d GMT+8 %H:%M') - timedelta(hours=8)
df
Date
0 2019-12-29 10:00:00
1 2019-12-14 21:00:00

Resampling data ending in specific data

I have the following dataframe:
data
Out[8]:
Population
date
1980-11-03 1591.4
1980-11-10 1592.9
1980-11-17 1596.3
1980-11-24 1597.2
1980-12-01 1596.1
...
2020-06-01 18152.1
2020-06-08 18248.6
2020-06-15 18328.9
2020-06-22 18429.1
2020-06-29 18424.4
[2070 rows x 1 columns]
If i resample it over a year i will get this:
data.resample('Y').mean()
date
1980-12-31 1596.144444
1981-12-31 1678.686538
1982-12-31 1829.826923
Is it possible for it to be resampled in way such that it is resampled ending on a specific date such as Jan 1st. Thus the dates would be 1980 - 1 -1 in the above example instead of 1980-12-31.
What we can do is change to YS
df.resample('YS').mean()

How do I select all dates that are most recent in comparison with a different date field?

I'm trying to gather all dates between 06-01-2020 and 06-30-2020 based on the forecast date which can be 06-08-2020, 06-20-2020, and 06-24-2020. The problem I am running into is that I'm only grabbing all of the dates associated with the forecast date 06-24-2020. I need all dates that are most recent so if say 06-03-2020 occurs with the forecast date 06-08-2020 and not with 06-20-2020, I still need all of the dates associated with that forecast date. Here's the code I am currently using.
df = df[df['Forecast Date'].isin([max(df['Forecast Date'])])]
It's producing this-
Date \
5668 2020-06-25
5669 2020-06-26
5670 2020-06-27
5671 2020-06-28
5672 2020-06-29
5673 2020-06-30
Media Granularity Forecast Date
5668 NaN 2020-06-24
5669 NaN 2020-06-24
5670 NaN 2020-06-24
5671 NaN 2020-06-24
5672 NaN 2020-06-24
5673 NaN 2020-06-24
With a length of 6 (len(df[df['Forecast Date'].isin([max(df['Forecast Date'])])])). It needs to be a length of 30, one for each unique date. It is only grabbing the columns where the max of Forecast date is 06-24-2020.
I'm thinking it's something along the lines of df.sort_values(df[['Date', 'Forecast Date']]).drop_duplicates(df['Date'], keep='last') but it's giving me a key error.
It was easy but not what I expected.
df = df.sort_values(by=['Date', 'Forecast Date']).drop_duplicates(subset=['Date'], keep='last')

Resample daily to annual from the most recent day?

I would like to convert the daily series to annual but to be based on the latest observation. For example, the latest observation is 2020-06-06 so I would like to convert to annual frequency from there (...2018-06-06, 2019-06-06, 2020-06-06). When I use the resample it automatically sets the annual series to the last calendar day of each year. Is there an easier way to do this or do I need to do further indexing to get these dates out?
import pandas as pd
import numpy as np
from datetime import date
today = date.today()
dates = pd.date_range('2010-01-01', today, freq='D')
np.random.seed(100)
data = np.random.randn(len(dates))
ts = pd.Series(data=data, index=dates, name='Series')
ts_year = ts.resample('A').ffill()
2010-12-31 0.790428
2011-12-31 1.518362
2012-12-31 0.150378
2013-12-31 0.570817
2014-12-31 1.481655
2015-12-31 -1.582277
2016-12-31 0.443544
2017-12-31 -1.296233
2018-12-31 0.479207
2019-12-31 -1.484178
2020-12-31 0.044787
Freq: A-DEC, Name: Series, dtype: float64
pd.resample takes an offset parameter, subtracting the days left till the end of current year should to the trick. Something like:
ts.resample('A', loffset=today - date(today.year, 12, 31)).ffill()
2010-06-06 0.790428
2011-06-06 1.518362
2012-06-06 0.150378
2013-06-06 0.570817
2014-06-06 1.481655
2015-06-06 -1.582277
2016-06-06 0.443544
2017-06-06 -1.296233
2018-06-06 0.479207
2019-06-06 -1.484178
2020-06-06 0.044787
Name: Series, dtype: float64
Not sure how it behaves with leap days though, but it's not clear from your question how it should (i.e. what happens if today is feb 29?)

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Resources