Resample daily to annual from the most recent day? - python-3.x

I would like to convert the daily series to annual but to be based on the latest observation. For example, the latest observation is 2020-06-06 so I would like to convert to annual frequency from there (...2018-06-06, 2019-06-06, 2020-06-06). When I use the resample it automatically sets the annual series to the last calendar day of each year. Is there an easier way to do this or do I need to do further indexing to get these dates out?
import pandas as pd
import numpy as np
from datetime import date
today = date.today()
dates = pd.date_range('2010-01-01', today, freq='D')
np.random.seed(100)
data = np.random.randn(len(dates))
ts = pd.Series(data=data, index=dates, name='Series')
ts_year = ts.resample('A').ffill()
2010-12-31 0.790428
2011-12-31 1.518362
2012-12-31 0.150378
2013-12-31 0.570817
2014-12-31 1.481655
2015-12-31 -1.582277
2016-12-31 0.443544
2017-12-31 -1.296233
2018-12-31 0.479207
2019-12-31 -1.484178
2020-12-31 0.044787
Freq: A-DEC, Name: Series, dtype: float64

pd.resample takes an offset parameter, subtracting the days left till the end of current year should to the trick. Something like:
ts.resample('A', loffset=today - date(today.year, 12, 31)).ffill()
2010-06-06 0.790428
2011-06-06 1.518362
2012-06-06 0.150378
2013-06-06 0.570817
2014-06-06 1.481655
2015-06-06 -1.582277
2016-06-06 0.443544
2017-06-06 -1.296233
2018-06-06 0.479207
2019-06-06 -1.484178
2020-06-06 0.044787
Name: Series, dtype: float64
Not sure how it behaves with leap days though, but it's not clear from your question how it should (i.e. what happens if today is feb 29?)

Related

How to read in unusual date\time format

I have a small df with a date\time column using a format I have never seen.
Pandas reads it in as an object even if I use parse_dates, and to_datetime() chokes on it.
The dates in the column are formatted as such:
2019/12/29 GMT+8 18:00
2019/12/15 GMT+8 05:00
I think the best approach is using a date parsing pattern. Something like this:
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)
But I simply do not know how to approach this format.
The datatime format for UTC is very specific for converting the offset.
strftime() and strptime() Format Codes
The format must be + or - and then 00:00
Use str.zfill to backfill the 0s between the sign and the integer
+08:00 or -08:00 or +10:00 or -10:00
import pandas as pd
# sample data
df = pd.DataFrame({'datetime': ['2019/12/29 GMT+8 18:00', '2019/12/15 GMT+8 05:00', '2019/12/15 GMT+10 05:00', '2019/12/15 GMT-10 05:00']})
# display(df)
datetime
2019/12/29 GMT+8 18:00
2019/12/15 GMT+8 05:00
2019/12/15 GMT+10 05:00
2019/12/15 GMT-10 05:00
# fix the format
df.datetime = df.datetime.str.split(' ').apply(lambda x: x[0] + x[2] + x[1][3:].zfill(3) + ':00')
# convert to a utc datetime
df.datetime = pd.to_datetime(df.datetime, format='%Y/%m/%d%H:%M%z', utc=True)
# display(df)
datetime
2019-12-29 10:00:00+00:00
2019-12-14 21:00:00+00:00
2019-12-14 19:00:00+00:00
2019-12-15 15:00:00+00:00
print(df.info())
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 4 non-null datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1)
memory usage: 160.0 bytes
You could pass the custom format with GMT+8 in the middle and then subtract eight hours with timedelta(hours=8):
import pandas as pd
from datetime import datetime, timedelta
df['Date'] = pd.to_datetime(df['Date'], format='%Y/%m/%d GMT+8 %H:%M') - timedelta(hours=8)
df
Date
0 2019-12-29 10:00:00
1 2019-12-14 21:00:00

Resampling data ending in specific data

I have the following dataframe:
data
Out[8]:
Population
date
1980-11-03 1591.4
1980-11-10 1592.9
1980-11-17 1596.3
1980-11-24 1597.2
1980-12-01 1596.1
...
2020-06-01 18152.1
2020-06-08 18248.6
2020-06-15 18328.9
2020-06-22 18429.1
2020-06-29 18424.4
[2070 rows x 1 columns]
If i resample it over a year i will get this:
data.resample('Y').mean()
date
1980-12-31 1596.144444
1981-12-31 1678.686538
1982-12-31 1829.826923
Is it possible for it to be resampled in way such that it is resampled ending on a specific date such as Jan 1st. Thus the dates would be 1980 - 1 -1 in the above example instead of 1980-12-31.
What we can do is change to YS
df.resample('YS').mean()

Data resolution change in Pandas

I have a dataframe whose data has a resolution of 10 minutes as seen below:
DateTime TSM
0 2011-03-18 14:20:00 26.8
1 2011-03-18 14:30:00 26.5
2 2011-03-18 14:40:00 26.3
... ... ...
445088 2019-09-03 11:40:00 27.6
445089 2019-09-03 11:50:00 27.6
445090 2019-09-03 12:00:00 27.6
Now, I would like to reduce its resolution to 1 day. Does Pandas have any function that can help me with this?
Your dataframe should have datetime index in order to use resample method. Also you need to apply an aggregate function, for example mean()
# Make sure DateTime type is datetime
df['DateTime'] = df['DateTime'].astype('datetime64')
# Set DateTime column as index
df.set_index('DateTime', inplace=True)
# 1D stands for 1 day offset
df.resample('1D').mean()

pandas timeseries select single date

Selecting a single date from a timeserie gives a KeyError.
Setup:
import pandas as pd
import numpy as np
ts = pd.DataFrame({'date': pd.date_range(start = '1/1/2017', periods = 5),
'observations': np.random.choice(range(0, 100), 5, replace = True)}).set_index('date')
Dataframe:
observations
date
2017-01-01 58
2017-01-02 88
2017-01-03 53
2017-01-04 4
2017-01-05 26
How do I select the number of observations for a single date?
ts['2017-01-01']
Returns: KeyError: '2017-01-01'
But...
ts['2017-01-01':'2017-01-01']
...seems to work just fine.
Any suggestions how to select/subset with a single date?
As #scnerd pointed out, when you do ts['2017-01-01'] it tries to find '2017-01-01' as a column's name of the dataframe ts, which gives you an KeyError as none of the columns in ts has this name
In order to look for an index' name, as in your example 'date' is set as index, you need to use loc method such as ts.loc['2017-01-01'] and you will get:
observations 54
Name: 2017-01-01 00:00:00, dtype: int32

Parse dates and create time series from .csv

I am using a simple csv file which contains data on calory intake. It has 4 columns: cal, day, month, year. It looks like this:
cal month year day
3668.4333 1 2002 10
3652.2498 1 2002 11
3647.8662 1 2002 12
3646.6843 1 2002 13
...
3661.9414 2 2003 14
# data types
cal float64
month int64
year int64
day int64
I am trying to do some simple time series analysis. I hence would like to parse month, year, and day to a single column. I tried the following using pandas:
import pandas as pd
from pandas import Series, DataFrame, Panel
data = pd.read_csv('time_series_calories.csv', header=0, pars_dates=['day', 'month', 'year']], date_parser=True, infer_datetime_format=True)
My questions are: (1) How do I parse the data and (2) define the data type of the new column? I know there are quite a few other similar questions and answers (see e.g. here, here and here) - but I can't make it work so far.
You can use parameter parse_dates where define column names in list in read_csv:
import pandas as pd
import numpy as np
import io
temp=u"""cal,month,year,day
3668.4333,1,2002,10
3652.2498,1,2002,11
3647.8662,1,2002,12
3646.6843,1,2002,13
3661.9414,2,2003,14"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), parse_dates=[['year','month','day']])
print (df)
year_month_day cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414
print (df.dtypes)
year_month_day datetime64[ns]
cal float64
dtype: object
Then you can rename column:
df.rename(columns={'year_month_day':'date'}, inplace=True)
print (df)
date cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414
Or better is pass dictionary with new column name to parse_dates:
df = pd.read_csv(io.StringIO(temp), parse_dates={'dates': ['year','month','day']})
print (df)
dates cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414

Resources