Resampling data ending in specific data - python-3.x

I have the following dataframe:
data
Out[8]:
Population
date
1980-11-03 1591.4
1980-11-10 1592.9
1980-11-17 1596.3
1980-11-24 1597.2
1980-12-01 1596.1
...
2020-06-01 18152.1
2020-06-08 18248.6
2020-06-15 18328.9
2020-06-22 18429.1
2020-06-29 18424.4
[2070 rows x 1 columns]
If i resample it over a year i will get this:
data.resample('Y').mean()
date
1980-12-31 1596.144444
1981-12-31 1678.686538
1982-12-31 1829.826923
Is it possible for it to be resampled in way such that it is resampled ending on a specific date such as Jan 1st. Thus the dates would be 1980 - 1 -1 in the above example instead of 1980-12-31.

What we can do is change to YS
df.resample('YS').mean()

Related

Get the last date before an nth date for each month in Python

I am using a csv with an accumulative number that changes daily.
Day Accumulative Number
0 9/1/2020 100
1 11/1/2020 102
2 18/1/2020 98
3 11/2/2020 105
4 24/2/2020 95
5 6/3/2020 120
6 13/3/2020 100
I am now trying to find the best way to aggregate it and compare the monthly results before a specific date. So, I want to check the balance on the 11th of each month but for some months, there is no activity for the specific day. As a result, I trying to get the latest day before the 12th of each Month. So, the above would be:
Day Accumulative Number
0 11/1/2020 102
1 11/2/2020 105
2 6/3/2020 120
What I managed to do so far is to just get the latest day of each month:
dateparse = lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
df = pd.read_csv("Accumulative.csv",quotechar="'", usecols=["Day","Accumulative Number"], index_col=False, parse_dates=["Day"], date_parser=dateparse, na_values=['.', '??'] )
df.index = df['Day']
grouped = df.groupby(pd.Grouper(freq='M')).sum()
print (df.groupby(df.index.month).apply(lambda x: x.iloc[-1]))
which returns:
Day Accumulative Number
1 2020-01-18 98
2 2020-02-24 95
3 2020-03-13 100
Is there a way to achieve this in Pandas, Python or do I have to use SQL logic in my script? Is there an easier way I am missing out in order to get the "balance" as per the 11th day of each month?
You can do groupby with factorize
n = 12
df = df.sort_values('Day')
m = df.groupby(df.Day.dt.strftime('%Y-%m')).Day.transform(lambda x :x.factorize()[0])==n
df_sub = df[m].copy()
You can try filtering the dataframe where the days are less than 12 , then take last of each group(grouped by month) :
df['Day'] = pd.to_datetime(df['Day'],dayfirst=True)
(df[df['Day'].dt.day.lt(12)]
.groupby([df['Day'].dt.year,df['Day'].dt.month],sort=False).last()
.reset_index(drop=True))
Day Accumulative_Number
0 2020-01-11 102
1 2020-02-11 105
2 2020-03-06 120
I would try:
# convert to datetime type:
df['Day'] = pd.to_datetime(df['Day'], dayfirst=True)
# select day before the 12th
new_df = df[df['Day'].dt.day < 12]
# select the last day in each month
new_df.loc[~new_df['Day'].dt.to_period('M').duplicated(keep='last')]
Output:
Day Accumulative Number
1 2020-01-11 102
3 2020-02-11 105
5 2020-03-06 120
Here's another way using expanding the date range:
# set as datetime
df2['Day'] = pd.to_datetime(df2['Day'], dayfirst=True)
# set as index
df2 = df2.set_index('Day')
# make a list of all dates
dates = pd.date_range(start=df2.index.min(), end=df2.index.max(), freq='1D')
# add dates
df2 = df2.reindex(dates)
# replace NA with forward fill
df2['Number'] = df2['Number'].ffill()
# filter to get output
df2 = df2[df2.index.day == 11].reset_index().rename(columns={'index': 'Date'})
print(df2)
Date Number
0 2020-01-11 102.0
1 2020-02-11 105.0
2 2020-03-11 120.0

How do I select all dates that are most recent in comparison with a different date field?

I'm trying to gather all dates between 06-01-2020 and 06-30-2020 based on the forecast date which can be 06-08-2020, 06-20-2020, and 06-24-2020. The problem I am running into is that I'm only grabbing all of the dates associated with the forecast date 06-24-2020. I need all dates that are most recent so if say 06-03-2020 occurs with the forecast date 06-08-2020 and not with 06-20-2020, I still need all of the dates associated with that forecast date. Here's the code I am currently using.
df = df[df['Forecast Date'].isin([max(df['Forecast Date'])])]
It's producing this-
Date \
5668 2020-06-25
5669 2020-06-26
5670 2020-06-27
5671 2020-06-28
5672 2020-06-29
5673 2020-06-30
Media Granularity Forecast Date
5668 NaN 2020-06-24
5669 NaN 2020-06-24
5670 NaN 2020-06-24
5671 NaN 2020-06-24
5672 NaN 2020-06-24
5673 NaN 2020-06-24
With a length of 6 (len(df[df['Forecast Date'].isin([max(df['Forecast Date'])])])). It needs to be a length of 30, one for each unique date. It is only grabbing the columns where the max of Forecast date is 06-24-2020.
I'm thinking it's something along the lines of df.sort_values(df[['Date', 'Forecast Date']]).drop_duplicates(df['Date'], keep='last') but it's giving me a key error.
It was easy but not what I expected.
df = df.sort_values(by=['Date', 'Forecast Date']).drop_duplicates(subset=['Date'], keep='last')

Resample daily to annual from the most recent day?

I would like to convert the daily series to annual but to be based on the latest observation. For example, the latest observation is 2020-06-06 so I would like to convert to annual frequency from there (...2018-06-06, 2019-06-06, 2020-06-06). When I use the resample it automatically sets the annual series to the last calendar day of each year. Is there an easier way to do this or do I need to do further indexing to get these dates out?
import pandas as pd
import numpy as np
from datetime import date
today = date.today()
dates = pd.date_range('2010-01-01', today, freq='D')
np.random.seed(100)
data = np.random.randn(len(dates))
ts = pd.Series(data=data, index=dates, name='Series')
ts_year = ts.resample('A').ffill()
2010-12-31 0.790428
2011-12-31 1.518362
2012-12-31 0.150378
2013-12-31 0.570817
2014-12-31 1.481655
2015-12-31 -1.582277
2016-12-31 0.443544
2017-12-31 -1.296233
2018-12-31 0.479207
2019-12-31 -1.484178
2020-12-31 0.044787
Freq: A-DEC, Name: Series, dtype: float64
pd.resample takes an offset parameter, subtracting the days left till the end of current year should to the trick. Something like:
ts.resample('A', loffset=today - date(today.year, 12, 31)).ffill()
2010-06-06 0.790428
2011-06-06 1.518362
2012-06-06 0.150378
2013-06-06 0.570817
2014-06-06 1.481655
2015-06-06 -1.582277
2016-06-06 0.443544
2017-06-06 -1.296233
2018-06-06 0.479207
2019-06-06 -1.484178
2020-06-06 0.044787
Name: Series, dtype: float64
Not sure how it behaves with leap days though, but it's not clear from your question how it should (i.e. what happens if today is feb 29?)

Data resolution change in Pandas

I have a dataframe whose data has a resolution of 10 minutes as seen below:
DateTime TSM
0 2011-03-18 14:20:00 26.8
1 2011-03-18 14:30:00 26.5
2 2011-03-18 14:40:00 26.3
... ... ...
445088 2019-09-03 11:40:00 27.6
445089 2019-09-03 11:50:00 27.6
445090 2019-09-03 12:00:00 27.6
Now, I would like to reduce its resolution to 1 day. Does Pandas have any function that can help me with this?
Your dataframe should have datetime index in order to use resample method. Also you need to apply an aggregate function, for example mean()
# Make sure DateTime type is datetime
df['DateTime'] = df['DateTime'].astype('datetime64')
# Set DateTime column as index
df.set_index('DateTime', inplace=True)
# 1D stands for 1 day offset
df.resample('1D').mean()

Python - Plot Multiple Dataframes by Month and Day (Ignore Year)

I have multiple dataframes having different years data.
The data in dataframes are:
>>> its[0].head(5)
Crocs
date
2017-01-01 46
2017-01-08 45
2017-01-15 43
2017-01-22 43
2017-01-29 41
>>> its[1].head(5)
Crocs
date
2018-01-07 23
2018-01-14 21
2018-01-21 23
2018-01-28 21
2018-02-04 25
>>> its[2].head(5)
Crocs
date
2019-01-06 90
2019-01-13 79
2019-01-20 82
2019-01-27 82
2019-02-03 81
I tried to plot all these dataframes in single figure (graph), yeah i accomplished but it was not what i wanted.
I plotted the dataframes using the following code
>>> for p in its:
plt.plot(p.index,p.values)
>>> plt.show()
and i got the following graph
but this is not what i wanted
i want the graph to be like this
Simply i want graph to ignore years and plot by month and days
You can try of converting the datetime index to timeseries integers based on month and date and plot
df3 = pd.concat(its,axis=1)
xindex= df3.index.month*30 + df3.index.day
plt.plot(xindex,df3)
plt.show()
If you want to have datetime information than integers you can add xticks to frame
labels = (df3.index.month*30).astype(str)+"-" + df3.index.day.astype(str)
plt.xticks(df3.index.month*30 + df3.index.day, labels)
plt.show()

Resources