Number of rows in a rolling window of 30 days - python-3.x

I have a sample dataframe
Account Date Amount
10 2020-06-01 100
10 2020-06-11 500
10 2020-06-21 600
10 2020-06-25 900
10 2020-07-11 1000
10 2020-07-15 600
11 2020-06-01 100
11 2020-06-11 200
11 2020-06-21 500
11 2020-06-25 1500
11 2020-07-11 2500
11 2020-07-15 6700
I want to get the number of rows in each 30 day interval for each account ie
Account Date Amount
10 2020-06-01 1
10 2020-06-11 2
10 2020-06-21 3
10 2020-06-25 4
10 2020-07-11 4
10 2020-07-15 4
11 2020-06-01 1
11 2020-06-11 2
11 2020-06-21 3
11 2020-06-25 4
11 2020-07-11 4
11 2020-07-15 4
I have tried Grouper and resampling but those give me the counts per each 30 days and not the rolling counts.
Thanks in advance!

def get_rolling_amount(grp, freq):
return grp.rolling(freq, on="Date", closed="both").count()
df["Date"] = pd.to_datetime(df["Date"])
df["Amount"] = df.groupby("Account").apply(get_rolling_amount, "30D").values
print(df)
Prints:
Account Date Amount
0 10 2020-06-01 1
1 10 2020-06-11 2
2 10 2020-06-21 3
3 10 2020-06-25 4
4 10 2020-07-11 4
5 10 2020-07-15 4
6 11 2020-06-01 1
7 11 2020-06-11 2
8 11 2020-06-21 3
9 11 2020-06-25 4
10 11 2020-07-11 4
11 11 2020-07-15 4

You can use broadcasting within group to check how many rows fall within X days.
import pandas as pd
def within_days(s, days):
arr = ((s.to_numpy() >= s.to_numpy()[:, None])
& (s.to_numpy() <= (s + pd.offsets.DateOffset(days=days)).to_numpy()[:, None])).sum(axis=0)
return pd.Series(arr, index=s.index)
df['Amount'] = df.groupby('Account')['Date'].apply(within_days, days=30)
Account Date Amount
0 10 2020-06-01 1
1 10 2020-06-11 2
2 10 2020-06-21 3
3 10 2020-06-25 4
4 10 2020-07-11 4
5 10 2020-07-15 4
6 11 2020-06-01 1
7 11 2020-06-11 2
8 11 2020-06-21 3
9 11 2020-06-25 4
10 11 2020-07-11 4
11 11 2020-07-15 4

df = df.resample('30D').agg({'date':'count','Amount':'sum'})
This will aggregate the 'Date' column by count, getting the data you want.
However, since you will need to first set date as your index for resampling, you could create a "dummy" column containing zeros:
df['dummy'] = pd.Series(np.zeros(len(df))

Related

Create a new column to indicate dataframe split at least n-rows groups in Python [duplicate]

This question already has an answer here:
Pandas - Fill N rows for a specific column with a integer value and increment the integer there after
(1 answer)
Closed 1 year ago.
Given a dataframe df as follows:
df = pd.DataFrame({'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Value': [11, 8, 10, 15, 110, 60, 100, 40]})
Out:
Date Sym Value
0 2015-05-08 aapl 11
1 2015-05-07 aapl 8
2 2015-05-06 aapl 10
3 2015-05-05 aapl 15
4 2015-05-08 aaww 110
5 2015-05-07 aaww 60
6 2015-05-06 aaww 100
7 2015-05-05 aaww 40
I hope to create a new column Group to indicate groups with a range of integers starting from 1, each group should have 3 rows, except for the last group which may have less than 3 rows.
The final result will like this:
Date Sym Value Group
0 2015-05-08 aapl 11 1
1 2015-05-07 aapl 8 1
2 2015-05-06 aapl 10 1
3 2015-05-05 aapl 15 2
4 2015-05-08 aaww 110 2
5 2015-05-07 aaww 60 2
6 2015-05-06 aaww 100 3
7 2015-05-05 aaww 40 3
How could I achieve that with Pandas or Numpy? Thanks.
My trial code:
n = 3
for g, df in df.groupby(np.arange(len(df)) // n):
print(df.shape)
You are close, assign output from groupby to new column and add 1:
n = 3
df['Group'] = np.arange(len(df)) // n + 1
print (df)
Date Sym Value Group
0 2015-05-08 aapl 11 1
1 2015-05-07 aapl 8 1
2 2015-05-06 aapl 10 1
3 2015-05-05 aapl 15 2
4 2015-05-08 aaww 110 2
5 2015-05-07 aaww 60 2
6 2015-05-06 aaww 100 3
7 2015-05-05 aaww 40 3

Join and forward and/or back fill based on custom criteria

All the date(s) in df are present in ref_date of ref_df and not vice versa. Corresponding to each date in df, I need to get ref_date from ref_df based on following logic:
If a date is repeated more than once and either previous or next ref_date(s) are missing then from edges of repeated date allocate to the nearest missing previous or next ref_date(s).
If a date is repeated more than once but there is no missing prev/next ref_date then ref_date is same as date.
There can be missing ref_date(s) not included in df. This happens when date(s) are not repeated around given ref_date(s) to fill for it.
Example:
>>> import pandas as pd
>>> from datetime import datetime as dt
>>> df = pd.DataFrame({'date':[dt(2020,1,20), dt(2020,1,20), dt(2020,1,20), dt(2020,2,25), dt(2020,3,18), dt(2020,3,18), dt(2020,4,9), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,5,28), dt(2020,6,1), dt(2020,6,1), dt(2020,6,1), dt(2020,6,28), dt(2020,6,28)], 'qty':range(18)})
>>> ref_df = pd.DataFrame({'ref_date':[dt(2019,12,8), dt(2020,1,20), dt(2020,2,25), dt(2020,3,18), dt(2020,4,9), dt(2020,4,10), dt(2020,4,12), dt(2020,4,13), dt(2020,4,14), dt(2020,5,28), dt(2020,5,29), dt(2020,5,30), dt(2020,6,1), dt(2020,6,2), dt(2020,6,3), dt(2020,6,28), dt(2020,6,29), dt(2020,7,7)]})
>>> df
date qty
0 2020-01-20 0
1 2020-01-20 1
2 2020-01-20 2
3 2020-02-25 3
4 2020-03-18 4
5 2020-03-18 5
6 2020-04-09 6
7 2020-04-12 7
8 2020-04-12 8
9 2020-04-12 9
10 2020-04-12 10
11 2020-04-12 11
12 2020-05-28 12
13 2020-06-01 13
14 2020-06-01 14
15 2020-06-01 15
16 2020-06-28 16
17 2020-06-28 17
>>> ref_df
ref_date
0 2019-12-08
1 2020-01-20
2 2020-02-25
3 2020-03-18
4 2020-04-09
5 2020-04-10
6 2020-04-12
7 2020-04-13
8 2020-04-14
9 2020-05-28
10 2020-05-29
11 2020-05-30
12 2020-06-01
13 2020-06-02
14 2020-06-03
15 2020-06-28
16 2020-06-29
17 2020-07-07
Expected_output:
>>> df
date qty ref_date
0 2020-01-20 0 2019-12-08
1 2020-01-20 1 2020-01-20 # Note: repeated as no gap
2 2020-01-20 2 2020-01-20
3 2020-02-25 3 2020-02-25
4 2020-03-18 4 2020-03-18
5 2020-03-18 5 2020-03-18 # Note: repeated as no gap
6 2020-04-09 6 2020-04-09
7 2020-04-12 7 2020-04-10 # Note: Filling from the edges
8 2020-04-12 8 2020-04-12
9 2020-04-12 9 2020-04-12 # Note: repeated as not enough gap
10 2020-04-12 10 2020-04-13
11 2020-04-12 11 2020-04-14
12 2020-05-28 12 2020-05-28
13 2020-06-01 13 2020-05-30 # Filling nearest previous
14 2020-06-01 14 2020-06-01 # First filling previous
15 2020-06-01 15 2020-06-02 # Filling nearest next
16 2020-06-28 16 2020-06-28
17 2020-06-28 17 2020-06-29
I am able to get the answer but it doesn't look like the most efficient way to do so. Could someone suggest an optimal way to do it:
ref_df['date'] = ref_df['ref_date']
df = pd.merge_asof(df, ref_df, on='date', direction='nearest')
df = df.rename(columns={'ref_date':'nearest_ref_date'})
nrd_cnt = df.groupby('nearest_ref_date')['date'].count().reset_index().rename(columns={'date':'nrd_count'})
nrd_cnt['lc'] = nrd_cnt['nearest_ref_date'].shift(1)
nrd_cnt['uc'] = nrd_cnt['nearest_ref_date'].shift(-1)
df = df.merge(nrd_cnt, how='left', on='nearest_ref_date')
# TODO: Review it. Looping it to finite number 100 to avoid infite loop (in case of edge cases)
for _ in range(100):
df2 = df.copy()
df2['days'] = np.abs((df2['nearest_ref_date'] - df2['date']).dt.days)
df2['repeat_rank'] = df2.groupby('nearest_ref_date')['days'].rank(method='first')
reduced_ref_df = ref_df[~ref_df['ref_date'].isin(df2['nearest_ref_date'].unique())]
df2 = pd.merge_asof(df2, reduced_ref_df, on='date', direction='nearest')
df2 = df2.rename(columns={'ref_date':'new_nrd'})
df2.loc[(df2['new_nrd']<=df2['lc']) | (df2['new_nrd']>=df2['uc']), 'new_nrd'] = pd.to_datetime(np.nan)
df2.loc[(~pd.isna(df2['new_nrd'])) & (df2['repeat_rank'] > 1), 'nearest_ref_date'] = df2['new_nrd']
df2 = df2[['date', 'qty', 'nearest_ref_date', 'lc', 'uc']]
if df.equals(df2):
break
df = df2
df = df[['date', 'qty', 'nearest_ref_date']]
df.loc[:, 'repeat_rank'] = df.groupby('nearest_ref_date')['nearest_ref_date'].rank(method='first')
df = pd.merge_asof(df, ref_df, on='date', direction='nearest')
# Repeated nearest_ref_date set to nearest ref_date
df.loc[df['repeat_rank'] > 1, 'nearest_ref_date'] = df['ref_date']
# Sorting nearest_ref_date within the ref_date group (without changing order of rest of cols).
df.loc[:, 'nearest_ref_date'] = df[['ref_date', 'nearest_ref_date']].sort_values(['ref_date', 'nearest_ref_date']).reset_index().drop('index',axis=1)['nearest_ref_date']
df = df[['date', 'qty', 'nearest_ref_date']]
df
date qty ref_date
0 2020-01-20 0 2019-12-08
1 2020-01-20 1 2020-01-20
2 2020-01-20 2 2020-01-20
3 2020-02-25 3 2020-02-25
4 2020-03-18 4 2020-03-18
5 2020-03-18 5 2020-03-18
6 2020-04-09 6 2020-04-09
7 2020-04-12 7 2020-04-10
8 2020-04-12 8 2020-04-12
9 2020-04-12 9 2020-04-12
10 2020-04-12 10 2020-04-13
11 2020-04-12 11 2020-04-14
12 2020-05-28 12 2020-05-28
13 2020-06-01 13 2020-05-30
14 2020-06-01 14 2020-06-01
15 2020-06-01 15 2020-06-02
16 2020-06-28 16 2020-06-28
17 2020-06-28 17 2020-06-29

Add workdays to pandas df date columns based of other column

Is there a way to increment a date field in a pandas data frame by the number of working days specified in an another column?
np.random.seed(10)
df = pd.DataFrame({'Date':pd.date_range(start=dt.datetime(2020,7,1), end = dt.datetime(2020,7,10))})
df['Offset'] = np.random.randint(0,10, len(df))
Date Offset
0 2020-07-01 9
1 2020-07-02 4
2 2020-07-03 0
3 2020-07-04 1
4 2020-07-05 9
5 2020-07-06 0
6 2020-07-07 1
7 2020-07-08 8
8 2020-07-09 9
9 2020-07-10 0
I would expect this to work, however it throws and error:
df['Date'] + pd.tseries.offsets.BusinessDay(n = df['Offset'])
TypeError: n argument must be an integer, got <class
'pandas.core.series.Series'>
pd.to_timedelta does not support working days.
Like I mentioned in my comment, you are trying to pass an entire Series as an integer. instead you want to apply the function row wise:
df['your_answer'] = df.apply(lambda x:x['Date'] + pd.tseries.offsets.BusinessDay(n= x['Offset']), axis=1)
df
Date Offset your_answer
0 2020-07-01 9 2020-07-14
1 2020-07-02 7 2020-07-13
2 2020-07-03 3 2020-07-08
3 2020-07-04 2 2020-07-07
4 2020-07-05 7 2020-07-14
5 2020-07-06 7 2020-07-15
6 2020-07-07 7 2020-07-16
7 2020-07-08 2 2020-07-10
8 2020-07-09 1 2020-07-10
9 2020-07-10 0 2020-07-10
Line of code broken down:
# notice how this returns every value of that column
df.apply(lambda x:x['Date'], axis=1)
0 2020-07-01
1 2020-07-02
2 2020-07-03
3 2020-07-04
4 2020-07-05
5 2020-07-06
6 2020-07-07
7 2020-07-08
8 2020-07-09
9 2020-07-10
# same thing with `Offset`
df.apply(lambda x:x['Offset'], axis=1)
0 9
1 7
2 3
3 2
4 7
5 7
6 7
7 2
8 1
9 0
Since pd.tseries.offsets.BusinessDay(n=foo_bar) takes an integer and not a series. We use the two columns in the apply() together - It's as if you are looping each number in the Offset column into the offsets.BusinessDay() function

Compairing 4 graph in one graph

I have 4 dataframe with value count of number of occurance per month.
I want to compare all 4 value counts in one graph, so i can see visual difference between every month on these four years.
Like below
i like to have output like this image with years and month
newdf2018.Month.value_counts()
output
1 3451
2 3895
3 3408
4 3365
5 3833
6 3543
7 3333
8 3219
9 3447
10 2943
11 3296
12 2909
newdf2017.Month.value_counts()
1 2801
2 3048
3 3620
4 3014
5 3226
6 3962
7 3500
8 3707
9 3601
10 3349
11 3743
12 2002
newdf2016.Month.value_counts()
1 3201
2 2034
3 2405
4 3805
5 3308
6 3212
7 3049
8 3777
9 3275
10 3099
11 3775
12 2115
newdf2015.Month.value_counts()
1 2817
2 2604
3 2711
4 2817
5 2670
6 2507
7 3256
8 2195
9 3304
10 3238
11 2005
12 2008
Create dictionary of DataFrames and concat together, then use plot:
dfs = {2015:newdf2015, 2016:newdf2016, 2017:newdf2017, 2018:newdf2018}
df = pd.concat({k:v['Month'].value_counts() for k, v in dfs.items()}, axis=1)
df.plot.bar()

Most frequent occurence in a pandas dataframe indexed by datetime

I have a large DataFrame which is indexed by datetime, in particular, by days. I am looking for an efficient function which, for each column, checks the most common non-null value in each week, and outputs a dataframe which is indexed by weeks consisting of these within-week most common values.
Here is an example. The following DataFrame consists of two weeks of daily data:
0 1
2015-11-12 00:00:00 8 nan
2015-11-13 00:00:00 7 nan
2015-11-14 00:00:00 nan 5
2015-11-15 00:00:00 7 nan
2015-11-16 00:00:00 8 nan
2015-11-17 00:00:00 7 nan
2015-11-18 00:00:00 5 nan
2015-11-19 00:00:00 9 nan
2015-11-20 00:00:00 8 nan
2015-11-21 00:00:00 6 nan
2015-11-22 00:00:00 6 nan
2015-11-23 00:00:00 6 nan
2015-11-24 00:00:00 6 nan
2015-11-25 00:00:00 2 nan
and should be transformed into:
0 1
2015-11-12 00:00:00 7 5
2015-11-19 00:00:00 6 nan
My DataFrame is very large so efficiency is important. Thanks.
EDIT: If possible, can someone suggest a method that would be applicable if the entries are tuples (instead of floats as in my example)?
You can use resample to group your data by the weekly interval. Then, count the number of occurences via pd.value_counts and select the most common with idxmax:
df.resample("7D").apply(lambda x: x.apply(pd.value_counts).idxmax())
0 1
2015-11-12 00:00:00 7.0 5.0
2015-11-19 00:00:00 6.0 NaN
Edit
Here is another numpy version which is faster than the above solution:
def numpy_mode(series):
values = series.values
dropped = values[~np.isnan(values)]
# check for empty array and return NaN
if not dropped.size:
return np.NaN
uniques, counts = np.unique(series.dropna(), return_counts=True)
return uniques[np.argmax(counts)]
df2.resample("7D").apply(lambda x: x.apply(get_mode))
0 1
2015-11-12 00:00:00 7.0 5.0
2015-11-19 00:00:00 6.0 NaN
And here the timings based on the dummy data (for further improvements, have a look here):
%%timeit
df2.resample("7D").apply(lambda x: x.apply(pd.value_counts).idxmax())
>>> 100 loops, best of 3: 18.6 ms per loop
%%timeit
df2.resample("7D").apply(lambda x: x.apply(get_mode))
>>> 100 loops, best of 3: 3.72 ms per loop
I also tried scipy.stats.mode however it was also slower than the numpy solution:
size = 1000
index = pd.DatetimeIndex(start="2012-12-12", periods=size, freq="D")
dummy = pd.DataFrame(np.random.randint(0, 20, size=(size, 50)), index=index)
print(dummy.head)
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
2012-12-12 18 2 7 1 7 9 16 2 19 19 ... 10 2 18 16 15 10 7 19 9 6
2012-12-13 7 4 11 19 17 10 18 0 10 7 ... 19 11 5 5 11 4 0 16 12 19
2012-12-14 14 0 14 5 1 11 2 19 5 9 ... 2 9 4 2 9 5 19 2 16 2
2012-12-15 12 2 7 2 12 12 11 11 19 5 ... 16 0 4 9 13 5 10 2 14 4
2012-12-16 8 15 2 18 3 16 15 0 14 14 ... 18 2 6 13 19 10 3 16 11 4
%%timeit
dummy.resample("7D").apply(lambda x: x.apply(get_mode))
>>> 1 loop, best of 3: 926 ms per loop
%%timeit
dummy.resample("7D").apply(lambda x: x.apply(pd.value_counts).idxmax())
>>> 1 loop, best of 3: 5.84 s per loop
%%timeit
dummy.resample("7D").apply(lambda x: stats.mode(x).mode)
>>> 1 loop, best of 3: 1.32 s per loop

Resources