This question already has an answer here:
Pandas - Fill N rows for a specific column with a integer value and increment the integer there after
(1 answer)
Closed 1 year ago.
Given a dataframe df as follows:
df = pd.DataFrame({'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Value': [11, 8, 10, 15, 110, 60, 100, 40]})
Out:
Date Sym Value
0 2015-05-08 aapl 11
1 2015-05-07 aapl 8
2 2015-05-06 aapl 10
3 2015-05-05 aapl 15
4 2015-05-08 aaww 110
5 2015-05-07 aaww 60
6 2015-05-06 aaww 100
7 2015-05-05 aaww 40
I hope to create a new column Group to indicate groups with a range of integers starting from 1, each group should have 3 rows, except for the last group which may have less than 3 rows.
The final result will like this:
Date Sym Value Group
0 2015-05-08 aapl 11 1
1 2015-05-07 aapl 8 1
2 2015-05-06 aapl 10 1
3 2015-05-05 aapl 15 2
4 2015-05-08 aaww 110 2
5 2015-05-07 aaww 60 2
6 2015-05-06 aaww 100 3
7 2015-05-05 aaww 40 3
How could I achieve that with Pandas or Numpy? Thanks.
My trial code:
n = 3
for g, df in df.groupby(np.arange(len(df)) // n):
print(df.shape)
You are close, assign output from groupby to new column and add 1:
n = 3
df['Group'] = np.arange(len(df)) // n + 1
print (df)
Date Sym Value Group
0 2015-05-08 aapl 11 1
1 2015-05-07 aapl 8 1
2 2015-05-06 aapl 10 1
3 2015-05-05 aapl 15 2
4 2015-05-08 aaww 110 2
5 2015-05-07 aaww 60 2
6 2015-05-06 aaww 100 3
7 2015-05-05 aaww 40 3
All the date(s) in df are present in ref_date of ref_df and not vice versa. Corresponding to each date in df, I need to get ref_date from ref_df based on following logic:
If a date is repeated more than once and either previous or next ref_date(s) are missing then from edges of repeated date allocate to the nearest missing previous or next ref_date(s).
If a date is repeated more than once but there is no missing prev/next ref_date then ref_date is same as date.
There can be missing ref_date(s) not included in df. This happens when date(s) are not repeated around given ref_date(s) to fill for it.
Example:
>>> import pandas as pd
>>> from datetime import datetime as dt
>>> df = pd.DataFrame({'date':[dt(2020,1,20), dt(2020,1,20), dt(2020,1,20), dt(2020,2,25), dt(2020,3,18), dt(2020,3,18), dt(2020,4,9), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,5,28), dt(2020,6,1), dt(2020,6,1), dt(2020,6,1), dt(2020,6,28), dt(2020,6,28)], 'qty':range(18)})
>>> ref_df = pd.DataFrame({'ref_date':[dt(2019,12,8), dt(2020,1,20), dt(2020,2,25), dt(2020,3,18), dt(2020,4,9), dt(2020,4,10), dt(2020,4,12), dt(2020,4,13), dt(2020,4,14), dt(2020,5,28), dt(2020,5,29), dt(2020,5,30), dt(2020,6,1), dt(2020,6,2), dt(2020,6,3), dt(2020,6,28), dt(2020,6,29), dt(2020,7,7)]})
>>> df
date qty
0 2020-01-20 0
1 2020-01-20 1
2 2020-01-20 2
3 2020-02-25 3
4 2020-03-18 4
5 2020-03-18 5
6 2020-04-09 6
7 2020-04-12 7
8 2020-04-12 8
9 2020-04-12 9
10 2020-04-12 10
11 2020-04-12 11
12 2020-05-28 12
13 2020-06-01 13
14 2020-06-01 14
15 2020-06-01 15
16 2020-06-28 16
17 2020-06-28 17
>>> ref_df
ref_date
0 2019-12-08
1 2020-01-20
2 2020-02-25
3 2020-03-18
4 2020-04-09
5 2020-04-10
6 2020-04-12
7 2020-04-13
8 2020-04-14
9 2020-05-28
10 2020-05-29
11 2020-05-30
12 2020-06-01
13 2020-06-02
14 2020-06-03
15 2020-06-28
16 2020-06-29
17 2020-07-07
Expected_output:
>>> df
date qty ref_date
0 2020-01-20 0 2019-12-08
1 2020-01-20 1 2020-01-20 # Note: repeated as no gap
2 2020-01-20 2 2020-01-20
3 2020-02-25 3 2020-02-25
4 2020-03-18 4 2020-03-18
5 2020-03-18 5 2020-03-18 # Note: repeated as no gap
6 2020-04-09 6 2020-04-09
7 2020-04-12 7 2020-04-10 # Note: Filling from the edges
8 2020-04-12 8 2020-04-12
9 2020-04-12 9 2020-04-12 # Note: repeated as not enough gap
10 2020-04-12 10 2020-04-13
11 2020-04-12 11 2020-04-14
12 2020-05-28 12 2020-05-28
13 2020-06-01 13 2020-05-30 # Filling nearest previous
14 2020-06-01 14 2020-06-01 # First filling previous
15 2020-06-01 15 2020-06-02 # Filling nearest next
16 2020-06-28 16 2020-06-28
17 2020-06-28 17 2020-06-29
I am able to get the answer but it doesn't look like the most efficient way to do so. Could someone suggest an optimal way to do it:
ref_df['date'] = ref_df['ref_date']
df = pd.merge_asof(df, ref_df, on='date', direction='nearest')
df = df.rename(columns={'ref_date':'nearest_ref_date'})
nrd_cnt = df.groupby('nearest_ref_date')['date'].count().reset_index().rename(columns={'date':'nrd_count'})
nrd_cnt['lc'] = nrd_cnt['nearest_ref_date'].shift(1)
nrd_cnt['uc'] = nrd_cnt['nearest_ref_date'].shift(-1)
df = df.merge(nrd_cnt, how='left', on='nearest_ref_date')
# TODO: Review it. Looping it to finite number 100 to avoid infite loop (in case of edge cases)
for _ in range(100):
df2 = df.copy()
df2['days'] = np.abs((df2['nearest_ref_date'] - df2['date']).dt.days)
df2['repeat_rank'] = df2.groupby('nearest_ref_date')['days'].rank(method='first')
reduced_ref_df = ref_df[~ref_df['ref_date'].isin(df2['nearest_ref_date'].unique())]
df2 = pd.merge_asof(df2, reduced_ref_df, on='date', direction='nearest')
df2 = df2.rename(columns={'ref_date':'new_nrd'})
df2.loc[(df2['new_nrd']<=df2['lc']) | (df2['new_nrd']>=df2['uc']), 'new_nrd'] = pd.to_datetime(np.nan)
df2.loc[(~pd.isna(df2['new_nrd'])) & (df2['repeat_rank'] > 1), 'nearest_ref_date'] = df2['new_nrd']
df2 = df2[['date', 'qty', 'nearest_ref_date', 'lc', 'uc']]
if df.equals(df2):
break
df = df2
df = df[['date', 'qty', 'nearest_ref_date']]
df.loc[:, 'repeat_rank'] = df.groupby('nearest_ref_date')['nearest_ref_date'].rank(method='first')
df = pd.merge_asof(df, ref_df, on='date', direction='nearest')
# Repeated nearest_ref_date set to nearest ref_date
df.loc[df['repeat_rank'] > 1, 'nearest_ref_date'] = df['ref_date']
# Sorting nearest_ref_date within the ref_date group (without changing order of rest of cols).
df.loc[:, 'nearest_ref_date'] = df[['ref_date', 'nearest_ref_date']].sort_values(['ref_date', 'nearest_ref_date']).reset_index().drop('index',axis=1)['nearest_ref_date']
df = df[['date', 'qty', 'nearest_ref_date']]
df
date qty ref_date
0 2020-01-20 0 2019-12-08
1 2020-01-20 1 2020-01-20
2 2020-01-20 2 2020-01-20
3 2020-02-25 3 2020-02-25
4 2020-03-18 4 2020-03-18
5 2020-03-18 5 2020-03-18
6 2020-04-09 6 2020-04-09
7 2020-04-12 7 2020-04-10
8 2020-04-12 8 2020-04-12
9 2020-04-12 9 2020-04-12
10 2020-04-12 10 2020-04-13
11 2020-04-12 11 2020-04-14
12 2020-05-28 12 2020-05-28
13 2020-06-01 13 2020-05-30
14 2020-06-01 14 2020-06-01
15 2020-06-01 15 2020-06-02
16 2020-06-28 16 2020-06-28
17 2020-06-28 17 2020-06-29
I have a sample dataframe
Account Date Amount
10 2020-06-01 100
10 2020-06-11 500
10 2020-06-21 600
10 2020-06-25 900
10 2020-07-11 1000
10 2020-07-15 600
11 2020-06-01 100
11 2020-06-11 200
11 2020-06-21 500
11 2020-06-25 1500
11 2020-07-11 2500
11 2020-07-15 6700
I want to get the number of rows in each 30 day interval for each account ie
Account Date Amount
10 2020-06-01 1
10 2020-06-11 2
10 2020-06-21 3
10 2020-06-25 4
10 2020-07-11 4
10 2020-07-15 4
11 2020-06-01 1
11 2020-06-11 2
11 2020-06-21 3
11 2020-06-25 4
11 2020-07-11 4
11 2020-07-15 4
I have tried Grouper and resampling but those give me the counts per each 30 days and not the rolling counts.
Thanks in advance!
def get_rolling_amount(grp, freq):
return grp.rolling(freq, on="Date", closed="both").count()
df["Date"] = pd.to_datetime(df["Date"])
df["Amount"] = df.groupby("Account").apply(get_rolling_amount, "30D").values
print(df)
Prints:
Account Date Amount
0 10 2020-06-01 1
1 10 2020-06-11 2
2 10 2020-06-21 3
3 10 2020-06-25 4
4 10 2020-07-11 4
5 10 2020-07-15 4
6 11 2020-06-01 1
7 11 2020-06-11 2
8 11 2020-06-21 3
9 11 2020-06-25 4
10 11 2020-07-11 4
11 11 2020-07-15 4
You can use broadcasting within group to check how many rows fall within X days.
import pandas as pd
def within_days(s, days):
arr = ((s.to_numpy() >= s.to_numpy()[:, None])
& (s.to_numpy() <= (s + pd.offsets.DateOffset(days=days)).to_numpy()[:, None])).sum(axis=0)
return pd.Series(arr, index=s.index)
df['Amount'] = df.groupby('Account')['Date'].apply(within_days, days=30)
Account Date Amount
0 10 2020-06-01 1
1 10 2020-06-11 2
2 10 2020-06-21 3
3 10 2020-06-25 4
4 10 2020-07-11 4
5 10 2020-07-15 4
6 11 2020-06-01 1
7 11 2020-06-11 2
8 11 2020-06-21 3
9 11 2020-06-25 4
10 11 2020-07-11 4
11 11 2020-07-15 4
df = df.resample('30D').agg({'date':'count','Amount':'sum'})
This will aggregate the 'Date' column by count, getting the data you want.
However, since you will need to first set date as your index for resampling, you could create a "dummy" column containing zeros:
df['dummy'] = pd.Series(np.zeros(len(df))
I have the following data, although I have months and months of data, I have projected here only few days data
I tried to use the following code to get last 15 days data, but I always land on following error
AttributeError: 'Series' object has no attribute 'isoweekday'
Need help here
My code:
import datetime
df['Data_Date'] = datetime.date.today()
weekday = df['Data_Date'].isoweekday()
start = df["Data_Date"] - datetime.timedelta(days=weekday)
dates = [start - datetime.timedelta(days=d) for d in range(15)]
df["Data_Date"] = [str(d) for d in dates]
Data
Data_Date File Data
2021-03-06 18 1144396
2021-03-06 12 1069004
2021-03-06 11 2050459
2021-03-06 18 1648709
2021-03-07 18 1131606
2021-03-07 11 1069685
2021-03-07 11 2062713
2021-03-07 18 1594153
2021-03-08 18 1161566
2021-03-08 18 1068366
2021-03-08 18 2048878
2021-03-08 18 1649411
2021-03-09 19 1257021
2021-03-09 18 1055597
2021-03-09 18 2026171
2021-03-09 19 1792446
2021-03-10 18 1164453
2021-03-10 12 1088292
2021-03-10 12 2073664
2021-03-10 12 1658517
2021-03-11 12 1140799
2021-03-11 12 1030003
2021-03-11 12 1995509
2021-03-11 12 1614548
IIUC:
df['Data_Date'] = pd.to_datetime(df['Data_Date'], format='%Y%m%d')
#get today Timestamp
today = pd.to_datetime('today').normalize()
#get previous days by length of rows in DataFrame
prev = today - pd.Timedelta(len(df), unit='d')
#generate new datetimes by range with starting datetime
df["new"] = pd.to_datetime(range(len(df)), unit='d', origin=prev)
print (df)
Data_Date Data new
0 2021-03-11 1762439 2021-02-20
1 2021-03-12 1678808 2021-02-21
2 2021-03-12 1665741 2021-02-22
3 2021-03-12 3043567 2021-02-23
4 2021-03-12 2362461 2021-02-24
5 2021-03-13 1166616 2021-02-25
6 2021-03-13 1156903 2021-02-26
7 2021-03-13 2121702 2021-02-27
8 2021-03-13 1779516 2021-02-28
9 2021-03-14 1381958 2021-03-01
10 2021-03-14 1389385 2021-03-02
11 2021-03-14 2523322 2021-03-03
12 2021-03-14 2086453 2021-03-04
13 2021-03-15 1194240 2021-03-05
14 2021-03-15 1205421 2021-03-06
15 2021-03-15 2184774 2021-03-07
16 2021-03-15 1813142 2021-03-08
17 2021-03-16 1194240 2021-03-09
18 2021-03-16 1205421 2021-03-10
19 2021-03-16 2184774 2021-03-11
20 2021-03-16 1813142 2021-03-12
21 2021-03-17 1194240 2021-03-13
22 2021-03-17 1205421 2021-03-14
23 2021-03-17 2184774 2021-03-15
24 2021-03-17 1813142 2021-03-16
I have a following dataframe:
user_id end date start date no
0 1 2018-03-01 2018-01-01 15
1 1 2018-04-01 2018-02-01 20
2 1 2018-05-01 2018-03-01 35
3 2 2018-07-01 2018-04-01 50
4 2 2018-07-01 2018-05-01 18
I want to create another dataframe such that for a given user id, I have start and end dates along with the last date for all months between given start and end dates e-g:
user_id date no
1 2018-01-01 15
1 2018-02-28 15
1 2018-03-01 15
1 2018-02-01 20
1 2018-03-31 20
1 2018-04-01 20
1 2018-03-01 35
1 2018-04-30 35
1 2018-05-01 35
2 2018-04-01 50
2 2018-05-31 50
2 2018-06-30 50
2 2018-07-01 50
2 2018-05-01 18
2 2018-06-30 18
2 2018-07-01 1
Use:
#rename columns for using itertuples
df = df.rename(columns=lambda x: x.replace(' ', '_'))
#convert columns to datetimes if necessary
df[['start_date','end_date']] = df[['start_date','end_date']].apply(pd.to_datetime)
#repeat datetimes to Series
s = pd.concat([pd.Series(r.Index,pd.date_range(r.start_date, r.end_date, freq='MS'))
for r in df.itertuples()])
#swap keys with values
s = pd.Series(s.index, index=s, name='date')
#add to original
df = df.join(s).reset_index(drop=True)
#check values if match
mask = df[['start_date','end_date']].ne(df['date'], axis=0).all(axis=1)
df = df.drop(['start_date','end_date'], axis=1)
#set end of month for added datetimes
df.loc[mask, 'date'] += pd.offsets.MonthEnd()
#change order of columns
df = df[['user_id','date','no']]
print (df)
user_id date no
0 1 2018-01-01 15
1 1 2018-02-28 15
2 1 2018-03-01 15
3 1 2018-02-01 20
4 1 2018-03-31 20
5 1 2018-04-01 20
6 1 2018-03-01 35
7 1 2018-04-30 35
8 1 2018-05-01 35
9 2 2018-04-01 50
10 2 2018-05-31 50
11 2 2018-06-30 50
12 2 2018-07-01 50
13 2 2018-05-01 18
14 2 2018-06-30 18
15 2 2018-07-01 18
EDIT:
For match starting and end of value not starting by first day of month use:
print (df)
user_id end date start date no
0 1 2018-05-10 2018-03-10 15
1 1 2018-04-01 2018-02-01 20
2 1 2018-05-01 2018-03-01 35
3 2 2018-07-01 2018-04-01 50
4 2 2018-07-01 2018-05-01 18
#convert columns to datetimes if necessary
df[['start date','end date']] = df[['start date','end date']].apply(pd.to_datetime)
#create new column with start of months
df['start_date'] = df['start date'].dt.to_period('m').dt.to_timestamp()
df['end_date'] = df['end date'].dt.to_period('m').dt.to_timestamp()
#repeat datetimes to Series
s = pd.concat([pd.Series(r.Index,pd.date_range(r.start_date, r.end_date, freq='MS'))
for r in df.itertuples()])
#swap keys with values
s = pd.Series(s.index, index=s, name='date')
#add to original
df = df.join(s).reset_index(drop=True)
#check values if match
m1 = df['start_date'].ne(df['date'])
m2 = df['end_date'].ne(df['date'])
#replace values by start and end of groups
df['date'] = df['date'].where(m1, df['start date'])
df['date'] = df['date'].where(m2, df['end date'])
df = df.drop(['start_date','end_date','start date','end date'], axis=1)
#set end of month for added datetimes
df.loc[m1 & m2, 'date'] += pd.offsets.MonthEnd()
#change order of columns
df = df[['user_id','date','no']]
print (df)
user_id date no
0 1 2018-03-10 15
1 1 2018-04-30 15
2 1 2018-05-10 15
3 1 2018-02-01 20
4 1 2018-03-31 20
5 1 2018-04-01 20
6 1 2018-03-01 35
7 1 2018-04-30 35
8 1 2018-05-01 35
9 2 2018-04-01 50
10 2 2018-05-31 50
11 2 2018-06-30 50
12 2 2018-07-01 50
13 2 2018-05-01 18
14 2 2018-06-30 18
15 2 2018-07-01 18