All the date(s) in df are present in ref_date of ref_df and not vice versa. Corresponding to each date in df, I need to get ref_date from ref_df based on following logic:
If a date is repeated more than once and either previous or next ref_date(s) are missing then from edges of repeated date allocate to the nearest missing previous or next ref_date(s).
If a date is repeated more than once but there is no missing prev/next ref_date then ref_date is same as date.
There can be missing ref_date(s) not included in df. This happens when date(s) are not repeated around given ref_date(s) to fill for it.
Example:
>>> import pandas as pd
>>> from datetime import datetime as dt
>>> df = pd.DataFrame({'date':[dt(2020,1,20), dt(2020,1,20), dt(2020,1,20), dt(2020,2,25), dt(2020,3,18), dt(2020,3,18), dt(2020,4,9), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,5,28), dt(2020,6,1), dt(2020,6,1), dt(2020,6,1), dt(2020,6,28), dt(2020,6,28)], 'qty':range(18)})
>>> ref_df = pd.DataFrame({'ref_date':[dt(2019,12,8), dt(2020,1,20), dt(2020,2,25), dt(2020,3,18), dt(2020,4,9), dt(2020,4,10), dt(2020,4,12), dt(2020,4,13), dt(2020,4,14), dt(2020,5,28), dt(2020,5,29), dt(2020,5,30), dt(2020,6,1), dt(2020,6,2), dt(2020,6,3), dt(2020,6,28), dt(2020,6,29), dt(2020,7,7)]})
>>> df
date qty
0 2020-01-20 0
1 2020-01-20 1
2 2020-01-20 2
3 2020-02-25 3
4 2020-03-18 4
5 2020-03-18 5
6 2020-04-09 6
7 2020-04-12 7
8 2020-04-12 8
9 2020-04-12 9
10 2020-04-12 10
11 2020-04-12 11
12 2020-05-28 12
13 2020-06-01 13
14 2020-06-01 14
15 2020-06-01 15
16 2020-06-28 16
17 2020-06-28 17
>>> ref_df
ref_date
0 2019-12-08
1 2020-01-20
2 2020-02-25
3 2020-03-18
4 2020-04-09
5 2020-04-10
6 2020-04-12
7 2020-04-13
8 2020-04-14
9 2020-05-28
10 2020-05-29
11 2020-05-30
12 2020-06-01
13 2020-06-02
14 2020-06-03
15 2020-06-28
16 2020-06-29
17 2020-07-07
Expected_output:
>>> df
date qty ref_date
0 2020-01-20 0 2019-12-08
1 2020-01-20 1 2020-01-20 # Note: repeated as no gap
2 2020-01-20 2 2020-01-20
3 2020-02-25 3 2020-02-25
4 2020-03-18 4 2020-03-18
5 2020-03-18 5 2020-03-18 # Note: repeated as no gap
6 2020-04-09 6 2020-04-09
7 2020-04-12 7 2020-04-10 # Note: Filling from the edges
8 2020-04-12 8 2020-04-12
9 2020-04-12 9 2020-04-12 # Note: repeated as not enough gap
10 2020-04-12 10 2020-04-13
11 2020-04-12 11 2020-04-14
12 2020-05-28 12 2020-05-28
13 2020-06-01 13 2020-05-30 # Filling nearest previous
14 2020-06-01 14 2020-06-01 # First filling previous
15 2020-06-01 15 2020-06-02 # Filling nearest next
16 2020-06-28 16 2020-06-28
17 2020-06-28 17 2020-06-29
I am able to get the answer but it doesn't look like the most efficient way to do so. Could someone suggest an optimal way to do it:
ref_df['date'] = ref_df['ref_date']
df = pd.merge_asof(df, ref_df, on='date', direction='nearest')
df = df.rename(columns={'ref_date':'nearest_ref_date'})
nrd_cnt = df.groupby('nearest_ref_date')['date'].count().reset_index().rename(columns={'date':'nrd_count'})
nrd_cnt['lc'] = nrd_cnt['nearest_ref_date'].shift(1)
nrd_cnt['uc'] = nrd_cnt['nearest_ref_date'].shift(-1)
df = df.merge(nrd_cnt, how='left', on='nearest_ref_date')
# TODO: Review it. Looping it to finite number 100 to avoid infite loop (in case of edge cases)
for _ in range(100):
df2 = df.copy()
df2['days'] = np.abs((df2['nearest_ref_date'] - df2['date']).dt.days)
df2['repeat_rank'] = df2.groupby('nearest_ref_date')['days'].rank(method='first')
reduced_ref_df = ref_df[~ref_df['ref_date'].isin(df2['nearest_ref_date'].unique())]
df2 = pd.merge_asof(df2, reduced_ref_df, on='date', direction='nearest')
df2 = df2.rename(columns={'ref_date':'new_nrd'})
df2.loc[(df2['new_nrd']<=df2['lc']) | (df2['new_nrd']>=df2['uc']), 'new_nrd'] = pd.to_datetime(np.nan)
df2.loc[(~pd.isna(df2['new_nrd'])) & (df2['repeat_rank'] > 1), 'nearest_ref_date'] = df2['new_nrd']
df2 = df2[['date', 'qty', 'nearest_ref_date', 'lc', 'uc']]
if df.equals(df2):
break
df = df2
df = df[['date', 'qty', 'nearest_ref_date']]
df.loc[:, 'repeat_rank'] = df.groupby('nearest_ref_date')['nearest_ref_date'].rank(method='first')
df = pd.merge_asof(df, ref_df, on='date', direction='nearest')
# Repeated nearest_ref_date set to nearest ref_date
df.loc[df['repeat_rank'] > 1, 'nearest_ref_date'] = df['ref_date']
# Sorting nearest_ref_date within the ref_date group (without changing order of rest of cols).
df.loc[:, 'nearest_ref_date'] = df[['ref_date', 'nearest_ref_date']].sort_values(['ref_date', 'nearest_ref_date']).reset_index().drop('index',axis=1)['nearest_ref_date']
df = df[['date', 'qty', 'nearest_ref_date']]
df
date qty ref_date
0 2020-01-20 0 2019-12-08
1 2020-01-20 1 2020-01-20
2 2020-01-20 2 2020-01-20
3 2020-02-25 3 2020-02-25
4 2020-03-18 4 2020-03-18
5 2020-03-18 5 2020-03-18
6 2020-04-09 6 2020-04-09
7 2020-04-12 7 2020-04-10
8 2020-04-12 8 2020-04-12
9 2020-04-12 9 2020-04-12
10 2020-04-12 10 2020-04-13
11 2020-04-12 11 2020-04-14
12 2020-05-28 12 2020-05-28
13 2020-06-01 13 2020-05-30
14 2020-06-01 14 2020-06-01
15 2020-06-01 15 2020-06-02
16 2020-06-28 16 2020-06-28
17 2020-06-28 17 2020-06-29
I have a sample dataframe
Account Date Amount
10 2020-06-01 100
10 2020-06-11 500
10 2020-06-21 600
10 2020-06-25 900
10 2020-07-11 1000
10 2020-07-15 600
11 2020-06-01 100
11 2020-06-11 200
11 2020-06-21 500
11 2020-06-25 1500
11 2020-07-11 2500
11 2020-07-15 6700
I want to get the number of rows in each 30 day interval for each account ie
Account Date Amount
10 2020-06-01 1
10 2020-06-11 2
10 2020-06-21 3
10 2020-06-25 4
10 2020-07-11 4
10 2020-07-15 4
11 2020-06-01 1
11 2020-06-11 2
11 2020-06-21 3
11 2020-06-25 4
11 2020-07-11 4
11 2020-07-15 4
I have tried Grouper and resampling but those give me the counts per each 30 days and not the rolling counts.
Thanks in advance!
def get_rolling_amount(grp, freq):
return grp.rolling(freq, on="Date", closed="both").count()
df["Date"] = pd.to_datetime(df["Date"])
df["Amount"] = df.groupby("Account").apply(get_rolling_amount, "30D").values
print(df)
Prints:
Account Date Amount
0 10 2020-06-01 1
1 10 2020-06-11 2
2 10 2020-06-21 3
3 10 2020-06-25 4
4 10 2020-07-11 4
5 10 2020-07-15 4
6 11 2020-06-01 1
7 11 2020-06-11 2
8 11 2020-06-21 3
9 11 2020-06-25 4
10 11 2020-07-11 4
11 11 2020-07-15 4
You can use broadcasting within group to check how many rows fall within X days.
import pandas as pd
def within_days(s, days):
arr = ((s.to_numpy() >= s.to_numpy()[:, None])
& (s.to_numpy() <= (s + pd.offsets.DateOffset(days=days)).to_numpy()[:, None])).sum(axis=0)
return pd.Series(arr, index=s.index)
df['Amount'] = df.groupby('Account')['Date'].apply(within_days, days=30)
Account Date Amount
0 10 2020-06-01 1
1 10 2020-06-11 2
2 10 2020-06-21 3
3 10 2020-06-25 4
4 10 2020-07-11 4
5 10 2020-07-15 4
6 11 2020-06-01 1
7 11 2020-06-11 2
8 11 2020-06-21 3
9 11 2020-06-25 4
10 11 2020-07-11 4
11 11 2020-07-15 4
df = df.resample('30D').agg({'date':'count','Amount':'sum'})
This will aggregate the 'Date' column by count, getting the data you want.
However, since you will need to first set date as your index for resampling, you could create a "dummy" column containing zeros:
df['dummy'] = pd.Series(np.zeros(len(df))
I have a df as shown below.
the data is like this.
Date y
0 2020-06-14 127
1 2020-06-15 216
2 2020-06-16 4
3 2020-06-17 90
4 2020-06-18 82
5 2020-06-19 70
6 2020-06-20 59
7 2020-06-21 48
8 2020-06-22 23
9 2020-06-23 25
10 2020-06-24 24
11 2020-06-25 22
12 2020-06-26 19
13 2020-06-27 10
14 2020-06-28 18
15 2020-06-29 157
16 2020-06-30 16
17 2020-07-01 14
18 2020-07-02 343
The code to create the data frame.
# Create a dummy dataframe
import pandas as pd
import numpy as np
y0 = [127,216,4,90, 82,70,59,48,23,25,24,22,19,10,18,157,16,14,343]
def initial_forecast(data):
data['y'] = y0
return data
# Initial date dataframe
df_dummy = pd.DataFrame({'Date': pd.date_range('2020-06-14', periods=19, freq='1D')})
# Dates
start_date = df_dummy.Date.iloc[1]
print(start_date)
end_date = df_dummy.Date.iloc[17]
print(end_date)
# Adding y0 in the dataframe
df_dummy = initial_forecast(df_dummy)
df_dummy
From the above I would like to interpolate the data for a particular date range.
I would like to interpolate(linear) between 2020-06-17 to 2020-06-27.
ie from 2020-06-17 to 2020-06-27 'y' values changes from 90 to 10 in 10 steps. so at an average in each step it reduces 8.
ie (90-10)/10(number of steps) = 8 in each steps
The expected output:
Date y y_new
0 2020-06-14 127 127
1 2020-06-15 216 216
2 2020-06-16 4 4
3 2020-06-17 90 90
4 2020-06-18 82 82
5 2020-06-19 70 74
6 2020-06-20 59 66
7 2020-06-21 48 58
8 2020-06-22 23 50
9 2020-06-23 25 42
10 2020-06-24 24 34
11 2020-06-25 22 26
12 2020-06-26 19 18
13 2020-06-27 10 10
14 2020-06-28 18 18
15 2020-06-29 157 157
16 2020-06-30 16 16
17 2020-07-01 14 14
18 2020-07-02 343 343
Note: In the remaining date range y_new value should be same as y value.
I tried below code, that is not giving desired output
# Function
def df_interpolate(df, start_date, end_date):
df["Date"]=pd.to_datetime(df["Date"])
df.loc[(df['Date'] >= start_date) & (df['Date'] <= end_date), 'y_new'] = np.nan
df['y_new'] = df['y'].interpolate().round()
return df
df1 = df_interpolate(df_dummy, '2020-06-17', '2020-06-27')
With some tweaks to your function it works. np.where to create the new column, removing the = from your conditionals, and casting to int as per your expected output.
def df_interpolate(df, start_date, end_date):
df["Date"] = pd.to_datetime(df["Date"])
df['y_new'] = np.where((df['Date'] > start_date) & (df['Date'] < end_date), np.nan, df['y'])
df['y_new'] = df['y_new'].interpolate().round().astype(int)
return df
Date y y_new
0 2020-06-14 127 127
1 2020-06-15 216 216
2 2020-06-16 4 4
3 2020-06-17 90 90
4 2020-06-18 82 82
5 2020-06-19 70 74
6 2020-06-20 59 66
7 2020-06-21 48 58
8 2020-06-22 23 50
9 2020-06-23 25 42
10 2020-06-24 24 34
11 2020-06-25 22 26
12 2020-06-26 19 18
13 2020-06-27 10 10
14 2020-06-28 18 18
15 2020-06-29 157 157
16 2020-06-30 16 16
17 2020-07-01 14 14
18 2020-07-02 343 343
I'm trying to compute the ratio of certain columns in all my dataframes(present as a dictionary) wrt an aggregated_data.
Here data is a dictionary and it contains the level name as key and its data(as dataframe) as values.
For eg:
1)This is how the data looks like(Just an example for illustration)
data={'State':State_data,'District':District_data}
>>> State_data
Time level value 97E03K 90KFTO FXRDW9 1I4OX9 N6HO97
0 2017-04-01 State NY 15 7 8 19 17
1 2017-05-01 State NY 11 8 9 16 11
2 2017-06-01 State NY 17 16 6 12 17
3 2017-04-01 State WDC 6 17 19 8 20
4 2017-05-01 State WDC 19 9 20 11 17
5 2017-06-01 State WDC 10 11 6 20 11
>>> District_data
Time level value 97E03K 90KFTO FXRDW9 1I4OX9 N6HO97
0 2017-04-01 District Downtown 2 1 5 3 5
1 2017-05-01 District Downtown 4 3 2 4 3
2 2017-06-01 District Downtown 4 3 4 1 3
3 2017-04-01 District Central 3 4 3 5 5
4 2017-05-01 District Central 4 3 5 4 3
5 2017-06-01 District Central 4 3 5 5 3
2)This is how the aggregated data looks like:
Time level value 97E03K 90KFTO FXRDW9 1I4OX9 N6HO97
0 2017-04-01 Aggregated Aggregated 27 21 23 30 21
1 2017-05-01 Aggregated Aggregated 27 29 26 22 30
2 2017-06-01 Aggregated Aggregated 27 30 30 25 25
3 2017-04-01 Aggregated Aggregated 22 27 30 22 25
4 2017-05-01 Aggregated Aggregated 22 21 24 22 29
5 2017-06-01 Aggregated Aggregated 25 27 23 22 24
I've to iterate for each level and find the ratio of each level to the aggregated for the corresponding based on this dictionary:
columns_to_work = {'97E03K': '97E03K', '90KFTO': '97E03K', 'FXRDW9': '97E03K', '1I4OX9': '1I4OX9', 'N6HO97': '97E03K'}
Here for every key, I'll find the ratio of its value wrt to the aggregated level on the same date for the same value and replace the column name with the key+'_rank'.
Eg. For key 90KFTO, the value 97E03K at current level has to be divided wrt aggregated's 97E03K column for the same timepoint. And this ratio is stored with key's name as 90KFTO_rank.
Likewise, I'm finding for each level and appending each of it to a list which I'm finally concatenating to get a flat dataframe containing '_rank' columns for all inputted levels
4)Final output data looks something like this(Ratio of data wrt aggregated):
Time level value 97E03K_rank 90KFTO_rank FXRDW9_rank 1I4OX9_rank N6HO97_rank
0 2017-04-01 State NY 0.555556 0.555556 0.555556 0.633333 0.555556
1 2017-05-01 State NY 0.407407 0.407407 0.407407 0.727273 0.407407
2 2017-06-01 State NY 0.629630 0.629630 0.629630 0.480000 0.629630
3 2017-04-01 State WDC 0.272727 0.272727 0.272727 0.363636 0.272727
4 2017-05-01 State WDC 0.863636 0.863636 0.863636 0.500000 0.863636
5 2017-06-01 State WDC 0.400000 0.400000 0.400000 0.909091 0.400000
6 2017-04-01 District Downtown 0.074074 0.074074 0.074074 0.100000 0.074074
7 2017-05-01 District Downtown 0.148148 0.148148 0.148148 0.181818 0.148148
8 2017-06-01 District Downtown 0.148148 0.148148 0.148148 0.040000 0.148148
9 2017-04-01 District Central 0.136364 0.136364 0.136364 0.227273 0.136364
10 2017-05-01 District Central 0.181818 0.181818 0.181818 0.181818 0.181818
11 2017-06-01 District Central 0.160000 0.160000 0.160000 0.227273 0.160000
Now this is approach which needs to be optimized:
samp_data=list()
level={}
for l,da in data.items(): #Here l is the key and da is the dataframe
level[l] = da.copy()
lev[l] = pd.DataFrame() #Just a copy to work with
lev[l] = pd.concat([lev[l],level[l][[tim,'level','value']]],sort=False)
for c,d in columns_to_work.items():
level[l] = level[l].join(aggregated_data[[d]], on = tim, rsuffix = '_rank1')
level[l].rename(columns = {d+'_rank1':c+'_rank'}, inplace=True)
level[l][c+'_rank'] = level[l][d]/level[l][c+'_rank']
lev[l] = pd.concat([lev[l],level[l][c+'_rank']],axis=1,sort=False)
samp_data.append(lev[l])
Explanation of Code if the logic is still not clear:
In the first iteration, I'm iterating for all levels present in my dictionary and In the second iteration, I'm iterating over the column names. But here, the `columns_to_work is a dictionary with key and value both being columns in my dataframes.
I've to calculate the ratio of d column w.r.t the aggregated data for my current level and rename the column name with c+"_rank".
Although the above code works fine for small datasets, it fails big time while trying to scale for bigger datasets. I'm looking for an optimized way of achieving the same. Any advice/suggestions will be greatly appreciated:)
P.S. I tried using the aggregated_data as a dictionary of lists to improve the performance. But the problem is some time points present in the aggregated_data file may not be in the level data. Hence the order mapping gets messed up.
This should work:
Step 1: concat state and district data
df = pd.concat([State_data, District_data])
Step 2: join sate and district data to aggregated data (using index, since there are multiple distinct rows for the same Time)
df = pd.merge(
left=df,
left_index=True,
right=aggregated_data.drop(columns=['level', 'value', 'Time']),
right_index=True,
suffixes=['', '_agg']
)
Step 3: Iterate through columns_to_work
for k, v in columns_to_work.items():
df[f'{k}_rank'] = df[v]/df[f'{v}_agg']
Step 4: Sort df and drop unnecessary columns
df = df[['Time', 'level', 'value', '97E03K_rank', '90KFTO_rank', 'FXRDW9_rank', '1I4OX9_rank', 'N6HO97_rank']].sort_values('level', ascending=False)
End result:
Time level value 97E03K_rank 90KFTO_rank FXRDW9_rank 1I4OX9_rank N6HO97_rank
2017-04-01 State NY 0.556 0.556 0.556 0.633 0.556
2017-05-01 State NY 0.407 0.407 0.407 0.727 0.407
2017-06-01 State NY 0.630 0.630 0.630 0.480 0.630
2017-04-01 State WDC 0.273 0.273 0.273 0.364 0.273
2017-05-01 State WDC 0.864 0.864 0.864 0.500 0.864
2017-06-01 State WDC 0.400 0.400 0.400 0.909 0.400
2017-04-01 District Downtown 0.074 0.074 0.074 0.100 0.074
2017-05-01 District Downtown 0.148 0.148 0.148 0.182 0.148
2017-06-01 District Downtown 0.148 0.148 0.148 0.040 0.148
2017-04-01 District Central 0.136 0.136 0.136 0.227 0.136
2017-05-01 District Central 0.182 0.182 0.182 0.182 0.182
2017-06-01 District Central 0.160 0.160 0.160 0.227 0.160
I have 4 dataframe with value count of number of occurance per month.
I want to compare all 4 value counts in one graph, so i can see visual difference between every month on these four years.
Like below
i like to have output like this image with years and month
newdf2018.Month.value_counts()
output
1 3451
2 3895
3 3408
4 3365
5 3833
6 3543
7 3333
8 3219
9 3447
10 2943
11 3296
12 2909
newdf2017.Month.value_counts()
1 2801
2 3048
3 3620
4 3014
5 3226
6 3962
7 3500
8 3707
9 3601
10 3349
11 3743
12 2002
newdf2016.Month.value_counts()
1 3201
2 2034
3 2405
4 3805
5 3308
6 3212
7 3049
8 3777
9 3275
10 3099
11 3775
12 2115
newdf2015.Month.value_counts()
1 2817
2 2604
3 2711
4 2817
5 2670
6 2507
7 3256
8 2195
9 3304
10 3238
11 2005
12 2008
Create dictionary of DataFrames and concat together, then use plot:
dfs = {2015:newdf2015, 2016:newdf2016, 2017:newdf2017, 2018:newdf2018}
df = pd.concat({k:v['Month'].value_counts() for k, v in dfs.items()}, axis=1)
df.plot.bar()