I have a dataframe with multiple columns and 700+ rows and a series of 27 rows. I want to create a new column i.e. series in dataframe as per matching indexes with predefined column in df
data frame I have and need to add series which contains the same indexes of "Reason for absence"
ID Reason for absence Month of absence Day of the week Seasons
0 11 26 7 3 1
1 36 0 7 3 1
2 3 23 7 4 1
3 7 7 7 5 1
4 11 23 7 5 1
5 3 23 7 6 1
6 10 22 7 6 1
7 20 23 7 6 1
8 14 19 7 2 1
9 1 22 7 2 1
10 20 1 7 2 1
11 20 1 7 3 1
12 20 11 7 4 1
13 3 11 7 4 1
14 3 23 7 4 1
15 24 14 7 6 1
16 3 23 7 6 1
17 3 21 7 2 1
18 6 11 7 5 1
19 33 23 8 4 1
20 18 10 8 4 1
21 3 11 8 2 1
22 10 13 8 2 1
23 20 28 8 6 1
24 11 18 8 2 1
25 10 25 8 2 1
26 11 23 8 3 1
27 30 28 8 4 1
28 11 18 8 4 1
29 3 23 8 6 1
30 3 18 8 2 1
31 2 18 8 5 1
32 1 23 8 5 1
33 2 18 8 2 1
34 3 23 8 2 1
35 10 23 8 2 1
36 11 24 8 3 1
37 19 11 8 5 1
38 2 28 8 6 1
39 20 23 8 6 1
40 27 23 9 3 1
41 34 23 9 2 1
42 3 23 9 3 1
43 5 19 9 3 1
44 14 23 9 4 1
this is series table s_conditions
0 Not absent
1 Infectious and parasitic diseases
2 Neoplasms
3 Diseases of the blood
4 Endocrine, nutritional and metabolic diseases
5 Mental and behavioural disorders
6 Diseases of the nervous system
7 Diseases of the eye
8 Diseases of the ear
9 Diseases of the circulatory system
10 Diseases of the respiratory system
11 Diseases of the digestive system
12 Diseases of the skin
13 Diseases of the musculoskeletal system
14 Diseases of the genitourinary system
15 Pregnancy and childbirth
16 Conditions from perinatal period
17 Congenital malformations
18 Symptoms not elsewhere classified
19 Injury
20 External causes
21 Factors influencing health status
22 Patient follow-up
23 Medical consultation
24 Blood donation
25 Laboratory examination
26 Unjustified absence
27 Physiotherapy
28 Dental consultation
dtype: object
I tried this
df1.insert(loc=0, column="Reason_for_absence", value=s_conditons)
out- this is wrong because i need the reason_for_absence colum according to the index of reason for absence and s_conditions
Reason_for_absence ID Reason for absence \
0 Not absent 11 26
1 Infectious and parasitic diseases 36 0
2 Neoplasms 3 23
3 Diseases of the blood 7 7
4 Endocrine, nutritional and metabolic diseases 11 23
5 Mental and behavioural disorders 3 23
6 Diseases of the nervous system 10 22
7 Diseases of the eye 20 23
8 Diseases of the ear 14 19
9 Diseases of the circulatory system 1 22
10 Diseases of the respiratory system 20 1
11 Diseases of the digestive system 20 1
12 Diseases of the skin 20 11
13 Diseases of the musculoskeletal system 3 11
14 Diseases of the genitourinary system 3 23
15 Pregnancy and childbirth 24 14
16 Conditions from perinatal period 3 23
17 Congenital malformations 3 21
18 Symptoms not elsewhere classified 6 11
19 Injury 33 23
20 External causes 18 10
21 Factors influencing health status 3 11
22 Patient follow-up 10 13
23 Medical consultation 20 28
24 Blood donation 11 18
25 Laboratory examination 10 25
26 Unjustified absence 11 23
27 Physiotherapy 30 28
28 Dental consultation 11 18
29 NaN 3 23
30 NaN 3 18
31 NaN 2 18
32 NaN 1 23
i am getting output upto 28 rows and NaN values after that. Instead, I need correct order of series according to indexes for all the rows
While this question is a bit confusing, it seems the desire is to match the series index with the dataframe "Reason for Absence" column. If this is correct, below is a small example of how to accomplish. Keep in mind, the resulting dataframe will be sorted based on the 'Reason for Absence Numerical' column. If my understanding is incorrect, please clarify this question so we can better assist you.
d = {'ID': [11,36,3], 'Reason for Absence Numerical': [3,2,1], 'Day of the Week': [4,2,6]}
dataframe = pd.DataFrame(data=d)
s = {0: 'Not absent', 1:'Neoplasms', 2:'Injury', 3:'Diseases of the eye'}
disease_series = pd.Series(data=s)
def add_series_to_df(df, series, index_val):
df_filtered = df[df['Reason for Absence Numerical'] == index_val].copy()
series_filtered = series[series.index == index_val]
if not df_filtered.empty:
df_filtered['Reason for Absence Text'] = series_filtered.item()
return df_filtered
x = [add_series_to_df(dataframe, disease_series, index_val) for index_val in range(len(disease_series.index))]
new_df = pd.concat(x)
print(new_df)
All the date(s) in df are present in ref_date of ref_df and not vice versa. Corresponding to each date in df, I need to get ref_date from ref_df based on following logic:
If a date is repeated more than once and either previous or next ref_date(s) are missing then from edges of repeated date allocate to the nearest missing previous or next ref_date(s).
If a date is repeated more than once but there is no missing prev/next ref_date then ref_date is same as date.
There can be missing ref_date(s) not included in df. This happens when date(s) are not repeated around given ref_date(s) to fill for it.
Example:
>>> import pandas as pd
>>> from datetime import datetime as dt
>>> df = pd.DataFrame({'date':[dt(2020,1,20), dt(2020,1,20), dt(2020,1,20), dt(2020,2,25), dt(2020,3,18), dt(2020,3,18), dt(2020,4,9), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,5,28), dt(2020,6,1), dt(2020,6,1), dt(2020,6,1), dt(2020,6,28), dt(2020,6,28)], 'qty':range(18)})
>>> ref_df = pd.DataFrame({'ref_date':[dt(2019,12,8), dt(2020,1,20), dt(2020,2,25), dt(2020,3,18), dt(2020,4,9), dt(2020,4,10), dt(2020,4,12), dt(2020,4,13), dt(2020,4,14), dt(2020,5,28), dt(2020,5,29), dt(2020,5,30), dt(2020,6,1), dt(2020,6,2), dt(2020,6,3), dt(2020,6,28), dt(2020,6,29), dt(2020,7,7)]})
>>> df
date qty
0 2020-01-20 0
1 2020-01-20 1
2 2020-01-20 2
3 2020-02-25 3
4 2020-03-18 4
5 2020-03-18 5
6 2020-04-09 6
7 2020-04-12 7
8 2020-04-12 8
9 2020-04-12 9
10 2020-04-12 10
11 2020-04-12 11
12 2020-05-28 12
13 2020-06-01 13
14 2020-06-01 14
15 2020-06-01 15
16 2020-06-28 16
17 2020-06-28 17
>>> ref_df
ref_date
0 2019-12-08
1 2020-01-20
2 2020-02-25
3 2020-03-18
4 2020-04-09
5 2020-04-10
6 2020-04-12
7 2020-04-13
8 2020-04-14
9 2020-05-28
10 2020-05-29
11 2020-05-30
12 2020-06-01
13 2020-06-02
14 2020-06-03
15 2020-06-28
16 2020-06-29
17 2020-07-07
Expected_output:
>>> df
date qty ref_date
0 2020-01-20 0 2019-12-08
1 2020-01-20 1 2020-01-20 # Note: repeated as no gap
2 2020-01-20 2 2020-01-20
3 2020-02-25 3 2020-02-25
4 2020-03-18 4 2020-03-18
5 2020-03-18 5 2020-03-18 # Note: repeated as no gap
6 2020-04-09 6 2020-04-09
7 2020-04-12 7 2020-04-10 # Note: Filling from the edges
8 2020-04-12 8 2020-04-12
9 2020-04-12 9 2020-04-12 # Note: repeated as not enough gap
10 2020-04-12 10 2020-04-13
11 2020-04-12 11 2020-04-14
12 2020-05-28 12 2020-05-28
13 2020-06-01 13 2020-05-30 # Filling nearest previous
14 2020-06-01 14 2020-06-01 # First filling previous
15 2020-06-01 15 2020-06-02 # Filling nearest next
16 2020-06-28 16 2020-06-28
17 2020-06-28 17 2020-06-29
I am able to get the answer but it doesn't look like the most efficient way to do so. Could someone suggest an optimal way to do it:
ref_df['date'] = ref_df['ref_date']
df = pd.merge_asof(df, ref_df, on='date', direction='nearest')
df = df.rename(columns={'ref_date':'nearest_ref_date'})
nrd_cnt = df.groupby('nearest_ref_date')['date'].count().reset_index().rename(columns={'date':'nrd_count'})
nrd_cnt['lc'] = nrd_cnt['nearest_ref_date'].shift(1)
nrd_cnt['uc'] = nrd_cnt['nearest_ref_date'].shift(-1)
df = df.merge(nrd_cnt, how='left', on='nearest_ref_date')
# TODO: Review it. Looping it to finite number 100 to avoid infite loop (in case of edge cases)
for _ in range(100):
df2 = df.copy()
df2['days'] = np.abs((df2['nearest_ref_date'] - df2['date']).dt.days)
df2['repeat_rank'] = df2.groupby('nearest_ref_date')['days'].rank(method='first')
reduced_ref_df = ref_df[~ref_df['ref_date'].isin(df2['nearest_ref_date'].unique())]
df2 = pd.merge_asof(df2, reduced_ref_df, on='date', direction='nearest')
df2 = df2.rename(columns={'ref_date':'new_nrd'})
df2.loc[(df2['new_nrd']<=df2['lc']) | (df2['new_nrd']>=df2['uc']), 'new_nrd'] = pd.to_datetime(np.nan)
df2.loc[(~pd.isna(df2['new_nrd'])) & (df2['repeat_rank'] > 1), 'nearest_ref_date'] = df2['new_nrd']
df2 = df2[['date', 'qty', 'nearest_ref_date', 'lc', 'uc']]
if df.equals(df2):
break
df = df2
df = df[['date', 'qty', 'nearest_ref_date']]
df.loc[:, 'repeat_rank'] = df.groupby('nearest_ref_date')['nearest_ref_date'].rank(method='first')
df = pd.merge_asof(df, ref_df, on='date', direction='nearest')
# Repeated nearest_ref_date set to nearest ref_date
df.loc[df['repeat_rank'] > 1, 'nearest_ref_date'] = df['ref_date']
# Sorting nearest_ref_date within the ref_date group (without changing order of rest of cols).
df.loc[:, 'nearest_ref_date'] = df[['ref_date', 'nearest_ref_date']].sort_values(['ref_date', 'nearest_ref_date']).reset_index().drop('index',axis=1)['nearest_ref_date']
df = df[['date', 'qty', 'nearest_ref_date']]
df
date qty ref_date
0 2020-01-20 0 2019-12-08
1 2020-01-20 1 2020-01-20
2 2020-01-20 2 2020-01-20
3 2020-02-25 3 2020-02-25
4 2020-03-18 4 2020-03-18
5 2020-03-18 5 2020-03-18
6 2020-04-09 6 2020-04-09
7 2020-04-12 7 2020-04-10
8 2020-04-12 8 2020-04-12
9 2020-04-12 9 2020-04-12
10 2020-04-12 10 2020-04-13
11 2020-04-12 11 2020-04-14
12 2020-05-28 12 2020-05-28
13 2020-06-01 13 2020-05-30
14 2020-06-01 14 2020-06-01
15 2020-06-01 15 2020-06-02
16 2020-06-28 16 2020-06-28
17 2020-06-28 17 2020-06-29
I have a sample dataframe
Account Date Amount
10 2020-06-01 100
10 2020-06-11 500
10 2020-06-21 600
10 2020-06-25 900
10 2020-07-11 1000
10 2020-07-15 600
11 2020-06-01 100
11 2020-06-11 200
11 2020-06-21 500
11 2020-06-25 1500
11 2020-07-11 2500
11 2020-07-15 6700
I want to get the number of rows in each 30 day interval for each account ie
Account Date Amount
10 2020-06-01 1
10 2020-06-11 2
10 2020-06-21 3
10 2020-06-25 4
10 2020-07-11 4
10 2020-07-15 4
11 2020-06-01 1
11 2020-06-11 2
11 2020-06-21 3
11 2020-06-25 4
11 2020-07-11 4
11 2020-07-15 4
I have tried Grouper and resampling but those give me the counts per each 30 days and not the rolling counts.
Thanks in advance!
def get_rolling_amount(grp, freq):
return grp.rolling(freq, on="Date", closed="both").count()
df["Date"] = pd.to_datetime(df["Date"])
df["Amount"] = df.groupby("Account").apply(get_rolling_amount, "30D").values
print(df)
Prints:
Account Date Amount
0 10 2020-06-01 1
1 10 2020-06-11 2
2 10 2020-06-21 3
3 10 2020-06-25 4
4 10 2020-07-11 4
5 10 2020-07-15 4
6 11 2020-06-01 1
7 11 2020-06-11 2
8 11 2020-06-21 3
9 11 2020-06-25 4
10 11 2020-07-11 4
11 11 2020-07-15 4
You can use broadcasting within group to check how many rows fall within X days.
import pandas as pd
def within_days(s, days):
arr = ((s.to_numpy() >= s.to_numpy()[:, None])
& (s.to_numpy() <= (s + pd.offsets.DateOffset(days=days)).to_numpy()[:, None])).sum(axis=0)
return pd.Series(arr, index=s.index)
df['Amount'] = df.groupby('Account')['Date'].apply(within_days, days=30)
Account Date Amount
0 10 2020-06-01 1
1 10 2020-06-11 2
2 10 2020-06-21 3
3 10 2020-06-25 4
4 10 2020-07-11 4
5 10 2020-07-15 4
6 11 2020-06-01 1
7 11 2020-06-11 2
8 11 2020-06-21 3
9 11 2020-06-25 4
10 11 2020-07-11 4
11 11 2020-07-15 4
df = df.resample('30D').agg({'date':'count','Amount':'sum'})
This will aggregate the 'Date' column by count, getting the data you want.
However, since you will need to first set date as your index for resampling, you could create a "dummy" column containing zeros:
df['dummy'] = pd.Series(np.zeros(len(df))
I have a time series data as follows:
ds y
0 2016-10-31 2000
1 2016-11-30 3000
2 2016-12-31 5000
3 2017-01-31 5000
4 2017-02-28 4000
5 2017-03-31 4500
6 2017-04-30 10000
7 2017-05-31 6500
8 2017-06-30 3500
9 2017-07-31 5500
10 2017-08-31 2000
11 2017-09-30 3000
12 2017-10-31 10000
13 2017-11-30 5000
14 2017-12-31 4000
15 2018-01-31 4500
16 2018-02-28 5000
17 2018-03-31 6500
18 2018-04-30 3500
19 2018-05-31 5500
20 2018-06-30 2000
21 2018-07-31 3000
22 2018-08-31 10000
23 2018-09-30 5000
24 2018-10-31 4000
25 2018-11-30 4500
26 2018-12-31 5000
27 2019-01-31 6500
28 2019-02-28 3500
29 2019-03-31 5500
I have applied FB Prophet change point detection algorithm to extract changepoints.
When I specify 5 changepoints in the code, I get the following changepoints:
5 2017-03-31
9 2017-07-31
14 2017-12-31
18 2018-04-30
23 2018-09-30
When I specify 7 changepoints in the code, I get the following changepoints:
3 2017-01-31
7 2017-05-31
10 2017-08-31
13 2017-11-30
16 2018-02-28
20 2018-06-30
23 2018-09-30
Why the algorithm does not detect point 6, 12, 22 where there is maximum change in the value as compared to the previous point?
My code below:
from fbprophet import Prophet
import pandas as pd
import matplotlib.pyplot as plt
m = Prophet(growth='linear', n_changepoints = 7, changepoint_range=0.8, changepoint_prior_scale=0.5)
m.fit(df)
future = m.make_future_dataframe(freq = 'M', periods=3)
fcst = m.predict(future)
from fbprophet.plot import add_changepoints_to_plot
fig = m.plot(fcst)
a = add_changepoints_to_plot(fig.gca(), m, fcst)
m.changepoints
Changepoints are a measure to calculate where the trend of the data changes. Your points 6, 12, and 22 are outliers, or perhaps holiday effects. Changepoints do nothing to account for this in a reliable way. Taken your 7 changepoints example prophet analyzed the following trendlines:
2016-10-31 - 2016-12-31
2016-12-31 - 2017-05-31
2017-05-31 - 2017-08-31
2017-08-31 - 2017-11-30
2017-11-30 - 2018-02-28
2018-02-28 - 2018-06-30
2018-06-30 - 2018-09-30
2018-09-30 - 2019-03-31 (and beyond)
add rows for all dates between two columns?
ID Initiation_Date Step Start_Date End_Date Days
P-03 29-11-2018 3 2018-11-29 2018-12-10 11.0
P-04 29-11-2018 4 2018-12-03 2018-12-07 4.0
P-05 29-11-2018 5 2018-12-07 2018-12-07 0.0
Use:
mydata = [{'ID' : '10', 'Entry Date': '10/10/2016', 'Exit Date': '15/10/2016'},
{'ID' : '20', 'Entry Date': '10/10/2016', 'Exit Date': '18/10/2016'}]
df = pd.DataFrame(mydata)
#convert columns to datetimes
df[['Entry Date','Exit Date']] = df[['Entry Date','Exit Date']].apply(pd.to_datetime)
#repeat index by difference of dates
df = df.loc[df.index.repeat((df['Exit Date'] - df['Entry Date']).dt.days + 1)]
#add counter duplicated rows to day timedeltas to new column
df['Date'] = df['Entry Date'] + pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
#default RangeIndex
df = df.reset_index(drop=True)
print (df)
Entry Date Exit Date ID Date
0 2016-10-10 2016-10-15 10 2016-10-10
1 2016-10-10 2016-10-15 10 2016-10-11
2 2016-10-10 2016-10-15 10 2016-10-12
3 2016-10-10 2016-10-15 10 2016-10-13
4 2016-10-10 2016-10-15 10 2016-10-14
5 2016-10-10 2016-10-15 10 2016-10-15
6 2016-10-10 2016-10-18 20 2016-10-10
7 2016-10-10 2016-10-18 20 2016-10-11
8 2016-10-10 2016-10-18 20 2016-10-12
9 2016-10-10 2016-10-18 20 2016-10-13
10 2016-10-10 2016-10-18 20 2016-10-14
11 2016-10-10 2016-10-18 20 2016-10-15
12 2016-10-10 2016-10-18 20 2016-10-16
13 2016-10-10 2016-10-18 20 2016-10-17
14 2016-10-10 2016-10-18 20 2016-10-18