I have these dataframes:
import pandas as pd
import numpy as np
from functools import reduce
a = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'A_val': [0.1, np.nan, 0.3, np.nan, 0.5], 'B_val': [1.233, np.nan, 1.4, np.nan, 1.9]})
b = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'A_val': [np.nan, 0.2, np.nan, 0.4, np.nan], 'B_val': [np.nan, 1.56, np.nan, 1.1, np.nan]})
c = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'C_val': [121, np.nan, 334, np.nan, 555], 'D_val': [10.233, np.nan, 10.4, np.nan, 10.9]})
d = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'C_val': [np.nan, 322, np.nan, 454, np.nan], 'D_val': [np.nan, 10.56, np.nan, 10.1, np.nan]})
I am dropping the nan values:
a.dropna(inplace=True)
b.dropna(inplace=True)
c.dropna(inplace=True)
d.dropna(inplace=True)
And then , I want to merge them and have this result:
id gr_code A_val B_val C_val D_val
1 121 0.1 1.233 121.0 10.233
2 121 0.2 1.56 322 10.56
3 134 0.3 1.400 334.0 10.400
4 155 0.4 1.10 454.0 10.10
5 156 0.5 1.900 555.0 10.900
but whatever I try , it introduces nan values.
For example:
df = pd.concat([a, b, c, d], axis=1)
df = df.loc[:,~df.columns.duplicated()]
gives:
id gr_code A_val B_val C_val D_val
1.0 121.0 0.1 1.233 121.0 10.233
3.0 134.0 0.3 1.400 334.0 10.400
5.0 156.0 0.5 1.900 555.0 10.900
NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN
If I try:
df_list = [a, b, c, d]
df = reduce(lambda left, right: pd.merge(left, right,
on=['id', 'gr_code'],
how='outer'), df_list)
it gives:
id gr_code A_val_x B_val_x A_val_y B_val_y C_val_x D_val_x C_val_y D_val_y
1 121 0.1 1.233 NaN NaN 121.0 10.233 NaN NaN
3 134 0.3 1.400 NaN NaN 334.0 10.400 NaN NaN
5 156 0.5 1.900 NaN NaN 555.0 10.900 NaN NaN
2 121 NaN NaN 0.2 1.56 NaN NaN 322.0 10.56
4 155 NaN NaN 0.4 1.10 NaN NaN 454.0 10.10
more dataframes:
e = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'E_val': [0.11, np.nan, 0.13, np.nan, 0.35], 'F_val': [11.233, np.nan, 11.4, np.nan, 11.9]})
f = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156],
'E_val': [np.nan, 3222, np.nan, 4541, np.nan], 'F_val': [np.nan, 110.56, np.nan, 101.1, np.nan]})
You can use concat and merge the duplicated columns:
df = (pd.concat([d.set_index(['id', 'gr_code']) for d in df_list], axis=1)
.groupby(level=0, axis=1).first().reset_index()
)
output:
id gr_code A_val B_val C_val D_val
0 1 121 0.1 1.233 121.0 10.233
1 2 121 0.2 1.560 322.0 10.560
2 3 134 0.3 1.400 334.0 10.400
3 4 155 0.4 1.100 454.0 10.100
4 5 156 0.5 1.900 555.0 10.900
I have a dataframe like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
[2, np.nan, np.nan, np.nan, np.nan],
[np.nan, 2, np.nan, np.nan, np.nan],
[np.nan, np.nan, 2, np.nan, np.nan],
[np.nan, 2, 2, np.nan, np.nan],
[2, np.nan, 2, np.nan, 2],
[2, np.nan, np.nan, 2, np.nan],
[np.nan, 2, 2, 2, np.nan],
[2, np.nan, np.nan, np.nan, 2]
],
index=list('abcdefgh'), columns=list('ABCDE')
)
df
A B C D E
a 2.0 NaN NaN NaN NaN
b NaN 2.0 NaN NaN NaN
c NaN NaN 2.0 NaN NaN
d NaN 2.0 2.0 NaN NaN
e 2.0 NaN 2.0 NaN 2.0
f 2.0 NaN NaN 2.0 NaN
g NaN 2.0 2.0 2.0 NaN
h 2.0 NaN NaN NaN 2.0
I would like to fill NaNs by 0 for each row, before and after there is a non-NaN value, only for one NaN for each side of the non-NaN value with pandas.
so my desired output would be the following:
A B C D E
a 2.0 0.0 NaN NaN NaN
b 0.0 2.0 0.0 NaN NaN
c NaN 0.0 2.0 0.0 NaN
d 0.0 2.0 2.0 0.0 NaN
e 2.0 0.0 2.0 0.0 2.0
f 2.0 0.0 0.0 2.0 0.0
g 0.0 2.0 2.0 2.0 0.0
h 2.0 0.0 NaN 0.0 2.0
I know how to do it with for loops, but I was wondering if it is possible do it only with pandas.
Thank you very much for your help!
You can use shift backward and forward on both axes and mask:
cond = (df.notna().shift(axis=1, fill_value=False) # check left
|df.notna().shift(-1, axis=1, fill_value=False) # check right
)&df.isna() # cell is NA
df.mask(cond, 0)
output:
A B C D E
a 2.0 0.0 NaN NaN NaN
b 0.0 2.0 0.0 NaN NaN
c NaN 0.0 2.0 0.0 NaN
d 0.0 2.0 2.0 0.0 NaN
e 2.0 0.0 2.0 0.0 2.0
f 2.0 0.0 0.0 2.0 0.0
g 0.0 2.0 2.0 2.0 0.0
h 2.0 0.0 NaN 0.0 2.0
NB. This transformation is called a binary dilation, you can also use scipy.ndimage.morphology.binary_dilation for that. The advantage with this method is that you can use various structurating elements (not only Left/Right/Top/Bottom)
import numpy as np
from scipy.ndimage.morphology import binary_dilation
struct = np.array([[True, False, True]])
df.mask(binary_dilation(df.notna(), structure=struct), 0)
I´m trying to iterate over a huge pandas dataframe (over 370.000 rows) based on the index.
For each row the code should look back on the last 12 entries of this index (if available) and sum up based on (running) quarters / semesters / year.
If there is no information or not enough information (only last 3 months) then the code should consider the other months / quarters as 0.
Here is a sample of my dataframe:
This is the expected output:
So looking at DateID "1" we don´t have any other information for this row. DateID "1" is the last month in this case (month 12 so to say) and therefore in Q4 and H2. All other previous month are not existing and therefore not considered.
I already found a working solution but its very inefficient and takes a huge amount of time that is not acceptable.
Here is my code sample:
for company_name, c in df.groupby('Account Name'):
for i, row in c.iterrows():
i += 1
if i < 4:
q4 = c.iloc[:i]['Value$'].sum()
q3 = 0
q2 = 0
q1 = 0
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
elif 3 < i < 7:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[:i-3]['Value$'].sum()
q2 = 0
q1 = 0
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
elif 6 < i < 10:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[i-6:i-3]['Value$'].sum()
q2 = c.iloc[:i-6]['Value$'].sum()
q1 = 0
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
elif 9 < i < 13:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[i-6:i-3]['Value$'].sum()
q2 = c.iloc[i-9:i-6]['Value$'].sum()
q1 = c.iloc[:i-9]['Value$'].sum()
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
else:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[i-6:i-3]['Value$'].sum()
q2 = c.iloc[i-9:i-6]['Value$'].sum()
q1 = c.iloc[i-12:i-9]['Value$'].sum()
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
new_df = new_df.append({'Account Name':row['Account Name'], 'DateID': row['DateID'],'Q4':q4,'Q3':q3,'Q2':q2,'Q1':q1,'H1':h1,'H2':h2,'Year':year},ignore_index=True)
As I said I´m looking for a more efficient way to calculate these numbers as I have almost 10.000 Account Names and 30 Date ID´s per Account.
Thanks a lot!
If I got you right, this should calculate your figures:
grouped= df.groupby('Account Name')['Value$']
last_3= grouped.apply(lambda ser: ser.rolling(window=3, min_periods=1).sum())
last_6= grouped.apply(lambda ser: ser.rolling(window=6, min_periods=1).sum())
last_9= grouped.apply(lambda ser: ser.rolling(window=9, min_periods=1).sum())
last_12= grouped.apply(lambda ser: ser.rolling(window=12, min_periods=1).sum())
df['Q4']= last_3
df['Q3']= last_6 - last_3
df['Q2']= last_9 - last_6
df['Q1']= last_12 - last_9
df['H1']= df['Q1'] + df['Q2']
df['H2']= df['Q3'] + df['Q4']
This outputs:
Out[19]:
Account Name DateID Value$ Q4 Q3 Q2 Q1 H1 H2
0 A 0 33 33.0 0.0 0.0 0.0 0.0 33.0
1 A 1 20 53.0 0.0 0.0 0.0 0.0 53.0
2 A 2 24 77.0 0.0 0.0 0.0 0.0 77.0
3 A 3 21 65.0 33.0 0.0 0.0 0.0 98.0
4 A 4 22 67.0 53.0 0.0 0.0 0.0 120.0
5 A 5 31 74.0 77.0 0.0 0.0 0.0 151.0
6 A 6 30 83.0 65.0 33.0 0.0 33.0 148.0
7 A 7 23 84.0 67.0 53.0 0.0 53.0 151.0
8 A 8 11 64.0 74.0 77.0 0.0 77.0 138.0
9 A 9 35 69.0 83.0 65.0 33.0 98.0 152.0
10 A 10 32 78.0 84.0 67.0 53.0 120.0 162.0
11 A 11 31 98.0 64.0 74.0 77.0 151.0 162.0
12 A 12 32 95.0 69.0 83.0 65.0 148.0 164.0
13 A 13 20 83.0 78.0 84.0 67.0 151.0 161.0
14 A 14 15 67.0 98.0 64.0 74.0 138.0 165.0
15 B 0 44 44.0 0.0 0.0 0.0 0.0 44.0
16 B 1 43 87.0 0.0 0.0 0.0 0.0 87.0
17 B 2 31 118.0 0.0 0.0 0.0 0.0 118.0
18 B 3 10 84.0 44.0 0.0 0.0 0.0 128.0
19 B 4 13 54.0 87.0 0.0 0.0 0.0 141.0
20 B 5 20 43.0 118.0 0.0 0.0 0.0 161.0
21 B 6 28 61.0 84.0 44.0 0.0 44.0 145.0
22 B 7 14 62.0 54.0 87.0 0.0 87.0 116.0
23 B 8 20 62.0 43.0 118.0 0.0 118.0 105.0
24 B 9 41 75.0 61.0 84.0 44.0 128.0 136.0
25 B 10 39 100.0 62.0 54.0 87.0 141.0 162.0
26 B 11 46 126.0 62.0 43.0 118.0 161.0 188.0
27 B 12 26 111.0 75.0 61.0 84.0 145.0 186.0
28 B 13 24 96.0 100.0 62.0 54.0 116.0 196.0
29 B 14 34 84.0 126.0 62.0 43.0 105.0 210.0
32 C 2 12 12.0 0.0 0.0 0.0 0.0 12.0
33 C 3 15 27.0 0.0 0.0 0.0 0.0 27.0
34 C 4 45 72.0 0.0 0.0 0.0 0.0 72.0
35 C 5 22 82.0 12.0 0.0 0.0 0.0 94.0
36 C 6 48 115.0 27.0 0.0 0.0 0.0 142.0
37 C 7 45 115.0 72.0 0.0 0.0 0.0 187.0
38 C 8 11 104.0 82.0 12.0 0.0 12.0 186.0
39 C 9 27 83.0 115.0 27.0 0.0 27.0 198.0
For the following test data:
data= {'Account Name': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
'DateID': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 2, 3, 4, 5, 6, 7, 8, 9],
'Value$': [33, 20, 24, 21, 22, 31, 30, 23, 11, 35, 32, 31, 32, 20, 15, 44, 43, 31, 10, 13, 20, 28, 14, 20, 41, 39, 46, 26, 24, 34, 12, 15, 45, 22, 48, 45, 11, 27]
}
df= pd.DataFrame(data)
Edit:: If you want to count the unique entires over the same period, you can do that as follows:
def get_nunique(np_array):
unique, counts= np.unique(np_array, return_counts=True)
return len(unique)
df['Category'].rolling(window=3, min_periods=1).apply(get_nunique)
I didn't want to overload the answer above completely, so I add a new one for your second part:
# define a function that
# creates the unique counts
# by aggregating period_length times
# so 3 times for the quarter mapping
# and 6 times for the half year
# it's basically doing something like
# a sliding window aggregation
def get_mapping(df, period_lenght=3):
df_mapping= None
for offset in range(period_lenght):
quarter= (df['DateID']+offset) // period_lenght
aggregated= df.groupby([quarter, df['Account Name']]).agg({'DateID': max, 'Category': lambda ser: len(set(ser))})
incomplete_data= ((aggregated['DateID']+offset+1)//period_lenght <= aggregated.index.get_level_values(0)) & (aggregated.index.get_level_values(0) >= period_lenght)
aggregated.drop(aggregated.index[incomplete_data].to_list(), inplace=True)
aggregated.set_index('DateID', append=True, inplace=True)
aggregated= aggregated.droplevel(0, axis='index')
if df_mapping is None:
df_mapping= aggregated
else:
df_mapping= pd.concat([df_mapping, aggregated], axis='index')
return df_mapping
# apply it for 3 months and merge it to the source df
df_mapping= get_mapping(df, period_lenght=3)
df_mapping.columns= ['unique_3_months']
df_with_3_months= df.merge(df_mapping, left_on=['Account Name', 'DateID'], how='left', right_index=True)
# do the same for 6 months and merge it again
df_mapping= get_mapping(df, period_lenght=6)
df_mapping.columns= ['unique_6_months']
df_with_6_months= df_with_3_months.merge(df_mapping, left_on=['Account Name', 'DateID'], how='left', right_index=True)
This results in:
Out[305]:
Account Name DateID Value$ Category unique_3_months unique_6_months
0 A 0 10 1 1 1
1 A 1 12 2 2 2
2 A 1 38 1 2 2
3 A 2 20 3 3 3
4 A 3 25 3 3 3
5 A 4 24 4 2 4
6 A 5 27 8 3 5
7 A 6 30 5 3 6
8 A 7 47 7 3 5
9 A 8 30 4 3 5
10 A 9 17 7 2 4
11 A 10 20 8 3 4
12 A 11 33 8 2 4
13 A 12 45 9 2 4
14 A 13 19 2 3 5
15 A 14 24 10 3 3
15 A 14 24 10 3 4
15 A 14 24 10 3 4
15 A 14 24 10 3 5
15 A 14 24 10 3 1
15 A 14 24 10 3 2
16 B 0 41 2 1 1
17 B 1 13 9 2 2
18 B 2 17 6 3 3
19 B 3 45 7 3 4
20 B 4 11 6 2 4
21 B 5 38 8 3 5
22 B 6 44 8 2 4
23 B 7 15 8 1 3
24 B 8 50 2 2 4
25 B 9 27 7 3 4
26 B 10 38 10 3 4
27 B 11 25 6 3 5
28 B 12 25 8 3 5
29 B 13 14 7 3 5
30 B 14 25 9 3 3
30 B 14 25 9 3 4
30 B 14 25 9 3 5
30 B 14 25 9 3 5
30 B 14 25 9 3 1
30 B 14 25 9 3 2
31 C 2 31 9 1 1
32 C 3 31 7 2 2
33 C 4 26 5 3 3
34 C 5 11 2 3 4
35 C 6 15 8 3 5
36 C 7 22 2 2 5
37 C 8 33 2 2 4
38 C 9 16 5 2 3
38 C 9 16 5 2 3
38 C 9 16 5 2 3
38 C 9 16 5 2 1
38 C 9 16 5 2 2
38 C 9 16 5 2 2
The output is based on the following input data:
data= {
'Account Name': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
'DateID': [0, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 2, 3, 4, 5, 6, 7, 8, 9],
'Value$': [10, 12, 38, 20, 25, 24, 27, 30, 47, 30, 17, 20, 33, 45, 19, 24, 41, 13, 17, 45, 11, 38, 44, 15, 50, 27, 38, 25, 25, 14, 25, 31, 31, 26, 11, 15, 22, 33, 16],
'Category': [1, 2, 1, 3, 3, 4, 8, 5, 7, 4, 7, 8, 8, 9, 2, 10, 2, 9, 6, 7, 6, 8, 8, 8, 2, 7, 10, 6, 8, 7, 9, 9, 7, 5, 2, 8, 2, 2, 5]
}
df= pd.DataFrame(data)