Group by multiple columns and calculate the average sum - python-3.x

I have the below dataframe :
Customer Category Month Mon_exp
1 A 1 200
1 A 1 100
1 A 2 150
1 B 2 150
1 B 3 300
2 A 1 300
2 A 1 200
2 A 2 150
2 B 2 150
2 B 3 400
Expected Dataframe :
Customer Category Month Mon_exp Ave_Mon_exp
1 A 1 200 300
1 A 1 100 300
1 A 2 150 300
1 B 2 150 300
1 B 3 300 300
2 A 1 300 400
2 A 1 200 400
2 A 2 150 400
2 B 2 150 400
2 B 3 400 400
Explanation for the new column 'Ave_Mon_exp' :
1) For Each customer, sum the 'Mon_exp' and divide with the count of unique 'Month' value.
For eg. Customer - 1, Sum of 'Mon_exp' is 900 and count of unique 'Month' value is 3. Hence the Ave_Mon_exp is 300.
Can anyone help me to derive the new column 'Ave_Mon_exp' ?
Thanks

import pandas as pd
sample_df = pd.DataFrame({'Customer':[1,1,1,1,1,2,2,2,2,2],'Category':['A','A','A','B','B','A','A','A','B','B'], 'Month': [1,1,2,2,3,1,1,2,2,3], 'Mon_exp': [200, 100, 150, 150, 300,300,200,150,150,400]})
new_col = sample.groupby('Customer')['Mon_exp'].sum()/ sample.groupby('Customer')['Month'].nunique()
new_col.name = 'Customer'
sample = sample.join(new_col, on='Customer', rsuffix='_Ave_Mon_exp')
print(sample_df)

Related

Grouping several dataframe columns based on another columns values

I have this dataframe:
refid col2 price1 factor1 price2 factor2 price3 factor3
0 1 a 200 1 180 3 150 10
1 2 b 500 1 450 3 400 10
2 3 c 700 1 620 2 550 5
And I need to get this output:
refid col2 price factor
0 1 a 200 1
1 1 b 500 1
2 1 c 700 1
3 2 a 180 3
4 2 b 450 3
5 2 c 620 2
6 3 a 150 10
7 3 b 400 10
8 3 c 550 5
Right now I'm trying to use df.melt method, but can't get it to work, this is the code and the current result:
df2_melt = df2.melt(id_vars=["refid","col2"],
value_vars=["price1","price2","price3",
"factor1","factor2","factor3"],
var_name="Price",
value_name="factor")
refid col2 price factor
0 1 a price1 200
1 2 b price1 500
2 3 c price1 700
3 1 a price2 180
4 2 b price2 450
5 3 c price2 620
6 1 a price3 150
7 2 b price3 400
8 3 c price3 550
9 1 a factor1 1
10 2 b factor1 1
11 3 c factor1 1
12 1 a factor2 3
13 2 b factor2 3
14 3 c factor2 2
15 1 a factor3 10
16 2 b factor3 10
17 3 c factor3 5
Since you have a wide DataFrame with common prefixes, you can use wide_to_long:
out = pd.wide_to_long(df, stubnames=['price','factor'],
i=["refid","col2"], j='num').droplevel(-1).reset_index()
Output:
refid col2 price factor
0 1 a 200 1
1 1 a 180 3
2 1 a 150 10
3 2 b 500 1
4 2 b 450 3
5 2 b 400 10
6 3 c 700 1
7 3 c 620 2
8 3 c 550 5
Note that your expected output has an error where factors don't align with refids.
You can melt two times and then concat them:
import pandas as pd
df = pd.DataFrame({'refid': [1, 2, 3], 'col2': ['a', 'b', 'c'],
'price1': [200, 500, 700], 'factor1': [1, 1, 1],
'price2': [180, 450, 620], 'factor2': [3,3,2],
'price3': [150, 400, 550], 'factor3': [10, 10, 5]})
prices = [c for c in df if c.startswith('price')]
factors = [c for c in df if c.startswith('factor')]
df1 = pd.melt(df, id_vars=["refid","col2"], value_vars=prices, value_name='price').drop('variable', axis=1)
df2 = pd.melt(df, id_vars=["refid","col2"], value_vars=factors, value_name='factor').drop('variable', axis=1)
df3 = pd.concat([df1, df2['factor']],axis=1).reset_index().drop('index', axis=1)
print(df3)
Here is the output:
refid col2 price factor
0 1 a 200 1
1 2 b 500 1
2 3 c 700 1
3 1 a 180 3
4 2 b 450 3
5 3 c 620 2
6 1 a 150 10
7 2 b 400 10
8 3 c 550 5
One option is pivot_longer from pyjanitor:
# pip install pyjanitor
import janitor
import pandas as pd
(df
.pivot_longer(
index = ['refid', 'col2'],
names_to = '.value',
names_pattern = r"(.+)\d",
sort_by_appearance = True)
)
refid col2 price factor
0 1 a 200 1
1 1 a 180 3
2 1 a 150 10
3 2 b 500 1
4 2 b 450 3
5 2 b 400 10
6 3 c 700 1
7 3 c 620 2
8 3 c 550 5
The idea for this particular reshape is that whatever group in the regular expression is paired with the .value stays as the column header.

Merge pandas data frame based on specific conditions

I have a df as shown below
df1:
ID Job Salary
1 A 100
2 B 200
3 B 20
4 C 150
5 A 500
6 A 600
7 A 200
8 B 150
df2:
ID Type Status Age
1 2 P 23
2 1 P 28
8 1 F 33
4 3 P 48
14 1 F 23
11 2 P 28
16 2 F 23
41 3 P 38
df3:
ID T_Type Amount
1 K 20
2 L -50
1 K 30
3 K 5
1 K 100
2 L -50
1 L -30
25 K 500
1 K 20
4 L -80
19 K 30
2 K -5
Explanation About the data
ID is the primary key of df1.
ID is the primary key of df2.
df3 does not have any primary key.
From the above, I would like to prepare below dfs.
1. IDs which are in df1 and df2.
Expected output1:
ID Job Salary
1 A 100
2 B 200
4 C 150
8 B 150
IDs which are there in df1 and not in df2
output2:
ID Job Salary
3 B 20
5 A 500
6 A 600
7 A 200
IDs which are there in df1 and df3
output3:
ID Job Salary
1 A 100
2 B 200
3 B 20
4 C 150
4. IDs which are there in df1 and not in df3.
output4:
ID Job Salary
5 A 500
6 A 600
7 A 200
8 B 150
>>> # 1. IDs which are in df1 and df2.
>>> df1[df1['ID'].isin(df2['ID'])]
ID Job Salary
0 1 A 100
1 2 B 200
3 4 C 150
7 8 B 150
>>> # 2. IDs which are there in df1 and not in df2
>>> df1[~df1['ID'].isin(df2['ID'])]
ID Job Salary
2 3 B 20
4 5 A 500
5 6 A 600
6 7 A 200
>>> # 3. IDs which are there in df1 and df3
>>> df1[df1['ID'].isin(df3['ID'])]
ID Job Salary
0 1 A 100
1 2 B 200
2 3 B 20
3 4 C 150
>>> # 4. IDs which are there in df1 and not in df3.
>>> df1[~df1['ID'].isin(df3['ID'])]
ID Job Salary
4 5 A 500
5 6 A 600
6 7 A 200
7 8 B 150
Actually, your expected results aren't any merges, but rather
selections, based on whether df1.ID is (or is not) in ID column
of the second DataFrame.
To get your expected results, run the following commands:
result_1 = df1[df1.ID.isin(df2.ID)]
result_2 = df1[~df1.ID.isin(df2.ID)]
result_3 = df1[df1.ID.isin(df3.ID)]
result_4 = df1[~df1.ID.isin(df3.ID)]

Aggregate the given data frame based on the specific conditions in pandas

I have a df as shown below
df:
ID Number_of_Cars Age_in_days Total_amount Total_N Type
1 2 100 10000 100 A
2 5 10 1000 2 B
3 1 1000 1000 200 B
4 1 20 0 0 C
5 3 1000 100000 20 A
6 6 100 10000 20 C
7 4 200 10000 200 A
from the above df I would like to prepare df1 as shown below
df1:
ID Avg_Monthly_Amount Avg_Monthly_N Type
1 3000 30 A
2 3000 6 B
3 30 6 B
4 0 0 C
5 3000 0.6 A
6 3000 6 C
7 1500 30 A
Explanation:
Avg_Monthly_Amount = Avg monthly amount
Avg_Monthly_N = Avg monthly N
To prepare df1, I tried below code
df['Avg_Monthly_Amount'] = df['Total_amount'] / df['Age_in_days'] * 30
df['Avg_Monthly_N'] = df['Total_N'] / df['Age_in_days'] * 30
From df and df1 (or df alone) I would like to prepare below dataframe as df2
I could not a write a proper code to generate below df2
Explanation:
Aggregate above number at Type level
Example:
There are 3 customers (ID = 1, 5, 7) with Type = A, hence for Type = A, Number_Of_Type = 3
Avg_Cars for Type = A, is (2+3+4)/3 = 3
Avg_age_in_years for Type = A is ((100+1000+200)/3)/365
Avg_amount_monthly for Type = A is Mean of Average_Monthly_Amount in for type = A in df1
Avg_N_monthly for Type = A is Mean of Avg_Monthly_N in for type = A in df1
Final expected output (df2)
Type Number_Of_Type Avg_Cars Avg_age_in_years Avg_amount_monthly Avg_N_monthly
A 3 3 1.19 2500 20.2
B 2 3 1.38 1515 6
C 2 3.5 0.16 1500 3
Don't prepare other df named df1 from your original dataframe df
your dataframe df:-
ID Number_of_Cars Age_in_days Total_amount Total_N Type
1 2 100 10000 100 A
2 5 10 1000 2 B
3 1 1000 1000 200 B
4 1 20 0 0 C
5 3 1000 100000 20 A
6 6 100 10000 20 C
7 4 200 10000 200 A
After you created/imported df:-
df['Avg_Monthly_Amount'] = df['Total_amount'] / df['Age_in_days'] * 30
df['Avg_Monthly_N'] = df['Total_N'] / df['Age_in_days'] * 30
df['Age_in_year']=df['Age_in_days']/365
Then:-
df2=df.groupby('Type').agg({'Type':'count','Number_of_Cars':'mean','Age_in_year':'mean','Avg_Monthly_Amount':'mean','Avg_Monthly_N':'mean'}).rename(columns={'Type':'Number_Of_Type'})
Now if you print or write df2(if you are using jupyter notebook) then you get your desired output
Output:-
Number_Of_Type Number_of_Cars Age_in_year Avg_Monthly_Amount Avg_Monthly_N
Type
A 3 3.0 1.187215 2500.0 20.2
B 2 3.0 1.383562 1515.0 6.0
C 2 3.5 0.164384 1500.0 3.0

Collapse/Transpose Columns of a DataFrame Based on Repeating - pandas

I have a data frame sample_df like this,
id pd pd_dt pd_tp pd.1 pd_dt.1 pd_tp.1 pd.2 pd_dt.2 pd_tp.2
0 1 100 per year 468 200 per year 400 300 per year 320
1 2 100 per year 60 200 per year 890 300 per year 855
I need my output like this,
id pd pd_dt pd_tp
1 100 per year 468
1 200 per year 400
1 300 per year 320
2 100 per year 60
2 200 per year 890
2 300 per year 855
I tried the following,
sample_df.stack().reset_index().drop('level_1',axis=1)
This does not work.
I have pd, pd_dt, pd_tp are repeating with .1, .2 .. values.
I have How can I achieve output?
You want pd.wide_to_long, but with some tweak since your first few columns do not share the same patterns with the rest:
# rename
df.columns = [x+'.0' if '.' not in x and x != 'id' else x
for x in df.columns]
pd.wide_to_long(df, stubnames=['pd','pd_dt','pd_tp'],
i='id', j='order', sep='.')
Output:
pd pd_dt pd_tp
id order
1 0 100 per year 468
2 0 100 per year 60
1 1 200 per year 400
2 1 200 per year 890
1 2 300 per year 320
2 2 300 per year 855
You can use numpy split to split it into n arrays and concetanate them back together. Then repeat the id column by the number of rows in your new dataframe.
new_df = pd.DataFrame(np.concatenate(np.split(df.iloc[:,1:].values, (df.shape[1] - 1)/3, axis=1)))
new_df.columns = ['pd','pd_dt','pd_tp']
new_df['id'] = pd.concat([df.id] * (new_df.shape[0]//2), ignore_index=True)
new_df.sort_values('id')
Result:
pd pd_dt pd_tp id
0 100 per year 468 1
2 200 per year 400 1
4 300 per year 320 1
1 100 per year 60 2
3 200 per year 890 2
5 300 per year 855 2
You can do this:
dt_mask=df.columns.str.contains('dt')
tp_mask=df.columns.str.contains('tp')
new_df=pd.DataFrame()
new_df['pd']=df[df.columns[~(dt_mask|tp_mask)]].stack().reset_index(level=1,drop='level_1')
new_df['pd_dt']=df[df.columns[dt_mask]].stack().reset_index(level=1,drop='level_1')
new_df['pd_tp']=df[df.columns[tp_mask]].stack().reset_index(level=1,drop='level_1')
new_df.reset_index(inplace=True)
print(new_df)
id pd pd_dt pd_tp
0 1 100 per_year 468
1 1 200 per_year 400
2 1 300 per_year 320
3 2 100 per_year 60
4 2 200 per_year 890
5 2 300 per_year 855

summing up certain rows in a panda dataframe

I have a pandas dataframe with 1000 rows and 10 columns. I am looking to aggregate rows 100-1000 and replace them with just one row where the indexvalue is '>100' and the column values are the sum of rows 100-1000 of each column. Any ideas on a simple way of doing this? Thanks in advance
Say I have the below
a b c
0 1 10 100
1 2 20 100
2 3 60 100
3 5 80 100
and I want it replaced with
a b c
0 1 10 100
1 2 20 100
>1 8 140 200
You could use ix or loc but it shows SettingWithCopyWarning:
ind = 1
mask = df.index > ind
df1 = df[~mask]
df1.ix['>1', :] = df[mask].sum()
In [69]: df1
Out[69]:
a b c
0 1 10 100
1 2 20 100
>1 8 140 200
To set it without warning you could do it with pd.concat. May be not elegant due to two transposing but worked:
ind = 1
mask = df.index > ind
df1 = pd.concat([df[~mask].T, df[mask].sum()], axis=1).T
df1.index = df1.index.tolist()[:-1] + ['>{}'.format(ind)]
In [36]: df1
Out[36]:
a b c
0 1 10 100
1 2 20 100
>1 8 140 200
Some demonstrations:
In [37]: df.index > ind
Out[37]: array([False, False, True, True], dtype=bool)
In [38]: df[mask].sum()
Out[38]:
a 8
b 140
c 200
dtype: int64
In [40]: pd.concat([df[~mask].T, df[mask].sum()], axis=1).T
Out[40]:
a b c
0 1 10 100
1 2 20 100
0 8 140 200

Resources