pandas - sort columns and group by a particular field - python-3.x

I have a list of objects
[
{
"companyid": long,
"parentid": long
"score": long,
...
}
]
The parentid is nothing but the cid of the parent company
Sample data will look something like this
cid parentid score
1 10 1000
2 10 100
3 10 1001
10 10 20
11 100 1000
12 100 100
100 100 200
111 1000 10
112 1000 100
1000 100 2000
I need to sort the values based on the score, but i want to group the values by parentid
I tried this which didn't really fit my requirements, since it groups then sorts
df.groupby('headcompanyid').apply(lambda x: x.sort_values('score'))
Sorting by score will give this result:
cid parentid score
1000 100 2000
3 10 1001
1 10 1000
11 100 1000
100 100 200
2 10 100
112 1000 100
12 100 100
10 10 20
111 1000 10
Grouping by parentid on the sorted data (which is my end goal), should give this result
cid parentid score
1000 100 2000
11 100 1000 // since 100 is the parentid, it needs to be pushed up the in the result set
100 100 200 // if multiple records are pushed up, then sorting should be based on score
12 100 100
3 10 1001 // 2nd group by is based on 10, since 1001 is the next highest score which
1 10 1000 // doesn't belong to the 100 parentid group
2 10 100
10 10 20
112 1000 100
111 1000 10
i am using pandas v0.24.2 and python 3.7 if it matters

Try this:
df.sort_values(['parentid', 'score'], ascending=[False, False])
Output:
cid parentid score
8 112 1000 100
7 111 1000 10
9 1000 100 2000
4 11 100 1000
6 100 100 200
5 12 100 100
2 3 10 1001
0 1 10 1000
1 2 10 100
3 10 10 20

Related

I have a table with column AGE with numbers. I wanted to cluster similar number and count

AGE
CARD
SCORE
10
1
20000
10
1
3000
25
0
2000
10
1
20000
18
1
3000
10
0
2000
12
1
20000
10
1
3000
10
0
2000
I want to count Age 10 as 4.
The first two rows (group) should be counted as 1 and 10 appearing in different rows can be added individually and the last two rows (group of age 10) should be counted as 1.
Assuming that data is in a table named "Table1":
=COUNT(1/FREQUENCY(IF(Table1[AGE]=10,ROW(Table1[AGE])),IF(Table1[AGE]<>10,ROW(Table1[AGE]))))

Collapse/Transpose Columns of a DataFrame Based on Repeating - pandas

I have a data frame sample_df like this,
id pd pd_dt pd_tp pd.1 pd_dt.1 pd_tp.1 pd.2 pd_dt.2 pd_tp.2
0 1 100 per year 468 200 per year 400 300 per year 320
1 2 100 per year 60 200 per year 890 300 per year 855
I need my output like this,
id pd pd_dt pd_tp
1 100 per year 468
1 200 per year 400
1 300 per year 320
2 100 per year 60
2 200 per year 890
2 300 per year 855
I tried the following,
sample_df.stack().reset_index().drop('level_1',axis=1)
This does not work.
I have pd, pd_dt, pd_tp are repeating with .1, .2 .. values.
I have How can I achieve output?
You want pd.wide_to_long, but with some tweak since your first few columns do not share the same patterns with the rest:
# rename
df.columns = [x+'.0' if '.' not in x and x != 'id' else x
for x in df.columns]
pd.wide_to_long(df, stubnames=['pd','pd_dt','pd_tp'],
i='id', j='order', sep='.')
Output:
pd pd_dt pd_tp
id order
1 0 100 per year 468
2 0 100 per year 60
1 1 200 per year 400
2 1 200 per year 890
1 2 300 per year 320
2 2 300 per year 855
You can use numpy split to split it into n arrays and concetanate them back together. Then repeat the id column by the number of rows in your new dataframe.
new_df = pd.DataFrame(np.concatenate(np.split(df.iloc[:,1:].values, (df.shape[1] - 1)/3, axis=1)))
new_df.columns = ['pd','pd_dt','pd_tp']
new_df['id'] = pd.concat([df.id] * (new_df.shape[0]//2), ignore_index=True)
new_df.sort_values('id')
Result:
pd pd_dt pd_tp id
0 100 per year 468 1
2 200 per year 400 1
4 300 per year 320 1
1 100 per year 60 2
3 200 per year 890 2
5 300 per year 855 2
You can do this:
dt_mask=df.columns.str.contains('dt')
tp_mask=df.columns.str.contains('tp')
new_df=pd.DataFrame()
new_df['pd']=df[df.columns[~(dt_mask|tp_mask)]].stack().reset_index(level=1,drop='level_1')
new_df['pd_dt']=df[df.columns[dt_mask]].stack().reset_index(level=1,drop='level_1')
new_df['pd_tp']=df[df.columns[tp_mask]].stack().reset_index(level=1,drop='level_1')
new_df.reset_index(inplace=True)
print(new_df)
id pd pd_dt pd_tp
0 1 100 per_year 468
1 1 200 per_year 400
2 1 300 per_year 320
3 2 100 per_year 60
4 2 200 per_year 890
5 2 300 per_year 855

Apply multiple operations on same columns after groupby

I have the following df,
id year_month amount
10 201901 10
10 201901 20
10 201901 30
20 201902 40
20 201902 20
I want to groupby id and year-month and then get the group size and sum of amount,
df.groupby(['id', 'year_month'], as_index=False)['amount'].sum()
df.groupby(['id', 'year_month'], as_index=False).size().reset_index(name='count')
I am wondering how to do it at the same time in one line;
id year_month amount count
10 201901 60 3
20 201902 60 2
Use agg:
df.groupby(['id', 'year_month']).agg({'amount': ['count', 'sum']})
amount
count sum
id year_month
10 201901 3 60
20 201902 2 60
If you want to remove the multi-index, use MultiIndex.droplevel:
s = df.groupby(['id', 'year_month']).agg({'amount': ['count', 'sum']}).rename(columns ={'sum': 'amount'})
s.columns = s.columns.droplevel(level=0)
s.reset_index()
id year_month count amount
0 10 201901 3 60
1 20 201902 2 60

Group by multiple columns and calculate the average sum

I have the below dataframe :
Customer Category Month Mon_exp
1 A 1 200
1 A 1 100
1 A 2 150
1 B 2 150
1 B 3 300
2 A 1 300
2 A 1 200
2 A 2 150
2 B 2 150
2 B 3 400
Expected Dataframe :
Customer Category Month Mon_exp Ave_Mon_exp
1 A 1 200 300
1 A 1 100 300
1 A 2 150 300
1 B 2 150 300
1 B 3 300 300
2 A 1 300 400
2 A 1 200 400
2 A 2 150 400
2 B 2 150 400
2 B 3 400 400
Explanation for the new column 'Ave_Mon_exp' :
1) For Each customer, sum the 'Mon_exp' and divide with the count of unique 'Month' value.
For eg. Customer - 1, Sum of 'Mon_exp' is 900 and count of unique 'Month' value is 3. Hence the Ave_Mon_exp is 300.
Can anyone help me to derive the new column 'Ave_Mon_exp' ?
Thanks
import pandas as pd
sample_df = pd.DataFrame({'Customer':[1,1,1,1,1,2,2,2,2,2],'Category':['A','A','A','B','B','A','A','A','B','B'], 'Month': [1,1,2,2,3,1,1,2,2,3], 'Mon_exp': [200, 100, 150, 150, 300,300,200,150,150,400]})
new_col = sample.groupby('Customer')['Mon_exp'].sum()/ sample.groupby('Customer')['Month'].nunique()
new_col.name = 'Customer'
sample = sample.join(new_col, on='Customer', rsuffix='_Ave_Mon_exp')
print(sample_df)

pandas calculate scores for each group based on multiple functions

I have the following df,
group_id code amount date
1 100 20 2017-10-01
1 100 25 2017-10-02
1 100 40 2017-10-03
1 100 25 2017-10-03
2 101 5 2017-11-01
2 102 15 2017-10-15
2 103 20 2017-11-05
I like to groupby group_id and then compute scores to each group based on the following features:
if code values are all the same in a group, score 0 and 10 otherwise;
if amount sum is > 100, score 20 and 0 otherwise;
sort_values by date in descending order and sum the differences between the dates, if the sum < 5, score 30, otherwise 0.
so the result df looks like,
group_id code amount date score
1 100 20 2017-10-01 50
1 100 25 2017-10-02 50
1 100 40 2017-10-03 50
1 100 25 2017-10-03 50
2 101 5 2017-11-01 10
2 102 15 2017-10-15 10
2 103 20 2017-11-05 10
here are the functions that correspond to each feature above:
def amount_score(df, amount_col, thold=100):
if df[amount_col].sum() > thold:
return 20
else:
return 0
def col_uniq_score(df, col_name):
if df[col_name].nunique() == 1:
return 0
else:
return 10
def date_diff_score(df, col_name):
df.sort_values(by=[col_name], ascending=False, inplace=True)
if df[col_name].diff().dropna().sum() / np.timedelta64(1, 'D') < 5:
return score + 30
else:
return score
I am wondering how to apply these functions to each group and calculate the sum of all the functions to give a score.
You can try groupby.transform for same size of Series as original DataFrame with numpy.where for if-else for Series:
grouped = df.sort_values('date', ascending=False).groupby('group_id', sort=False)
a = np.where(grouped['code'].transform('nunique') == 1, 0, 10)
print (a)
[10 10 10 0 0 0 0]
b = np.where(grouped['amount'].transform('sum') > 100, 20, 0)
print (b)
[ 0 0 0 20 20 20 20]
c = np.where(grouped['date'].transform(lambda x:x.diff().dropna().sum()).dt.days < 5, 30, 0)
print (c)
[30 30 30 30 30 30 30]
df['score'] = a + b + c
print (df)
group_id code amount date score
0 1 100 20 2017-10-01 40
1 1 100 25 2017-10-02 40
2 1 100 40 2017-10-03 40
3 1 100 25 2017-10-03 50
4 2 101 5 2017-11-01 50
5 2 102 15 2017-10-15 50
6 2 103 20 2017-11-05 50

Resources