Apply multiple operations on same columns after groupby - python-3.x

I have the following df,
id year_month amount
10 201901 10
10 201901 20
10 201901 30
20 201902 40
20 201902 20
I want to groupby id and year-month and then get the group size and sum of amount,
df.groupby(['id', 'year_month'], as_index=False)['amount'].sum()
df.groupby(['id', 'year_month'], as_index=False).size().reset_index(name='count')
I am wondering how to do it at the same time in one line;
id year_month amount count
10 201901 60 3
20 201902 60 2

Use agg:
df.groupby(['id', 'year_month']).agg({'amount': ['count', 'sum']})
amount
count sum
id year_month
10 201901 3 60
20 201902 2 60
If you want to remove the multi-index, use MultiIndex.droplevel:
s = df.groupby(['id', 'year_month']).agg({'amount': ['count', 'sum']}).rename(columns ={'sum': 'amount'})
s.columns = s.columns.droplevel(level=0)
s.reset_index()
id year_month count amount
0 10 201901 3 60
1 20 201902 2 60

Related

Dynamically pass in name of column based on value from another

Say I have a dataframe like below.
eff_month eff_date latest_volume_month_1 latest_volume_month_2 latest_volume_month_3 pct
1 2022-01-13 55 60 70 .5
2 2022-02-10 40 50 60 .1
3 2022-03-02 30 50 70 .2
I am trying to create a new column that depending on the eff_month and pct will do some math for the values within the latest_volume_month_{eff_month} and pct.
The math: For the 1st month only take the sum of latest_volume_month_1 and multiple by pct to get vol.
For the 2nd month take the sum of latest_volume_month_1 and latest_volume_month_2 and then multiply by pct.
Third row take the sum of month 1,2,3 and multiply by pct for vol.
The expected output would look something like this.
eff_month eff_date latest_volume_month_1 latest_volume_month_2 latest_volume_month_3 pct vol
1 2022-01-13 55 60 70 .5 22.5
2 2022-02-10 40 50 60 .1 9.1
3 2022-03-02 30 50 70 .2 25

pandas - sort columns and group by a particular field

I have a list of objects
[
{
"companyid": long,
"parentid": long
"score": long,
...
}
]
The parentid is nothing but the cid of the parent company
Sample data will look something like this
cid parentid score
1 10 1000
2 10 100
3 10 1001
10 10 20
11 100 1000
12 100 100
100 100 200
111 1000 10
112 1000 100
1000 100 2000
I need to sort the values based on the score, but i want to group the values by parentid
I tried this which didn't really fit my requirements, since it groups then sorts
df.groupby('headcompanyid').apply(lambda x: x.sort_values('score'))
Sorting by score will give this result:
cid parentid score
1000 100 2000
3 10 1001
1 10 1000
11 100 1000
100 100 200
2 10 100
112 1000 100
12 100 100
10 10 20
111 1000 10
Grouping by parentid on the sorted data (which is my end goal), should give this result
cid parentid score
1000 100 2000
11 100 1000 // since 100 is the parentid, it needs to be pushed up the in the result set
100 100 200 // if multiple records are pushed up, then sorting should be based on score
12 100 100
3 10 1001 // 2nd group by is based on 10, since 1001 is the next highest score which
1 10 1000 // doesn't belong to the 100 parentid group
2 10 100
10 10 20
112 1000 100
111 1000 10
i am using pandas v0.24.2 and python 3.7 if it matters
Try this:
df.sort_values(['parentid', 'score'], ascending=[False, False])
Output:
cid parentid score
8 112 1000 100
7 111 1000 10
9 1000 100 2000
4 11 100 1000
6 100 100 200
5 12 100 100
2 3 10 1001
0 1 10 1000
1 2 10 100
3 10 10 20

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90

Creating URNs based on a row ID

I have a pandas dataset that has rows with the same Site ID. I want to create a new ID for each row. Currently I have a df like this:
SiteID SomeData1 SomeData2
100001 20 30
100001 20 30
100002 30 40
I am looking to achieve the below output
Output:
SiteID SomeData1 SomeData2 Site_ID2
100001 20 30 1000011
100001 20 30 1000012
100002 30 40 1000021
What would be the best way to achieve this?
Add helper Series by GroupBy.cumcount converted to strings to column SiteID :
s = df.groupby(['SomeData1','SomeData2']).cumcount().add(1)
df['Site_ID2'] = df['SiteID'].astype(str).add(s.astype(str))
print (df)
SiteID SomeData1 SomeData2 Site_ID2
0 100001 20 30 1000011
1 100001 20 30 1000012
2 100002 30 40 1000021

pandas calculate scores for each group based on multiple functions

I have the following df,
group_id code amount date
1 100 20 2017-10-01
1 100 25 2017-10-02
1 100 40 2017-10-03
1 100 25 2017-10-03
2 101 5 2017-11-01
2 102 15 2017-10-15
2 103 20 2017-11-05
I like to groupby group_id and then compute scores to each group based on the following features:
if code values are all the same in a group, score 0 and 10 otherwise;
if amount sum is > 100, score 20 and 0 otherwise;
sort_values by date in descending order and sum the differences between the dates, if the sum < 5, score 30, otherwise 0.
so the result df looks like,
group_id code amount date score
1 100 20 2017-10-01 50
1 100 25 2017-10-02 50
1 100 40 2017-10-03 50
1 100 25 2017-10-03 50
2 101 5 2017-11-01 10
2 102 15 2017-10-15 10
2 103 20 2017-11-05 10
here are the functions that correspond to each feature above:
def amount_score(df, amount_col, thold=100):
if df[amount_col].sum() > thold:
return 20
else:
return 0
def col_uniq_score(df, col_name):
if df[col_name].nunique() == 1:
return 0
else:
return 10
def date_diff_score(df, col_name):
df.sort_values(by=[col_name], ascending=False, inplace=True)
if df[col_name].diff().dropna().sum() / np.timedelta64(1, 'D') < 5:
return score + 30
else:
return score
I am wondering how to apply these functions to each group and calculate the sum of all the functions to give a score.
You can try groupby.transform for same size of Series as original DataFrame with numpy.where for if-else for Series:
grouped = df.sort_values('date', ascending=False).groupby('group_id', sort=False)
a = np.where(grouped['code'].transform('nunique') == 1, 0, 10)
print (a)
[10 10 10 0 0 0 0]
b = np.where(grouped['amount'].transform('sum') > 100, 20, 0)
print (b)
[ 0 0 0 20 20 20 20]
c = np.where(grouped['date'].transform(lambda x:x.diff().dropna().sum()).dt.days < 5, 30, 0)
print (c)
[30 30 30 30 30 30 30]
df['score'] = a + b + c
print (df)
group_id code amount date score
0 1 100 20 2017-10-01 40
1 1 100 25 2017-10-02 40
2 1 100 40 2017-10-03 40
3 1 100 25 2017-10-03 50
4 2 101 5 2017-11-01 50
5 2 102 15 2017-10-15 50
6 2 103 20 2017-11-05 50

Resources