How to aggregate Python Pandas dataframe such that value of a variable corresponds to the row a variable is selected in aggfunc? - python-3.x

I have the following data
ID DATE AGE COUNT
1 Nat 16 1
1 2021-06-06 19 2
1 2020-01-05 20 3
2 Nat 23 3
2 Nat 16 3
2 2019-02-04 36 12
I want to aggregate this so that the DATE will be the earliest valid date (in time), while AGE will be extracted from the corresponding row the earliest date is selected. The output should be
ID DATE AGE COUNT
1 2021-06-06 19 1
2 2019-02-04 36 3
My code which gives this error TypeError: Must provide 'func' or named aggregation **kwargs..
df_agg = pd.pivot_table(df, index=['ID'],
values=['DATE', 'AGE'],
aggfunc={'DATE': np.min, 'AGE': None, 'COUNT': np.min})
I don't want to use 'AGE': np.min since for ID=1, AGE=16 will be extracted which is not what I want.
///////////// Edits ///////////////
Edits made to provide a more generic example.

You can try .first_valid_index():
x = df.loc[df.groupby("ID").apply(lambda x: x["DATE"].first_valid_index())]
print(x)
Prints:
ID DATE AGE
1 1 2021-06-06 19
5 2 2019-02-04 36
EDIT: Using .pivot_table(). You can extract the "DATE"/"AGE" together as a list, for "COUNT" you can use np.min or "min". The second step would be explode the "DATE"/"AGE" list to separate columns:
df_agg = pd.pivot_table(
df,
index=["ID"],
values=["DATE", "AGE", "COUNT"],
aggfunc={
"DATE": lambda x: df.loc[x.first_valid_index()][
["DATE", "AGE"]
].tolist(),
"COUNT": "min",
},
)
df_agg[["DATE", "AGE"]] = pd.DataFrame(df_agg["DATE"].apply(pd.Series))
print(df_agg)
Prints:
COUNT DATE AGE
ID
1 1 2021-06-06 19
2 3 2019-02-04 36

You can sort values and drop the duplicates (sort_index is optional)
df.sort_values(['DATE']).drop_duplicates('ID').sort_index()
ID DATE AGE
1 1 2021-06-06 19
5 2 2019-02-04 36
With groupby and transform:
df[df['DATE'] == df.groupby("ID")['DATE'].transform('min')]

Assuming you have an index, a simple solution would be:
def min_val(group):
group = group.loc[group.DATE.idxmin]
return group
df.groupby(['ID']).apply(min_val)
If you do not have an index you can use:
df.reset_index().groupby(['ID']).apply(min_val).drop(columns=['ID'])

Related

How to exclude rows from a groupby operation

I am working on a groupby operation using the attribute column but I want to exclude the desc_type 1 and 2 that will be used to calculate total discount inside each attrib.
pd.DataFrame({'ID':[10,10,10,20,30,30],'attribute':['attrib_1','desc_type1','desc_type2','attrib_1','attrib_2','desc_type1'],'value':[100,0,0,100,30,0],'discount':[0,6,2,0,0,13.3]})
output:
ID attribute value discount
10 attrib_1 100 0
10 desc_type1 0 6
10 desc_type2 0 2
20 attrib_1 100 0
30 attrib_2 30 0
30 desc_type1 0 13.3
I want to groupby this dataframe by attribute but excluding the desc_type1 and desc_type2.
The desired output:
attribute ID_count value_sum discount_sum
attrib_1 2 200 8
attrib_2 1 30 13.3
explanations:
attrib_1 has discount_sum=8 because ID 30 that belongs to attrib_1has two desc_type
attrib_2 has discount_sum=13.3 because ID 10 has one desc_type
ID=20 has no discounts types.
What I did so far:
df.groupby('attribute').agg({'ID':'count','value':'sum','discount':'sum'})
But the line above does not exclude the desc_type 1 and 2 from the groupby
Important: an ID may have a discount or not.
link to the realdataset: realdataset
You can fill the attributes per ID, then groupby.agg:
m = df['attribute'].str.startswith('desc_type')
group = df['attribute'].mask(m).groupby(df['ID']).ffill()
out = (df
.groupby(group, as_index=False)
.agg(**{'ID_count': ('ID', 'nunique'),
'value_sum': ('value', 'sum'),
'discount_sum': ('discount', 'sum')
})
)
output:
ID_count value_sum discount_sum
0 2 200 8.0
1 1 30 13.3
Hello I think this helps :
df.loc[(df['attribute'] != 'desc_type1') &( df['attribute'] != 'desc_type2')].groupby('attribute').agg({'ID':'count','value':'sum','discount':'sum'})
Output :
ID value discount
attribute
attrib_1 2 200 0.0
attrib_2 1 30 0.0

Join two dataframes based on closest combination that sums up to a target value

Im trying to join below two dataframes based on closest combination of rows from df2 column Sales that sums up to target value in df1 column Total Sales, columns Name & Date in both dataframes should be the same when joining (as showed in expected output).
For Example : in df1 row number 0 should be matched only with df2 rows 0 & 1, since columns Name & Date is the same, Which is Name : John and Date : 2021-10-01.
df1 :
df1 = pd.DataFrame({"Name":{"0":"John","1":"John","2":"Jack","3":"Nancy","4":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-11-01","2":"2021-10-10","3":"2021-10-12","4":"2021-10-30"},
"Total Sales":{"0":15500,"1":5500,"2":17600,"3":20700,"4":12000}})
Name Date Total Sales
0 John 2021-10-01 15500
1 John 2021-11-01 5500
2 Jack 2021-10-10 17600
3 Nancy 2021-10-12 20700
4 Ahmed 2021-10-30 12000
df2 :
df2 = pd.DataFrame({"ID":{"0":"JO1","1":"JO2","2":"JO3","3":"JO4","4":"JA1","5":"JA2","6":"NA1",
"7":"NA2","8":"NA3","9":"NA4","10":"AH1","11":"AH2","12":"AH3","13":"AH3"},
"Name":{"0":"John","1":"John","2":"John","3":"John","4":"Jack","5":"Jack","6":"Nancy","7":"Nancy",
"8":"Nancy","9":"Nancy","10":"Ahmed","11":"Ahmed","12":"Ahmed","13":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-10-01","2":"2021-11-01","3":"2021-11-01","4":"2021-10-10","5":"2021-10-10","6":"2021-10-12","7":"2021-10-12",
"8":"2021-10-12","9":"2021-10-12","10":"2021-10-30","11":"2021-10-30","12":"2021-10-30","13":"2021-10-29"},
"Sales":{"0":10000,"1":5000,"2":1000,"3":5500,"4":10000,"5":7000,"6":20000,
"7":100,"8":500,"9":100,"10":5000,"11":7000,"12":10000,"13":12000}})
ID Name Date Sales
0 JO1 John 2021-10-01 10000
1 JO2 John 2021-10-01 5000
2 JO3 John 2021-11-01 1000
3 JO4 John 2021-11-01 5500
4 JA1 Jack 2021-10-10 10000
5 JA2 Jack 2021-10-10 7000
6 NA1 Nancy 2021-10-12 20000
7 NA2 Nancy 2021-10-12 100
8 NA3 Nancy 2021-10-12 500
9 NA4 Nancy 2021-10-12 100
10 AH1 Ahmed 2021-10-30 5000
11 AH2 Ahmed 2021-10-30 7000
12 AH3 Ahmed 2021-10-30 10000
13 AH3 Ahmed 2021-10-29 12000
Expected Output :
Name Date Total Sales Comb IDs Comb Total
0 John 2021-10-01 15500 JO1, JO2 15000.0
1 John 2021-11-01 5500 JO4 5500.0
2 Jack 2021-10-10 17600 JA1, JA2 17000.0
3 Nancy 2021-10-12 20700 NA1, NA2, NA3, NA4 20700.0
4 Ahmed 2021-10-30 12000 AH1, AH2 12000.0
What i have tried below is working for only one row at a time, but im not sure how to apply it in pandas dataframes to get the expected output.
Variable numbers in below script represent Sales column in df2, and variable target below represent Total Sales column in df1.
import itertools
import math
numbers = [1000, 5000, 3000]
target = 6000
best_combination = ((None,))
best_result = math.inf
best_sum = 0
for L in range(0, len(numbers)+1):
for combination in itertools.combinations(numbers, L):
sum = 0
for number in combination:
sum += number
result = target - sum
if abs(result) < abs(best_result):
best_result = result
best_combination = combination
best_sum = sum
print("\nbest sum{} = {}".format(best_combination, best_sum))
[Out] best sum(1000, 5000) = 6000
Take the code you wrote which finds the best sum and turn it into a function (let's call it opt, which has parameters for target and a dataframe (which will be a subset of df2. It needs to return a list of IDs which correspond to the optimal combination.
Write another function which takes 3 arguments name, date and target (let's call it calc). This function will filter df2 based on name and date, and pass it, along with the target to the opt function and return the result of that function. Finally, iterate through rows of df1, and call calc with the row arguments (or alternatively use pandas.DataFrame.apply

Unstack a dataframe with duplicated index in Pandas

Given a toy dataset as follow which has duplicated price and quantity:
city item value
0 bj price 12
1 bj quantity 15
2 bj price 12
3 bj quantity 15
4 bj level a
5 sh price 45
6 sh quantity 13
7 sh price 56
8 sh quantity 7
9 sh level b
I want to reshape it into the following dataframe, which means add sell_ for the first pair and buy_ for the second pair:
city sell_price sell_quantity buy_price buy_quantity level
0 bj 12 15 13 16 a
1 sh 45 13 56 7 b
I have tried with df.set_index(['city', 'item']).unstack().reset_index(), but it raises an error: ValueError: Index contains duplicate entries, cannot reshape.
How could I get the desired output as above? Thanks.
You can add for second duplicated values buy_ and for first duplicates sell_ and change values in item before your solution:
m1 = df.duplicated(['city', 'item'])
m2 = df.duplicated(['city', 'item'], keep=False)
df['item'] = np.where(m1, 'buy_', np.where(m2, 'sell_', '')) + df['item']
df = (df.set_index(['city', 'item'])['value']
.unstack()
.reset_index()
.rename_axis(None, axis=1))
#for change order of columns names
df = df[['city','sell_price','sell_quantity','buy_price','buy_quantity','level']]
print (df)
city sell_price sell_quantity buy_price buy_quantity level
0 bj 12 15 12 15 a
1 sh 45 13 56 7 b

adding 1 to the previous row based on conditions

I have a pandas dataframe like below:
data=[['A',1,30],
['A',1,2],
['A',0,4],
['A',1,4],
['B',0,5],
['B',1,1],
['B',0,5],
['B',1,8]]
df = pd.DataFrame(data,columns=['group','var_1','var_2'])
I want to create a series of values with index based on below condition:
Step 1) Increment should always happen from 1st row of 'var_2'of each group. For example: for group A, the increment should start from 30 and for group B,
increment should start from 5
Step 2) Incremented value where 'var_1" = 1
My desired output:
0 30
1 31
3 32
5 6
7 7
IIUC:
#Get first index in each group and union index where var_1 ==1
indx = df.drop_duplicates('group').index.union(df[(df['var_1']==1)].index)
#Reindex dataframe group by group, add cusum value to other present values in group.
#Use .loc to filter where var_1 != 0 and get column var_2
df.reindex(indx).groupby('group')\
.transform(lambda x: x.iloc[0] + x.shift().notna().cumsum())\
.loc[lambda x: x.var_1 !=0, 'var_2']
Output:
0 30
1 31
3 32
5 6
7 7
Name: var_2, dtype: int64
Try groupby cumcount and first
df1 = df.loc[df.var_1.eq(1)]
g = df1.groupby('group')['var_2']
g.transform('first') + g.cumcount()
Out[66]:
0 30
1 31
3 32
5 1
7 2
dtype: int64
Or use duplicated with df.where and cumsum
df1 = df.loc[df.var_1.eq(1)]
df1.var_2.where(~df1.duplicated('group'), 1).groupby(df1.group).cumsum()
Out[77]:
0 30
1 31
3 32
5 1
7 2
Name: var_2, dtype: int64

Find occurrences of conditional value from one column and count values from another column in a dataframe

I have a dataframe containing userIds, week number, and a column X as shown below:
I am trying to group by the userIds if X is greater than 3 for 3 weeks.
I have tried using groupby and lambda in pandas but I am stuck
weekly_X = df.groupby(['Userid','Week #'], as_index=False)
UserIds Week X
123 14 3
123 15 4
123 16 7
123 17 2
123 18 1
456 14 4
456 15 5
456 16 11
456 17 2
456 18 6
The result I am aiming for is a dataframe containing user 456 and how many weeks the condition occurred.
df_3 = df.groupby('UserIds').apply(lambda x: (x.X > 3).sum() > 3).to_frame('ID_want').reset_index()
df = df[df.UserIds.isin(df_3.loc[df_3.ID_want == 1,'UserIds'])]
Get counts of values greater like 3 with aggregate sum and then filter values greater like 3:
s = df['X'].gt(3).astype(int).groupby(df['UserIds']).sum()
out = s[s.gt(3)].reset_index(name='count')
print (out)
UserIds count
0 456 4

Resources