I have a dataset given as such in Python:
#Load the required libraries
import pandas as pd
#Create dataset
data = {'ID': [1, 1, 1, 1, 1,1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
'Salary': [1, 2, 3, 4, 5,6,7,8,9,10, 1, 2, 3,4,5,6, 1, 2, 3, 4,5,6,7,8],
'Children': ['No', 'Yes', 'Yes', 'Yes', 'No','No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No'],
'Days': [123, 128, 66, 120, 141,123, 128, 66, 120, 141, 52,96, 120, 141, 52,96, 120, 141,123,15,85,36,58,89],
}
#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)
The above dataframe looks as such :
Now, for every ID/group, I wish to set an upperbound for some value of 'Salary'.
For example,
For ID=1, the upperbound of 'Salary' should be set at 4
For ID=2, the upperbound of 'Salary' should be set at 3
For ID=3, the upperbound of 'Salary' should be set at 5
The net result needs to look as such:
Can somebody please let me know how to achieve this task in python?
Use custom function with mapping by helper dictionary in GroupBy.transform:
d = {1:4, 2:3, 3:5}
def f(x):
x.iloc[:d[x.name]] = d[x.name]
return x
df['Salary'] = df.groupby('ID')['Salary'].transform(f)
print (df)
ID Salary Children Days
0 1 4 No 123
1 1 4 Yes 128
2 1 4 Yes 66
3 1 4 Yes 120
4 1 5 No 141
5 1 6 No 123
6 1 7 Yes 128
7 1 8 Yes 66
8 1 9 Yes 120
9 1 10 No 141
10 2 3 Yes 52
11 2 3 Yes 96
12 2 3 No 120
13 2 4 Yes 141
14 2 5 Yes 52
15 2 6 Yes 96
16 3 5 Yes 120
17 3 5 Yes 141
18 3 5 No 123
19 3 5 Yes 15
20 3 5 No 85
21 3 6 Yes 36
22 3 7 Yes 58
23 3 8 No 89
Another idea is use GroupBy.cumcount for counter per ID, compared by mapped ID and if match set mapped Series by Series.mask:
d = {1:4, 2:3, 3:5}
s = df['ID'].map(d)
df['Salary'] = df['Salary'].mask(df.groupby('ID').cumcount().lt(s), s)
Or if counter column is in Salary is possible use:
s = df['ID'].map(d)
df['Salary'] = df['Salary'].mask(df['Salary'].le(s), s)
print (df)
ID Salary Children Days
0 1 4 No 123
1 1 4 Yes 128
2 1 4 Yes 66
3 1 4 Yes 120
4 1 5 No 141
5 1 6 No 123
6 1 7 Yes 128
7 1 8 Yes 66
8 1 9 Yes 120
9 1 10 No 141
10 2 3 Yes 52
11 2 3 Yes 96
12 2 3 No 120
13 2 4 Yes 141
14 2 5 Yes 52
15 2 6 Yes 96
16 3 5 Yes 120
17 3 5 Yes 141
18 3 5 No 123
19 3 5 Yes 15
20 3 5 No 85
21 3 6 Yes 36
22 3 7 Yes 58
23 3 8 No 89
One option is to create a series from the dictionary, merge with the dataframe and then update the Salary column conditionally:
ser = pd.Series(d, name = 'd')
ser.index.name = 'ID'
(df
.merge(ser, on = 'ID')
.assign(Salary = lambda f: np.where(f.Salary.lt(f.d), f.d, f.Salary))
.drop(columns='d')
)
ID Salary Children Days
0 1 4 No 123
1 1 4 Yes 128
2 1 4 Yes 66
3 1 4 Yes 120
4 1 5 No 141
5 1 6 No 123
6 1 7 Yes 128
7 1 8 Yes 66
8 1 9 Yes 120
9 1 10 No 141
10 2 3 Yes 52
11 2 3 Yes 96
12 2 3 No 120
13 2 4 Yes 141
14 2 5 Yes 52
15 2 6 Yes 96
16 3 5 Yes 120
17 3 5 Yes 141
18 3 5 No 123
19 3 5 Yes 15
20 3 5 No 85
21 3 6 Yes 36
22 3 7 Yes 58
23 3 8 No 89
Input df:
Store Category Item tot_table
11 AA Apple 13.5
11 AA Orange 13.5
11 BB Potato 11.5
11 BB Carrot 11.5
12 AA Apple 10
12 BB Potato 9
12 BB Carrot 9
Need to perform df.groupby('Store')['tot_table'].unique().sum() , but this line of code doesn't work out.
Expected output df:
Store Category Item split_table tot_table
11 AA Apple 13.5 25
11 AA Orange 13.5 25
11 BB Potato 11.5 25
11 BB Carrot 11.5 25
12 AA Apple 10 19
12 BB Potato 9 19
12 BB Carrot 9 19
You can use groupby.transform with unique/sum:
df['tot_table'] = (df.groupby('Store')['tot_table']
.transform(lambda s: s.unique().sum())
)
output:
Store Category Item tot_table
0 11 AA Apple 25.0
1 11 AA Orange 25.0
2 11 BB Potato 25.0
3 11 BB Carrot 25.0
4 12 AA Apple 19.0
5 12 BB Potato 19.0
6 12 BB Carrot 19.0
I have dataframe, i want to split dataframe in groups based on condition from flag_0 and flag_1 column , when flag_0 is '3' and and flag_1 is '1' continous.
Here is my dataframe example:
df=pd.DataFrame({'flag_0':[1,2,3,1,2,3,1,2,3,3,3,3,1,2,3,1,2,3,4,4],'flag_1':[1,2,3,1,2,3,1,2,1,1,1,1,1,2,1,1,2,3,4,4],'dd':[1,1,1,7,7,7,8,8,8,1,1,1,7,7,7,8,8,8,5,7]})
Out[172]:
flag_0 flag_1 dd
0 1 1 1
1 2 2 1
2 3 3 1
3 1 1 7
4 2 2 7
5 3 3 7
6 1 1 8
7 2 2 8
8 3 1 8
9 3 1 1
10 3 1 1
11 3 1 1
12 1 1 7
13 2 2 7
14 3 1 7
15 1 1 8
16 2 2 8
17 3 3 8
18 4 4 5
19 4 4 7
Desired output:
group_1
Out[172]:
flag_0 flag_1 dd
9 3 1 1
10 3 1 1
11 3 1 1
group 2
Out[172]:
flag_0 flag_1 dd
14 3 1 7
You can use a mask and groupby to split the dataframe:
cond = {'flag_0': 3, 'flag_1': 1}
mask = df[list(cond)].eq(cond).all(1)
groups = [g for k,g in df[mask].groupby((~mask).cumsum())]
output:
[ flag_0 flag_1 dd
8 3 1 8
9 3 1 1
10 3 1 1
11 3 1 1,
flag_0 flag_1 dd
14 3 1 7]
groups[0]
flag_0 flag_1 dd
8 3 1 8
9 3 1 1
10 3 1 1
11 3 1 1
Name Value
0 AA 33
1 AA 24
2 BB 23
3 BB NaN
4 CC NaN
5 CC 23
6 CC 45
How can I replace these NaN with existing values by looking at column Name? For CC I would like to get the max (but if it is too convoluted, then I am fine with either 23 or 45). The expected output:
Name Value
0 AA 33
1 AA 24
2 BB 23
3 BB 23
4 CC 45
5 CC 23
6 CC 45
Thanks!
You can groupby and transform with max then fillna:
df['Value'] = df['Value'].fillna(df.groupby("Name")['Value'].transform('max'))
print(df)
Name Value
0 AA 33.0
1 AA 24.0
2 BB 23.0
3 BB 23.0
4 CC 45.0
5 CC 23.0
6 CC 45.0
You can also use lambda with transform
df["Value"] = df.groupby('Name').transform(lambda x:x.fillna(x.max()))
df
Name Value
0 AA 33.0
1 AA 24.0
2 BB 23.0
3 BB 23.0
4 CC 45.0
5 CC 23.0
6 CC 45.0
I am trying to multiply two data frame
Df1
Name|Key |100|101|102|103|104
Abb AB 2 6 10 5 1
Bcc BC 1 3 7 4 2
Abb AB 5 1 11 3 1
Bcc BC 7 1 4 5 0
Df2
Key_1|100|101|102|103|104
AB 10 2 1 5 1
BC 1 10 2 2 4
Expected Output
Name|Key |100|101|102|103|104
Abb AB 20 12 10 25 1
Bcc BC 1 30 14 8 8
Abb AB 50 2 11 15 1
Bcc BC 7 10 8 10 0
I have tried grouping Df1 and then multiplying with Df2 but it didn't work
Please help me on how to approach this problem
You can rename the df2 Key_1 to Key(similar to df1) , then set index and mul on level=1
df1.set_index(['Name','Key']).mul(df2.rename(columns={'Key_1':'Key'})
.set_index('Key'),level=1).reset_index()
Or similar:
df1.set_index(['Name','Key']).mul(df2.set_index('Key_1')
.rename_axis('Key'),level=1).reset_index()
As correctly pointed by #QuangHoang , you can do without renaming too:
df1.set_index(['Name','Key']).mul(df2.set_index('Key_1'),level=1).reset_index()
Name Key 100 101 102 103 104
0 Abb AB 20 12 10 25 1
1 Bcc BC 1 30 14 8 8
2 Abb AB 50 2 11 15 1
3 Bcc BC 7 10 8 10 0
IIUC reindex_like
df1.set_index('Key',inplace=True)
df1=df1.mul(df2.set_index('Key_1').reindex_like(df1).values).fillna(df1)
Out[235]:
Name 100 101 102 103 104
Key
AB Abb 20.0 12.0 10.0 25.0 1.0
BC Bcc 1.0 30.0 14.0 8.0 8.0
AB Abb 50.0 2.0 11.0 15.0 1.0
BC Bcc 7.0 10.0 8.0 10.0 0.0
We could also use DataFrame.merge with pd.Index.difference to select columns.
mul_cols = df1.columns.difference(['Name','Key'])
df1.assign(**df1[mul_cols].mul(df2.merge(df1[['Key']],
left_on = 'Key_1',
right_on = 'Key')[mul_cols]))
Name Key 100 101 102 103 104
0 Abb AB 20 12 10 25 1
1 Bcc BC 10 6 7 20 2
2 Abb AB 5 10 22 6 4
3 Bcc BC 7 10 8 10 0