Python: Groupby and sum respective rows and update dataframe column - python-3.x

Input df:
Store Category Item tot_table
11 AA Apple 13.5
11 AA Orange 13.5
11 BB Potato 11.5
11 BB Carrot 11.5
12 AA Apple 10
12 BB Potato 9
12 BB Carrot 9
Need to perform df.groupby('Store')['tot_table'].unique().sum() , but this line of code doesn't work out.
Expected output df:
Store Category Item split_table tot_table
11 AA Apple 13.5 25
11 AA Orange 13.5 25
11 BB Potato 11.5 25
11 BB Carrot 11.5 25
12 AA Apple 10 19
12 BB Potato 9 19
12 BB Carrot 9 19

You can use groupby.transform with unique/sum:
df['tot_table'] = (df.groupby('Store')['tot_table']
.transform(lambda s: s.unique().sum())
)
output:
Store Category Item tot_table
0 11 AA Apple 25.0
1 11 AA Orange 25.0
2 11 BB Potato 25.0
3 11 BB Carrot 25.0
4 12 AA Apple 19.0
5 12 BB Potato 19.0
6 12 BB Carrot 19.0

Related

Pandas: Combine pandas columns that have the same column name

If we have the following df,
df
A A B B B
0 10 2 0 3 3
1 20 4 19 21 36
2 30 20 24 24 12
3 40 10 39 23 46
How can I combine the content of the columns with the same names?
e.g.
A B
0 10 0
1 20 19
2 30 24
3 40 39
4 2 3
5 4 21
6 20 24
7 10 23
8 Na 3
9 Na 36
10 Na 12
11 Na 46
I tried groupby and merge and both are not doing this job.
Any help is appreciated.
If columns names are duplicated you can use DataFrame.melt with concat:
df = pd.concat([df['A'].melt()['value'], df['B'].melt()['value']], axis=1, keys=['A','B'])
print (df)
A B
0 10.0 0
1 20.0 19
2 30.0 24
3 40.0 39
4 2.0 3
5 4.0 21
6 20.0 24
7 10.0 23
8 NaN 3
9 NaN 36
10 NaN 12
11 NaN 46
EDIT:
uniq = df.columns.unique()
df = pd.concat([df[c].melt()['value'] for c in uniq], axis=1, keys=uniq)
print (df)
A B
0 10.0 0
1 20.0 19
2 30.0 24
3 40.0 39
4 2.0 3
5 4.0 21
6 20.0 24
7 10.0 23
8 NaN 3
9 NaN 36
10 NaN 12
11 NaN 46

Manipulating Pandas Columns into different labels [duplicate]

Let's assume that I have the following dataframe in pandas:
AA BB CC
date
05/03 1 2 3
06/03 4 5 6
07/03 7 8 9
08/03 5 7 1
and I want to transform it to the following:
AA 05/03 1
AA 06/03 4
AA 07/03 7
AA 08/03 5
BB 05/03 2
BB 06/03 5
BB 07/03 8
BB 08/03 7
CC 05/03 3
CC 06/03 6
CC 07/03 9
CC 08/03 1
How can I do it?
The reason of the transformation from wide to long is that, in the next stage, I would like to merge this dataframe with another one, based on dates and the initial column names (AA, BB, CC).
Use pandas.melt or pandas.DataFrame.melt to transform from wide to long:
df = pd.DataFrame({
'date' : ['05/03', '06/03', '07/03', '08/03'],
'AA' : [1, 4, 7, 5],
'BB' : [2, 5, 8, 7],
'CC' : [3, 6, 9, 1]
}).set_index('date')
df
AA BB CC
date
05/03 1 2 3
06/03 4 5 6
07/03 7 8 9
08/03 5 7 1
To convert, we just need to reset the index and then melt:
df = df.reset_index()
pd.melt(df, id_vars='date', value_vars=['AA', 'BB', 'CC'])
Using .reset_index after .melt, removes the need to specify value_vars.
dfm = df.melt(ignore_index=False).reset_index()
Final Result - both options
date variable value
0 05/03 AA 1
1 06/03 AA 4
2 07/03 AA 7
3 08/03 AA 5
4 05/03 BB 2
5 06/03 BB 5
6 07/03 BB 8
7 08/03 BB 7
8 05/03 CC 3
9 06/03 CC 6
10 07/03 CC 9
11 08/03 CC 1
Update
As George Liu has shown in another answer, pd.melt is the idiomatic, flexible and fast solution to this problem. Do not use unstack for this.
unstack returns a series with a multiindex:
In [38]: df.unstack()
Out[38]:
date
AA 05/03 1
06/03 4
07/03 7
08/03 5
BB 05/03 2
06/03 5
07/03 8
08/03 7
CC 05/03 3
06/03 6
07/03 9
08/03 1
dtype: int64
You can call reset_index on the returning series:
In [39]: df.unstack().reset_index()
Out[39]:
level_0 date 0
0 AA 05-03 1
1 AA 06-03 4
2 AA 07-03 7
3 AA 08-03 5
4 BB 05-03 2
5 BB 06-03 5
6 BB 07-03 8
7 BB 08-03 7
8 CC 05-03 3
9 CC 06-03 6
10 CC 07-03 9
11 CC 08-03 1
Or construct a dataframe with a multiindex:
In [40]: pd.DataFrame(df.unstack())
Out[40]:
0
date
AA 05-03 1
06-03 4
07-03 7
08-03 5
BB 05-03 2
06-03 5
07-03 8
08-03 7
CC 05-03 3
06-03 6
07-03 9
08-03 1

Replace NaN with existing value of the group

Name Value
0 AA 33
1 AA 24
2 BB 23
3 BB NaN
4 CC NaN
5 CC 23
6 CC 45
How can I replace these NaN with existing values by looking at column Name? For CC I would like to get the max (but if it is too convoluted, then I am fine with either 23 or 45). The expected output:
Name Value
0 AA 33
1 AA 24
2 BB 23
3 BB 23
4 CC 45
5 CC 23
6 CC 45
Thanks!
You can groupby and transform with max then fillna:
df['Value'] = df['Value'].fillna(df.groupby("Name")['Value'].transform('max'))
print(df)
Name Value
0 AA 33.0
1 AA 24.0
2 BB 23.0
3 BB 23.0
4 CC 45.0
5 CC 23.0
6 CC 45.0
You can also use lambda with transform
df["Value"] = df.groupby('Name').transform(lambda x:x.fillna(x.max()))
df
Name Value
0 AA 33.0
1 AA 24.0
2 BB 23.0
3 BB 23.0
4 CC 45.0
5 CC 23.0
6 CC 45.0

Multiply 2 different dataframe with same dimension and repeating rows

I am trying to multiply two data frame
Df1
Name|Key |100|101|102|103|104
Abb AB 2 6 10 5 1
Bcc BC 1 3 7 4 2
Abb AB 5 1 11 3 1
Bcc BC 7 1 4 5 0
Df2
Key_1|100|101|102|103|104
AB 10 2 1 5 1
BC 1 10 2 2 4
Expected Output
Name|Key |100|101|102|103|104
Abb AB 20 12 10 25 1
Bcc BC 1 30 14 8 8
Abb AB 50 2 11 15 1
Bcc BC 7 10 8 10 0
I have tried grouping Df1 and then multiplying with Df2 but it didn't work
Please help me on how to approach this problem
You can rename the df2 Key_1 to Key(similar to df1) , then set index and mul on level=1
df1.set_index(['Name','Key']).mul(df2.rename(columns={'Key_1':'Key'})
.set_index('Key'),level=1).reset_index()
Or similar:
df1.set_index(['Name','Key']).mul(df2.set_index('Key_1')
.rename_axis('Key'),level=1).reset_index()
As correctly pointed by #QuangHoang , you can do without renaming too:
df1.set_index(['Name','Key']).mul(df2.set_index('Key_1'),level=1).reset_index()
Name Key 100 101 102 103 104
0 Abb AB 20 12 10 25 1
1 Bcc BC 1 30 14 8 8
2 Abb AB 50 2 11 15 1
3 Bcc BC 7 10 8 10 0
IIUC reindex_like
df1.set_index('Key',inplace=True)
df1=df1.mul(df2.set_index('Key_1').reindex_like(df1).values).fillna(df1)
Out[235]:
Name 100 101 102 103 104
Key
AB Abb 20.0 12.0 10.0 25.0 1.0
BC Bcc 1.0 30.0 14.0 8.0 8.0
AB Abb 50.0 2.0 11.0 15.0 1.0
BC Bcc 7.0 10.0 8.0 10.0 0.0
We could also use DataFrame.merge with pd.Index.difference to select columns.
mul_cols = df1.columns.difference(['Name','Key'])
df1.assign(**df1[mul_cols].mul(df2.merge(df1[['Key']],
left_on = 'Key_1',
right_on = 'Key')[mul_cols]))
Name Key 100 101 102 103 104
0 Abb AB 20 12 10 25 1
1 Bcc BC 10 6 7 20 2
2 Abb AB 5 10 22 6 4
3 Bcc BC 7 10 8 10 0

Merge matching rows in excel & summarizing matching columns

Looking to merge some data and summarize the results. Bene poking around google but haven't found anything that will match up duplicates and summarize.
The left side of the table is what I'm starting with, I would like the output on the right side.
Street Name Widgets Sprockets Nuts Bolts Street Name Widgets Sprockets Nuts Bolts
123 Any street ACB Co 10 248 2 50 123 Any street ACB Co 10 846 10 78
123 Any street Bob's plumbing 25 22 2 7 123 Any street Bob's plumbing 25 22 2 7
456 Another st Bill's cars 55 5 456 456 Another st Bill's cars 62 878 13 55
123 Any street ACB Co 54 4 6 789 789 Ave Shelley and co 5 2 2 78
456 Another st Bill's cars 7 878 8 55 789 Ave Divers down 7 90 10 11
789 Ave Shelley and co 5 2 2 78 456 Another st ACB Co 6 50 5
123 Any street ACB Co 544 4 22
456 Another st ACB Co 6 50 5
789 Ave Divers down 6 90 9 4
789 Ave Divers down 1 1 7
Use Pivot Tables an set the layout to tabular.
Details can be found here: https://www.youtube.com/watch?v=LkFPBn7sgEc

Resources