How to group rows in pandas with sum in the certain column - python-3.x

Given a DataFrame like this:
A
B
C
D
0
ABC
unique_ident_1
10
ONE
1
KLM
unique_ident_2
2
TEN
2
KLM
unique_ident_2
7
TEN
3
XYZ
unique_ident_3
2
TWO
3
ABC
unique_ident_1
8
ONE
3
XYZ
unique_ident_3
-5
TWO
where column "B" contains a unique text identifier, columns "A" and "D" contain some constant texts dependent from unique id, and column C has a quantity. I want to group rows by unique identifiers (col "B") with quantity column summed up by ident:
A
B
C
D
0
ABC
unique_ident_1
18
ONE
1
KLM
unique_ident_2
9
TEN
2
XYZ
unique_ident_3
-3
TWO
How can I get this result with pandas?

use named tuples with a groupby.
df1 = df.groupby('B',as_index=False).agg(
A=('A','first'),
C=('C','sum'),
D=('D','first')
)[df.columns]
A B C D
0 ABC unique_ident_1 18 ONE
1 KLM unique_ident_2 9 TEN
2 XYZ unique_ident_3 -3 TWO

You can also create a dictionary and then group incase you have many columns:
agg_d = {col:'sum' if col=='C' else'first' for col in df.columns}
out = df.groupby('B').agg(agg_d).reset_index(drop=True)
print(out)
A B C D
0 ABC unique_ident_1 18 ONE
1 KLM unique_ident_2 9 TEN
2 XYZ unique_ident_3 -3 TWO

Related

Python create a column based on the values of each row of another column

I have a pandas dataframe as below:
import pandas as pd
df = pd.DataFrame({'ORDER':["A", "A", "A", "B", "B","B"], 'GROUP': ["A_2018_1B1", "A_2018_1B1", "A_2018_1M1", "B_2018_I000_1C1", "B_2018_I000_1B1", "B_2018_I000_1C1H"], 'VAL':[1,3,8,5,8,10]})
df
ORDER GROUP VAL
0 A A_2018_1B1 1
1 A A_2018_1B1H 3
2 A A_2018_1M1 8
3 B B_2018_I000_1C1 5
4 B B_2018_I000_1B1 8
5 B B_2018_I000_1C1H 10
I want to create a column "CAL" as sum of 'VAL' where GROUP name is same for all the rows expect H character in the end. So, for example, 'VAL' column for 1st two rows will be added because the only difference between the 'GROUP' is 2nd row has H in the last. Row 3 will remain as it is, Row 4 and 6 will get added and Row 5 will remain same.
My expected output
ORDER GROUP VAL CAL
0 A A_2018_1B1 1 4
1 A A_2018_1B1H 3 4
2 A A_2018_1M1 8 8
3 B B_2018_I000_1C1 5 15
4 B B_2018_I000_1B1 8 8
5 B B_2018_I000_1C1H 10 15
Try with replace then transform
df.groupby(df.GROUP.str.replace('H','')).VAL.transform('sum')
0 4
1 4
2 8
3 15
4 8
5 15
Name: VAL, dtype: int64
df['CAL'] = df.groupby(df.GROUP.str.replace('H','')).VAL.transform('sum')

Is there a way to hide the same values in MultiIndex level 1?

I have the following dataframe (named test) in pandas:
Group 1 Group 2 Species Adj. P-value
0 a b Parabacteroides goldsteinii 7
1 a b Parabacteroides johnsonii 8
2 a b Parabacteroides merdae 9
3 a b Parabacteroides sp 10
4 c d Bacteroides coprocola 1
5 c d Bacteroides dorei 2
I would like to transform this table in latex format, but with the repeated values in Group 1 and Group 2 centred (see figure below for an example). In latex this is done with the package \multirow, and df.to_latex has a parameter called multirow to enable this (to_latex)
However, a MultiIndex has to be created in order to use the multirow option in to_latex.
So I did this:
test.index = pd.MultiIndex.from_frame(test[["Group 1","Group 2"]])
test = test.drop(["Group 1","Group 2"], axis=1)
test
Species Adj. P-value
Group 1 Group 2
a b Parabacteroides goldsteinii 7
b Parabacteroides johnsonii 8
b Parabacteroides merdae 9
b Parabacteroides sp 10
c d Bacteroides coprocola 1
d Bacteroides dorei 2
And finally I stored the table:
test.to_latex("la_tex_tab.txt",multirow=True, index=True,float_format="{:0.3f}".format).
However, this yields:
It works just for level 0 (Group 1) but not for level 1 (Group 2) of the MultiIndex. Do you have any suggestions about how to avoid the repetitions of the values b and d in the MultiIndex?
Thank you.
Kind of a hack if you want:
test['Group 2'] = test['Group 2'].mask(test['Group 2'].duplicated(),'')
test.set_index(["Group 1","Group 2"])
Species Adj. P-value
Group 1 Group 2
a b Parabacteroides goldsteinii 7
Parabacteroides johnsonii 8
Parabacteroides merdae 9
Parabacteroides sp 10
c d Bacteroides coprocola 1
Bacteroides dorei 2
We can do it for display only by use assign with blank column
test = test.assign(help='').set_index('help',append=True).drop(["Group 1","Group 2"], axis=1)

Pandas: Sort a dataframe based on multiple columns

I know that this question has been asked several times. But none of the answers match my case.
I've a pandas dataframe with columns,department and employee_count. I need to sort the employee_count column in descending order. But if there is a tie between 2 employee_counts then they should be sorted alphabetically based on department.
Department Employee_Count
0 abc 10
1 adc 10
2 bca 11
3 cde 9
4 xyz 15
required output:
Department Employee_Count
0 xyz 15
1 bca 11
2 abc 10
3 adc 10
4 cde 9
This is what I've tried.
df = df.sort_values(['Department','Employee_Count'],ascending=[True,False])
But this just sorts the departments alphabetically.
I've also tried to sort by Department first and then by Employee_Count. Like this:
df = df.sort_values(['Department'],ascending=[True])
df = df.sort_values(['Employee_Count'],ascending=[False])
This doesn't give me correct output either:
Department Employee_Count
4 xyz 15
2 bca 11
1 adc 10
0 abc 10
3 cde 9
It gives 'adc' first and then 'abc'.
Kindly help me.
You can swap columns in list and also values in ascending parameter:
Explanation:
Order of columns names is order of sorting, first sort descending by Employee_Count and if some duplicates in Employee_Count then sorting by Department only duplicates rows ascending.
df1 = df.sort_values(['Employee_Count', 'Department'], ascending=[False, True])
print (df1)
Department Employee_Count
4 xyz 15
2 bca 11
0 abc 10 <-
1 adc 10 <-
3 cde 9
Or for test if use second False then duplicated rows are sorting descending:
df2 = df.sort_values(['Employee_Count', 'Department',],ascending=[False, False])
print (df2)
Department Employee_Count
4 xyz 15
2 bca 11
1 adc 10 <-
0 abc 10 <-
3 cde 9

How to group by value for certain time period

I had a DataFrame like below:
Item Date Count
a 6/1/2018 1
b 6/1/2018 2
c 6/1/2018 3
a 12/1/2018 3
b 12/1/2018 4
c 12/1/2018 1
a 1/1/2019 2
b 1/1/2019 3
c 1/1/2019 2
I would like to get the sum of Count per Item with the specified duration from 7/1/2018 to 6/1/2019. For this case, the expected output will be:
Item TotalCount
a 5
b 7
c 3
We can use query with Series.between and chain that with GroupBy.sum:
df.query('Date.between("07-01-2018", "06-01-2019")').groupby('Item')['Count'].sum()
Output
Item
a 5
b 7
c 3
Name: Count, dtype: int64
To match your exact output, use reset_index:
df.query('Date.between("07-01-2018", "06-01-2019")').groupby('Item')['Count'].sum()\
.reset_index(name='Totalcount')
Output
Item Totalcount
0 a 5
1 b 7
2 c 3
Here is one with .loc[] using lambda:
#df.Date=pd.to_datetime(df.Date)
(df.loc[lambda x: x.Date.between("07-01-2018", "06-01-2019")]
.groupby('Item',as_index=False)['Count'].sum())
Item Count
0 a 5
1 b 7
2 c 3

Pandas Conditionally Combine (and sum) Rows

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'A':['A','A','A','B','B','B'],
'B':[1,1,2,1,1,1],
'C':[2,4,6,3,5,7]})
df
A B C
0 A 1 2
1 A 1 4
2 A 2 6
3 B 1 3
4 B 1 5
5 B 1 7
Wherever there are duplicate rows per columns 'A' and 'B', I'd like to combine those rows and sum the value under column 'C' like this:
A B C
0 A 1 6
2 A 2 6
3 B 1 15
So far, I can at least identify the duplicates like this:
df['Dup']=df.duplicated(['A','B'],keep=False)
Thanks in advance!
use groupby() and sum():
In [94]: df.groupby(['A','B']).sum().reset_index()
Out[94]:
A B C
0 A 1 6
1 A 2 6
2 B 1 15

Resources