Group By - but sum one column, and show original columns - pandas-groupby

I have a 5 column df. I need to groupby by the common names in column A, and sum column B and D. But I need to keep my output that currently sits in columns C through E.
Everytime I groupby its drops columns not involved in the the grouping.
I understand some columns will have 2 non common rows, for a common item in column A, and I need to display both of those values. Hope an example illustrates the problem better.
A
B
C
D
E
Apple
10
Green
1
X
Pear
15
Brown
2
Y
Pear
5
Yellow
3
Z
Banana
4
Yellow
4
P
Plum
2
Red
5
R
I'd like to output :
A
B
C
D
E
Apple
10
Green
1
X
Pear
20
Brown
5
Y
Yellow
Z
Banana
4
Yellow
4
P
Plum
2
Red
5
R
I cant seem to find the right combination within the groupby function

df_save =df_orig.loc[:, ["A", "C", "E"]]
df_agg = df_orig.groupby("A").agg({"B": "sum", "D" : "sum"}).reset_index()
df_merged = df_save.merge(df_agg)
for c in ["B", "D"] :
df_merged.loc[df_merged[c].duplicated(), c] = ''
A
C
E
B
D
Apple
Green
X
10
1
Pear
Brown
Y
155
23
Pear
Yellow
Z
Banana
Yellow
P
4
4
Plum
Red
R
2
5
The above is the output after the operations. I hope this works. Thanks

Related

Pandas create a new data frame from counting rows into columns

I have something like this data frame:
item color
0 A red
1 A red
2 A green
3 B red
4 B green
5 B green
6 C red
7 C green
And I want to count the times a color repeat for each item and group-by it into columns like this:
item red green
0 A 2 1
1 B 1 2
2 C 1 1
Any though? Thanks in advance

SUM based on list of categories

Consider the following Excel
A B C D
1 foo 7 whaa
2 bar 5 AA
3 baz 9 BB
4 bal 1 AA
5 oof 3 blah
6 aba 9 C
Extra:
Each row has either a value in column C OR in column D
The values in column Care categories (in this example ÀA,BB,C`)
The values in column Dcan be anything
I need a SUM (based on column A) as follows:
SUM of column B for all lines that have a value in (any value) in column D (called Rest)
SUM of column B for each category in column C. I have a list of the categories (see below)
So like this:
A B
1 Rest 10 <----- 7 + 3
2 AA 6 <----- 5 + 1
3 BB 9
4 C 9
What formulas do I need in column B above to get this result?
or, you can use sumproduct to solve:
H2=SUMPRODUCT(($D$4:$D$9=IF(G2="Rest","",G2))*$C$4:$C$9)
H2=SUMIF($D$4:$D$9,IF(G2="Rest","",G2),$C$4:$C$9)

Sort values of multiple text columns in pandas dataframe

I have a pandas dataframe like this:
A B C D E
0 apple banana orange 5 0.09
1 orange apple banana 10 4.0
2 banana orange apple 15 1.9
3 banana apple banana 20 2.8
I want to sort values of each row only based on column A,B,C as follows:
0 apple banana orange 5 0.09
1 apple banana orange 10 4.0
2 apple banana orange 15 1.9
3 apple banana banana 20 2.8
I have tried the solution like df['F']=(df.A+df.B+df.C).map(set).map(list) such that I can create a new column F and later replace A,B,C with the value of the splitted list of F, but it is concatinating all letters of my strings and creating a set ot of that, terefore of no use, as follows:
A B C D E F
0 apple banana orange 5 0.09 [b, g, r, l, n, a, p, e, o]
1 orange apple banana 10 4.0 [b, g, r, l, n, a, p, e, o]
2 banana orange apple 15 1.9 [b, g, r, l, n, a, p, e, o]
3 banana apple banana 20 2.8 [b, l, n, a, p, e]
Try:
df[['A','B','C']] = np.sort(df[['A','B','C']].to_numpy(), axis=1)
or
df[['A','B','C']] = [sorted(i) for i in df[['A','B','C']].to_numpy()]
Output:
A B C D E
0 apple banana orange 5 0.09
3 apple banana banana 20 2.80
2 apple banana orange 15 1.90
1 apple banana orange 10 4.00

Filter rows based on the count of unique values

I need to count the unique values of column A and filter out the column with values greater than say 2
A C
Apple 4
Orange 5
Apple 3
Mango 5
Orange 1
I have calculated the unique values but not able to figure out how to filer them df.value_count()
I want to filter column A that have greater than 2, expected Dataframe
A B
Apple 4
Orange 5
Apple 3
Orange 1
value_counts should be called on a Series (single column) rather than a DataFrame:
counts = df['A'].value_counts()
Giving:
A
Apple 2
Mango 1
Orange 2
dtype: int64
You can then filter this to only keep those >= 2 and use isin to filter your DataFrame:
filtered = counts[counts >= 2]
df[df['A'].isin(filtered.index)]
Giving:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1
Use duplicated with parameter keep=False:
df[df.duplicated(['A'], keep=False)]
Output:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1

Count by row on based on more than one column

I have data shaped like the following in excel:
A B C D
"foo" 5 3 1
"foo" 2 4 5
"foo" 5 5 5
"bar" 1 2 3
"bar" 4 5 7
I want to know how many rows contain "foo" in column A and 5 in either one of column B, C or D.
In other words, I want the following formula =COUNTIFS(A1:A5;"foo";B1:B5;5;C1:C5;5;D1:D5;5), but with the B, C and D ranges or'ed together instead of and'ed. Is there a simple way to do this with an excel formula?
Try,
=SUMPRODUCT((A1:A5="foo")*SIGN((B1:B5=5)+(C1:C5=5)+(D1:D5=5)))
If we can use a extra column then also we can achieve this task
A B C D E
foo 5 0 1 =COUNTIF(B2:D2,5)(1)
foo 2 5 5 =COUNTIF(B3:D3,5)(2)
foo 5 5 5 =COUNTIF(B4:D4,5)(3)
bar 1 5 3 =COUNTIF(B5:D5,5)(1)
foo 4 0 7 =COUNTIF(B6:D6,5)(0)
Then, we can use countifs
=COUNTIFS(A:A,"foo",E:E,">0") = 3

Resources