Group By - but sum one column, and show original columns

Group By - but sum one column, and show original columns - pandas-groupby

I have a 5 column df. I need to groupby by the common names in column A, and sum column B and D. But I need to keep my output that currently sits in columns C through E.
Everytime I groupby its drops columns not involved in the the grouping.
I understand some columns will have 2 non common rows, for a common item in column A, and I need to display both of those values. Hope an example illustrates the problem better.
A
B
C
D
E
Apple
10
Green
1
X
Pear
15
Brown
2
Y
Pear
5
Yellow
3
Z
Banana
4
Yellow
4
P
Plum
2
Red
5
R
I'd like to output :
A
B
C
D
E
Apple
10
Green
1
X
Pear
20
Brown
5
Y
Yellow
Z
Banana
4
Yellow
4
P
Plum
2
Red
5
R
I cant seem to find the right combination within the groupby function

df_save =df_orig.loc[:, ["A", "C", "E"]]
df_agg = df_orig.groupby("A").agg({"B": "sum", "D" : "sum"}).reset_index()
df_merged = df_save.merge(df_agg)
for c in ["B", "D"] :
df_merged.loc[df_merged[c].duplicated(), c] = ''
A
C
E
B
D
Apple
Green
X
10
1
Pear
Brown
Y
155
23
Pear
Yellow
Z
Banana
Yellow
P
4
4
Plum
Red
R
2
5
The above is the output after the operations. I hope this works. Thanks

Related

Pandas create a new data frame from counting rows into columns

I have something like this data frame:
item color
0 A red
1 A red
2 A green
3 B red
4 B green
5 B green
6 C red
7 C green
And I want to count the times a color repeat for each item and group-by it into columns like this:
item red green
0 A 2 1
1 B 1 2
2 C 1 1
Any though? Thanks in advance

SUM based on list of categories

Consider the following Excel
A B C D
1 foo 7 whaa
2 bar 5 AA
3 baz 9 BB
4 bal 1 AA
5 oof 3 blah
6 aba 9 C
Extra:
Each row has either a value in column C OR in column D
The values in column Care categories (in this example ÀA,BB,C`)
The values in column Dcan be anything
I need a SUM (based on column A) as follows:
SUM of column B for all lines that have a value in (any value) in column D (called Rest)
SUM of column B for each category in column C. I have a list of the categories (see below)
So like this:
A B
1 Rest 10 <----- 7 + 3
2 AA 6 <----- 5 + 1
3 BB 9
4 C 9
What formulas do I need in column B above to get this result?

or, you can use sumproduct to solve:
H2=SUMPRODUCT(($D$4:$D$9=IF(G2="Rest","",G2))*$C$4:$C$9)
H2=SUMIF($D$4:$D$9,IF(G2="Rest","",G2),$C$4:$C$9)

Sort values of multiple text columns in pandas dataframe

I have a pandas dataframe like this:
A B C D E
0 apple banana orange 5 0.09
1 orange apple banana 10 4.0
2 banana orange apple 15 1.9
3 banana apple banana 20 2.8
I want to sort values of each row only based on column A,B,C as follows:
0 apple banana orange 5 0.09
1 apple banana orange 10 4.0
2 apple banana orange 15 1.9
3 apple banana banana 20 2.8
I have tried the solution like df['F']=(df.A+df.B+df.C).map(set).map(list) such that I can create a new column F and later replace A,B,C with the value of the splitted list of F, but it is concatinating all letters of my strings and creating a set ot of that, terefore of no use, as follows:
A B C D E F
0 apple banana orange 5 0.09 [b, g, r, l, n, a, p, e, o]
1 orange apple banana 10 4.0 [b, g, r, l, n, a, p, e, o]
2 banana orange apple 15 1.9 [b, g, r, l, n, a, p, e, o]
3 banana apple banana 20 2.8 [b, l, n, a, p, e]

Try:
df[['A','B','C']] = np.sort(df[['A','B','C']].to_numpy(), axis=1)
or
df[['A','B','C']] = [sorted(i) for i in df[['A','B','C']].to_numpy()]
Output:
A B C D E
0 apple banana orange 5 0.09
3 apple banana banana 20 2.80
2 apple banana orange 15 1.90
1 apple banana orange 10 4.00

Filter rows based on the count of unique values

I need to count the unique values of column A and filter out the column with values greater than say 2
A C
Apple 4
Orange 5
Apple 3
Mango 5
Orange 1
I have calculated the unique values but not able to figure out how to filer them df.value_count()
I want to filter column A that have greater than 2, expected Dataframe
A B
Apple 4
Orange 5
Apple 3
Orange 1

value_counts should be called on a Series (single column) rather than a DataFrame:
counts = df['A'].value_counts()
Giving:
A
Apple 2
Mango 1
Orange 2
dtype: int64
You can then filter this to only keep those >= 2 and use isin to filter your DataFrame:
filtered = counts[counts >= 2]
df[df['A'].isin(filtered.index)]
Giving:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1

Use duplicated with parameter keep=False:
df[df.duplicated(['A'], keep=False)]
Output:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1

Count by row on based on more than one column

I have data shaped like the following in excel:
A B C D
"foo" 5 3 1
"foo" 2 4 5
"foo" 5 5 5
"bar" 1 2 3
"bar" 4 5 7
I want to know how many rows contain "foo" in column A and 5 in either one of column B, C or D.
In other words, I want the following formula =COUNTIFS(A1:A5;"foo";B1:B5;5;C1:C5;5;D1:D5;5), but with the B, C and D ranges or'ed together instead of and'ed. Is there a simple way to do this with an excel formula?

Try,
=SUMPRODUCT((A1:A5="foo")*SIGN((B1:B5=5)+(C1:C5=5)+(D1:D5=5)))

If we can use a extra column then also we can achieve this task
A B C D E
foo 5 0 1 =COUNTIF(B2:D2,5)(1)
foo 2 5 5 =COUNTIF(B3:D3,5)(2)
foo 5 5 5 =COUNTIF(B4:D4,5)(3)
bar 1 5 3 =COUNTIF(B5:D5,5)(1)
foo 4 0 7 =COUNTIF(B6:D6,5)(0)
Then, we can use countifs
=COUNTIFS(A:A,"foo",E:E,">0") = 3

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Group By - but sum one column, and show original columns - pandas-groupby

Related

Pandas create a new data frame from counting rows into columns

SUM based on list of categories

Sort values of multiple text columns in pandas dataframe

Filter rows based on the count of unique values

Count by row on based on more than one column

Categories

Resources