Sort values of multiple text columns in pandas dataframe - python-3.x

I have a pandas dataframe like this:
A B C D E
0 apple banana orange 5 0.09
1 orange apple banana 10 4.0
2 banana orange apple 15 1.9
3 banana apple banana 20 2.8
I want to sort values of each row only based on column A,B,C as follows:
0 apple banana orange 5 0.09
1 apple banana orange 10 4.0
2 apple banana orange 15 1.9
3 apple banana banana 20 2.8
I have tried the solution like df['F']=(df.A+df.B+df.C).map(set).map(list) such that I can create a new column F and later replace A,B,C with the value of the splitted list of F, but it is concatinating all letters of my strings and creating a set ot of that, terefore of no use, as follows:
A B C D E F
0 apple banana orange 5 0.09 [b, g, r, l, n, a, p, e, o]
1 orange apple banana 10 4.0 [b, g, r, l, n, a, p, e, o]
2 banana orange apple 15 1.9 [b, g, r, l, n, a, p, e, o]
3 banana apple banana 20 2.8 [b, l, n, a, p, e]

Try:
df[['A','B','C']] = np.sort(df[['A','B','C']].to_numpy(), axis=1)
or
df[['A','B','C']] = [sorted(i) for i in df[['A','B','C']].to_numpy()]
Output:
A B C D E
0 apple banana orange 5 0.09
3 apple banana banana 20 2.80
2 apple banana orange 15 1.90
1 apple banana orange 10 4.00

Related

Group By - but sum one column, and show original columns

I have a 5 column df. I need to groupby by the common names in column A, and sum column B and D. But I need to keep my output that currently sits in columns C through E.
Everytime I groupby its drops columns not involved in the the grouping.
I understand some columns will have 2 non common rows, for a common item in column A, and I need to display both of those values. Hope an example illustrates the problem better.
A
B
C
D
E
Apple
10
Green
1
X
Pear
15
Brown
2
Y
Pear
5
Yellow
3
Z
Banana
4
Yellow
4
P
Plum
2
Red
5
R
I'd like to output :
A
B
C
D
E
Apple
10
Green
1
X
Pear
20
Brown
5
Y
Yellow
Z
Banana
4
Yellow
4
P
Plum
2
Red
5
R
I cant seem to find the right combination within the groupby function
df_save =df_orig.loc[:, ["A", "C", "E"]]
df_agg = df_orig.groupby("A").agg({"B": "sum", "D" : "sum"}).reset_index()
df_merged = df_save.merge(df_agg)
for c in ["B", "D"] :
df_merged.loc[df_merged[c].duplicated(), c] = ''
A
C
E
B
D
Apple
Green
X
10
1
Pear
Brown
Y
155
23
Pear
Yellow
Z
Banana
Yellow
P
4
4
Plum
Red
R
2
5
The above is the output after the operations. I hope this works. Thanks

How to find the total length of a column value that has multiple values in different rows for another column

Is there a way to find IDs that have both Apple and Strawberry, and then find the total length? and IDs that has only Apple, and IDS that has only Strawberry?
df:
ID Fruit
0 ABC Apple <-ABC has Apple and Strawberry
1 ABC Strawberry <-ABC has Apple and Strawberry
2 EFG Apple <-EFG has Apple only
3 XYZ Apple <-XYZ has Apple and Strawberry
4 XYZ Strawberry <-XYZ has Apple and Strawberry
5 CDF Strawberry <-CDF has Strawberry
6 AAA Apple <-AAA has Apple only
Desired output:
Length of IDs that has Apple and Strawberry: 2
Length of IDs that has Apple only: 2
Length of IDs that has Strawberry: 1
Thanks!
If always all values are only Apple or Strawberry in column Fruit you can compare sets per groups and then count ID by sum of Trues values:
v = ['Apple','Strawberry']
out = df.groupby('ID')['Fruit'].apply(lambda x: set(x) == set(v)).sum()
print (out)
2
EDIT: If there is many values:
s = df.groupby('ID')['Fruit'].agg(frozenset).value_counts()
print (s)
{Apple} 2
{Strawberry, Apple} 2
{Strawberry} 1
Name: Fruit, dtype: int64
You can use pivot_table and value_counts for DataFrames (Pandas 1.1.0.):
df.pivot_table(index='ID', columns='Fruit', aggfunc='size', fill_value=0)\
.value_counts()
Output:
Apple Strawberry
1 1 2
0 2
0 1 1
Alternatively you can use:
df.groupby(['ID', 'Fruit']).size().unstack('Fruit', fill_value=0)\
.value_counts()

Compare values in two different pandas columns

I have a dataframe that looks like this:
Fruit Cost Quantity Fruit_Copy
Apple 0.5 6 Watermelon
Orange 0.3 2 Orange
Apple 0.5 8 Apple
Apple 0.5 7 Apple
Banana 0.25 8 Banana
Banana 0.25 7 Banana
Apple 0.5 6 Apple
Apple 0.5 3 Apple
I want to write a snippet that, in pandas, compares Fruit and Fruit_Copy and outputs a new column "Match" that indicates if the values in Fruit = Fruit_Copy.
Thanks in advance!
Lets say your dataframe is 'fruits'. Then you can make use of the Pandas Series Equals function pd.Series.eq as,
fruits['Match'] = pd.Series.eq(fruits['Fruit'],fruits['Fruit_Copy'])
Something like this would work.
df.loc[df['Fruit'] == df['Fruit_Copy'], 'Match'] = 'Yes'
Using numpy.where:
df['Match'] = np.where(df['Fruit'] == df['Fruit_Copy'], 'Yes', 'No')
You could try something like this:
import pandas as pd
import numpy as np
fruits = pd.DataFrame({'Fruit':['Apple', 'Orange', 'Apple', 'Apple', 'Banana', 'Banana', 'Apple', 'Apple'], 'Cost':[0.5,0.3,0.5,0.5,0.25,0.25,0.5,0.5], 'Quantity':[6,2,8,7,8,7,6,3], 'Fruit_Copy':['Watermelon', 'Orange', 'Apple', 'Apple', 'Banana', 'Banana', 'Apple', 'Apple']})
fruits['Match'] = np.where(fruits['Fruit'] == fruits['Fruit_Copy'], 1, 0)
fruits
Fruit Cost Quantity Fruit_Copy Match
0 Apple 0.50 6 Watermelon 0
1 Orange 0.30 2 Orange 1
2 Apple 0.50 8 Apple 1
3 Apple 0.50 7 Apple 1
4 Banana 0.25 8 Banana 1
5 Banana 0.25 7 Banana 1
6 Apple 0.50 6 Apple 1
7 Apple 0.50 3 Apple 1

Filter rows based on the count of unique values

I need to count the unique values of column A and filter out the column with values greater than say 2
A C
Apple 4
Orange 5
Apple 3
Mango 5
Orange 1
I have calculated the unique values but not able to figure out how to filer them df.value_count()
I want to filter column A that have greater than 2, expected Dataframe
A B
Apple 4
Orange 5
Apple 3
Orange 1
value_counts should be called on a Series (single column) rather than a DataFrame:
counts = df['A'].value_counts()
Giving:
A
Apple 2
Mango 1
Orange 2
dtype: int64
You can then filter this to only keep those >= 2 and use isin to filter your DataFrame:
filtered = counts[counts >= 2]
df[df['A'].isin(filtered.index)]
Giving:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1
Use duplicated with parameter keep=False:
df[df.duplicated(['A'], keep=False)]
Output:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1

calculating mean with a condition on python pandas Group by on two columns. And print only the mean for each category?

Input
Fruit Count Price tag
Apple 55 35 red
Orange 60 40 orange
Apple 60 36 red
Apple 70 41 red
Output 1
Fruit Mean tag
Apple 35.5 red
Orange 40 orange
I need mean on condition price between 31 and 40
Output 2
Fruit Count tag
Apple 2 red
Orange 1 orange
I need count on condition price between 31 and 40
pls help
Use between with boolean indexing for filtering:
df1 = df[df['Price'].between(31, 40)]
print (df1)
Fruit Count Price tag
0 Apple 55 35 red
1 Orange 60 40 orange
2 Apple 60 36 red
If possible multiple columns by aggregated functions:
df2 = df1.groupby(['Fruit', 'tag'])['Price'].agg(['mean','size']).reset_index()
print (df2)
Fruit tag mean size
0 Apple red 35.5 2
1 Orange orange 40.0 1
Or 2 separately DataFrames:
df3 = df1.groupby(['Fruit', 'tag'], as_index=False)['Price'].mean()
print (df3)
Fruit tag Price
0 Apple red 35.5
1 Orange orange 40.0
df4 = df1.groupby(['Fruit', 'tag'])['Price'].size().reset_index()
print (df4)
Fruit tag Price
0 Apple red 2
1 Orange orange 1

Resources