calculating mean with a condition on python pandas Group by on two columns. And print only the mean for each category? - python-3.x

Input
Fruit Count Price tag
Apple 55 35 red
Orange 60 40 orange
Apple 60 36 red
Apple 70 41 red
Output 1
Fruit Mean tag
Apple 35.5 red
Orange 40 orange
I need mean on condition price between 31 and 40
Output 2
Fruit Count tag
Apple 2 red
Orange 1 orange
I need count on condition price between 31 and 40
pls help

Use between with boolean indexing for filtering:
df1 = df[df['Price'].between(31, 40)]
print (df1)
Fruit Count Price tag
0 Apple 55 35 red
1 Orange 60 40 orange
2 Apple 60 36 red
If possible multiple columns by aggregated functions:
df2 = df1.groupby(['Fruit', 'tag'])['Price'].agg(['mean','size']).reset_index()
print (df2)
Fruit tag mean size
0 Apple red 35.5 2
1 Orange orange 40.0 1
Or 2 separately DataFrames:
df3 = df1.groupby(['Fruit', 'tag'], as_index=False)['Price'].mean()
print (df3)
Fruit tag Price
0 Apple red 35.5
1 Orange orange 40.0
df4 = df1.groupby(['Fruit', 'tag'])['Price'].size().reset_index()
print (df4)
Fruit tag Price
0 Apple red 2
1 Orange orange 1

Related

Finding the difference between values with the same name in a merged CSV file

I need to find the difference between values with the same names.
I have two csv files that I merged together and placed in another csv file to have a side by side comparison of the number differences.
Below is the sample merged csv file:
Q1Count Q1Names Q2Count Q2Names
2 candy 2 candy
9 apple 8 apple
10 bread 5 pineapple
4 pies 12 bread
3 cookies 4 pies
32 chocolate 3 cookies
[Total count: 60] 27 chocolate
NaN NaN [Total count: 61]
All the names are the same (almost), but I would like to have a way to make a new row space for the new name that popped up under Q2Names, pinapple.
Below is the code I implemented so far:
import pandas as pd
import csv
Q1ReportsDir='/path/to/Q1/Reports/'
Q2ReportsDir='/path/to/Q2/Reports/'
Q1lineCount = f'{Q1ReportsDir}Q1Report.csv'
Q2lineCount = f'{Q2ReportsDir}Q2Report.csv'
merged_destination = f'{Q2ReportsDir}DifferenceReport.csv'
diffDF = [pd.read_csv(p) for p in (Q1lineCount, Q2lineCount)]
merged_dataframe = pd.concat(diffDF, axis=1)
merged_dataframe.to_csv(merged_destination, index=False)
diffGenDF = pd.read_csv(merged_destination)
# getting Difference
diffGenDF ['Difference'] = diffGenDF ['Q1Count'] - diffGenDF ['Q2Count']
diffGenDF = diffGenDF [['Difference', 'Q1Count', 'Q1Names', 'Q2Count ', 'Q2Names']]
diffGenDF.to_csv(merged_destination, index=False)
So, making a space under Q1Names and adding a 0 under Q1Count in the same row where pineapple is under column Q2Names would make this easier to see an accurate difference between the values.
Q1Count Q1Names Q2Count Q2Names
2 candy 2 candy
9 apple 8 apple
0 5 pineapple
10 bread 12 bread
4 pies 4 pies
3 cookies 3 cookies
32 chocolate 27 chocolate
[Total count: 60] [Total count: 61]
The final desired output I would get if I can get past that part is this:
Difference Q1Count Q1Names Q2Count Q2Names
0 2 candy 2 candy
1 9 apple 8 apple
-5 0 5 pineapple
-2 10 bread 12 bread
0 4 pies 4 pies
0 3 cookies 3 cookies
5 32 chocolate 27 chocolate
[Total count: 60] [Total count: 61]
I was able to get your same results using a pd.merge with the dataframe you provided
df_merge = pd.merge(df1, df2, left_on = 'Q1Names', right_on = 'Q2Names', how = 'outer')
df_merge[['Q1Count', 'Q2Count']] = df_merge[['Q1Count', 'Q2Count']].fillna(0)
df_merge[['Q1Names', 'Q2Names']] = df_merge[['Q1Names', 'Q2Names']].fillna('')
df_merge['Difference'] = df_merge['Q1Count'].sub(df_merge['Q2Count'])

Looking for specifics records matchs in another dataframe [duplicate]

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 1 year ago.
I have df1 as follow:
Name | ID
________|_____
Banana | 10
Orange | 21
Peach | 115
Then I have a df2 like this:
ID Price
10 2.34
10 2.34
115 6.00
I want to modify df2 to add another column name Fruit to get this as output:
ID Fruit Price
10 Banana 2.34
10 Banana 2.34
115 Peach 6.00
200 NA NA
I can use iloc to get one specific match but how to do it in all records in the df2?
Have you tried looking at the merge function ?
pd.merge(df1, df2)
Output :
Name Id Price
0 Banana 10 2.34
1 Banana 10 2.34
2 Peach 115 6.00
EDIT :
If you want to add only a specific column from df2 :
df = pd.merge(df1,df2[['Id','Price']],on='Id', how='left')
Output :
Name Id Price
0 Banana 10 2.34
1 Banana 10 2.34
2 Orange 21 NaN
3 Peach 115 6.00

Find difference between two integer columns but by specific ID column [duplicate]

This question already has answers here:
Python: Sum values in DataFrame if other values match between DataFrames
(3 answers)
Closed 2 years ago.
I have the following two dataframes.
last_request_df:
name fruit_id sold
apple 123 1
melon 456 12
banana 12 23
current_request_df:
name fruit_id sold
apple 123 5
melon 456 19
banana 12 43
orange 55 3
mango 66 0
The output should be based on matching the fruit_id column from both last_request_df and current_request_df and figuring out the difference in the sold column:
difference_df:
name fruit_id sold
apple 123 4
melon 456 7
banana 12 20
orange 55 3
mango 66 0
I've tried the following but I'm afraid this is not matching by the fruid_id column.
difference_df['sold_diff'] = current_request_df['sold'] - last_request_df['sold']
Is there a preferred method to capture the difference_df based on the data I've provided?
#Reset index to name for both dfs
difference_df=current_request_df.set_index('name')
last_request_df=last_request_df.set_index('name')
#Find the difference using sub. To do this ensure the two dfs have same index by reindexing
difference_df['sold']=difference_df['sold'].sub(last_request_df.reindex(index=difference_df.index).fillna(0)['sold'])
fruit_id sold
name
apple 123 4.0
melon 456 7.0
banana 12 20.0
orange 55 3.0
mango 66 0.0

Filter rows based on the count of unique values

I need to count the unique values of column A and filter out the column with values greater than say 2
A C
Apple 4
Orange 5
Apple 3
Mango 5
Orange 1
I have calculated the unique values but not able to figure out how to filer them df.value_count()
I want to filter column A that have greater than 2, expected Dataframe
A B
Apple 4
Orange 5
Apple 3
Orange 1
value_counts should be called on a Series (single column) rather than a DataFrame:
counts = df['A'].value_counts()
Giving:
A
Apple 2
Mango 1
Orange 2
dtype: int64
You can then filter this to only keep those >= 2 and use isin to filter your DataFrame:
filtered = counts[counts >= 2]
df[df['A'].isin(filtered.index)]
Giving:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1
Use duplicated with parameter keep=False:
df[df.duplicated(['A'], keep=False)]
Output:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1

How to split Pandas string column into different rows?

Here is my issue. I have data like this:
data = {
'name': ["Jack ;; Josh ;; John", "Apple ;; Fruit ;; Pear"],
'grade': [11, 12],
'color':['black', 'blue']
}
df = pd.DataFrame(data)
It looks like:
name grade color
0 Jack ;; Josh ;; John 11 black
1 Apple ;; Fruit ;; Pear 12 blue
I want it to look like:
name age color
0 Jack 11 black
1 Josh 11 black
2 John 11 black
3 Apple 12 blue
4 Fruit 12 blue
5 Pear 12 blue
So first I'd need to split name by using ";;" and then explode that list into different rows
Use Series.str.split with reshape by DataFrame.stack and add orriginal another columns by DataFrame.join:
c = df.columns
s = (df.pop('name')
.str.split(' ;; ', expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('name'))
df = df.join(s).reset_index(drop=True).reindex(columns=c)
print (df)
name grade color
0 Jack 11 black
1 Josh 11 black
2 John 11 black
3 Apple 12 blue
4 Fruit 12 blue
5 Pear 12 blue
You have 2 challenges:
split the name with ;; into a list AND have each item in the list as a column such that:
df['name']=df.name.str.split(';;')
df_temp = df.name.apply(pd.Series)
df = pd.concat([df[:], df_temp[:]], axis=1)
df.drop('name', inplace=True, axis=1)
result:
grade color 0 1 2
0 11 black Jack Josh John
1 12 blue Apple Fruit Pear
Melt the list to get desired result:
df.melt(id_vars=["grade", "color"],
value_name="Name").sort_values('grade').drop('variable', axis=1)
desired result:
grade color Name
0 11 black Jack
2 11 black Josh
4 11 black John
1 12 blue Apple
3 12 blue Fruit
5 12 blue Pear

Resources