Looking for specifics records matchs in another dataframe [duplicate] - python-3.x

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 1 year ago.
I have df1 as follow:
Name | ID
________|_____
Banana | 10
Orange | 21
Peach | 115
Then I have a df2 like this:
ID Price
10 2.34
10 2.34
115 6.00
I want to modify df2 to add another column name Fruit to get this as output:
ID Fruit Price
10 Banana 2.34
10 Banana 2.34
115 Peach 6.00
200 NA NA
I can use iloc to get one specific match but how to do it in all records in the df2?

Have you tried looking at the merge function ?
pd.merge(df1, df2)
Output :
Name Id Price
0 Banana 10 2.34
1 Banana 10 2.34
2 Peach 115 6.00
EDIT :
If you want to add only a specific column from df2 :
df = pd.merge(df1,df2[['Id','Price']],on='Id', how='left')
Output :
Name Id Price
0 Banana 10 2.34
1 Banana 10 2.34
2 Orange 21 NaN
3 Peach 115 6.00

Related

Finding the difference between values with the same name in a merged CSV file

I need to find the difference between values with the same names.
I have two csv files that I merged together and placed in another csv file to have a side by side comparison of the number differences.
Below is the sample merged csv file:
Q1Count Q1Names Q2Count Q2Names
2 candy 2 candy
9 apple 8 apple
10 bread 5 pineapple
4 pies 12 bread
3 cookies 4 pies
32 chocolate 3 cookies
[Total count: 60] 27 chocolate
NaN NaN [Total count: 61]
All the names are the same (almost), but I would like to have a way to make a new row space for the new name that popped up under Q2Names, pinapple.
Below is the code I implemented so far:
import pandas as pd
import csv
Q1ReportsDir='/path/to/Q1/Reports/'
Q2ReportsDir='/path/to/Q2/Reports/'
Q1lineCount = f'{Q1ReportsDir}Q1Report.csv'
Q2lineCount = f'{Q2ReportsDir}Q2Report.csv'
merged_destination = f'{Q2ReportsDir}DifferenceReport.csv'
diffDF = [pd.read_csv(p) for p in (Q1lineCount, Q2lineCount)]
merged_dataframe = pd.concat(diffDF, axis=1)
merged_dataframe.to_csv(merged_destination, index=False)
diffGenDF = pd.read_csv(merged_destination)
# getting Difference
diffGenDF ['Difference'] = diffGenDF ['Q1Count'] - diffGenDF ['Q2Count']
diffGenDF = diffGenDF [['Difference', 'Q1Count', 'Q1Names', 'Q2Count ', 'Q2Names']]
diffGenDF.to_csv(merged_destination, index=False)
So, making a space under Q1Names and adding a 0 under Q1Count in the same row where pineapple is under column Q2Names would make this easier to see an accurate difference between the values.
Q1Count Q1Names Q2Count Q2Names
2 candy 2 candy
9 apple 8 apple
0 5 pineapple
10 bread 12 bread
4 pies 4 pies
3 cookies 3 cookies
32 chocolate 27 chocolate
[Total count: 60] [Total count: 61]
The final desired output I would get if I can get past that part is this:
Difference Q1Count Q1Names Q2Count Q2Names
0 2 candy 2 candy
1 9 apple 8 apple
-5 0 5 pineapple
-2 10 bread 12 bread
0 4 pies 4 pies
0 3 cookies 3 cookies
5 32 chocolate 27 chocolate
[Total count: 60] [Total count: 61]
I was able to get your same results using a pd.merge with the dataframe you provided
df_merge = pd.merge(df1, df2, left_on = 'Q1Names', right_on = 'Q2Names', how = 'outer')
df_merge[['Q1Count', 'Q2Count']] = df_merge[['Q1Count', 'Q2Count']].fillna(0)
df_merge[['Q1Names', 'Q2Names']] = df_merge[['Q1Names', 'Q2Names']].fillna('')
df_merge['Difference'] = df_merge['Q1Count'].sub(df_merge['Q2Count'])

Find difference between two integer columns but by specific ID column [duplicate]

This question already has answers here:
Python: Sum values in DataFrame if other values match between DataFrames
(3 answers)
Closed 2 years ago.
I have the following two dataframes.
last_request_df:
name fruit_id sold
apple 123 1
melon 456 12
banana 12 23
current_request_df:
name fruit_id sold
apple 123 5
melon 456 19
banana 12 43
orange 55 3
mango 66 0
The output should be based on matching the fruit_id column from both last_request_df and current_request_df and figuring out the difference in the sold column:
difference_df:
name fruit_id sold
apple 123 4
melon 456 7
banana 12 20
orange 55 3
mango 66 0
I've tried the following but I'm afraid this is not matching by the fruid_id column.
difference_df['sold_diff'] = current_request_df['sold'] - last_request_df['sold']
Is there a preferred method to capture the difference_df based on the data I've provided?
#Reset index to name for both dfs
difference_df=current_request_df.set_index('name')
last_request_df=last_request_df.set_index('name')
#Find the difference using sub. To do this ensure the two dfs have same index by reindexing
difference_df['sold']=difference_df['sold'].sub(last_request_df.reindex(index=difference_df.index).fillna(0)['sold'])
fruit_id sold
name
apple 123 4.0
melon 456 7.0
banana 12 20.0
orange 55 3.0
mango 66 0.0

Filter rows based on the count of unique values

I need to count the unique values of column A and filter out the column with values greater than say 2
A C
Apple 4
Orange 5
Apple 3
Mango 5
Orange 1
I have calculated the unique values but not able to figure out how to filer them df.value_count()
I want to filter column A that have greater than 2, expected Dataframe
A B
Apple 4
Orange 5
Apple 3
Orange 1
value_counts should be called on a Series (single column) rather than a DataFrame:
counts = df['A'].value_counts()
Giving:
A
Apple 2
Mango 1
Orange 2
dtype: int64
You can then filter this to only keep those >= 2 and use isin to filter your DataFrame:
filtered = counts[counts >= 2]
df[df['A'].isin(filtered.index)]
Giving:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1
Use duplicated with parameter keep=False:
df[df.duplicated(['A'], keep=False)]
Output:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1

calculating mean with a condition on python pandas Group by on two columns. And print only the mean for each category?

Input
Fruit Count Price tag
Apple 55 35 red
Orange 60 40 orange
Apple 60 36 red
Apple 70 41 red
Output 1
Fruit Mean tag
Apple 35.5 red
Orange 40 orange
I need mean on condition price between 31 and 40
Output 2
Fruit Count tag
Apple 2 red
Orange 1 orange
I need count on condition price between 31 and 40
pls help
Use between with boolean indexing for filtering:
df1 = df[df['Price'].between(31, 40)]
print (df1)
Fruit Count Price tag
0 Apple 55 35 red
1 Orange 60 40 orange
2 Apple 60 36 red
If possible multiple columns by aggregated functions:
df2 = df1.groupby(['Fruit', 'tag'])['Price'].agg(['mean','size']).reset_index()
print (df2)
Fruit tag mean size
0 Apple red 35.5 2
1 Orange orange 40.0 1
Or 2 separately DataFrames:
df3 = df1.groupby(['Fruit', 'tag'], as_index=False)['Price'].mean()
print (df3)
Fruit tag Price
0 Apple red 35.5
1 Orange orange 40.0
df4 = df1.groupby(['Fruit', 'tag'])['Price'].size().reset_index()
print (df4)
Fruit tag Price
0 Apple red 2
1 Orange orange 1

Add rows according to other rows

My DataFrame object similar to this one:
Product StoreFrom StoreTo Date
1 out melon StoreQ StoreP 20170602
2 out cherry StoreW StoreO 20170614
3 out Apple StoreE StoreU 20170802
4 in Apple StoreE StoreU 20170812
I want to avoid duplications, in 3rd and 4th row show same action. I try to reach
Product StoreFrom StoreTo Date Days
1 out melon StoreQ StoreP 20170602
2 out cherry StoreW StoreO 20170614
5 in Apple StoreE StoreU 20170812 10
and I got more than 10k entry. I could not find similar work to this. Any help will be very useful.
d1 = df.assign(Date=pd.to_datetime(df.Date.astype(str)))
d2 = d1.assign(Days=d1.groupby(cols).Date.apply(lambda x: x - x.iloc[0]))
d2.drop_duplicates(cols, 'last')
io Product StoreFrom StoreTo Date Days
1 out melon StoreQ StoreP 2017-06-02 0 days
2 out cherry StoreW StoreO 2017-06-14 0 days
4 in Apple StoreE StoreU 2017-08-12 10 days

Resources