Compare three dataframe and create a new column in one of the dataframe based on a condition - python-3.x

I am comparing two data frames with master_df and create a new column based on a new condition if available.
for example I have master_df and two region df as asia_df and europe_df. I want to check if company of master_df is available in any of the region data frames and create a new column as region as Europe and Asia
master_df
company product
ABC Apple
BCA Mango
DCA Apple
ERT Mango
NFT Oranges
europe_df
account sales
ABC 12
BCA 13
DCA 12
asia_df
account sales
DCA 15
ERT 34
My final output dataframe is expected to be
company product region
ABC Apple Europe
BCA Mango Europe
DCA Apple Europe
DCA Apple Asia
ERT Mango Asia
NFT Oranges Others
When I try to merge and compare, some datas are removed. I need help on how to fix this issues
final_df = europe_df.merge(master_df, left_on='company', right_on='account', how='left').drop_duplicates()
final1_df = asia_df.merge(master_df, left_on='company', right_on='account', how='left').drop_duplicates()
final['region'] = np.where(final_df['account'] == final_df['company'] ,'Europe','Others')
final['region'] = np.where(final1_df['account'] == final1_df['company'] ,'Asia','Others')

First using pd.concat concat the dataframes asia_df and europe_df then use DataFrame.merge to merge them with master_df, finally use Series.fillna to fill NaN values in Region with Others:
r = pd.concat([europe_df.assign(Region='Europe'), asia_df.assign(Region='Asia')])\
.rename(columns={'account': 'company'})[['company', 'Region']]
df = master_df.merge(r, on='company', how='left')
df['Region'] = df['Region'].fillna('Others')
Result:
print(df)
company product Region
0 ABC Apple Europe
1 BCA Mango Europe
2 DCA Apple Europe
3 DCA Apple Asia
4 ERT Mango Asia
5 NFT Oranges Others

Related

Extract the mapping dictionary between two columns in pandas

I have a dataframe as shown below.
df:
id player country_code country
1 messi arg argentina
2 neymar bra brazil
3 tevez arg argentina
4 aguero arg argentina
5 rivaldo bra brazil
6 owen eng england
7 lampard eng england
8 gerrard eng england
9 ronaldo bra brazil
10 marria arg argentina
from the above df, I would like to extract the mapping dictionary that relates the country_code with country columns.
Expected Output:
d = {'arg':'argentina', 'bra':'brazil', 'eng':'england'}
Dictionary has unique keys, so is possible convert Series with duplicated index by column country_code:
d = df.set_index('country_code')['country'].to_dict()
If there is possible some country should be different per country_code, then is used last value per country.

How to select records with not exists condition in pandas dataframe

I am have two dataframes as below. I want to rewrite the data selection SQL query into pandaswhich contains not exists condition
SQL
Select ORDER_NUM, DRIVER FROM DF
WHERE
1=1
AND NOT EXISTS
(
SELECT 1 FROM
order_addition oa
WHERE
oa.Flag_Value = 'Y'
AND df.ORDER_NUM = oa.ORDER_NUM)
Sample data
order_addition.head(10)
ORDER_NUM Flag_Value
22574536 Y
32459745 Y
15642314 Y
12478965 N
25845673 N
36789156 N
df.head(10)
ORDER_NUM REGION DRIVER
22574536 WEST Ravi
32459745 WEST David
15642314 SOUTH Rahul
12478965 NORTH David
25845673 SOUTH Mani
36789156 SOUTH Tim
How can this be done in pandas easily.
IIUC, you can merge on df1 with values equal to Y, and then find the nans:
result = df2.merge(df1[df1["Flag_Value"].eq("Y")],how="left",on="ORDER_NUM")
print (result[result["Flag_Value"].isnull()])
ORDER_NUM REGION DRIVER Flag_Value
3 12478965 NORTH David NaN
4 25845673 SOUTH Mani NaN
5 36789156 SOUTH Tim NaN
Or even simpler if your ORDER_NUM are unique:
print (df2.loc[~df2["ORDER_NUM"].isin(df1.loc[df1["Flag_Value"].eq("Y"),"ORDER_NUM"])])
ORDER_NUM REGION DRIVER
3 12478965 NORTH David
4 25845673 SOUTH Mani
5 36789156 SOUTH Tim

Handling duplicate data with pandas

Hello everyone, I'm having some issues with using pandas python library. Basically I'm reading csv
file with pandas and want to remove duplicates. I've tried everything and problem is still there.
import sqlite3
import pandas as pd
import numpy
connection = sqlite3.connect("test.db")
## pandas dataframe
dataframe = pd.read_csv('Countries.csv')
##dataframe.head(3)
countries = dataframe.loc[:, ['Retailer country', 'Continent']]
countries.head(6)
Output of this will be:
Retailer country Continent
-----------------------------
0 United States North America
1 Canada North America
2 Japan Asia
3 Italy Europe
4 Canada North America
5 United States North America
6 France Europe
I want to be able to drop duplicate values based on columns from
a dataframe above so I would have smth like this unique values from each country, and continent
so that desired output of this will be:
Retailer country Continent
-----------------------------
0 United States North America
1 Canada North America
2 Japan Asia
3 Italy Europe
4 France Europe
I have tried some methods mentioned there: Using pandas for duplicate values and looked around the net and realized I could use df.drop_duplicates() function, but when I use the code below and df.head(3) function it displays only one row. What can I do to get those unique rows and finally loop through them ?
countries.head(4)
country = countries['Retailer country']
continent = countries['Continent']
df = pd.DataFrame({'a':[country], 'b':[continent]})
df.head(3)
It seems like a simple group-by could solve your problem.
import pandas as pd
na = 'North America'
a = 'Asia'
e = 'Europe'
df = pd.DataFrame({'Retailer': [0, 1, 2, 3, 4, 5, 6],
'country': ['Unitied States', 'Canada', 'Japan', 'Italy', 'Canada', 'Unitied States', 'France'],
'continent': [na, na, a, e, na, na, e]})
df.groupby(['country', 'continent']).agg('count').reset_index()
The Retailer column is now showing a count of the number of times that country, continent combination occurs. You could remove this by `df = df[['country', 'continent']].

Pandas merge two dataframe and overwrite rows

I have two data frames that I am trying to combine -
Dataframe 1 -
Product Buyer Date Store
TV Person A 9/18/2018 Boston
DVD Person B 4/10/2018 New York
Blue-ray Player Person C 9/19/2018 Boston
Phone Person A 9/18/2018 Boston
Sound System Person C 3/05/2018 Washington
Dataframe 2 -
Product Type Buyer Date Store
TV Person B 5/29/2018 New York
Phone Person A 2/10/2018 Washington
The first dataframe has about 500k rows while the second dataframe has about 80k rows. There are time when the second dataframe has home columns but I am trying to get the final output with to show the same columns as the Dataframe 1 and update the Dataframe 1 rows with Dataframe 2.
The output looks like this -
Product Buyer Date Store
TV Person B 5/29/2018 New York
DVD Person B 4/10/2018 New York
Blue-ray Player Person C 9/19/2018 Boston
Phone Person A 2/10/2018 Washington
Sound System Person C 3/05/2018 Washington
I tried the join but the columns are repeated. Is there an elegant solution to do this?
Edit 1-
I have already tried -
pd.merge(df,df_correction, left_on = ['Product'], right_on = ['Product Type'],how = 'outer')
Product Buyer_x Date_x Store_x Product Type Buyer_y Date_y Store_y
TV Person B 5/29/2018 New York TV Person B 5/29/2018 New York
DVD Person B 4/10/2018 New York NaN NaN NaN NaN
Blue-ray Player Person C 9/19/2018 Boston NaN NaN NaN NaN
Phone Person A 2/10/2018 Washington Phone Person A 2/10/2018 Washington
Sound System Person C 3/05/2018 Washington NaN NaN NaN NaN
i think combine first is the function you are looking for https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.combine_first.html
can you try:
d1.rename(columns={'ProductType':'Product'}).set_index('Product').combine_first(d2.set_index('Product')).reset_index()

python -count elements pandas dataframe

I have a table with some info about districts. I have converted it into a pandas dataframe and my question is how can I count how many times SOUTHERN, BAYVIEW etc. appear in the table below? I want to add an extra column next to District with the total number of each district.
District
0 SOUTHERN
1 BAYVIEW
2 CENTRAL
3 NORTH
Here you need to use a groupby and a size method (you can also use some other aggregations such as count)
With this dataframe:
import pandas as pd
df = pd.DataFrame({'DISTRICT': ['SOUTHERN', 'SOUTHERN', 'BAYVIEW', 'BAYVIEW', 'BAYVIEW', 'CENTRAL', 'NORTH']})
Represented as below
DISTRICT
0 SOUTHERN
1 SOUTHERN
2 BAYVIEW
3 BAYVIEW
4 BAYVIEW
5 CENTRAL
6 NORTH
You can use
df.groupby(['DISTRICT']).size().reset_index(name='counts')
You have this output
DISTRICT counts
0 BAYVIEW 3
1 CENTRAL 1
2 NORTH 1
3 SOUTHERN 2

Resources