Handling duplicate data with pandas - python-3.x

Hello everyone, I'm having some issues with using pandas python library. Basically I'm reading csv
file with pandas and want to remove duplicates. I've tried everything and problem is still there.
import sqlite3
import pandas as pd
import numpy
connection = sqlite3.connect("test.db")
## pandas dataframe
dataframe = pd.read_csv('Countries.csv')
##dataframe.head(3)
countries = dataframe.loc[:, ['Retailer country', 'Continent']]
countries.head(6)
Output of this will be:
Retailer country Continent
-----------------------------
0 United States North America
1 Canada North America
2 Japan Asia
3 Italy Europe
4 Canada North America
5 United States North America
6 France Europe
I want to be able to drop duplicate values based on columns from
a dataframe above so I would have smth like this unique values from each country, and continent
so that desired output of this will be:
Retailer country Continent
-----------------------------
0 United States North America
1 Canada North America
2 Japan Asia
3 Italy Europe
4 France Europe
I have tried some methods mentioned there: Using pandas for duplicate values and looked around the net and realized I could use df.drop_duplicates() function, but when I use the code below and df.head(3) function it displays only one row. What can I do to get those unique rows and finally loop through them ?
countries.head(4)
country = countries['Retailer country']
continent = countries['Continent']
df = pd.DataFrame({'a':[country], 'b':[continent]})
df.head(3)

It seems like a simple group-by could solve your problem.
import pandas as pd
na = 'North America'
a = 'Asia'
e = 'Europe'
df = pd.DataFrame({'Retailer': [0, 1, 2, 3, 4, 5, 6],
'country': ['Unitied States', 'Canada', 'Japan', 'Italy', 'Canada', 'Unitied States', 'France'],
'continent': [na, na, a, e, na, na, e]})
df.groupby(['country', 'continent']).agg('count').reset_index()
The Retailer column is now showing a count of the number of times that country, continent combination occurs. You could remove this by `df = df[['country', 'continent']].

Related

Python merge two dataframe based on text similarity of their columns

I am working with two dataframes which look like this:
df1
country_1 column1
united states of america abcd
Ireland (Republic of Ireland) efgh
Korea Rep Of fsdf
Switzerland (Swiss Confederation) dsaa
df2
country_2 column2
united states cdda
Ireland ddgd
South Korea rewt
Switzerland tuut
desired output:
country_1 column1 country_2 column2
united states of america abcd united states cdda
Ireland (Republic of Ireland) efgh Ireland ddgd
Korea Rep Of fsdf South Korea rewt
Switzerland (Swiss Confederation) dsaa Switzerland tuut
I am not that familiar with text analytics hence unable to understand any method to tackle this problem. I have tried string matching and regex but its not able to solve this problem.
You can use difflib.
Data:
data1 = {
"country_1": ["united states of america", "Ireland (Republic of Ireland)", "Korea Rep Of", "Switzerland (Swiss Confederation)"],
"column1": ["abcd", "efgh", "fsdf", "dsaa"]
}
df1 = pd.DataFrame(data1)
data2 = {
"country_2": ["united states", "Ireland", "Korea", "Switzerland"],
"column2": ["cdda", "ddgd", "rewt", "tuut"]
}
df2 = pd.DataFrame(data2)
Code:
import difflib
from dataclasses import dataclass
import pandas as pd
#dataclass()
class FuzzyMerge:
"""
Works like pandas merge except also merges on approximate matches.
"""
left: pd.DataFrame
right: pd.DataFrame
left_on: str
right_on: str
how: str = "inner"
cutoff: float = 0.3
def main(self) -> pd.DataFrame:
temp = self.right.copy()
temp[self.left_on] = [
self.get_closest_match(x, self.left[self.left_on]) for x in temp[self.right_on]
]
return self.left.merge(temp, on=self.left_on, how=self.how)
def get_closest_match(self, left: pd.Series, right: pd.Series) -> str or None:
matches = difflib.get_close_matches(left, right, cutoff=self.cutoff)
return matches[0] if matches else None
Call the class:
merged = FuzzyMerge(left=df1, right=df2, left_on="country_1", right_on="country_2").main()
print(merged)
Output:
country_1 column1 country_2 column2
0 united states of america abcd united states cdda
1 Ireland (Republic of Ireland) efgh Ireland ddgd
2 Korea Rep Of fsdf Korea rewt
3 Switzerland (Swiss Confederation) dsaa Switzerland tuut
you can solve this problem by using pandas operations i.e using join,merge and concat: but I suggest you go through concat first as it is easy to start with
ps: make sure this is in form of Dataframe
to convert it into DataFrame
data1 = pd.DataFrame(data1)
data2 = pd.DataFrame(data2)
using concat
data = pd.concat([data1, data2], axis=1)

Pandas - I have a dataset where the clmns r country, company and total employees. I need a dataframe for total employees in each company by country

There are totally 8 companies and around 30 - 40 countries. I need to get a dataframe where i can know how many total number of employees in each company by country.
Sounds like you want to use Panda's groupby feature. I'm not sure what type of data you have and what result you want, so here are some toy examples:
df = pd.DataFrame({'company': ["A", "A", "B"], 'country': ["USA", "USA", "USA"], 'employees': [10, 20, 50]})
dfg = df.groupby(['company', 'country'], as_index=False)['employees'].sum()
print(dfg)
# company country employees
# 0 A USA 30
# 1 B USA 50
df = pd.DataFrame({'company': ["A", "A", "A"], 'country': ["USA", "USA", "Japan"], 'employees': ['Art', 'Bob', 'Chris']})
dfg = df.groupby(['company', 'country'], as_index=False)['employees'].count()
print(dfg)
# company country employees
# 0 A Japan 1
# 1 A USA 2

Compare three dataframe and create a new column in one of the dataframe based on a condition

I am comparing two data frames with master_df and create a new column based on a new condition if available.
for example I have master_df and two region df as asia_df and europe_df. I want to check if company of master_df is available in any of the region data frames and create a new column as region as Europe and Asia
master_df
company product
ABC Apple
BCA Mango
DCA Apple
ERT Mango
NFT Oranges
europe_df
account sales
ABC 12
BCA 13
DCA 12
asia_df
account sales
DCA 15
ERT 34
My final output dataframe is expected to be
company product region
ABC Apple Europe
BCA Mango Europe
DCA Apple Europe
DCA Apple Asia
ERT Mango Asia
NFT Oranges Others
When I try to merge and compare, some datas are removed. I need help on how to fix this issues
final_df = europe_df.merge(master_df, left_on='company', right_on='account', how='left').drop_duplicates()
final1_df = asia_df.merge(master_df, left_on='company', right_on='account', how='left').drop_duplicates()
final['region'] = np.where(final_df['account'] == final_df['company'] ,'Europe','Others')
final['region'] = np.where(final1_df['account'] == final1_df['company'] ,'Asia','Others')
First using pd.concat concat the dataframes asia_df and europe_df then use DataFrame.merge to merge them with master_df, finally use Series.fillna to fill NaN values in Region with Others:
r = pd.concat([europe_df.assign(Region='Europe'), asia_df.assign(Region='Asia')])\
.rename(columns={'account': 'company'})[['company', 'Region']]
df = master_df.merge(r, on='company', how='left')
df['Region'] = df['Region'].fillna('Others')
Result:
print(df)
company product Region
0 ABC Apple Europe
1 BCA Mango Europe
2 DCA Apple Europe
3 DCA Apple Asia
4 ERT Mango Asia
5 NFT Oranges Others

How to select records with not exists condition in pandas dataframe

I am have two dataframes as below. I want to rewrite the data selection SQL query into pandaswhich contains not exists condition
SQL
Select ORDER_NUM, DRIVER FROM DF
WHERE
1=1
AND NOT EXISTS
(
SELECT 1 FROM
order_addition oa
WHERE
oa.Flag_Value = 'Y'
AND df.ORDER_NUM = oa.ORDER_NUM)
Sample data
order_addition.head(10)
ORDER_NUM Flag_Value
22574536 Y
32459745 Y
15642314 Y
12478965 N
25845673 N
36789156 N
df.head(10)
ORDER_NUM REGION DRIVER
22574536 WEST Ravi
32459745 WEST David
15642314 SOUTH Rahul
12478965 NORTH David
25845673 SOUTH Mani
36789156 SOUTH Tim
How can this be done in pandas easily.
IIUC, you can merge on df1 with values equal to Y, and then find the nans:
result = df2.merge(df1[df1["Flag_Value"].eq("Y")],how="left",on="ORDER_NUM")
print (result[result["Flag_Value"].isnull()])
ORDER_NUM REGION DRIVER Flag_Value
3 12478965 NORTH David NaN
4 25845673 SOUTH Mani NaN
5 36789156 SOUTH Tim NaN
Or even simpler if your ORDER_NUM are unique:
print (df2.loc[~df2["ORDER_NUM"].isin(df1.loc[df1["Flag_Value"].eq("Y"),"ORDER_NUM"])])
ORDER_NUM REGION DRIVER
3 12478965 NORTH David
4 25845673 SOUTH Mani
5 36789156 SOUTH Tim

Plotting UK Districts, Postcode Areas and Regions

I am wondering if we can do similar choropleth as below with UK District, Postcode Area and Region map.
It would be great if you could show an example for UK choropleths.
Geographic shape files can be downloaded from http://martinjc.github.io/UK-GeoJSON/
state_geo = os.path.join('data', 'us-states.json')
state_unemployment = os.path.join('data', 'US_Unemployment_Oct2012.csv')
state_data = pd.read_csv(state_unemployment)
j1 = pd.read_json(state_geo)
from branca.utilities import split_six
threshold_scale = split_six(state_data['Unemployment'])
m = folium.Map(location=[48, -102], zoom_start=3)
m.choropleth(
geo_path=state_geo,
geo_str='choropleth',
data=state_data,
columns=['State', 'Unemployment'],
key_on='feature.id',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Unemployment Rate (%)'
)
m
m.save('choropleth.html')
This is what I did.
First, collect your data. I used www.nomisweb.co.uk to collect employment rates for the main regions:
North East (England)
North West (England)
Yorkshire and The Humber
East Midlands (England)
West Midlands (England)
East of England
London South East (England)
South West (England)
Wales Scotland
Northern Ireland
I saved this dataset as UKEmploymentData.csv. Note that you will have to change the region names to match the geo data id's.
Then I followed what you posted using the NUTS data from the ONS geoportal.
import pandas as pd
import os
import json
# read in population data
df = pd.read_csv('UKEmploymentData.csv')
import folium
from branca.utilities import split_six
state_geo = 'http://geoportal1-ons.opendata.arcgis.com/datasets/01fd6b2d7600446d8af768005992f76a_4.geojson'
m = folium.Map(location=[55, 4], zoom_start=5)
m.choropleth(
geo_data=state_geo,
data=df,
columns=['region', 'Total in employment - aged 16 and over'],
key_on='feature.properties.nuts118nm',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Employment Rate (%)',
highlight=True
)
m

Resources