How to replace dataframe columns country name with continent? - python-3.x

I have Dataframe like this.
problem.head(30)
Out[25]:
Country
0 Sweden
1 Africa
2 Africa
3 Africa
4 Africa
5 Germany
6 Germany
7 Germany
8 Germany
9 UK
10 Germany
11 Germany
12 Germany
13 Germany
14 Sweden
15 Sweden
16 Africa
17 Africa
18 Africa
19 Africa
20 Africa
21 Africa
22 Africa
23 Africa
24 Africa
25 Africa
26 Pakistan
27 Pakistan
28 ZA
29 ZA
Now i want to replace the country name with the continent name. So the country name will be replace with its continent name.
What i did is, i have created all the Continent array(which is there in my data frame, i have 56 country),
asia = ['Afghanistan', 'Bahrain', 'United Arab Emirates','Saudi Arabia', 'Kuwait', 'Qatar', 'Oman',
'Sultanate of Oman','Lebanon', 'Iraq', 'Yemen', 'Pakistan', 'Lebanon', 'Philippines', 'Jordan']
europe = ['Germany','Spain', 'France', 'Italy', 'Netherlands', 'Norway', 'Sweden','Czech Republic', 'Finland',
'Denmark', 'Czech Republic', 'Switzerland', 'UK', 'UK&I', 'Poland', 'Greece','Austria',
'Bulgaria', 'Hungary', 'Luxembourg', 'Romania' , 'Slovakia', 'Estonia', 'Slovenia','Portugal',
'Croatia', 'Lithuania', 'Latvia','Serbia', 'Estonia', 'ME', 'Iceland' ]
africa = ['Morocco', 'Tunisia', 'Africa', 'ZA', 'Kenya']
other = ['USA', 'Australia', 'Reunion', 'Faroe Islands']
Now trying to replace using
dataframe['Continent'] = dataframe['Country'].replace(asia, 'Asia', regex=True)
where asia is my list name and Asia is text to be replace. But is not working
it only work for
dataframe['Continent'] = dataframe['Country'].replace(np.nan, 'Asia', regex=True)
So, help will be appreciated

Using apply with a custom function.
Demo:
import pandas as pd
asia = ['Afghanistan', 'Bahrain', 'United Arab Emirates','Saudi Arabia', 'Kuwait', 'Qatar', 'Oman',
'Sultanate of Oman','Lebanon', 'Iraq', 'Yemen', 'Pakistan', 'Lebanon', 'Philippines', 'Jordan']
europe = ['Germany','Spain', 'France', 'Italy', 'Netherlands', 'Norway', 'Sweden','Czech Republic', 'Finland',
'Denmark', 'Czech Republic', 'Switzerland', 'UK', 'UK&I', 'Poland', 'Greece','Austria',
'Bulgaria', 'Hungary', 'Luxembourg', 'Romania' , 'Slovakia', 'Estonia', 'Slovenia','Portugal',
'Croatia', 'Lithuania', 'Latvia','Serbia', 'Estonia', 'ME', 'Iceland' ]
africa = ['Morocco', 'Tunisia', 'Africa', 'ZA', 'Kenya']
other = ['USA', 'Australia', 'Reunion', 'Faroe Islands']
def GetConti(counry):
if counry in asia:
return "Asia"
elif counry in europe:
return "Europe"
elif counry in africa:
return "Africa"
else:
return "other"
df = pd.DataFrame({"Country": ["Sweden", "Africa", "Africa", "Germany", "Germany", "UK","Pakistan"]})
df['Continent'] = df['Country'].apply(lambda x: GetConti(x))
print(df)
Output:
Country Continent
0 Sweden Europe
1 Africa Africa
2 Africa Africa
3 Germany Europe
4 Germany Europe
5 UK Europe
6 Pakistan Asia

It would be better to store your country-to-continent map as a dictionary rather than four separate lists. You can do this as follows, starting with your current lists:
continents = {country: 'Asia' for country in asia}
continents.update({country: 'Europe' for country in europe})
continents.update({country: 'Africa' for country in africa})
continents.update({country: 'Other' for country in other})
Then you can use the Pandas map function to map continents to countries:
dataframe['Continent'] = dataframe['Country'].map(continents)

Related

Full country name to country code in Dataframe

I have these kind of countries in the dataframe. There are some with full country names, there are some with alpha-2.
Country
------------------------
8836 United Kingdom
1303 ES
7688 United Kingdom
12367 FR
7884 United Kingdom
6844 United Kingdom
3706 United Kingdom
3567 UK
6238 FR
588 UK
4901 United Kingdom
568 UK
4880 United Kingdom
11284 France
1273 Spain
2719 France
1386 UK
12838 United Kingdom
868 France
1608 UK
Name: Country, dtype: object
Note: Some data in Country are empty.
How will I be able to create a new column with the alpha-2 country codes in it?
Country | Country Code
---------------------------------------
United Kingdom | UK
France | FR
FR | FR
UK | UK
Italy | IT
Spain | ES
ES | ES
...
You can try this, as already mentioned in the comment by me earlier.
import pandas as pd
df = pd.DataFrame([[1, 'UK'],[2, 'United Kingdom'],[3, 'ES'],[2, 'Spain']], columns=['id', 'Country'])
#Create copy of country column as alpha-2
df['alpha-2'] = df['Country']
#Create a look up with required values
lookup_table = {'United Kingdom':'UK', 'Spain':'ES'}
#replace the alpha-2 column with lookup values.
df = df.replace({'alpha-2':lookup_table})
print(df)
Output
You will have to define a dictionary for the replacements (or find a library that does it for you). The abbreviations look pretty close the IBAN codes to me. But the biggest stickout was United Kingdom => GB as opposed to UK in your example.
I would start with the IBAN codes and define a big dictionary like this:
mappings = {
"Afghanistan": "AF",
"Albania": "AL",
...
}
df["Country Code"] = df["Country"].replace(mappings)

Python: how to remove footnotes when loading data, and how to select the first when there is a pair of numbers

I am new to python and looking for help.
resp =requests.get("https://en.wikipedia.org/wiki/World_War_II_casualties")
soup = bs.BeautifulSoup(resp.text)
table = soup.find("table", {"class": "wikitable sortable"})
deaths = []`
for row in table.findAll('tr')[1:]:
death = row.findAll('td')[5].text.strip()
deaths.append(death)
It comes out as
'30,000',
'40,400',
'',
'88,000',
'2,000',
'21,500',
'252,600',
'43,600',
'15,000,000[35]to 20,000,000[35]',
'100',
'340,000 to 355,000',
'6,000',
'3,000,000to 4,000,000',
'1,100',
'83,000',
'100,000[49]',
'85,000 to 95,000',
'600,000',
'1,000,000to 2,200,000',
'6,900,000 to 7,400,000',
...
'557,000',
'5,900,000[115] to 6,000,000[116]',
'40,000to 70,000',
'500,000[39]',
'36,000–50,000',
'11,900',
'10,000',
'20,000,000[141] to 27,000,000[142][143][144][145][146]',
'',
'2,100',
'100',
'7,600',
'200',
'450,900',
'419,400',
'1,027,000[160] to 1,700,000[159]',
'',
'70,000,000to 85,000,000']`
I want to plot a graph, but the [] footnote would completely ruin it. Many of the values are with footnotes. Is it also possible to select the first number when there is a pair in one cell? I'd appreciate if anyone of you could teach me... Thank you
You can use soup.find_next() with text=True parameter, then split/strip accordingly.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/World_War_II_casualties'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for tr in soup.table.select('tr:has(td)')[1:]:
tds = tr.select('td')
if not tds[0].b:
continue
name = tds[0].b.get_text(strip=True, separator=' ')
casualties = tds[5].find_next(text=True).strip()
print('{:<30} {}'.format(name, casualties.split('–')[0].split()[0] if casualties else ''))
Prints:
Albania 30,000
Australia 40,400
Austria
Belgium 88,000
Brazil 2,000
Bulgaria 21,500
Burma 252,600
Canada 43,600
China 15,000,000
Cuba 100
Czechoslovakia 340,000
Denmark 6,000
Dutch East Indies 3,000,000
Egypt 1,100
Estonia 83,000
Ethiopia 100,000
Finland 85,000
France 600,000
French Indochina 1,000,000
Germany 6,900,000
Greece 507,000
Guam 1,000
Hungary 464,000
Iceland 200
India 2,200,000
Iran 200
Iraq 700
Ireland 100
Italy 492,400
Japan 2,500,000
Korea 483,000
Latvia 250,000
Lithuania 370,000
Luxembourg 5,000
Malaya & Singapore 100,000
Malta 1,500
Mexico 100
Mongolia 300
Nauru 500
Nepal
Netherlands 210,000
Newfoundland 1,200
New Zealand 11,700
Norway 10,200
Papua and New Guinea 15,000
Philippines 557,000
Poland 5,900,000
Portuguese Timor 40,000
Romania 500,000
Ruanda-Urundi 36,000
South Africa 11,900
South Pacific Mandate 10,000
Soviet Union 20,000,000
Spain
Sweden 2,100
Switzerland 100
Thailand 7,600
Turkey 200
United Kingdom 450,900
United States 419,400
Yugoslavia 1,027,000
Approx. totals 70,000,000

Python - Cleaning US and Canadian Zip Codes with `df.loc` and `str` Methods

I have the following code to create a column with cleaned up zip codes for the USA and Canada
df = pd.read_csv(file1)
usa = df['Region'] == 'USA'
canada = df['Region'] == 'Canada'
df.loc[usa, 'ZipCleaned'] = df.loc[usa, 'Zip'].str.slice(stop=5)
df.loc[canada, 'ZipCleaned'] = df.loc[canada, 'Zip'].str.replace(' |-','')
The issue that i am having is that some of the rows that have "USA" as the country contain Canadian postal codes in the dataset. So the USA logic from above is being applied to Canadian postal codes.
I tried the edited code above along with the below and experimented with one province ("BC") to prevent the USA logic from being applied in this case but it didn't work
usa = df['Region'] == 'USA'
usa = df['State'] != 'BC'
Region Date State Zip Customer Revenue
USA 1/3/2014 BC A5Z 1B6 Customer A $157.52
Canada 1/13/2014 AB Z3J-4E5 Customer B $750.00
USA 1/4/2014 FL 90210-9999 Customer C $650.75
USA 1/21/2014 FL 12345 Customer D $242.00
USA 1/25/2014 FL 45678 Customer E $15.00
USA 1/28/2014 NY 91011 Customer F $25.00
Thanks Kris. But what if I wanted to maintain the original values in the Region column and change "Zip Cleaned" based on whether Zip contains a Canadian or USA Zip. I tried the following but it's not working
usa = df.loc[df['Ship To Customer Zip'].str.contains('[0-9]')]
canada = df.loc[df['Ship To Customer Zip'].str.contains('[A-Za-z]')]
df.loc[usa, 'ZipCleaned'] = df.loc[usa, 'Ship To Customer Zip'].str.slice(stop=5)
df.loc[canada, 'ZipCleaned'] = df.loc[canada, 'Ship To Customer Zip'].str.replace(' |-','')
Give this a try:
# sample df provided by OP
>>> df
Region Date State Zip Customer Revenue
0 USA 2014-01-03 BC A5Z 1B6 Customer A 157.52
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750
2 USA 2014-01-04 FL 90210-999 Customer C 650.75
3 USA 2014-01-21 FL 12345 Customer D 242
4 USA 2014-01-25 FL 45678 Customer E 15
5 USA 2014-01-28 NY 91011 Customer F 25
# Edit 'Region' by testing 'Zip' for presence of letters (US Zip Codes are only numeric)
>>> df.loc[df['Zip'].str.contains('[A-Za-z]'), 'Region'] = 'Canada'
>>> df
Region Date State Zip Customer Revenue
0 Canada 2014-01-03 BC A5Z 1B6 Customer A 157.52
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750
2 USA 2014-01-04 FL 90210-999 Customer C 650.75
3 USA 2014-01-21 FL 12345 Customer D 242
4 USA 2014-01-25 FL 45678 Customer E 15
5 USA 2014-01-28 NY 91011 Customer F 25
# apply OP's original filtering and cleaning
>>> usa = df['Region'] == 'USA'
>>> canada = df['Region'] == 'Canada'
>>> df.loc[usa, 'ZipCleaned'] = df.loc[usa, 'Zip'].str.slice(stop=5)
>>> df.loc[canada, 'ZipCleaned'] = df.loc[canada, 'Zip'].str.replace(' |-','')
# display resultant df
>>> df
Region Date State Zip Customer Revenue ZipCleaned
0 Canada 2014-01-03 BC A5Z 1B6 Customer A 157.52 A5Z1B6
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750 Z3J4E5
2 USA 2014-01-04 FL 90210-999 Customer C 650.75 90210
3 USA 2014-01-21 FL 12345 Customer D 242 12345
4 USA 2014-01-25 FL 45678 Customer E 15 45678
5 USA 2014-01-28 NY 91011 Customer F 25 91011
EDIT: Update as requested by OP: we can do the following to leave the original 'Region' intact:
>>> df
Region Date State Zip Customer Revenue
0 USA 2014-01-03 BC A5Z 1B6 Customer A 157.52
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750
2 USA 2014-01-04 FL 90210-999 Customer C 650.75
3 USA 2014-01-21 FL 123456 Customer D 242
4 USA 2014-01-25 FL 45678 Customer E 15
5 USA 2014-01-28 NY 91011 Customer F 25
# create 'ZipCleaned' by referencing original 'Zip'
>>> df.loc[~df['Zip'].str.contains('[A-Za-z]'), 'ZipCleaned'] = df['Zip'].str.slice(stop=5)
>>> df.loc[df['Zip'].str.contains('[A-Za-z]'), 'ZipCleaned'] = df['Zip'].str.replace(' |-', '')
# Resultant df
>>> df
Region Date State Zip Customer Revenue ZipCleaned
0 USA 2014-01-03 BC A5Z 1B6 Customer A 157.52 A5Z1B6
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750 Z3J4E5
2 USA 2014-01-04 FL 90210-999 Customer C 650.75 90210
3 USA 2014-01-21 FL 123456 Customer D 242 12345
4 USA 2014-01-25 FL 45678 Customer E 15 45678
5 USA 2014-01-28 NY 91011 Customer F 25 91011
>>>

Subtotal for each level in Pivot table

I'm trying to create a pivot table that has, besides the general total, a subtotal between each row level.
I created my df.
import pandas as pd
df = pd.DataFrame(
np.array([['SOUTH AMERICA', 'BRAZIL', 'SP', 500],
['SOUTH AMERICA', 'BRAZIL', 'RJ', 200],
['SOUTH AMERICA', 'BRAZIL', 'MG', 150],
['SOUTH AMERICA', 'ARGENTINA', 'BA', 180],
['SOUTH AMERICA', 'ARGENTINA', 'CO', 300],
['EUROPE', 'SPAIN', 'MA', 400],
['EUROPE', 'SPAIN', 'BA', 110],
['EUROPE', 'FRANCE', 'PA', 320],
['EUROPE', 'FRANCE', 'CA', 100],
['EUROPE', 'FRANCE', 'LY', 80]], dtype=object),
columns=["CONTINENT", "COUNTRY","LOCATION","POPULATION"]
)
After that i created my pivot table as shown bellow
table = pd.pivot_table(df, values=['POPULATION'], index=['CONTINENT', 'COUNTRY', 'LOCATION'], fill_value=0, aggfunc=np.sum, dropna=True)
table
To do the subtotal i started sum CONTINENT level
tab_tots = table.groupby(level='CONTINENT').sum()
tab_tots.index = [tab_tots.index, ['Total'] * len(tab_tots)]
And concatenated with my first pivot to get subtotal.
pd.concat([table, tab_tots]).sort_index()
And got it:
How can i get the values separated in level like the first table?
I'm not finding a way to do this.
With margins=True, and need change little bit of your pivot index and columns .
newdf=pd.pivot_table(df, index=['CONTINENT'],values=['POPULATION'], columns=[ 'COUNTRY', 'LOCATION'], aggfunc=np.sum, dropna=True,margins=True)
newdf.drop('All').stack([1,2])
Out[132]:
POPULATION
CONTINENT COUNTRY LOCATION
EUROPE All 1010.0
FRANCE CA 100.0
LY 80.0
PA 320.0
SPAIN BA 110.0
MA 400.0
SOUTH AMERICA ARGENTINA BA 180.0
CO 300.0
All 1330.0
BRAZIL MG 150.0
RJ 200.0
SP 500.0
IIUC:
contotal = table.groupby(level=0).sum().assign(COUNTRY='TOTAL', LOCATION='').set_index(['COUNTRY','LOCATION'], append=True)
coutotal = table.groupby(level=[0,1]).sum().assign(LOCATION='TOTAL').set_index(['LOCATION'], append=True)
df_out = (pd.concat([table,contotal,coutotal]).sort_index())
df_out
Output:
POPULATION
CONTINENT COUNTRY LOCATION
EUROPE FRANCE CA 100
LY 80
PA 320
TOTAL 500
SPAIN BA 110
MA 400
TOTAL 510
TOTAL 1010
SOUTH AMERICA ARGENTINA BA 180
CO 300
TOTAL 480
BRAZIL MG 150
RJ 200
SP 500
TOTAL 850
TOTAL 1330
You want to do something like this instead
tab_tots.index = [tab_tots.index, ['Total'] * len(tab_tots), [''] * len(tab_tots)]
Which gives the following I think you are after
In [277]: pd.concat([table, tab_tots]).sort_index()
Out[277]:
POPULATION
CONTINENT COUNTRY LOCATION
EUROPE FRANCE CA 100
LY 80
PA 320
SPAIN BA 110
MA 400
Total 1010
SOUTH AMERICA ARGENTINA BA 180
CO 300
BRAZIL MG 150
RJ 200
SP 500
Total 1330
Note that although this solves your problem, it isn't good programming stylistically. You have inconsistent logic on your summed levels.
This makes sense for a UI interface but if you are using the data it would be better to perhaps use
tab_tots.index = [tab_tots.index, ['All'] * len(tab_tots), ['All'] * len(tab_tots)]
This follows SQL table logic and will give you
In [289]: pd.concat([table, tab_tots]).sort_index()
Out[289]:
POPULATION
CONTINENT COUNTRY LOCATION
EUROPE All All 1010
FRANCE CA 100
LY 80
PA 320
SPAIN BA 110
MA 400
SOUTH AMERICA ARGENTINA BA 180
CO 300
All All 1330
BRAZIL MG 150
RJ 200
SP 500

if a column has duplicates and next column has a value how to add these to duplicate values

Their is a dataframes with the columns, below is input ,if the username has duplicates and the owner region should be added to the duplicates also
Queue Owner Region username
xxy aan
xyz india aan
yyx aandiapp
xox UK aandiapp
yox china aashwins
zxy aashwins
yoz aus aasyed
zxo aasyed
The required output should be
Queue Owner Region username
xxy india aan
xyz india aan
yyx Uk aandiapp
xox Uk aandiapp
yox china aashwins
zxy china aashwins
yoz aus aasyed
zxo aus aasyed
please anyone help me , Thanks inadvance
I think need replace empty values to NaNs first and then per groups repalce them by forward and back filling:
df['Owner Region'] = df['Owner Region'].replace('', np.nan)
df['Owner Region'] = df.groupby('username')['Owner Region'].transform(lambda x: x.ffill().bfill())
You can use mask and groupby.
df['Owner Region'] = (
df['Owner Region']
.mask(df['Owner Region'].str.len().eq(0))
.groupby(df.username)
.ffill()
.bfill())
df
Queue Owner Region username
0 xxy india aan
1 xyz india aan
2 yyx UK aandiapp
3 xox UK aandiapp
4 yox china aashwins
5 zxy china aashwins
6 yoz aus aasyed
7 zxo aus aasyed
When you call groupby + ffill, the subsequent bfill call does not require a groupby.
If it is possible a group is only NaNs in it, you cannot avoid the apply...
df['Owner Region'] = (
df['Owner Region']
.mask(df['Owner Region'].str.len().eq(0))
.groupby(df.username)
.apply(lambda x: x.ffill().bfill()))
df
Queue Owner Region username
0 xxy india aan
1 xyz india aan
2 yyx UK aandiapp
3 xox UK aandiapp
4 yox china aashwins
5 zxy china aashwins
6 yoz aus aasyed
7 zxo aus aasyed

Resources