How to encode pandas data frame column with three values fast? - python-3.x

I have a pandas data frame that contains a column called Country. I have more than a million rows in my data frame.
Country
USA
Canada
Japan
India
Brazil
......
I want to create a new column called Country_Encode, which will replace USA with 1, Canada with 2, and all others with 0 like the following.
Country Country_Encode
USA 1
Canada 2
Japan 0
India 0
Brazil 0
..................
I have tried following.
for idx, row in df.iterrows():
if (df.loc[idx, 'Country'] == USA):
df.loc[idx, 'Country_Encode'] = 1
elif (df.loc[idx, 'Country'] == Canada):
df.loc[idx, 'Country_Encode'] = 2
elif ((df.loc[idx, 'Country'] != USA) and (df.loc[idx, 'Country'] != Canada)):
df.loc[idx, 'Country_Encode'] = 0
The above solution works but it is very slow. Do you know how I can do it in a fast way? I really appreciate any help you can provide.

Assuming no row contains two country names, you could assign values in a vectorized way using a boolean condition:
df['Country_encode'] = df['Country'].eq('USA') + df['Country'].eq('Canada')*2
Output:
Country Country_encode
0 USA 1
1 Canada 2
2 Japan 0
3 India 0
4 Brazil 0
But in general, loc is very fast:
df['Country_encode'] = 0
df.loc[df['Country'].eq('USA'), 'Country_encode'] = 1
df.loc[df['Country'].eq('Canada'), 'Country_encode'] = 2

There are many ways to do this, the most basic one is the following:
def coding(row):
if row == "USA":
return 1
elif row== "Canada":
return 2
else:
return 0
df["Country_code"] = df["Country"].apply(coding)

Related

How to compare values in a data frame in pandas [duplicate]

I am trying to calculate the biggest difference between summer gold medal counts and winter gold medal counts relative to their total gold medal count. The problem is that I need to consider only countries that have won at least 1 gold medal in both summer and winter.
Gold: Count of summer gold medals
Gold.1: Count of winter gold medals
Gold.2: Total Gold
This a sample of my data:
Gold Gold.1 Gold.2 ID diff gold %
Afghanistan 0 0 0 AFG NaN
Algeria 5 0 5 ALG 1.000000
Argentina 18 0 18 ARG 1.000000
Armenia 1 0 1 ARM 1.000000
Australasia 3 0 3 ANZ 1.000000
Australia 139 5 144 AUS 0.930556
Austria 18 59 77 AUT 0.532468
Azerbaijan 6 0 6 AZE 1.000000
Bahamas 5 0 5 BAH 1.000000
Bahrain 0 0 0 BRN NaN
Barbados 0 0 0 BAR NaN
Belarus 12 6 18 BLR 0.333333
This is the code that I have but it is giving the wrong answer:
def answer():
Gold_Y = df2[(df2['Gold'] > 1) | (df2['Gold.1'] > 1)]
df2['difference'] = (df2['Gold']-df2['Gold.1']).abs()/df2['Gold.2']
return df2['diff gold %'].idxmax()
answer()
Try this code after subbing in the correct (your) function and variable names. I'm new to Python, but I think the issue was that you had to use the same variable in Line 4 (df1['difference']), and just add the method (.idxmax()) to the end. I don't think you need the first line of code for the function, either, as you don't use the local variable (Gold_Y). FYI - I don't think we're working with the same dataset.
def answer_three():
df1['difference'] = (df1['Gold']-df1['Gold.1']).abs()/df1['Gold.2']
return df1['difference'].idxmax()
answer_three()
def answer_three():
atleast_one_gold = df[(df['Gold']>1) & (df['Gold.1']> 1)]
return ((atleast_one_gold['Gold'] - atleast_one_gold['Gold.1'])/atleast_one_gold['Gold.2']).idxmax()
answer_three()
def answer_three():
_df = df[(df['Gold'] > 0) & (df['Gold.1'] > 0)]
return ((_df['Gold'] - _df['Gold.1']) / _df['Gold.2']).argmax() answer_three()
This looks like a question from the programming assignment of courser course -
"Introduction to Data Science in Python"
Having said that if you are not cheating "maybe" the bug is here:
Gold_Y = df2[(df2['Gold'] > 1) | (df2['Gold.1'] > 1)]
You should use the & operator. The | operator means you have countries that have won Gold in either the Summer or Winter olympics.
You should not get a NaN in your diff gold.
def answer_three():
diff=df['Gold']-df['Gold.1']
relativegold = diff.abs()/df['Gold.2']
df['relativegold']=relativegold
x = df[(df['Gold.1']>0) &(df['Gold']>0) ]
return x['relativegold'].idxmax(axis=0)
answer_three()
I an pretty new to python or programming as a whole.
So my solution would be the most novice ever!
I love to create variables; so you'll see a lot in the solution.
def answer_three:
a = df.loc[df['Gold'] > 0,'Gold']
#Boolean masking that only prints the value of Gold that matches the condition as stated in the question; in this case countries who had at least one Gold medal in the summer seasons olympics.
b = df.loc[df['Gold.1'] > 0, 'Gold.1']
#Same comment as above but 'Gold.1' is Gold medals in the winter seasons
dif = abs(a-b)
#returns the abs value of the difference between a and b.
dif.dropna()
#drops all 'Nan' values in the column.
tots = a + b
#i only realised that this step wasn't essential because the data frame had already summed it up in the column 'Gold.2'
tots.dropna()
result = dif.dropna()/tots.dropna()
returns result.idxmax
# returns the index value of the max result
def answer_two():
df2=pd.Series.max(df['Gold']-df['Gold.1'])
df2=df[df['Gold']-df['Gold.1']==df2]
return df2.index[0]
answer_two()
def answer_three():
return ((df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold'] - df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold.1'])/df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold.2']).argmax()

Groupby and calculate count and means based on multiple conditions in Pandas

For the given dataframe as follows:
id|address|sell_price|market_price|status|start_date|end_date
1|7552 Atlantic Lane|1170787.3|1463484.12|finished|2019/8/2|2019/10/1
1|7552 Atlantic Lane|1137782.02|1422227.52|finished|2019/8/2|2019/10/1
2|888 Foster Street|1066708.28|1333385.35|finished|2019/8/2|2019/10/1
2|888 Foster Street|1871757.05|1416757.05|finished|2019/10/14|2019/10/15
2|888 Foster Street|NaN|763744.52|current|2019/10/12|2019/10/13
3|5 Pawnee Avenue|NaN|928366.2|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|NaN|2025924.16|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|Nan|4000000|forward|2019/10/9|2019/10/10
3|5 Pawnee Avenue|2236138.9|1788938.9|finished|2019/10/8|2019/10/9
4|916 W. Mill Pond St.|2811026.73|1992026.73|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|13664803.02|10914803.02|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|3234636.64|1956636.64|finished|2019/9/30|2019/10/1
5|68 Henry Drive|2699959.92|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|5830725.66|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|2668401.36|1903401.36|finished|2019/12/8|2019/12/9
#copy above data and run below code to reproduce dataframe
df = pd.read_clipboard(sep='|')
I would like to groupby id and address and calculate mean_ratio and result_count based on the following conditions:
mean_ratio: which is groupby id and address and calculate mean for the rows meet the following conditions: status is finished and start_date isin the range of 2019-09 and 2019-10
result_count: which is groupby id and address and count the rows meet the following conditions: status is either finished or failed, and start_date isin the range of 2019-09 and 2019-10
The desired output will like this:
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0
1 2 888 Foster Street 1.32 1
2 3 5 Pawnee Avenue 1.25 1
3 4 916 W. Mill Pond St. 1.44 3
4 5 68 Henry Drive NaN 2
I have tried so far:
# convert date
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
# calculate ratio
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
In order to filter start_date isin the range of 2019-09 and 2019-10:
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
df = df[np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])]
To filter row status is finished or failed, I use:
mask = df['status'].str.contains('finished|failed')
df[mask]
But I don't know how to use those to get final result. Thanks your help at advance.
I think you need GroupBy.agg, but because some rows are excluded like id=1, then add them by DataFrame.join with all unique pairs id and address in df2, last replace missing values in result_count columns:
df2 = df[['id','address']].drop_duplicates()
print (df2)
id address
0 1 7552 Atlantic Lane
2 2 888 Foster Street
5 3 5 Pawnee Avenue
9 4 916 W. Mill Pond St.
12 5 68 Henry Drive
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
mask = df['status'].str.contains('finished|failed')
mask1 = np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])
df = df[mask1 & mask]
df1 = df.groupby(['id', 'address']).agg(mean_ratio=('ratio','mean'),
result_count=('ratio','size'))
df1 = df2.join(df1, on=['id','address']).fillna({'result_count': 0})
print (df1)
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0.0
2 2 888 Foster Street 1.320000 1.0
5 3 5 Pawnee Avenue 1.250000 1.0
9 4 916 W. Mill Pond St. 1.436667 3.0
12 5 68 Henry Drive NaN 2.0
Some helpers
def mean_ratio(idf):
# filtering data
idf = idf[
(idf['start_date'].between('2019-09-01', '2019-10-31')) &
(idf['mean_ratio'].notnull()) ]
return np.round(idf['mean_ratio'].mean(), 2)
def result_count(idf):
idf = idf[
(idf['status'].isin(['finished', 'failed'])) &
(idf['start_date'].between('2019-09-01', '2019-10-31')) ]
return idf.shape[0]
# We can caluclate `mean_ratio` before hand
df['mean_ratio'] = df['sell_price'] / df['market_price']
df = df.astype({'start_date': np.datetime64, 'end_date': np.datetime64})
# Group the df
g = df.groupby(['id', 'address'])
mean_ratio = g.apply(lambda idf: mean_ratio(idf)).to_frame('mean_ratio')
result_count = g.apply(lambda idf: result_count(idf)).to_frame('result_count')
# Final result
pd.concat((mean_ratio, result_count), axis=1)

How to calculate common elements in a dataframe depending on another column

I have a dataframe like this.
sport Country(s)
Foot_ball brazil
Foot_ball UK
Volleyball UK
Volleyball South_Africa
Volleyball brazil
Rugger UK
Rugger South_africa
Rugger Australia
Carrom UK
Carrom Australia
Chess UK
Chess Australia
I want to calculate the number of sports shared by two countries. For a example
Football and Volleyball is common to brazil and Uk. So the number of common sports played by brazil and Uk is 2.
carrom, chess and Rugger are common to australia and Uk. So the number of sports shared by australia and UK is 3.
Like this is there anyway that I can get a count in whole dataframe for
brazil, south_afriaca.
Brazil, Austrlia
SouthAfrica, Uk
e.t.c
Can anybody suggest me how to do this in pandas or any other way.
With the sample data you provided you can generate the desired output with below code:
import pandas as pd
df = pd.DataFrame(
[["Foot_ball", "brazil"],\
["Foot_ball", "UK"],\
["Volleyball", "UK"],\
["Volleyball", "South_Africa"],\
["Volleyball", "brazil"],\
["Rugger", "UK"],\
["Rugger", "South_Africa"],\
["Rugger", "Australia"],\
["Carrom", "UK"],\
["Carrom", "Australia"],\
["Chess", "UK"],\
["Chess", "Australia"]],\
columns = ["sport" , "Country"])
# Function to get the number of sports in common
def countCommonSports(row):
sports1 = df["sport"][df["Country"]==row["Country 1"]]
sports2 = df["sport"][df["Country"]==row["Country 2"]]
return len(list(set(sports1).intersection(sports2)))
# Generate the combinations of countries from original Dataframe
from itertools import combinations
comb = combinations(df["Country"].unique(), 2)
out = pd.DataFrame(list(comb), columns=["Country 1", "Country 2"])
# Find out the sports in common between coutries
out["common Sports count"] = out.apply(countCommonSports, axis = 1)
output is then:
>>> out
Country 1 Country 2 common Sports count
0 brazil UK 2
1 brazil South_Africa 1
2 brazil Australia 0
3 UK South_Africa 2
4 UK Australia 3
5 South_Africa Australia 1
pd.factorize and itertools.combinations
import pandas as pd
import numpy as np
from itertools import combinations, product
# Fix Capitalization
df['Country(s)'] = ['_'.join(map(str.title, x.split('_'))) for x in df['Country(s)']]
c0, c1 = zip(*[(a, b)
for s, c in df.groupby('sport')['Country(s)']
for a, b in combinations(c, 2)])
i, r = pd.factorize(c0)
j, c = pd.factorize(c1)
n, m = len(r), len(c)
o = np.zeros((n, m), np.int64)
np.add.at(o, (i, j), 1)
result = pd.DataFrame(o, r, c)
result
Australia Uk South_Africa Brazil
Uk 3 0 2 1
Brazil 0 1 0 0
South_Africa 1 0 0 1
Make symmetrical
result = result.align(result.T, fill_value=0)[0]
result
Australia Brazil South_Africa Uk
Australia 0 0 0 0
Brazil 0 0 0 1
South_Africa 1 1 0 0
Uk 3 1 2 0
pd.crosstab
This will be slower... almost certainly.
c0, c1 = map(pd.Series, zip(*[(a, b)
for s, c in df.groupby('sport')['Country(s)']
for a, b in combinations(c, 2)]))
pd.crosstab(c0, c1).rename_axis(None).rename_axis(None, axis=1).pipe(
lambda d: d.align(d.T, fill_value=0)[0]
)
Australia Brazil South_Africa Uk
Australia 0 0 0 0
Brazil 0 0 0 1
South_Africa 1 1 0 0
Uk 3 1 2 0
Or including all sports within a single country
c0, c1 = map(pd.Series, zip(*[(a, b)
for s, c in df.groupby('sport')['Country(s)']
for a, b in product(c, c)]))
pd.crosstab(c0, c1).rename_axis(None).rename_axis(None, axis=1)
Australia Brazil South_Africa Uk
Australia 3 0 1 3
Brazil 0 2 1 2
South_Africa 1 1 2 2
Uk 3 2 2 5

compare columns and replace result in existing column

I have two pandas columns, where I first compare the two columns and then replace an old string with a new one.
My data:
shopping on_List
Banana 1
Apple 0
Grapes 1
None 0
Banana 1
Nuts 0
Lemon 1
In order to compare the two I have done the following:
results = []
for shopping, on_list in zip(df.shopping, df.on_list):
if shopping != 'None' and on_list == 1:
items = shopping
if items == 'Banana':
re = items.replace('Banana', 'Bananas')
elif items == 'Lemon':
re = items.replace('Lemon', 'Lemons')
elif items == 'Apples':
re= items.replace('Apple','Apples')
results.append(re)
print(results)
Output: ['Bananas','Lemons', 'Apples']
Ideally I would like to return a new column that replaces my new values with old ones in the 'shopping' column:
This is my desired output, but unfortunately my new list (results) is not the same length as the current df:
shopping
Bananas
Apples
Grapes
None
Bananas
Nuts
Lemons
I suggest create dictionary for mapping and replace filtered values:
d = {'Banana':'Bananas', 'Lemon':'Lemons', 'Apple':'Apples'}
mask = df['on_List'].eq(1) & df['on_List'].notnull()
df['shopping'] = df['shopping'].mask(mask, df['shopping'].map(d)).fillna(df['shopping'])
#slowier solution
#df['shopping'] = df['shopping'].mask(mask, df['shopping'].replace(d))
print (df)
shopping on_List
0 Bananas 1
1 Apple 0
2 Grapes 1
3 None 0
4 Bananas 1
5 Nuts 0
6 Lemons 1
val = []
for i in range(len(df)):
if df["shopping"][i] != None and df["on_List"][i] == 1:
if df["shopping"][i] == "Banana":
val.append("Bananas")
elif df["shopping"][i] == "Lemon":
val.append("Lemons")
elif df["shopping"][i] == "Apple":
val.append("Apples")
else:
val.append("None")
df["Result"] = pd.Series(val)

pandas pd.read_html heading shifted to the right

I'm trying to convert wiki page table to dataframe. Headings are shifted to the
right, 'Launches' should be there were it is now 'Successes'.
I have used skiprows option, but it did not work.
df = pd.read_html(r'https://en.wikipedia.org/wiki/2018_in_spaceflight',skiprows=[1,2])[7]
df2 = df[df.columns[1:5]]
1 2 3 4
0 Launches Successes Failures Partial failures
1 India 1 1 0
2 Japan 3 3 0
3 New Zealand 1 1 0
4 Russia 3 3 0
5 United States 8 8 0
6 24 23 0 1
The problem is there are merged cells in the first column of the original table. If you want to parse it exactly, you should write a parser. Provisionally, you can try:
df = pd.read_html(r'https://en.wikipedia.org/wiki/2018_in_spaceflight', header=0)[7]
df.columns = [""] + list(df.columns[:-1])
df.iloc[-1] = [""] + list(df.iloc[-1][:-1])

Resources