How to apply IF, else, else if condition in Pandas DataFrame - python-3.x

I have a column in my pandas DataFrame with country names. I want to apply different filters on the column using if-else conditions and have to add a new column on that DataFrame with those conditions.
Current DataFrame:-
Company Country
BV Denmark
BV Sweden
DC Norway
BV Germany
BV France
DC Croatia
BV Italy
DC Germany
BV Austria
BV Spain
I have tried this but in this, I have to define countries again and again.
bookings_d2.loc[(bookings_d2.Country== 'Denmark') | (bookings_d2.Country== 'Norway'), 'Country'] = bookings_d2.Country
In R I am currently using if else condition like this, I want to implement this same thing in python.
R Code Example 1 :
ifelse(bookings_d2$COUNTRY_NAME %in% c('Denmark','Germany','Norway','Sweden','France','Italy','Spain','Germany','Austria','Netherlands','Croatia','Belgium'),
as.character(bookings_d2$COUNTRY_NAME),'Others')
R Code Example 2 :
ifelse(bookings_d2$country %in% c('Germany'),
ifelse(bookings_d2$BOOKING_BRAND %in% c('BV'),'Germany_BV','Germany_DC'),bookings_d2$country)
Expected DataFrame:-
Company Country
BV Denmark
BV Sweden
DC Norway
BV Germany_BV
BV France
DC Croatia
BV Italy
DC Germany_DC
BV Others
BV Others

Not sure exactly what you are trying to achieve, but I guess it is something along the lines of:
df=pd.DataFrame({'country':['Sweden','Spain','China','Japan'], 'continent':[None] * 4})
country continent
0 Sweden None
1 Spain None
2 China None
3 Japan None
df.loc[(df.country=='Sweden') | ( df.country=='Spain'), 'continent'] = "Europe"
df.loc[(df.country=='China') | ( df.country=='Japan'), 'continent'] = "Asia"
country continent
0 Sweden Europe
1 Spain Europe
2 China Asia
3 Japan Asia
You can also use python list comprehension like:
df.continent=["Europe" if (x=="Sweden" or x=="Denmark") else "Other" for x in df.country]

You can use:
For example1: Use Series.isin with numpy.where or loc, but necessary invert mask by ~:
#removed Austria, Spain
L = ['Denmark','Germany','Norway','Sweden','France','Italy',
'Germany','Netherlands','Croatia','Belgium']
df['Country'] = np.where(df['Country'].isin(L), df['Country'], 'Others')
Alternative:
df.loc[~df['Country'].isin(L), 'Country'] ='Others'
For example2: Use numpy.select or nested np.where:
m1 = df['Country'] == 'Germany'
m2 = df['Company'] == 'BV'
df['Country'] = np.select([m1 & m2, m1 & ~m2],['Germany_BV','Germany_DC'], df['Country'])
Alternative:
df['Country'] = np.where(~m1, df['Country'],
np.where(m2, 'Germany_BV','Germany_DC'))
print (df)
Company Country
0 BV Denmark
1 BV Sweden
2 DC Norway
3 BV Germany_BV
4 BV France
5 DC Croatia
6 BV Italy
7 DC Germany_DC
8 BV Others
9 BV Others

You can do to get it:
country_others=['Poland','Switzerland']
df.loc[df['Country']=='Germany','Country']=df.loc[df['Country']=='Germany'].apply(lambda x: x+df['Company'])['Country']
df.loc[(df['Company']=='DC') &(df['Country'].isin(country_others)),'Country']='Others'

Related

Extract the mapping dictionary between two columns in pandas

I have a dataframe as shown below.
df:
id player country_code country
1 messi arg argentina
2 neymar bra brazil
3 tevez arg argentina
4 aguero arg argentina
5 rivaldo bra brazil
6 owen eng england
7 lampard eng england
8 gerrard eng england
9 ronaldo bra brazil
10 marria arg argentina
from the above df, I would like to extract the mapping dictionary that relates the country_code with country columns.
Expected Output:
d = {'arg':'argentina', 'bra':'brazil', 'eng':'england'}
Dictionary has unique keys, so is possible convert Series with duplicated index by column country_code:
d = df.set_index('country_code')['country'].to_dict()
If there is possible some country should be different per country_code, then is used last value per country.

How to merge data with duplicates using panda python

I have two dataframe below, I 'd like to merge them to get ID on df1. However, I find by using merge, I cannot get the ID if the names are more than one. df2 has unique name, df1 and df2 are different in rows and columns. My code below:
df1: Name Region
0 P Asia
1 Q Eur
2 R Africa
3 S NA
4 R Africa
5 R Africa
6 S NA
df2: Name Id
0 P 1234
1 Q 1244
2 R 1233
3 S 1111
code:
x= df1.assign(temp1 = df1.groupby ('Name').cumcount())
y= df2.assign(temp1 = df2.groupby ('Name').cumcount())
xy= x.merge(y, on=['Name',temp2],how = 'left').drop(columns = ['temp1'])
the output is:
df1: Name Region Id
0 P Asia 1234
1 Q Eur 1244
2 R Africa 1233
3 S NA 1111
4 R Africa NAN
5 R Africa NAN
6 S NA NAN
How do I find all the id for these duplicate names?

Flag repeating entries in pandas time series

I have a data frame that takes this form (but is several millions of rows long):
import pandas as pd
dict = {'id':["A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D"],
'year': ["2000", "2001", "2002", "2000", "2001", "2003", "1999", "2000", "2001", "2000", "2000", "2001"],
'vacation':["France", "Morocco", "Morocco", "Germany", "Germany", "Germany", "Japan", "Australia", "Japan", "Canada", "Mexico", "China"],
'new':[1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1]}
df = pd.DataFrame(dict)
A 2000 France
A 2001 Morocco
A 2002 Morocco
B 2000 Germany
B 2001 Germany
B 2003 Germany
C 1999 Japan
C 2000 Australia
C 2001 Japan
D 2000 Canada
D 2000 Mexico
D 2001 China
For each person in each year, the holiday destination(s) is/are given; there can be multiple holiday destinations in a given year.
I would like to flag the rows when a participant goes to holiday to a destination to which they had not gone the year before (i.e., the destination is new). In the case above, the output would be:
id year vacation new
A 2000 France 1
A 2001 Morocco 1
A 2002 Morocco 0
B 2001 Germany 1
B 2002 Germany 0
B 2003 Germany 0
C 1999 Japan 1
C 1999 Australia 1
C 2000 Japan 1
D 2000 Canada 1
D 2000 Mexico 1
D 2001 China 1
For A, B, C, and D, the first holiday destination in our data frame is flagged as new. When A goes to Morocco two years in a row, the 2nd occurrence is not flagged, because A went there the year before. When B goes to Germany 3 times in a row, the 2nd and 3rd occurrences are not flagged. When person C goes to Japan twice, all of the occurrences are flagged, because they did not go to Japan two years in a row. D goes to 3 different destinations (albeit to 2 destinations in 2000) and all of them are flagged.
I have been trying to solve it myself, but have not been able to break away from iterations, which are too computationally intensive for such a massive dataset.
I'd appreciate any input; thanks.
IIUC,
what we are doing is grouping by id & vacation and ensuring that year is not equal to the year above, or we can selecting the first instance of that combination.
hopefully that's clear. let me know if you need anymore help.
df["new_2"] = (
df.groupby(["id", "vacation"])["id", "year"]
.apply(lambda x: x.ne(x.shift()))
.all(axis=1)
.add(0)
)
print(df)
id year vacation new_2
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1
Here's one solution I came up with, using groupby and transform:
df = df.sort_values(["id", "vacation", "year"])
df["new"] = (
df.groupby(["id", "vacation"])
.transform(lambda x: x.iloc[0])
.year.eq(df.year)
.astype(int)
)
You'll get
id year vacation new
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1
Here is a way using groupby+cumcount and series.mask:
df['new']=df.groupby(['id','vacation']).cumcount().add(1).mask(lambda x: x.gt(1),0)
print(df)
id year vacation new
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1

How to calculate common elements in a dataframe depending on another column

I have a dataframe like this.
sport Country(s)
Foot_ball brazil
Foot_ball UK
Volleyball UK
Volleyball South_Africa
Volleyball brazil
Rugger UK
Rugger South_africa
Rugger Australia
Carrom UK
Carrom Australia
Chess UK
Chess Australia
I want to calculate the number of sports shared by two countries. For a example
Football and Volleyball is common to brazil and Uk. So the number of common sports played by brazil and Uk is 2.
carrom, chess and Rugger are common to australia and Uk. So the number of sports shared by australia and UK is 3.
Like this is there anyway that I can get a count in whole dataframe for
brazil, south_afriaca.
Brazil, Austrlia
SouthAfrica, Uk
e.t.c
Can anybody suggest me how to do this in pandas or any other way.
With the sample data you provided you can generate the desired output with below code:
import pandas as pd
df = pd.DataFrame(
[["Foot_ball", "brazil"],\
["Foot_ball", "UK"],\
["Volleyball", "UK"],\
["Volleyball", "South_Africa"],\
["Volleyball", "brazil"],\
["Rugger", "UK"],\
["Rugger", "South_Africa"],\
["Rugger", "Australia"],\
["Carrom", "UK"],\
["Carrom", "Australia"],\
["Chess", "UK"],\
["Chess", "Australia"]],\
columns = ["sport" , "Country"])
# Function to get the number of sports in common
def countCommonSports(row):
sports1 = df["sport"][df["Country"]==row["Country 1"]]
sports2 = df["sport"][df["Country"]==row["Country 2"]]
return len(list(set(sports1).intersection(sports2)))
# Generate the combinations of countries from original Dataframe
from itertools import combinations
comb = combinations(df["Country"].unique(), 2)
out = pd.DataFrame(list(comb), columns=["Country 1", "Country 2"])
# Find out the sports in common between coutries
out["common Sports count"] = out.apply(countCommonSports, axis = 1)
output is then:
>>> out
Country 1 Country 2 common Sports count
0 brazil UK 2
1 brazil South_Africa 1
2 brazil Australia 0
3 UK South_Africa 2
4 UK Australia 3
5 South_Africa Australia 1
pd.factorize and itertools.combinations
import pandas as pd
import numpy as np
from itertools import combinations, product
# Fix Capitalization
df['Country(s)'] = ['_'.join(map(str.title, x.split('_'))) for x in df['Country(s)']]
c0, c1 = zip(*[(a, b)
for s, c in df.groupby('sport')['Country(s)']
for a, b in combinations(c, 2)])
i, r = pd.factorize(c0)
j, c = pd.factorize(c1)
n, m = len(r), len(c)
o = np.zeros((n, m), np.int64)
np.add.at(o, (i, j), 1)
result = pd.DataFrame(o, r, c)
result
Australia Uk South_Africa Brazil
Uk 3 0 2 1
Brazil 0 1 0 0
South_Africa 1 0 0 1
Make symmetrical
result = result.align(result.T, fill_value=0)[0]
result
Australia Brazil South_Africa Uk
Australia 0 0 0 0
Brazil 0 0 0 1
South_Africa 1 1 0 0
Uk 3 1 2 0
pd.crosstab
This will be slower... almost certainly.
c0, c1 = map(pd.Series, zip(*[(a, b)
for s, c in df.groupby('sport')['Country(s)']
for a, b in combinations(c, 2)]))
pd.crosstab(c0, c1).rename_axis(None).rename_axis(None, axis=1).pipe(
lambda d: d.align(d.T, fill_value=0)[0]
)
Australia Brazil South_Africa Uk
Australia 0 0 0 0
Brazil 0 0 0 1
South_Africa 1 1 0 0
Uk 3 1 2 0
Or including all sports within a single country
c0, c1 = map(pd.Series, zip(*[(a, b)
for s, c in df.groupby('sport')['Country(s)']
for a, b in product(c, c)]))
pd.crosstab(c0, c1).rename_axis(None).rename_axis(None, axis=1)
Australia Brazil South_Africa Uk
Australia 3 0 1 3
Brazil 0 2 1 2
South_Africa 1 1 2 2
Uk 3 2 2 5

How to access a column of grouped data to perform linear regression in pandas?

I want to perform a linear regression on groupes of grouped data frame in pandas. The function I am calling throws a KeyError that I cannot resolve.
I have an environmental data set called dat that includes concentration data of a chemical in different tree species of various age classes in different country sites over the course of several time steps. I now want to do a regression of concentration over time steps within each group of (site, species, age).
This is my code:
```
import pandas as pd
import statsmodels.api as sm
dat = pd.read_csv('data.csv')
dat.head(15)
SampleName Concentration Site Species Age Time_steps
0 batch1 2.18 Germany pine 1 1
1 batch2 5.19 Germany pine 1 2
2 batch3 11.52 Germany pine 1 3
3 batch4 16.64 Norway spruce 0 1
4 batch5 25.30 Norway spruce 0 2
5 batch6 31.20 Norway spruce 0 3
6 batch7 12.63 Norway spruce 1 1
7 batch8 18.70 Norway spruce 1 2
8 batch9 43.91 Norway spruce 1 3
9 batch10 9.41 Sweden birch 0 1
10 batch11 11.10 Sweden birch 0 2
11 batch12 15.73 Sweden birch 0 3
12 batch13 16.87 Switzerland beech 0 1
13 batch14 22.64 Switzerland beech 0 2
14 batch15 29.75 Switzerland beech 0 3
def ols_res_grouped(group):
xcols_const = sm.add_constant(group['Time_steps'])
linmod = sm.OLS(group['Concentration'], xcols_const).fit()
return linmod.params[1]
grouped = dat.groupby(['Site','Species','Age']).agg(ols_res_grouped)
```
I want to get the regression coefficient of concentration data over Time_steps but get a KeyError: 'Time_steps'. How can the sm method access group["Time_steps"]?
According to pandas's documentation, agg applies functions to each column independantly.
It might be possible to use NamedAgg but I am not sure.
I think it is a lot easier to just use a for loop for this :
for _, group in dat.groupby(['Site','Species','Age']):
coeff = ols_res_grouped(group)
# if you want to put the coeff inside the dataframe
dat.loc[group.index, 'coeff'] = coeff

Resources