Pandas: Fill rows if 2 column strings are the same - python-3.x

I have a data set with ton of columns, I just want to back fill the rows that are missing with existing row values. I am trying to back fill with this logic like: if 'school' and 'country' are the same string then replace 'state' value into the empty 'state' column.
Here is an example. Problem with this is that its combining the other rows I am trying not split the rows. Is there a way? Thanks!
Sample Data:
import pandas as pd
school = ['Univ of CT','Univ of CT','Oxford','Oxford','ABC Univ']
name = ['John','Matt','John','Ashley','John']
country = ['US','US','UK','UK','']
state = ['CT','','','ENG','']
df = pd.DataFrame({'school':school,'country':country,'state':state,'name':name})
df['school'] = df['school'].str.upper()
Above data gives preview like:
school country state name
UNIV OF CT US CT John
UNIV OF CT US Matt
OXFORD UK John
OXFORD UK ENG Ashley
ABC UNIV John
I am looking for output like this:
school country state name
UNIV OF CT US CT John
UNIV OF CT US CT Matt
OXFORD UK ENG John
OXFORD UK ENG Ashley
ABC UNIV John
Code I tried:
df = df.fillna('')
df = df.reset_index().groupby(['school','country']).agg(';'.join)
df = pd.DataFrame(df).reset_index()
len(df)

You can write a small function to basically look up the state if it is blank based on the school and country.
def find_state(school, country, state):
if len(state) > 0:
return state
found_state = df['state'][(df['school'] == school) & (df['country'] == country)]
return max(found_state)
So the full example would be as follows:
import pandas as pd
school = ['Univ of CT','Univ of CT','Oxford','Oxford','ABC Univ']
name = ['John','Matt','John','Ashley','John']
country = ['US','US','UK','UK','']
state = ['CT','','','ENG','']
df = pd.DataFrame({'school':school,'country':country,'state':state,'name':name})
df['school'] = df['school'].str.upper()
def find_state(school, country, state):
if len(state) > 0:
return state
found_state = df['state'][(df['school'] == school) & (df['country'] == country)]
return max(found_state)
df['state_new'] = [find_state(school, country, state) for school, country, state in
df[['school','country','state']].values]
print(df)
school country state name state_new
0 UNIV OF CT US CT John CT
1 UNIV OF CT US Matt CT
2 OXFORD UK John ENG
3 OXFORD UK ENG Ashley ENG
4 ABC UNIV John

try this
first try to convert empty space to nan and then simply use ffill() and bfill()
df = pd.DataFrame({'school':school,'country':country,'state':state,'name':name})
df['school'] = df['school'].str.upper()
df['state'] = df['state'].astype(str).replace('',np.nan)
df['state'] = df.groupby(['school', 'country'])['state'].transform(lambda x: x.ffill()).transform(lambda y: y.bfill())
print(df)
school country state name
UNIV OF CT US CT John
UNIV OF CT US CT Matt
OXFORD UK ENG John
OXFORD UK ENG Ashley
ABC UNIV NaN John

Related

Python merge two dataframe based on text similarity of their columns

I am working with two dataframes which look like this:
df1
country_1 column1
united states of america abcd
Ireland (Republic of Ireland) efgh
Korea Rep Of fsdf
Switzerland (Swiss Confederation) dsaa
df2
country_2 column2
united states cdda
Ireland ddgd
South Korea rewt
Switzerland tuut
desired output:
country_1 column1 country_2 column2
united states of america abcd united states cdda
Ireland (Republic of Ireland) efgh Ireland ddgd
Korea Rep Of fsdf South Korea rewt
Switzerland (Swiss Confederation) dsaa Switzerland tuut
I am not that familiar with text analytics hence unable to understand any method to tackle this problem. I have tried string matching and regex but its not able to solve this problem.
You can use difflib.
Data:
data1 = {
"country_1": ["united states of america", "Ireland (Republic of Ireland)", "Korea Rep Of", "Switzerland (Swiss Confederation)"],
"column1": ["abcd", "efgh", "fsdf", "dsaa"]
}
df1 = pd.DataFrame(data1)
data2 = {
"country_2": ["united states", "Ireland", "Korea", "Switzerland"],
"column2": ["cdda", "ddgd", "rewt", "tuut"]
}
df2 = pd.DataFrame(data2)
Code:
import difflib
from dataclasses import dataclass
import pandas as pd
#dataclass()
class FuzzyMerge:
"""
Works like pandas merge except also merges on approximate matches.
"""
left: pd.DataFrame
right: pd.DataFrame
left_on: str
right_on: str
how: str = "inner"
cutoff: float = 0.3
def main(self) -> pd.DataFrame:
temp = self.right.copy()
temp[self.left_on] = [
self.get_closest_match(x, self.left[self.left_on]) for x in temp[self.right_on]
]
return self.left.merge(temp, on=self.left_on, how=self.how)
def get_closest_match(self, left: pd.Series, right: pd.Series) -> str or None:
matches = difflib.get_close_matches(left, right, cutoff=self.cutoff)
return matches[0] if matches else None
Call the class:
merged = FuzzyMerge(left=df1, right=df2, left_on="country_1", right_on="country_2").main()
print(merged)
Output:
country_1 column1 country_2 column2
0 united states of america abcd united states cdda
1 Ireland (Republic of Ireland) efgh Ireland ddgd
2 Korea Rep Of fsdf Korea rewt
3 Switzerland (Swiss Confederation) dsaa Switzerland tuut
you can solve this problem by using pandas operations i.e using join,merge and concat: but I suggest you go through concat first as it is easy to start with
ps: make sure this is in form of Dataframe
to convert it into DataFrame
data1 = pd.DataFrame(data1)
data2 = pd.DataFrame(data2)
using concat
data = pd.concat([data1, data2], axis=1)

Show differences at row level between columns of 2 dataframes Pandas

I have 2 dataframes containing names and some demographic information, the dataframes are not identical due to monthly changes.
I'd like to create another df to show just the names of people where there are changes in either their COUNTRY or JOBCODE or MANAGERNAME columns, and also show what kind of changes these are.
Have tried the following code so far and am able to detect changes in the country column in the 2 dataframes for the common rows.
But am not so sure how to capture the movement in the MOVEMENT columns. Appreciate any form of help.
#Merge first
dfmerge = pd.merge(df1, df2, how ='inner', on ='EMAIL')
#create function to get COUNTRY_CHANGE column
def change_in(dfmerge):
if dfmerge['COUNTRY_x'] != dfmerge['COUNTRY_y']:
return 'YES'
else:
return 'NO'
dfmerge['COUNTRYCHANGE'] = dfmerge.apply(change_in, axis = 1)
Dataframe 1
NAME EMAIL COUNTRY JOBCODE MANAGERNAME
Jason Kelly jasonkelly#123.com USA 1221 Jon Gilman
Jon Gilman jongilman#123.com CANADA 1222 Cindy Lee
Jessica Lang jessicalang#123.com AUSTRALIA 1221 Esther Donato
Bob Wilder bobwilder#123.com ROMANIA 1355 Mike Lens
Samir Bala samirbala#123.com CANADA 1221 Ricky Easton
Dataframe 2
NAME EMAIL COUNTRY JOBCODE MANAGERNAME
Jason Kelly jasonkelly#123.com VIETNAM 1221 Jon Gilman
Jon Gilman jongilman#123.com CANADA 4464 Sheldon Tracey
Jessica Lang jessicalang#123.com AUSTRALIA 2224 Esther Donato
Bob Wilder bobwilder#123.com ROMANIA 1355 Emilia Tanner
Desired Output
EMAIL COUNTRY_CHANGE COUNTRY_MOVEMENT JOBCODE_CHANGE JOBCODE_MOVEMENT MGR_CHANGE MGR_MOVEMENT
jasonkelly#123.com YES FROM USA TO VIETNAM NO NO NO NO
jongilman#123.com NO NO YES FROM 1222 to 4464 YES FROM Cindy Lee to Sheldon Tracey
jessicalang#123.com NO NO YES FROM 1221 to 2224 NO NO
bobwilder#123.com NO NO NO NO YES FROM Mike Lens to Emilia Tanner
There is not direct feature in pandas that can help but we may leverage merge function as follows. We are merging dataframes and providing suffix to merged columns and then reporting their differences via this code.
# Assuming df1 and df2 are input data frames in your example.
df3 = pd.merge(df1, df2, on=['name', 'email'], suffixes=['past', 'present'])
dfans = pd.DataFrame() # this is the final output data frame
for column in df1.columns:
if not (column + 'present' in df3.columns or column + 'past' in df3.columns):
# Here we handle those columns which will not be merged like name and email.
dfans.loc[:, column] = df1.loc[:, column] # filling name and email as it is
else:
# string manipulation to name columns correctly in output
newColumn1 = '{}_CHANGE'.format(column)
newColumn2 = '{}_MOVEMENT'.format(column)
past, present = "{}past".format(column), "{}present".format(column)
# creating the output based on input
dfans.loc[:, newColumn1] = (df3[past] == df3[present]).map(lambda x: "YES" if x != 1 else "NO")
dfans.loc[:, newColumn2] = ["FROM {} TO {}".format(x, y) if x != y else "NO" for x, y in
zip(df3[past], df3[present])]

Separate a name into first and last name using Pandas

I have a DataFrame that looks like this:
name birth
John Henry Smith 1980
Hannah Gonzalez 1900
Michael Thomas Ford 1950
Michelle Lee 1984
And I want to create two new columns, "middle" and "last" for the middle and last names of each person, respectively. People who have no middle name should have None in that data frame.
This would be my ideal result:
name middle last birth
John Henry Smith 1980
Hannah None Gonzalez 1900
Michael Thomas Ford 1950
Michelle None Lee 1984
I have tried different approaches, such as this:
df['middle'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 2 else None)
df['last'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 1 else x.split(" ")[2])
I even made some functions that try to do the same thing more carefully, but I always get the same error: "List Index out of range". This is weird because if I go about printing df.iloc[i,0].split(" ") for i in range(len(df)), I do get lists with length 2 or length 3 only.
I also printed x.count(" ") for all x in the "name" column and I always got either 1 or 2 as a result. There are no single names.
This is my first question so thank you so much!
Use Series.str.replace with expand = True.
df2 = (df['name'].str
.split(' ',expand = True)
.rename(columns = {0:'name',1:'middle',2:'last'}))
new_df = df2.assign(middle = df2['middle'].where(df2['last'].notnull()),
last = df2['last'].fillna(df2['middle']),
birth = df['birth'])
print(new_df)
name middle last birth
0 John Henry Smith 1980
1 Hannah NaN Gonzalez 1900
2 Michael Thomas Ford 1950
3 Michelle NaN Lee 1984

How to remove first chracter from the string and store the same into new column in Pandas?

I have a column name called Student name and each row has four or five student names -- like this John mills, Tim Harry, Alex win, Kate marry... I want to take the first two student names and store into a new column called Student 1 and Student 2. Names have been separated from comma.
I created a function and i can able to extract first student name . result storing into my dataframe called student_0
def find_student(df2):
for i in range(2):
df2[f"student name_{i}"] = [x.split(',')[i] for x in df2["student name"]]
return df2
new_df = find_student(df2)
df2 is my dataframe name
I AM NOT GETTING SECOND STUDENT NAME. PLEASE ADVISE
Use Series.str.split with select first 2 columns by positions by DataFrame.iloc if need name and surnames:
print (df2)
student name
0 John mills, Tim Harry, Alex win, Kate marry
1 Brando XI, James Caan, Richard S. Castellano
2 Heath Ledger, Aaron Eckhart, Michael Caine
N = 2
df3 = df2["student name"].str.split(', ', expand=True).iloc[:, :N]
#rename columns names
df3.columns = [f"student name_{i+1}" for i in range(len(df3.columns))]
print (df3)
student name_1 student name_2
0 John mills Tim Harry
1 Brando XI James Caan
2 Heath Ledger Aaron Eckhart
Or use list comprehension:
N = 2
L = [x.split(',')[:2] for x in df2["student name"]]
df3 = pd.DataFrame(L, columns=[f"student name_{i+1}" for i in range(N)])
print (df3)
student name_1 student name_2
0 John mills Tim Harry
1 Brando XI James Caan
2 Heath Ledger Aaron Eckhart
If need only names:
N = 2
L = [[y.split()[0] for y in x.split(',')[:2]] for x in df2["student name"]]
df3 = pd.DataFrame(L, columns=[f"student name_{i+1}" for i in range(N)])
print (df3)
student name_1 student name_2
0 John Tim
1 Brando James
2 Heath Aaron
#join to original if necessary
df2 = df2.join(df3)
try this
def find_student(df2):
for i in range(2):
df2[f"student name_{i}"] = pd.Series(map(lambda x: x.split(',')[i], df2["student name"]))
return df2
Use pandas functionality(str and split), you don't need to write a function.
df = [["John mills, Tim Harry, Alex win, Kate marry"],
["Brando XI, James Caan, Richard S. Castellano"],
["Heath Ledger,Aaron Eckhart, Michael Caine"]]
df2 = pd.DataFrame(df)
df2.columns = ['Student_Name']
df2['student name_1'] = df2.Student_Name.str.split(",").str[0]
df2['student name_2'] = df2.Student_Name.str.split(",").str[1]

Appending new elements to a column in pandas dataframe

I have a pandas dataframe like this:
df1:
id name gender
1 Alice Male
2 Jenny Female
3 Bob Male
And now I want to add a new column sport which will contain values in the form of list.Let's I want to add Football to the rows where gender is male So df1 will look like:
df1:
id name gender sport
1 Alice Male [Football]
2 Jenny Female NA
3 Bob Male [Football]
Now if I want to add Badminton to rows where gender is female and tennis to rows where gender is male so that final output is:
df1:
id name gender sport
1 Alice Male [Football,Tennis]
2 Jenny Female [Badminton]
3 Bob Male [Football,Tennis]
How to write a general function in python which will accomplish this task of appending new values into the column based upon some other column value?
The below should work for you. Initialize column with an empty list and proceed
df['sport'] = np.empty((len(df), 0)).tolist()
def append_sport(df, filter_df, sport):
df.loc[filter_df, 'sport'] = df.loc[filter_df, 'sport'].apply(lambda x: x.append(sport) or x)
return df
filter_df = (df.gender == 'Male')
df = append_sport(df, filter_df, 'Football')
df = append_sport(df, filter_df, 'Cricket')
Output
id name gender sport
0 1 Alice Male [Football, Cricket]
1 2 Jenny Female []
2 3 Bob Male [Football, Cricket]

Resources