How to replace a string with loc. output? - python-3.x

I have two dataframes:
Dataframe_A:
Account_Nbr Customer_ID Gender
1234 A1234 male
5678 ? female
Dataframe_B:
Account_Nbr Customer_ID
1234 A1234
5678 B5678
And I want to replace '?' in dataframe A with 'B5678', here is my code:
Dataframe_A = Dataframe_A.assign(
Customer_ID = lambda x:
[cid if (cid != '?' ) else
Datafram_B.loc[Datafram_B['Account_Nbr'] == acct, ['Customer_ID']]
for cid, acct in zip(x.Customer_ID, x.Account_Nbr)]
Dataframe_A
But the output is not what I expect:
Account_Nbr Customer_ID Gender
1234 A1234 male
5678 Customer_ID female
B5678
It looks like it replace the cell with whole series. How can I get the output like this? Thank you.
Account_Nbr Customer_ID Gender
1234 A1234 male
5678 B5678 female

The below code should do the job.
import pandas as pd
df1 = pd.DataFrame([
[1234, 'A1234', 'male'],
[5678, '?', 'female']], columns=['Account_Nbr', 'Customer_ID', 'Gender'])
df2 = pd.DataFrame([
[1234, 'A1234'],
[5678, 'B5678']], columns=['Account_Nbr', 'Customer_ID'])
mask = df1['Account_Nbr'] == df2['Account_Nbr']
df1.loc[mask, 'Customer_ID'] = df2[mask]['Customer_ID']
df1.head()
Output:
Account_Nbr Customer_ID Gender
0 1234 A1234 male
1 5678 B5678 female

Related

Using Regex to change the name values format in a dataframe

I'm pretty sure I'm asking the wrong question here so here goes. I have a 2 dataframes, lets call them df1 and df2.
df1 looks like this:
data = {'Employee ID' : [12345, 23456, 34567],
'Values' : [123168546543154, 13513545435145434, 556423145613],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Wendy Davis', 'Albus Dumbledore', 'James Halliday']}
df1 = pd.DataFrame(data, columns=['Employee ID','Values','Employee Name','Department Supervisor'])
df2 looks similar:
data = {'Employee ID' : [12345, 23456, 34567],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Davis, Wendy', 'Dumbledore, Albus', 'Halliday, James']}
df2 = pd.DataFrame(data, columns=['Employee ID','Employee Name','Department Supervisor'])
My issue is that df1 is from an excel file and that sometimes has an Employee ID entered and sometimes doesn't. This is where df2 comes in, df2 is a sql pull from the employee database that I'm using to validate the employee names and supervisor names to ensure the correct employee id is used.
Normally I'd be happy to merge the dataframes to get my desired result but with the supervisor names being in different formats I'd like to use regex on df1 to turn 'Wendy Davis" into 'Davis, Wendy' along with the other supervisor names to match what df2 has. So far I'm coming up empty on how I want to search this for an answer, suggestions?
IIUC, do you need?
df1['DS Corrected'] = df1['Department Supervisor'].str.replace('(\w+) (\w+)','\\2, \\1', regex=True)
Output:
Employee ID Values Employee Name Department Supervisor DS Corrected
0 12345 123168546543154 Jones, John Wendy Davis Davis, Wendy
1 23456 13513545435145434 Potter, Harry Albus Dumbledore Dumbledore, Albus
2 34567 556423145613 Watts, Wade James Halliday Halliday, James
Since Albus' full name is Albus Percival Wulfric Brian Dumbledore and James' is James Donovan Halliday (if we're talking about Ready Player One) then consider a dataframe of:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Wendy Davis
1 23456 13513545435145434 Potter, Harry Albus Percival Wulfric Brian Dumbledore
2 34567 556423145613 Watts, Wade James Donovan Halliday
So we need to swap the last name to the front with...
import pandas as pd
data = {'Employee ID' : [12345, 23456, 34567],
'Values' : [123168546543154, 13513545435145434, 556423145613],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Wendy Davis', 'Albus Percival Wulfric Brian Dumbledore', 'James Donovan Halliday']}
df1 = pd.DataFrame(data, columns=['Employee ID','Values','Employee Name','Department Supervisor'])
def swap_names(text):
first, *middle, last = text.split()
if len(middle) == 0:
return last + ', ' + first
else:
return last + ', ' + first + ' ' + ' '.join(middle)
df1['Department Supervisor'] = [swap_names(row) for row in df1['Department Supervisor']]
print(df1)
Outputs:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Davis, Wendy
1 23456 13513545435145434 Potter, Harry Dumbledore, Albus Percival Wulfric Brian
2 34567 556423145613 Watts, Wade Halliday, James Donovan
Maybe...
df1['Department Supervisor'] = [', '.join(x.split()[::-1]) for x in df1['Department Supervisor']]
Outputs:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Davis, Wendy
1 23456 13513545435145434 Potter, Harry Dumbledore, Albus
2 34567 556423145613 Watts, Wade Halliday, James

Merge pandas dataframes based on several conditions for a large dataset

I want to merge two data frames based on certain conditions. First, I want to match the Full name only then for the mismatch entries, I would like to consider First and Last names as a matching condition. I have two Dataframes as follows:
df1
first_name last_name full_name
John Shoeb John Shoeb
John Shumon John Md Shumon
Abu Babu Abu A Babu
William Curl William Curl
df2
givenName surName displayName
John Shoeb John Shoeb
John Shumon John M Shumon
Abu Babu Abu Babu
Raju Kaju Raju Kaju
Bill Curl Bill Curl
I first merge them based on full name:
df3 = pd.merge(df1, df2, left_on=df1['full_name'].str.lower(), right_on=df2['displayName'].str.lower(), how='left')
And add a status and log columns:
df3.loc[ (df3.full_name.str.lower()==df3.displayName.str.lower()), 'status'] = True
df3.loc[ (df3.full_name.str.lower()==df3.displayName.str.lower()), 'log'] = 'Full Name Matching'
So the resultant dataframe df3 now looks like:
first_name last_name full_name givenName surName displayName status log
John Shoeb John Shoeb John Shoeb John Shoeb True Full Name Matching
John Shumon John Md Shumon NaN NaN NaN NaN NaN
Abu Babu Abu A Babu NaN NaN NaN NaN NaN
William Curl William Curl NaN NaN NaN False NaN
Expected Results
Now I want to apply matching condition based on df1 (First Name and Last Name) and df2 (givenName and surName). The final dataframe should look like as follows:
first_name last_name full_name givenName surName displayName status log
John Shoeb John Shoeb John Shoeb John Shoeb True Full Name Matching
John Shumon John Md Shumon John Shumon John Shumon True FN LN Matching
Abu Babu Abu A Babu Abu Babu Abu Babu True FN LN Matching
William Curl William Curl NaN NaN NaN False NaN
Question For the second part i.e. First Name and Last Name matching, I was able to get it done using the dataframe's itertuples(). However, when the same operations are applied to a huge dataset it keeps running forever. I'm looking for efficient ways so it can be applied to a big chunk of data.
You can use indicator=True in your merges. Then compare if first merge and second merge were "both" (for example with np.where):
df3 = (
pd.merge(
df1,
df2,
left_on=df1["full_name"].str.lower(),
right_on=df2["displayName"].str.lower(),
how="left",
indicator=True,
)
.drop(columns="key_0")
.rename(columns={"_merge": "first_merge"})
)
df3 = pd.merge(
df3,
df2,
left_on=df1["first_name"].str.lower() + " " + df1["last_name"].str.lower(),
right_on=df2["givenName"].str.lower() + " " + df2["surName"].str.lower(),
how="left",
indicator=True,
)
df3["log"] = np.where(
(df3["first_merge"] == "both"),
"Full Name Matching",
np.where(df3["_merge"] == "both", "FN LN Matching", None),
)
df3["status"] = df3["log"].notna()
df3 = df3[
[
"first_name",
"last_name",
"full_name",
"givenName_y",
"surName_y",
"displayName_y",
"status",
"log",
]
].rename(
columns={
"givenName_y": "givenName",
"surName_y": "surName",
"displayName_y": "displayName",
}
)
print(df3)
Prints:
first_name last_name full_name givenName surName displayName status log
0 John Shoeb John Shoeb John Shoeb John Shoeb True Full Name Matching
1 John Shumon John Md Shumon John Shumon John M Shumon True FN LN Matching
2 Abu Babu Abu A Babu Abu Babu Abu Babu True FN LN Matching
3 William Curl William Curl NaN NaN NaN False None

pandas help: map and match tab delimted strings in a column and print into new column

I have a dataframe data which have last column containing a bunch of sting and digits and i have one more dataframe info where those sting and digits means, i want to map user input(item) with info and match, print and count how many of them present in the last column in data and prioritize the dataframe data based on numbder of match
import pandas
#data
data = {'id': [123, 456, 789, 1122, 3344],
'Name': ['abc', 'def', 'hij', 'klm', 'nop'],
'MP-ID': ['MP:001|MP:0085|MP:0985', 'MP:005|MP:0258', 'MP:025|MP:5890', 'MP:0589|MP:02546', 'MP:08597|MP:001|MP:005']}
test_data = pd.DataFrame(data)
#info
info = {'MP-ID': ['MP:001', 'MP:002', 'MP:003', 'MP:004', 'MP:005'], 'Item': ['apple', 'orange', 'grapes', 'bannan', 'mango']}
test_info = pd.DataFrame(info)
user input exmaple:
run.py apple mango
desired output:
id Name MP-ID match count
3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
123 abc MP:001|MP:0085|MP:0985 MP:001 1
456 def MP:005|MP:0258 MP:005 1
789 hij MP:025|MP:5890 0
1122 klm MP:0589|MP:02546 0
Thank you for your help in advance
First get all arguments to variable vars, filter MP-ID by Series.isin with DataFrame.loc and extract them by Series.str.findall with Series.str.join, last use Series.str.count with DataFrame.sort_values:
import sys
vals = sys.argv[1:]
#vals = ['apple','mango']
s = test_info.loc[test_info['Item'].isin(vals), 'MP-ID']
test_data['MP-ID match'] = test_data['MP-ID'].str.findall('|'.join(s)).str.join('|')
test_data['count'] = test_data['MP-ID match'].str.count('MP')
test_data = test_data.sort_values('count', ascending=False, ignore_index=True)
print (test_data)
id Name MP-ID MP-ID match count
0 3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
1 123 abc MP:001|MP:0085|MP:0985 MP:001 1
2 456 def MP:005|MP:0258 MP:005 1
3 789 hij MP:025|MP:5890 0
4 1122 klm MP:0589|MP:02546 0

Python dataframe : converting columns into rows

I have a dataframe as follows
d = {'Movie' : ['The Shawshank Redemption', 'The Godfather'],
'FirstName1': ['Tim', 'Marlon'],
'FirstName2': ['Morgan', 'Al'],
'LastName1': ['Robbins', 'Brando'],
'LastName2': ['Freeman', 'Pacino'],
'ID1': ['TM', 'MB'],
'ID2': ['MF', 'AP']
}
df = pd.DataFrame(d)
df
I would like to re-arrange it into a 4 column dataframe,
by converting Firstname1, LastName1, FirstName2, LastName2, ID1, ID2 into 3 column rows as FirstName, LastName, ID then column movie repeats as follows.
In sql we do it as follows
select Movie as Movie, FirstName1 as FirstName, LastName1 as LastName, ID1 as ID from table
union
select Movie as Movie, FirstName2 as FirstName, LastName2 as LastName, ID2 as ID from table
Can we achieve it using pandas ?
If possible number in column names more like 9 use Series.str.extract for get integers with values before to MultiIndex to columns, so possible DataFrame.stack:
df = df.set_index('Movie')
df1 = df.columns.to_series().str.extract('([a-zA-Z]+)(\d+)')
df.columns = pd.MultiIndex.from_arrays([df1[0], df1[1].astype(int)])
df = df.rename_axis((None, None), axis=1).stack().reset_index(level=1, drop=True).reset_index()
print (df)
Movie FirstName ID LastName
0 The Shawshank Redemption Tim TM Robbins
1 The Shawshank Redemption Morgan MF Freeman
2 The Godfather Marlon MB Brando
3 The Godfather Al AP Pacino
If not use indexing for get last values of columns names with all previous and pass to MultiIndex.from_arrays:
df = df.set_index('Movie')
df.columns = pd.MultiIndex.from_arrays([df.columns.str[:-1], df.columns.str[-1].astype(int)])
df = df.stack().reset_index(level=1, drop=True).reset_index()
print (df)
Movie FirstName ID LastName
0 The Shawshank Redemption Tim TM Robbins
1 The Shawshank Redemption Morgan MF Freeman
2 The Godfather Marlon MB Brando
3 The Godfather Al AP Pacino
df = df.set_index('Movie')
df.columns = pd.MultiIndex.from_tuples([(col[:-1], col[-1:]) for col in df.columns])
df.stack()
# FirstName ID LastName
#Movie
#The Shawshank Redemption 1 Tim TM Robbins
# 2 Morgan MF Freeman
#The Godfather 1 Marlon MB Brando
# 2 Al AP Pacino
Use the power of MultiIndex! With from_tuples you create a DataFrame that has one column for FirstNames, divided in FirstName1 and FirstName2 (see below) and similar for ID and LastName. With stack you convert it into rows for each. Before you do this, make Movie the Index to exclude it from what you are doing. You could use reset_index() to regain everything as columns, but I'm not sure if you want that.
Before stack:
# FirstName LastName ID
# 1 2 1 2 1 2
#Movie
#The Shawshank Redemption Tim Morgan Robbins Freeman TM MF
#The Godfather Marlon Al Brando Pacino MB AP
I think an easy way to do this is to use the copy function from Pandas.
You can copy the columns "Movie", "FirstName", "LastName", "ID" to a new table. Then delete the columns you don't need in your first column. You can also create a new table for the other.
new = d['Movie', 'FirstName', 'LastName', 'ID].copy
Try below:
d1 = df.filter(regex="1$|Movie").rename(columns=lambda x: x[:-1])
d2 = df.filter(regex="2$|Movie").rename(columns=lambda x: x[:-1])
pd.concat([d1, d2]).rename({'Movi':'Movie'})

Appending new elements to a column in pandas dataframe

I have a pandas dataframe like this:
df1:
id name gender
1 Alice Male
2 Jenny Female
3 Bob Male
And now I want to add a new column sport which will contain values in the form of list.Let's I want to add Football to the rows where gender is male So df1 will look like:
df1:
id name gender sport
1 Alice Male [Football]
2 Jenny Female NA
3 Bob Male [Football]
Now if I want to add Badminton to rows where gender is female and tennis to rows where gender is male so that final output is:
df1:
id name gender sport
1 Alice Male [Football,Tennis]
2 Jenny Female [Badminton]
3 Bob Male [Football,Tennis]
How to write a general function in python which will accomplish this task of appending new values into the column based upon some other column value?
The below should work for you. Initialize column with an empty list and proceed
df['sport'] = np.empty((len(df), 0)).tolist()
def append_sport(df, filter_df, sport):
df.loc[filter_df, 'sport'] = df.loc[filter_df, 'sport'].apply(lambda x: x.append(sport) or x)
return df
filter_df = (df.gender == 'Male')
df = append_sport(df, filter_df, 'Football')
df = append_sport(df, filter_df, 'Cricket')
Output
id name gender sport
0 1 Alice Male [Football, Cricket]
1 2 Jenny Female []
2 3 Bob Male [Football, Cricket]

Resources