Pandas merge two dataframe and overwrite rows - python-3.x

I have two data frames that I am trying to combine -
Dataframe 1 -
Product Buyer Date Store
TV Person A 9/18/2018 Boston
DVD Person B 4/10/2018 New York
Blue-ray Player Person C 9/19/2018 Boston
Phone Person A 9/18/2018 Boston
Sound System Person C 3/05/2018 Washington
Dataframe 2 -
Product Type Buyer Date Store
TV Person B 5/29/2018 New York
Phone Person A 2/10/2018 Washington
The first dataframe has about 500k rows while the second dataframe has about 80k rows. There are time when the second dataframe has home columns but I am trying to get the final output with to show the same columns as the Dataframe 1 and update the Dataframe 1 rows with Dataframe 2.
The output looks like this -
Product Buyer Date Store
TV Person B 5/29/2018 New York
DVD Person B 4/10/2018 New York
Blue-ray Player Person C 9/19/2018 Boston
Phone Person A 2/10/2018 Washington
Sound System Person C 3/05/2018 Washington
I tried the join but the columns are repeated. Is there an elegant solution to do this?
Edit 1-
I have already tried -
pd.merge(df,df_correction, left_on = ['Product'], right_on = ['Product Type'],how = 'outer')
Product Buyer_x Date_x Store_x Product Type Buyer_y Date_y Store_y
TV Person B 5/29/2018 New York TV Person B 5/29/2018 New York
DVD Person B 4/10/2018 New York NaN NaN NaN NaN
Blue-ray Player Person C 9/19/2018 Boston NaN NaN NaN NaN
Phone Person A 2/10/2018 Washington Phone Person A 2/10/2018 Washington
Sound System Person C 3/05/2018 Washington NaN NaN NaN NaN

i think combine first is the function you are looking for https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.combine_first.html
can you try:
d1.rename(columns={'ProductType':'Product'}).set_index('Product').combine_first(d2.set_index('Product')).reset_index()

Related

Extract the mapping dictionary between two columns in pandas

I have a dataframe as shown below.
df:
id player country_code country
1 messi arg argentina
2 neymar bra brazil
3 tevez arg argentina
4 aguero arg argentina
5 rivaldo bra brazil
6 owen eng england
7 lampard eng england
8 gerrard eng england
9 ronaldo bra brazil
10 marria arg argentina
from the above df, I would like to extract the mapping dictionary that relates the country_code with country columns.
Expected Output:
d = {'arg':'argentina', 'bra':'brazil', 'eng':'england'}
Dictionary has unique keys, so is possible convert Series with duplicated index by column country_code:
d = df.set_index('country_code')['country'].to_dict()
If there is possible some country should be different per country_code, then is used last value per country.

How to split a Dataframe column whose data is not unique

I have a column called users in dataframe which doesn't have a unique format. I am doing a data cleanup project as the data looks unreadable.
company Users
A [{"Name":"Martin","Email":"name_1#email.com","EmpType":"Full"},{"Name":"Rick","Email":"name_2#email.com","Dept":"HR"}]
B [{"Name":"John","Email":"name_2#email.com","EmpType":"Full","Dept":"Sales" }]
I used the below query to this has broke down the data frame as below
df2 = df
df2 = df2.join(df['Users_config'].str.split('},{', expand=True).add_prefix('Users'))
company Users0 Users1
A "Name":"Martin","Email":"name_1#email.com","EmpType":"Full" "Name":"Rick","Email":"name_2#email.com","Dept":"HR"
B "Name":"John","Email":"name_2#email.com","EmpType":"Full","Dept":"Sales"
and further breaking the above df with "," using the same query I got the output as
Company Users01 Users02 Users03 Users10 Users11 Users12
1 "Name":"Martin" "Email":"name_1#email.com" "EmpType":"Full" "Name":"Rick" "Email":"name_2#email.com" "Dept":"HR"
2 "Name":"John" "Email":"name_2#email.com" "EmpType":"Full" "Dept":"Sales"
As this dataframe looks messy I want to get the output as below. I feel the best way to name the column is to use the column value "Name" from "Name":"Martin" itself and If we hardcore using df.rename the column name will get mismatch.
Company Name_1 Email_1 EmpType_1 Dept_1 Name_2 Email_2 Dept_2
1 Martin name_1#email.com Full Rick name_2#email.com "HR"
2 John name_2#email.com" Full Sales
Is there any way I can get the above output from the original dataframe.
Use:
df['Users'] = df['Users'].apply(ast.literal_eval)
d = df.explode('Users').reset_index(drop=True)
d = d.join(pd.DataFrame(d.pop('Users').tolist()))
d = d.set_index(['company', d.groupby('company').cumcount().add(1).astype(str)]).unstack()
d.columns = d.columns.map('_'.join)
Details:
First we use ast.literal_eval to evaluate the strings in Users column, then use DataFrame.explode on column Users to create a dataframe d.
print(d)
company Users
0 A {'Name': 'Martin', 'Email': 'name_1#email.com', 'EmpType': 'Full'}
1 A {'Name': 'Rick', 'Email': 'name_2#email.com', 'Dept': 'HR'}
2 B {'Name': 'John', 'Email': 'name_2#email.com', 'EmpType': 'Full', 'Dept': 'Sales'}
Create a new dataframe from the Users column in d and use DataFrame.join to join this new dataframe with d.
print(d)
company Name Email EmpType Dept
0 A Martin name_1#email.com Full NaN
1 A Rick name_2#email.com NaN HR
2 B John name_2#email.com Full Sales
Use DataFrame.groupby on column company then use groupby.cumcount to create a counter for each group, then use DataFrame.set_index to set the index of d as company + counter. Then use DataFrame.unstack to reshape the dataframe creating MultiIndex columns.
print(d)
Name Email EmpType Dept
1 2 1 2 1 2 1 2
company
A Martin Rick name_1#email.com name_2#email.com Full NaN NaN HR
B John NaN name_2#email.com NaN Full NaN Sales NaN
Finally use map along with .join to flatten the MultiIndex columns.
print(d)
Name_1 Name_2 Email_1 Email_2 EmpType_1 EmpType_2 Dept_1 Dept_2
company
A Martin Rick name_1#email.com name_2#email.com Full NaN NaN HR
B John NaN name_2#email.com NaN Full NaN Sales NaN

How to fill in between rows gap comparing with other dataframe using pandas?

I want to compare df1 with df2 and fill only the blanks without overwriting other values. I have no idea how to achieve this without overwriting or creating an extra columns.
Can I do this by converting df2 into dictionary and mapping with df1?
df1 = pd.DataFrame({'players name':['ram', 'john', 'ismael', 'sam', 'karan'],
'hobbies':['jog','','photos','','studying'],
'sports':['cricket', 'basketball', 'chess', 'kabadi', 'volleyball']})
df1:
players name hobbies sports
0 ram jog cricket
1 john basketball
2 ismael photos chess
3 sam kabadi
4 karan studying volleyball
And, df,
df2 = pd.DataFrame({'players name':['jagan', 'mohan', 'john', 'sam', 'karan'],
'hobbies':['riding', 'tv', 'sliding', 'jumping', 'studying']})
df2:
players name hobbies
0 jagan riding
1 mohan tv
2 john sliding
3 sam jumping
4 karan studying
I want output like this:
Try this:
df1['hobbies'] = (df1['players name'].map(df2.set_index('players name')['hobbies'])
.fillna(df1['hobbies']))
df1
Output:
players name hobbies sports
0 ram jog cricket
1 john sliding basketball
2 ismael photos chess
3 sam jumping kabadi
4 karan studying volleyball
if blank space is NaN value
df1 = pd.DataFrame({"players name":["ram","john","ismael","sam","karan"],
"hobbies":["jog",pd.np.NaN,"photos",pd.np.NaN,"studying"],
"sports":["cricket","basketball","chess","kabadi","volleyball"]})
then
dicts = df2.set_index("players name")['hobbies'].to_dict()
df1['hobbies'] = df1['hobbies'].fillna(df1['players name'].map(dicts))
output:
players name hobbies sports
0 ram jog cricket
1 john sliding basketball
2 ismael photos chess
3 sam jumping kabadi
4 karan studying volleyball

Appending new elements to a column in pandas dataframe

I have a pandas dataframe like this:
df1:
id name gender
1 Alice Male
2 Jenny Female
3 Bob Male
And now I want to add a new column sport which will contain values in the form of list.Let's I want to add Football to the rows where gender is male So df1 will look like:
df1:
id name gender sport
1 Alice Male [Football]
2 Jenny Female NA
3 Bob Male [Football]
Now if I want to add Badminton to rows where gender is female and tennis to rows where gender is male so that final output is:
df1:
id name gender sport
1 Alice Male [Football,Tennis]
2 Jenny Female [Badminton]
3 Bob Male [Football,Tennis]
How to write a general function in python which will accomplish this task of appending new values into the column based upon some other column value?
The below should work for you. Initialize column with an empty list and proceed
df['sport'] = np.empty((len(df), 0)).tolist()
def append_sport(df, filter_df, sport):
df.loc[filter_df, 'sport'] = df.loc[filter_df, 'sport'].apply(lambda x: x.append(sport) or x)
return df
filter_df = (df.gender == 'Male')
df = append_sport(df, filter_df, 'Football')
df = append_sport(df, filter_df, 'Cricket')
Output
id name gender sport
0 1 Alice Male [Football, Cricket]
1 2 Jenny Female []
2 3 Bob Male [Football, Cricket]

How to combine multiple rows of pandas dataframe into one between two other row values python3?

I have a pandas dataframe with a single column that contains name, address, and phone info separated by blank or na rows like this:
data
0 Business name one
1 1234 address ln
2 Town, ST 55655
3 (555) 555-5555
4 nan
5 Business name two
6 5678 address dr
7 New Town, ST 55677
8 nan
9 Business name three
10 nan
and so on...
What I want is this:
Name Addr1 Addr2 Phone
0 Business name one 1234 address ln Town, ST 55655 (555) 555-5555
1 Business name two 5678 address dr New Town, ST 55677
2 Business name three
I am using python 3 and have been stuck, any help is much appreciated!
You can use:
create groups for each row with isnull and cumsum
for align with non NaN rows add reindex
remove NaNs by dropna, set_index to MultiIndex with cumcount
reshape by unstack
a = df['data'].isnull().cumsum().reindex(df.dropna().index)
print (a)
0 0
1 0
2 0
3 0
5 1
6 1
7 1
9 2
Name: data, dtype: int32
df = df.dropna().set_index([a, a.groupby(a).cumcount()])['data'].unstack()
df.columns = ['Name','Addr1','Addr2','Phone']
print (df)
Name Addr1 Addr2 Phone
data
0 Business name one 1234 address ln Town, ST 55655 (555) 555-5555
1 Business name two 5678 address dr New Town, ST 55677 None
2 Business name three None None None
If there is multiple address is possible create columns dynamically:
df.columns = ['Name'] +
['Addr{}'.format(x+1) for x in range(len(df.columns) - 2)] +
['Phone']
df['group']=df['data'].str.contains('Business').cumsum().replace({True:1}).ffill()
df1=df.groupby('group')['data'].apply(list).apply(pd.Series).dropna(axis=1,thresh =1)
df1.columns=['Name','Addr1','Addr2','Phone']
df1
Out[1221]:
Name Addr1 Addr2 \
group
1.0 Business name one 1234 address ln Town, ST 55655
2.0 Business name two 5678 address dr New Town, ST 55677
3.0 Business name three NaN NaN
Phone
group
1.0 (555) 555-5555
2.0 NaN
3.0 NaN

Resources