How to merge two data frames with duplicate rows? - python-3.x

I have two data frames df1 and df2. The df1 has repeated text wrt column name but column hobby changes. The df2 also has repeated text in the column name. I want to merge both the data frames and keep everything.
df1:
name hobby
mike cricket
mike football
jack chess
jack football
jack vollyball
pieter sleeping
pieter cyclying
my df2 is
df2:
name
mike
pieter
jack
mike
pieter
Now I have to merge df2 with df1 on name column
So my resultant df3 should look like this:
df3:
name hobby
mike cricket
mike football
pieter sleeping
pieter cyclying
jack chess
jack football
jack vollyball
mike cricket
mike football
pieter sleeping
pieter cyclying

You want to assign an order for df2, merge on name, then sort by the said order:
(df2.assign(rank=np.arange(len(df2)))
.merge(df1, on='name')
.sort_values('rank')
.drop('rank', axis=1)
)
Output:
name hobby
0 mike cricket
1 mike football
4 pieter sleeping
5 pieter cyclying
8 jack chess
9 jack football
10 jack vollyball
2 mike cricket
3 mike football
6 pieter sleeping
7 pieter cyclying

Related

Matching two columns with the same row values in a csv file

I have a csv file with 4 columns:
Name Dept Email Name Hair Color
John Smith candy Lincoln Tun brown
Diana Princ candy John Smith gold
Perry Plat wood Oliver Twist bald
Jerry Springer clothes Diana Princ gold
Calvin Klein clothes
Lincoln Tun warehouse
Oliver Twist kitchen
I want to match the columns Name and Email Name by names.
This what the final output should look like:
Name Dept Email Name Hair Color
John Smith candy John Smith gold
Diana Princ candy Diana Princ gold
Perry Plat wood
Jerry Springer clothes
Calvin Klein clothes
Lincoln Tun warehouse Lincoln Tun brown
Oliver Twist kitchen Oliver Twist bald
I tried something like this in my code:
dfs = np.split(df,len(df.columns), axis=1)
dfs = [df.set_index(df.columns[0], drop=False) for df in dfs]
f=dfs[0].join(dfs[1:]).reset_index(drop=True).fillna(0)
Which sorted my two columns great but made everything else 0's
Name Dept Email Name Hair Color
John Smith 0 John Smith 0
Diana Princ 0 Diana Princ 0
Perry Plat 0 0 0
Jerry Springer 0 0 0
Calvin Klein 0 0 0
Lincoln Tun 0 Lincoln Tun 0
Oliver Twist 0 Oliver Twist 0
Here is my code so far:
import pandas as pd
import numpy as np
import os, csv, sys
csvPath = 'User.csv'
df= pd.read_csv(csvPath)
dfs = np.split(df,len(df.columns), axis=1)
dfs = [df.set_index(df.columns[0], drop=False) for df in dfs]
f=dfs[0].join(dfs[1:]).reset_index(drop=True).fillna(0)
testCSV = 'test_user.csv' #to check my csv file
f.to_csv(testCSV, encoding='utf-8') #send it to csv
You could use merge for that:
pd.merge(df[['Name','Dept']],df[['Email Name','Hair Color']], left_on='Name', right_on='Email Name', how='left')
Result
Name Dept Email Name Hair Color
0 John Smith candy John Smith gold
1 Diana Princ candy Diana Princ gold
2 Perry Plat wood NaN NaN
3 Jerry Springer clothes NaN NaN
4 Calvin Klein clothes NaN NaN
5 Lincoln Tun warehouse Lincoln Tun brown
6 Oliver Twist kitchen Oliver Twist bald

How update a dataframe column value from second dataframe where values on two specific columns that can repeat on first match on both dataframes?

I have two dataframes with different information about a person, on the first dataframe, person's name may repeat in different rows. I want to add/update the first dataframe with data from the second dataframe where the two columns containing person's data matches on both. Here an example on what I need to accomplish:
df1:
name surname
0 john doe
1 mary doe
2 peter someone
3 mary doe
4 john another
5 paul another
df2:
name surname account_id
0 peter someone 100
1 john doe 200
2 mary doe 300
3 john another 400
I need to accomplish this:
df1:
name surname account_id
0 john doe 200
1 mary doe 300
2 peter someone 100
3 mary doe 300
4 john another 400
5 paul another <empty>
Thanks!

Question about excel columns csv file how to combine columns

I got a quick question I got a column like this
the players name and the percentage of matches won
Rank
Country
Name
Matches Won %
1 ESP ESP Rafael Nadal 89.06%
2 SRB SRB Novak Djokovic 83.82%
3 SUI SUI Roger Federer 83.61%
4 RUS RUS Daniil Medvedev 73.75%
5 AUT AUT Dominic Thiem 72.73%
6 GRE GRE Stefanos Tsitsipas 67.95%
7 JPN JPN Kei Nishikori 67.44%
and I got another data like this ACES PERCENTAGE
Rank
Country
Name
Ace %
1 USA USA John Isner 26.97%
2 CRO CRO Ivo Karlovic 25.47%
3 USA USA Reilly Opelka 24.81%
4 CAN CAN Milos Raonic 24.63%
5 USA USA Sam Querrey 20.75%
6 AUS AUS Nick Kyrgios 20.73%
7 RSA RSA Kevin Anderson 17.82%
8 KAZ KAZ Alexander Bublik 17.06%
9 FRA FRA Jo Wilfried Tsonga 14.29%
---------------------------------------
85 ESP ESP RAFAEL NADAL 6.85%
My question is can I make my two tables align so for example I want to have
my data based on matches won
So I have for example
Rank Country Name Matches% Aces %
1 ESP RAFAEL NADAL 89.06% 6.85%
Like this for all the player
I agree with the comment above that it would be easiest to import both and to then use XLOOKUP() to add the Aces % column to the first set of data. If you import the first data set to Sheet1 and the second data set to Sheet2 and both have the rank in Column A , your XLOOKUP() in Sheet 1 Column E would look something like:
XLOOKUP(A2, Sheet2!A:A, Sheet2!D:D)

How to fill in between rows gap comparing with other dataframe using pandas?

I want to compare df1 with df2 and fill only the blanks without overwriting other values. I have no idea how to achieve this without overwriting or creating an extra columns.
Can I do this by converting df2 into dictionary and mapping with df1?
df1 = pd.DataFrame({'players name':['ram', 'john', 'ismael', 'sam', 'karan'],
'hobbies':['jog','','photos','','studying'],
'sports':['cricket', 'basketball', 'chess', 'kabadi', 'volleyball']})
df1:
players name hobbies sports
0 ram jog cricket
1 john basketball
2 ismael photos chess
3 sam kabadi
4 karan studying volleyball
And, df,
df2 = pd.DataFrame({'players name':['jagan', 'mohan', 'john', 'sam', 'karan'],
'hobbies':['riding', 'tv', 'sliding', 'jumping', 'studying']})
df2:
players name hobbies
0 jagan riding
1 mohan tv
2 john sliding
3 sam jumping
4 karan studying
I want output like this:
Try this:
df1['hobbies'] = (df1['players name'].map(df2.set_index('players name')['hobbies'])
.fillna(df1['hobbies']))
df1
Output:
players name hobbies sports
0 ram jog cricket
1 john sliding basketball
2 ismael photos chess
3 sam jumping kabadi
4 karan studying volleyball
if blank space is NaN value
df1 = pd.DataFrame({"players name":["ram","john","ismael","sam","karan"],
"hobbies":["jog",pd.np.NaN,"photos",pd.np.NaN,"studying"],
"sports":["cricket","basketball","chess","kabadi","volleyball"]})
then
dicts = df2.set_index("players name")['hobbies'].to_dict()
df1['hobbies'] = df1['hobbies'].fillna(df1['players name'].map(dicts))
output:
players name hobbies sports
0 ram jog cricket
1 john sliding basketball
2 ismael photos chess
3 sam jumping kabadi
4 karan studying volleyball

How to create spark datasets from a file without using File reader

I have a data file that has 4 data sections. Header data, Summary data, Detail data and Footer data. Each section has a fixed number of columns.Each section is divided by two rows that just have a single "#" as the row content.But different sections have different of columns. Is there a way I can avoid creating new files and just use spark tsv(tab seperated foramt) module or any other module to read the file into 4 datasets directly.If I read the file directly then I am loosing the extra columns in the next data section. It only reads the from the file only those columns as the first row of the file.
#deptno dname location
10 Accounting New York
20 Research Dallas
30 Sales Chicago
40 Operations Boston
#
#
#grade losal hisal
1 700.00 1200.00
2 1201.00 1400.00
4 2001.00 3000.00
5 3001.00 99999.00
3 1401.00 2000.00
#
#
#ENAME DNAME JOB EMPNO HIREDATE LOC
ADAMS RESEARCH CLERK 7876 23-MAY-87 DALLAS
ALLEN SALES SALESMAN 7499 20-FEB-81 CHICAGO
BLAKE SALES MANAGER 7698 01-MAY-81 CHICAGO
CLARK ACCOUNTING MANAGER 7782 09-JUN-81 NEW YORK
FORD RESEARCH ANALYST 7902 03-DEC-81 DALLAS
JAMES SALES CLERK 7900 03-DEC-81 CHICAGO
JONES RESEARCH MANAGER 7566 02-APR-81 DALLAS
#
#
#Name Age Address
Paul 23 1115 W Franklin
Bessy the Cow 5 Big Farm Way
Zeke 45 W Main St
Output:
Dataset d1 :
#deptno dname location
10 Accounting New York
20 Research Dallas
30 Sales Chicago
40 Operations Boston
Dataset d2 :
#grade losal hisal
1 700.00 1200.00
2 1201.00 1400.00
4 2001.00 3000.00
5 3001.00 99999.00
3 1401.00 2000.00
Dataset d3 :
#ENAME DNAME JOB EMPNO HIREDATE LOC
ADAMS RESEARCH CLERK 7876 23-MAY-87 DALLAS
ALLEN SALES SALESMAN 7499 20-FEB-81 CHICAGO
BLAKE SALES MANAGER 7698 01-MAY-81 CHICAGO
CLARK ACCOUNTING MANAGER 7782 09-JUN-81 NEW YORK
FORD RESEARCH ANALYST 7902 03-DEC-81 DALLAS
JAMES SALES CLERK 7900 03-DEC-81 CHICAGO
JONES RESEARCH MANAGER 7566 02-APR-81 DALLAS
Dataset d4 :
#Name Age Address
Paul23 1115 W Franklin
Bessy the Cow 5 Big Farm Way
Zeke 45 W Main St

Resources