Matching two columns with the same row values in a csv file - python-3.x

I have a csv file with 4 columns:
Name Dept Email Name Hair Color
John Smith candy Lincoln Tun brown
Diana Princ candy John Smith gold
Perry Plat wood Oliver Twist bald
Jerry Springer clothes Diana Princ gold
Calvin Klein clothes
Lincoln Tun warehouse
Oliver Twist kitchen
I want to match the columns Name and Email Name by names.
This what the final output should look like:
Name Dept Email Name Hair Color
John Smith candy John Smith gold
Diana Princ candy Diana Princ gold
Perry Plat wood
Jerry Springer clothes
Calvin Klein clothes
Lincoln Tun warehouse Lincoln Tun brown
Oliver Twist kitchen Oliver Twist bald
I tried something like this in my code:
dfs = np.split(df,len(df.columns), axis=1)
dfs = [df.set_index(df.columns[0], drop=False) for df in dfs]
f=dfs[0].join(dfs[1:]).reset_index(drop=True).fillna(0)
Which sorted my two columns great but made everything else 0's
Name Dept Email Name Hair Color
John Smith 0 John Smith 0
Diana Princ 0 Diana Princ 0
Perry Plat 0 0 0
Jerry Springer 0 0 0
Calvin Klein 0 0 0
Lincoln Tun 0 Lincoln Tun 0
Oliver Twist 0 Oliver Twist 0
Here is my code so far:
import pandas as pd
import numpy as np
import os, csv, sys
csvPath = 'User.csv'
df= pd.read_csv(csvPath)
dfs = np.split(df,len(df.columns), axis=1)
dfs = [df.set_index(df.columns[0], drop=False) for df in dfs]
f=dfs[0].join(dfs[1:]).reset_index(drop=True).fillna(0)
testCSV = 'test_user.csv' #to check my csv file
f.to_csv(testCSV, encoding='utf-8') #send it to csv

You could use merge for that:
pd.merge(df[['Name','Dept']],df[['Email Name','Hair Color']], left_on='Name', right_on='Email Name', how='left')
Result
Name Dept Email Name Hair Color
0 John Smith candy John Smith gold
1 Diana Princ candy Diana Princ gold
2 Perry Plat wood NaN NaN
3 Jerry Springer clothes NaN NaN
4 Calvin Klein clothes NaN NaN
5 Lincoln Tun warehouse Lincoln Tun brown
6 Oliver Twist kitchen Oliver Twist bald

Related

Inserting multiple columns from one csv file into another and outputting as a new csv file

I want to append columns from one csv file into another csv file.
I have two csv files with the below data:
First file: names_dept.csv
Name Dept Area
John Smith candy 5
Diana Princ candy 5
Tyler Perry candy 5
Perry Plat wood 3
Jerry Springer clothes 2
Calvin Klein clothes 2
Mary Poppins clothes 2
Ivan Evans clothes 2
Lincoln Tun warehouse 7
Oliver Twist kitchen 6
Herman Sherman kitchen 6
Second file: name_subject.csv
Who Subject
Perry Plat EMAIL RECEIVED
Mary Poppins EMAIL RECEIVED
Ivan Evans EMAIL RECEIVED
Lincoln Tun EMAIL RECEIVED
Oliver Twist EMAIL RECEIVED
This is what I want my final output to look like:
Output file: output.csv
Name Dept Area Who Subject
John Smith candy 5 Perry Plat EMAIL RECEIVED
Diana Princ candy 5 Mary Poppins EMAIL RECEIVED
Tyler Perry candy 5 Ivan Evans EMAIL RECEIVED
Perry Plat wood 3 Lincoln Tun EMAIL RECEIVED
Jerry Springer clothes 2 Oliver Twist EMAIL RECEIVED
Calvin Klein clothes 2
Mary Poppins clothes 2
Ivan Evans clothes 2
Lincoln Tun warehouse 7
Oliver Twist kitchen 6
Herman Sherman kitchen 6
My code so far is:
import pandas as pd
import os, csv, sys
namedept_path = 'names_dept.csv'
namesubject_path = 'name_subject.csv'
output_path = 'output.csv'
df1 = pd.read_csv(namedept_path)
df2 = pd.read_csv(namesubject_path)
#this was my attempt
output = df1 ['Who'] = df2 ['Who']
output = df1 ['Subject'] = df2 ['Subject']
output.to_csv(output_path , index=False)
I get the error: TypeError: string indices must be integers as the columns do contain strings.
I also tried:
with open(namedept_path, 'r') as name, open(namesubject_path, 'r') as email, \
open(output_path, 'w') as result:
name_reader = csv.reader(name)
email_reader = csv.reader(email)
result = csv.writer(result, lineterminator='\n')
result.writerows(x + y for x, y in zip(name_reader , email_reader))
Almost what I needed, but the output ended up looking something like this instead:
Name Dept Area Who Subject
John Smith candy 5 Perry Plat EMAIL RECEIVED
Diana Princ candy 5 Mary Poppins EMAIL RECEIVED
Tyler Perry candy 5 Ivan Evans EMAIL RECEIVED
Perry Plat wood 3 Lincoln Tun EMAIL RECEIVED
Jerry Springer clothes 2 Oliver Twist EMAIL RECEIVED
You can try pd.concat on columns
out = pd.concat([df1, df2], axis=1)
print(out)
Name Dept Area Who Subject
0 John Smith candy 5 Perry Plat EMAIL RECEIVED
1 Diana Princ candy 5 Mary Poppins EMAIL RECEIVED
2 Tyler Perry candy 5 Ivan Evans EMAIL RECEIVED
3 Perry Plat wood 3 Lincoln Tun EMAIL RECEIVED
4 Jerry Springer clothes 2 Oliver Twist EMAIL RECEIVED
5 Calvin Klein clothes 2 NaN NaN
6 Mary Poppins clothes 2 NaN NaN
7 Ivan Evans clothes 2 NaN NaN
8 Lincoln Tun warehouse 7 NaN NaN
9 Oliver Twist kitchen 6 NaN NaN
10 Herman Sherman kitchen 6 NaN NaN

Join on a second column if there is not a match on the first column of a pandas dataframe

I need to be able to match on a second column if there is not a match on the first column of a pandas dataframe (Python 3.x).
Ex.
table_df = pd.DataFrame ( {
'Name': ['James','Tim','John','Emily'],
'NickName': ['Jamie','','','Em'],
'Colour': ['Blue','Black','Red','Purple']
})
lookup_df = pd.DataFrame ( {
'Name': ['Tim','John','Em','Jamie'],
'Pet': ['Cat','Dog','Fox','Dog']
})
table_df
Name NickName Colour
0 James Jamie Blue
1 Tim Black
2 John Red
3 Emily Em Purple
lookup_df
Name Pet
0 Tim Cat
1 John Dog
2 Em Fox
3 Jamie Dog
The result I need:
Name NickName Colour Pet
0 James Jamie Blue Dog
1 Tim Black Cat
2 John Red Dog
3 Emily Em Purple Fox
which is matching on the Name column, and if there is no match, match on the Nickname column,
I tried many different things, including:
pd.merge(table_df,lookup_df, how='left', left_on='Name', right_on='Name')
if Nan -> pd.merge(table_df,lookup_df, how='left', left_on='NickName', right_on='Name')
but it does not do what I need and I want to avoid having a nested loop.
Has anyone an idea on how to do this? Any feedback is really appreciated.
Thanks!
You can map on Name and fillna on NickName:
s = lookup_df.set_index("Name")["Pet"]
table_df["pet"] = table_df["Name"].map(s).fillna(table_df["NickName"].map(s))
print (table_df)
Name NickName Colour pet
0 James Jamie Blue Dog
1 Tim Black Cat
2 John Red Dog
3 Emily Em Purple Fox

Code to detect Sunday to Saturday date windows and modify Dataframe

I'm trying to set up a code that will take in a table with date windows and modify them to fit a Sun-Sat template.
I have the data saved as follows:
Index Name: From: To:
1 Joe Doe 6/1/2020 6/8/2020
2 Joe Doe 6/14/2020 6/23/2020
3 Brandon Smith 5/9/2020 5/20/2020
4 Brandon Smith 5/26/2020 5/28/2020
5 Brandon Smith 5/12/2020 5/24/2020
6 Brandon Smith 5/26/2020 5/31/2020
7 Sarah Roberts 6/3/2020 6/25/2020
8 Sarah Roberts 6/15/2020 6/23/2020
I would like to create another From: and To: columns but only capturing windows of 7,14,21... days that run from a Sunday to a Saturday.
For example: Index 1 would not apply, index 2 would get transformed from the 14th to the 20th, and so forth.
The resulting table that I was hoping to get would look like this:
Index Name: From: To: From_new: To_new
1 Joe Doe 6/1/2020 6/8/2020 NA NA
2 Joe Doe 6/14/2020 6/23/2020 6/12/2020 6/20/2020
3 Brandon Smith 5/9/2020 5/20/2020 5/10/2020 5/16/2020
4 Brandon Smith 5/26/2020 5/28/2020 NA NA
5 Brandon Smith 5/12/2020 5/24/2020 5/17/2020 5/23/2020
6 Brandon Smith 5/26/2020 5/31/2020 NA NA
7 Sarah Roberts 6/3/2020 6/25/2020 6/7/2020 6/20/2020
8 Sarah Roberts 6/15/2020 6/23/2020 NA NA
I've tried to loop through each record and look at the start week day, if it's Sunday then run to the next Saturday, but then I get confused if it runs for another whole week after that, or if it's not Sunday to begin with.
Thank in advance.
You don't need a loop. The solution was in this SO post. All credits should go to #ifly6. :)
Having said that, this should work for you:
df['From_new'] = df['From:'] + pd.offsets.Week(weekday=6)
df.loc[df['From:'].dt.weekday == 6, 'From_new'] = df.loc[df['From:'].dt.weekday == 6, 'From:']
df['To_new'] = df['To:'] - pd.offsets.Week(weekday=5)
df.loc[df['To:'].dt.weekday == 5, 'To_new'] = df.loc[df['From:'].dt.weekday == 5, 'To:']
df.loc[df['To_new'] < df['From_new'], 'From_new'] = pd.NaT
df.loc[df['From_new'].isna(), 'To_new'] = pd.NaT
Output:
Index Name: From: To: From_new To_new
1 Joe Doe 2020-06-01 2020-06-08 NaT NaT
2 Joe Doe 2020-06-14 2020-06-23 2020-06-14 2020-06-20
3 Brandon Smith 2020-05-09 2020-05-20 2020-05-10 2020-05-16
4 Brandon Smith 2020-05-26 2020-05-28 NaT NaT
5 Brandon Smith 2020-05-12 2020-05-24 2020-05-17 2020-05-23
6 Brandon Smith 2020-05-26 2020-05-31 NaT NaT
7 Sarah Roberts 2020-06-03 2020-06-25 2020-06-07 2020-06-20
8 Sarah Roberts 2020-06-15 2020-06-23 NaT NaT

How to merge two data frames with duplicate rows?

I have two data frames df1 and df2. The df1 has repeated text wrt column name but column hobby changes. The df2 also has repeated text in the column name. I want to merge both the data frames and keep everything.
df1:
name hobby
mike cricket
mike football
jack chess
jack football
jack vollyball
pieter sleeping
pieter cyclying
my df2 is
df2:
name
mike
pieter
jack
mike
pieter
Now I have to merge df2 with df1 on name column
So my resultant df3 should look like this:
df3:
name hobby
mike cricket
mike football
pieter sleeping
pieter cyclying
jack chess
jack football
jack vollyball
mike cricket
mike football
pieter sleeping
pieter cyclying
You want to assign an order for df2, merge on name, then sort by the said order:
(df2.assign(rank=np.arange(len(df2)))
.merge(df1, on='name')
.sort_values('rank')
.drop('rank', axis=1)
)
Output:
name hobby
0 mike cricket
1 mike football
4 pieter sleeping
5 pieter cyclying
8 jack chess
9 jack football
10 jack vollyball
2 mike cricket
3 mike football
6 pieter sleeping
7 pieter cyclying

Combine 2 different sheets with same data in Excel

I have the same data from different sources, both incomplete, but combined they may be less incomplete..
I have 2 files;
File #1 has; ID, Zipcode, YoB, Gender
File #2 has: Email, ID, Zipcode, Yob, Gender
The ID's in both files are the same, but #1 has some ID's that #2 hasn't, and the other way aroud.
The Email is connected to the ID. ID's are linked to the zipcode, YoB and gender. In both files are some of that info missing. E.g. File #1 and #2 both have ID 1234, only in #1 it only has a postal code, YoB but no Gender. And #2 has the zipcode and gender but no YoB.
I want to have all the information in one file;
Email, ID, YoB, Zipcode, Gender
I tried to sort both ID's alphabetically and put them next to each other and search for duplicates, but because #1 has some ID's that #2 doesnt I'm not able to combine them...
What's the best way to fix this?
By the way its about 12000 ID's from #1 and 9500 from #2
If you want a list of all the unique IDs then you could create a new sheet, copy both lots of IDs into the same column and then use Advanced Filter to copy Unique records only to another column.
Then use that column to do vlookups from the two files in the columns you require.
(I'm presuming this is a one-time job and you don't mind a bit of manual-ness)...
If on your first Sheet ("Sheet1") you have:
ID F_Name S_Name Age Favourite Cheese
1 Bob Smith 25 Brie
2 Fred Jones 29 Cheddar
3 Jeff Brown 18 Edam
4 Alice Smith 39 Mozzarella
5 Mark Jones 65 Cheddar
7 Sarah Smith 29 Mozzarella
8 Nick Jones 40 Brie
10 Betty Thompson 34 Edam
and on your second Sheet ("Sheet2") you have:
ID F_Name S_Name Age
1 Bob Smith 25
3 Jeff Brown 18
4 Alice Smith 39
5 Mark Jones 65
6 Frank Brown 44
7 Sarah Smith 29
9 Tom Brown 28
10 Betty Thompson 34
Then if you're combining them on a 3rd Sheet you need to do something like:
=IFERROR(VLOOKUP($A2,Sheet1!$A$1:$E$9,COLUMN(),FALSE),VLOOKUP($A2,Sheet2!$A$1:$E$9,COLUMN(),FALSE))
If you're trying to get to:
ID F_Name S_Name Age Favourite Cheese
1 Bob Smith 25 Brie
2 Fred Jones 29 Cheddar
3 Jeff Brown 18 Edam
4 Alice Smith 39 Mozzarella
5 Mark Jones 65 Cheddar
6 Frank Brown 44 0
7 Sarah Smith 29 Mozzarella
8 Nick Jones 40 Brie
9 Tom Brown 28 0
10 Betty Thompson 34 Edam

Resources