Inserting multiple columns from one csv file into another and outputting as a new csv file - python-3.x

I want to append columns from one csv file into another csv file.
I have two csv files with the below data:
First file: names_dept.csv
Name Dept Area
John Smith candy 5
Diana Princ candy 5
Tyler Perry candy 5
Perry Plat wood 3
Jerry Springer clothes 2
Calvin Klein clothes 2
Mary Poppins clothes 2
Ivan Evans clothes 2
Lincoln Tun warehouse 7
Oliver Twist kitchen 6
Herman Sherman kitchen 6
Second file: name_subject.csv
Who Subject
Perry Plat EMAIL RECEIVED
Mary Poppins EMAIL RECEIVED
Ivan Evans EMAIL RECEIVED
Lincoln Tun EMAIL RECEIVED
Oliver Twist EMAIL RECEIVED
This is what I want my final output to look like:
Output file: output.csv
Name Dept Area Who Subject
John Smith candy 5 Perry Plat EMAIL RECEIVED
Diana Princ candy 5 Mary Poppins EMAIL RECEIVED
Tyler Perry candy 5 Ivan Evans EMAIL RECEIVED
Perry Plat wood 3 Lincoln Tun EMAIL RECEIVED
Jerry Springer clothes 2 Oliver Twist EMAIL RECEIVED
Calvin Klein clothes 2
Mary Poppins clothes 2
Ivan Evans clothes 2
Lincoln Tun warehouse 7
Oliver Twist kitchen 6
Herman Sherman kitchen 6
My code so far is:
import pandas as pd
import os, csv, sys
namedept_path = 'names_dept.csv'
namesubject_path = 'name_subject.csv'
output_path = 'output.csv'
df1 = pd.read_csv(namedept_path)
df2 = pd.read_csv(namesubject_path)
#this was my attempt
output = df1 ['Who'] = df2 ['Who']
output = df1 ['Subject'] = df2 ['Subject']
output.to_csv(output_path , index=False)
I get the error: TypeError: string indices must be integers as the columns do contain strings.
I also tried:
with open(namedept_path, 'r') as name, open(namesubject_path, 'r') as email, \
open(output_path, 'w') as result:
name_reader = csv.reader(name)
email_reader = csv.reader(email)
result = csv.writer(result, lineterminator='\n')
result.writerows(x + y for x, y in zip(name_reader , email_reader))
Almost what I needed, but the output ended up looking something like this instead:
Name Dept Area Who Subject
John Smith candy 5 Perry Plat EMAIL RECEIVED
Diana Princ candy 5 Mary Poppins EMAIL RECEIVED
Tyler Perry candy 5 Ivan Evans EMAIL RECEIVED
Perry Plat wood 3 Lincoln Tun EMAIL RECEIVED
Jerry Springer clothes 2 Oliver Twist EMAIL RECEIVED

You can try pd.concat on columns
out = pd.concat([df1, df2], axis=1)
print(out)
Name Dept Area Who Subject
0 John Smith candy 5 Perry Plat EMAIL RECEIVED
1 Diana Princ candy 5 Mary Poppins EMAIL RECEIVED
2 Tyler Perry candy 5 Ivan Evans EMAIL RECEIVED
3 Perry Plat wood 3 Lincoln Tun EMAIL RECEIVED
4 Jerry Springer clothes 2 Oliver Twist EMAIL RECEIVED
5 Calvin Klein clothes 2 NaN NaN
6 Mary Poppins clothes 2 NaN NaN
7 Ivan Evans clothes 2 NaN NaN
8 Lincoln Tun warehouse 7 NaN NaN
9 Oliver Twist kitchen 6 NaN NaN
10 Herman Sherman kitchen 6 NaN NaN

Related

Matching two columns with the same row values in a csv file

I have a csv file with 4 columns:
Name Dept Email Name Hair Color
John Smith candy Lincoln Tun brown
Diana Princ candy John Smith gold
Perry Plat wood Oliver Twist bald
Jerry Springer clothes Diana Princ gold
Calvin Klein clothes
Lincoln Tun warehouse
Oliver Twist kitchen
I want to match the columns Name and Email Name by names.
This what the final output should look like:
Name Dept Email Name Hair Color
John Smith candy John Smith gold
Diana Princ candy Diana Princ gold
Perry Plat wood
Jerry Springer clothes
Calvin Klein clothes
Lincoln Tun warehouse Lincoln Tun brown
Oliver Twist kitchen Oliver Twist bald
I tried something like this in my code:
dfs = np.split(df,len(df.columns), axis=1)
dfs = [df.set_index(df.columns[0], drop=False) for df in dfs]
f=dfs[0].join(dfs[1:]).reset_index(drop=True).fillna(0)
Which sorted my two columns great but made everything else 0's
Name Dept Email Name Hair Color
John Smith 0 John Smith 0
Diana Princ 0 Diana Princ 0
Perry Plat 0 0 0
Jerry Springer 0 0 0
Calvin Klein 0 0 0
Lincoln Tun 0 Lincoln Tun 0
Oliver Twist 0 Oliver Twist 0
Here is my code so far:
import pandas as pd
import numpy as np
import os, csv, sys
csvPath = 'User.csv'
df= pd.read_csv(csvPath)
dfs = np.split(df,len(df.columns), axis=1)
dfs = [df.set_index(df.columns[0], drop=False) for df in dfs]
f=dfs[0].join(dfs[1:]).reset_index(drop=True).fillna(0)
testCSV = 'test_user.csv' #to check my csv file
f.to_csv(testCSV, encoding='utf-8') #send it to csv
You could use merge for that:
pd.merge(df[['Name','Dept']],df[['Email Name','Hair Color']], left_on='Name', right_on='Email Name', how='left')
Result
Name Dept Email Name Hair Color
0 John Smith candy John Smith gold
1 Diana Princ candy Diana Princ gold
2 Perry Plat wood NaN NaN
3 Jerry Springer clothes NaN NaN
4 Calvin Klein clothes NaN NaN
5 Lincoln Tun warehouse Lincoln Tun brown
6 Oliver Twist kitchen Oliver Twist bald

How update a dataframe column value from second dataframe where values on two specific columns that can repeat on first match on both dataframes?

I have two dataframes with different information about a person, on the first dataframe, person's name may repeat in different rows. I want to add/update the first dataframe with data from the second dataframe where the two columns containing person's data matches on both. Here an example on what I need to accomplish:
df1:
name surname
0 john doe
1 mary doe
2 peter someone
3 mary doe
4 john another
5 paul another
df2:
name surname account_id
0 peter someone 100
1 john doe 200
2 mary doe 300
3 john another 400
I need to accomplish this:
df1:
name surname account_id
0 john doe 200
1 mary doe 300
2 peter someone 100
3 mary doe 300
4 john another 400
5 paul another <empty>
Thanks!

Code to detect Sunday to Saturday date windows and modify Dataframe

I'm trying to set up a code that will take in a table with date windows and modify them to fit a Sun-Sat template.
I have the data saved as follows:
Index Name: From: To:
1 Joe Doe 6/1/2020 6/8/2020
2 Joe Doe 6/14/2020 6/23/2020
3 Brandon Smith 5/9/2020 5/20/2020
4 Brandon Smith 5/26/2020 5/28/2020
5 Brandon Smith 5/12/2020 5/24/2020
6 Brandon Smith 5/26/2020 5/31/2020
7 Sarah Roberts 6/3/2020 6/25/2020
8 Sarah Roberts 6/15/2020 6/23/2020
I would like to create another From: and To: columns but only capturing windows of 7,14,21... days that run from a Sunday to a Saturday.
For example: Index 1 would not apply, index 2 would get transformed from the 14th to the 20th, and so forth.
The resulting table that I was hoping to get would look like this:
Index Name: From: To: From_new: To_new
1 Joe Doe 6/1/2020 6/8/2020 NA NA
2 Joe Doe 6/14/2020 6/23/2020 6/12/2020 6/20/2020
3 Brandon Smith 5/9/2020 5/20/2020 5/10/2020 5/16/2020
4 Brandon Smith 5/26/2020 5/28/2020 NA NA
5 Brandon Smith 5/12/2020 5/24/2020 5/17/2020 5/23/2020
6 Brandon Smith 5/26/2020 5/31/2020 NA NA
7 Sarah Roberts 6/3/2020 6/25/2020 6/7/2020 6/20/2020
8 Sarah Roberts 6/15/2020 6/23/2020 NA NA
I've tried to loop through each record and look at the start week day, if it's Sunday then run to the next Saturday, but then I get confused if it runs for another whole week after that, or if it's not Sunday to begin with.
Thank in advance.
You don't need a loop. The solution was in this SO post. All credits should go to #ifly6. :)
Having said that, this should work for you:
df['From_new'] = df['From:'] + pd.offsets.Week(weekday=6)
df.loc[df['From:'].dt.weekday == 6, 'From_new'] = df.loc[df['From:'].dt.weekday == 6, 'From:']
df['To_new'] = df['To:'] - pd.offsets.Week(weekday=5)
df.loc[df['To:'].dt.weekday == 5, 'To_new'] = df.loc[df['From:'].dt.weekday == 5, 'To:']
df.loc[df['To_new'] < df['From_new'], 'From_new'] = pd.NaT
df.loc[df['From_new'].isna(), 'To_new'] = pd.NaT
Output:
Index Name: From: To: From_new To_new
1 Joe Doe 2020-06-01 2020-06-08 NaT NaT
2 Joe Doe 2020-06-14 2020-06-23 2020-06-14 2020-06-20
3 Brandon Smith 2020-05-09 2020-05-20 2020-05-10 2020-05-16
4 Brandon Smith 2020-05-26 2020-05-28 NaT NaT
5 Brandon Smith 2020-05-12 2020-05-24 2020-05-17 2020-05-23
6 Brandon Smith 2020-05-26 2020-05-31 NaT NaT
7 Sarah Roberts 2020-06-03 2020-06-25 2020-06-07 2020-06-20
8 Sarah Roberts 2020-06-15 2020-06-23 NaT NaT

How to merge two data frames with duplicate rows?

I have two data frames df1 and df2. The df1 has repeated text wrt column name but column hobby changes. The df2 also has repeated text in the column name. I want to merge both the data frames and keep everything.
df1:
name hobby
mike cricket
mike football
jack chess
jack football
jack vollyball
pieter sleeping
pieter cyclying
my df2 is
df2:
name
mike
pieter
jack
mike
pieter
Now I have to merge df2 with df1 on name column
So my resultant df3 should look like this:
df3:
name hobby
mike cricket
mike football
pieter sleeping
pieter cyclying
jack chess
jack football
jack vollyball
mike cricket
mike football
pieter sleeping
pieter cyclying
You want to assign an order for df2, merge on name, then sort by the said order:
(df2.assign(rank=np.arange(len(df2)))
.merge(df1, on='name')
.sort_values('rank')
.drop('rank', axis=1)
)
Output:
name hobby
0 mike cricket
1 mike football
4 pieter sleeping
5 pieter cyclying
8 jack chess
9 jack football
10 jack vollyball
2 mike cricket
3 mike football
6 pieter sleeping
7 pieter cyclying

Combine 2 different sheets with same data in Excel

I have the same data from different sources, both incomplete, but combined they may be less incomplete..
I have 2 files;
File #1 has; ID, Zipcode, YoB, Gender
File #2 has: Email, ID, Zipcode, Yob, Gender
The ID's in both files are the same, but #1 has some ID's that #2 hasn't, and the other way aroud.
The Email is connected to the ID. ID's are linked to the zipcode, YoB and gender. In both files are some of that info missing. E.g. File #1 and #2 both have ID 1234, only in #1 it only has a postal code, YoB but no Gender. And #2 has the zipcode and gender but no YoB.
I want to have all the information in one file;
Email, ID, YoB, Zipcode, Gender
I tried to sort both ID's alphabetically and put them next to each other and search for duplicates, but because #1 has some ID's that #2 doesnt I'm not able to combine them...
What's the best way to fix this?
By the way its about 12000 ID's from #1 and 9500 from #2
If you want a list of all the unique IDs then you could create a new sheet, copy both lots of IDs into the same column and then use Advanced Filter to copy Unique records only to another column.
Then use that column to do vlookups from the two files in the columns you require.
(I'm presuming this is a one-time job and you don't mind a bit of manual-ness)...
If on your first Sheet ("Sheet1") you have:
ID F_Name S_Name Age Favourite Cheese
1 Bob Smith 25 Brie
2 Fred Jones 29 Cheddar
3 Jeff Brown 18 Edam
4 Alice Smith 39 Mozzarella
5 Mark Jones 65 Cheddar
7 Sarah Smith 29 Mozzarella
8 Nick Jones 40 Brie
10 Betty Thompson 34 Edam
and on your second Sheet ("Sheet2") you have:
ID F_Name S_Name Age
1 Bob Smith 25
3 Jeff Brown 18
4 Alice Smith 39
5 Mark Jones 65
6 Frank Brown 44
7 Sarah Smith 29
9 Tom Brown 28
10 Betty Thompson 34
Then if you're combining them on a 3rd Sheet you need to do something like:
=IFERROR(VLOOKUP($A2,Sheet1!$A$1:$E$9,COLUMN(),FALSE),VLOOKUP($A2,Sheet2!$A$1:$E$9,COLUMN(),FALSE))
If you're trying to get to:
ID F_Name S_Name Age Favourite Cheese
1 Bob Smith 25 Brie
2 Fred Jones 29 Cheddar
3 Jeff Brown 18 Edam
4 Alice Smith 39 Mozzarella
5 Mark Jones 65 Cheddar
6 Frank Brown 44 0
7 Sarah Smith 29 Mozzarella
8 Nick Jones 40 Brie
9 Tom Brown 28 0
10 Betty Thompson 34 Edam

Resources