Python - Cleaning US and Canadian Zip Codes with `df.loc` and `str` Methods - python-3.x

I have the following code to create a column with cleaned up zip codes for the USA and Canada
df = pd.read_csv(file1)
usa = df['Region'] == 'USA'
canada = df['Region'] == 'Canada'
df.loc[usa, 'ZipCleaned'] = df.loc[usa, 'Zip'].str.slice(stop=5)
df.loc[canada, 'ZipCleaned'] = df.loc[canada, 'Zip'].str.replace(' |-','')
The issue that i am having is that some of the rows that have "USA" as the country contain Canadian postal codes in the dataset. So the USA logic from above is being applied to Canadian postal codes.
I tried the edited code above along with the below and experimented with one province ("BC") to prevent the USA logic from being applied in this case but it didn't work
usa = df['Region'] == 'USA'
usa = df['State'] != 'BC'
Region Date State Zip Customer Revenue
USA 1/3/2014 BC A5Z 1B6 Customer A $157.52
Canada 1/13/2014 AB Z3J-4E5 Customer B $750.00
USA 1/4/2014 FL 90210-9999 Customer C $650.75
USA 1/21/2014 FL 12345 Customer D $242.00
USA 1/25/2014 FL 45678 Customer E $15.00
USA 1/28/2014 NY 91011 Customer F $25.00

Thanks Kris. But what if I wanted to maintain the original values in the Region column and change "Zip Cleaned" based on whether Zip contains a Canadian or USA Zip. I tried the following but it's not working
usa = df.loc[df['Ship To Customer Zip'].str.contains('[0-9]')]
canada = df.loc[df['Ship To Customer Zip'].str.contains('[A-Za-z]')]
df.loc[usa, 'ZipCleaned'] = df.loc[usa, 'Ship To Customer Zip'].str.slice(stop=5)
df.loc[canada, 'ZipCleaned'] = df.loc[canada, 'Ship To Customer Zip'].str.replace(' |-','')

Give this a try:
# sample df provided by OP
>>> df
Region Date State Zip Customer Revenue
0 USA 2014-01-03 BC A5Z 1B6 Customer A 157.52
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750
2 USA 2014-01-04 FL 90210-999 Customer C 650.75
3 USA 2014-01-21 FL 12345 Customer D 242
4 USA 2014-01-25 FL 45678 Customer E 15
5 USA 2014-01-28 NY 91011 Customer F 25
# Edit 'Region' by testing 'Zip' for presence of letters (US Zip Codes are only numeric)
>>> df.loc[df['Zip'].str.contains('[A-Za-z]'), 'Region'] = 'Canada'
>>> df
Region Date State Zip Customer Revenue
0 Canada 2014-01-03 BC A5Z 1B6 Customer A 157.52
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750
2 USA 2014-01-04 FL 90210-999 Customer C 650.75
3 USA 2014-01-21 FL 12345 Customer D 242
4 USA 2014-01-25 FL 45678 Customer E 15
5 USA 2014-01-28 NY 91011 Customer F 25
# apply OP's original filtering and cleaning
>>> usa = df['Region'] == 'USA'
>>> canada = df['Region'] == 'Canada'
>>> df.loc[usa, 'ZipCleaned'] = df.loc[usa, 'Zip'].str.slice(stop=5)
>>> df.loc[canada, 'ZipCleaned'] = df.loc[canada, 'Zip'].str.replace(' |-','')
# display resultant df
>>> df
Region Date State Zip Customer Revenue ZipCleaned
0 Canada 2014-01-03 BC A5Z 1B6 Customer A 157.52 A5Z1B6
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750 Z3J4E5
2 USA 2014-01-04 FL 90210-999 Customer C 650.75 90210
3 USA 2014-01-21 FL 12345 Customer D 242 12345
4 USA 2014-01-25 FL 45678 Customer E 15 45678
5 USA 2014-01-28 NY 91011 Customer F 25 91011
EDIT: Update as requested by OP: we can do the following to leave the original 'Region' intact:
>>> df
Region Date State Zip Customer Revenue
0 USA 2014-01-03 BC A5Z 1B6 Customer A 157.52
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750
2 USA 2014-01-04 FL 90210-999 Customer C 650.75
3 USA 2014-01-21 FL 123456 Customer D 242
4 USA 2014-01-25 FL 45678 Customer E 15
5 USA 2014-01-28 NY 91011 Customer F 25
# create 'ZipCleaned' by referencing original 'Zip'
>>> df.loc[~df['Zip'].str.contains('[A-Za-z]'), 'ZipCleaned'] = df['Zip'].str.slice(stop=5)
>>> df.loc[df['Zip'].str.contains('[A-Za-z]'), 'ZipCleaned'] = df['Zip'].str.replace(' |-', '')
# Resultant df
>>> df
Region Date State Zip Customer Revenue ZipCleaned
0 USA 2014-01-03 BC A5Z 1B6 Customer A 157.52 A5Z1B6
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750 Z3J4E5
2 USA 2014-01-04 FL 90210-999 Customer C 650.75 90210
3 USA 2014-01-21 FL 123456 Customer D 242 12345
4 USA 2014-01-25 FL 45678 Customer E 15 45678
5 USA 2014-01-28 NY 91011 Customer F 25 91011
>>>

Related

Excel cell lookup in subtotaled range

I'd like to use index/match to lookup values in a subtotaled range. Using the sample data below, from another sheet (Sheet 2), I need to lookup the total NY Company hours for each employee.
Sheet 2:
| Bob | NY Company | ???? |
This formula returns the first match of NY Company Total
=INDEX('Sheet1!A1:C45,MATCH(Sheet2!B2 & " Total",'Sheet1!B1:B45,0),3)
Now I need to expand the lookup to include the Employee (Bob). Also, Column A is blank on the total Row. I've started to work with something like the following but no luck.
=INDEX('Sheet1!A1:C45,MATCH(1,('Sheet2!B2 & " Total"='Sheet1!B1:B45)*('Sheet2!B1='Sheet1!A1:A45)),3)
Also, as the sample data below looks perfect in the preview and then looks really bad after saving, I've added a pic with the sample data.
Sample data:
Sample Data:
A
B
C
Employee
Customer
Hours
Bob
ABC Company
5
Bob
ABC Company
3
ABC Company Total
8
Bob
NY Company
7
Bob
NY Company
7
Bob
NY Company
5
Bob
NY Company
3
NY Company Total
22
Bob
Jet Company
1
Jet Company Total
1
Carrie
ABC Company
1
Carrie
ABC Company
4
ABC Company Total
5
Carrie
NY Company
6
Carrie
NY Company
2
Carrie
NY Company
3
NY Company Total
11
Carrie
Jet Company
7
Carrie
Jet Company
9
Jet Company Total
16
Carrie
XYZ Company
4
XYZ Company Total
4
Gale
Cats Service
2
Gale
Cats Service
6
Gale
Cats Service
1
Cats Service Total
9
Gale
NY Company
6
Gale
NY Company
8
NY Company Total
14
Gale
XYZ Company
1
XYZ Company Total
1
John
NY Company
3
John
NY Company
5
NY Company Total
8
John
XYZ Company
8
John
XYZ Company
5
XYZ Company Total
13
Ken
ABC Company
10
ABC Company Total
10
Ken
NY Company
2
Ken
NY Company
3
Ken
NY Company
5
NY Company Total
10
Grand Total
132
Any suggestions??

How to merge data with duplicates using panda python

I have two dataframe below, I 'd like to merge them to get ID on df1. However, I find by using merge, I cannot get the ID if the names are more than one. df2 has unique name, df1 and df2 are different in rows and columns. My code below:
df1: Name Region
0 P Asia
1 Q Eur
2 R Africa
3 S NA
4 R Africa
5 R Africa
6 S NA
df2: Name Id
0 P 1234
1 Q 1244
2 R 1233
3 S 1111
code:
x= df1.assign(temp1 = df1.groupby ('Name').cumcount())
y= df2.assign(temp1 = df2.groupby ('Name').cumcount())
xy= x.merge(y, on=['Name',temp2],how = 'left').drop(columns = ['temp1'])
the output is:
df1: Name Region Id
0 P Asia 1234
1 Q Eur 1244
2 R Africa 1233
3 S NA 1111
4 R Africa NAN
5 R Africa NAN
6 S NA NAN
How do I find all the id for these duplicate names?

Flag repeating entries in pandas time series

I have a data frame that takes this form (but is several millions of rows long):
import pandas as pd
dict = {'id':["A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D"],
'year': ["2000", "2001", "2002", "2000", "2001", "2003", "1999", "2000", "2001", "2000", "2000", "2001"],
'vacation':["France", "Morocco", "Morocco", "Germany", "Germany", "Germany", "Japan", "Australia", "Japan", "Canada", "Mexico", "China"],
'new':[1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1]}
df = pd.DataFrame(dict)
A 2000 France
A 2001 Morocco
A 2002 Morocco
B 2000 Germany
B 2001 Germany
B 2003 Germany
C 1999 Japan
C 2000 Australia
C 2001 Japan
D 2000 Canada
D 2000 Mexico
D 2001 China
For each person in each year, the holiday destination(s) is/are given; there can be multiple holiday destinations in a given year.
I would like to flag the rows when a participant goes to holiday to a destination to which they had not gone the year before (i.e., the destination is new). In the case above, the output would be:
id year vacation new
A 2000 France 1
A 2001 Morocco 1
A 2002 Morocco 0
B 2001 Germany 1
B 2002 Germany 0
B 2003 Germany 0
C 1999 Japan 1
C 1999 Australia 1
C 2000 Japan 1
D 2000 Canada 1
D 2000 Mexico 1
D 2001 China 1
For A, B, C, and D, the first holiday destination in our data frame is flagged as new. When A goes to Morocco two years in a row, the 2nd occurrence is not flagged, because A went there the year before. When B goes to Germany 3 times in a row, the 2nd and 3rd occurrences are not flagged. When person C goes to Japan twice, all of the occurrences are flagged, because they did not go to Japan two years in a row. D goes to 3 different destinations (albeit to 2 destinations in 2000) and all of them are flagged.
I have been trying to solve it myself, but have not been able to break away from iterations, which are too computationally intensive for such a massive dataset.
I'd appreciate any input; thanks.
IIUC,
what we are doing is grouping by id & vacation and ensuring that year is not equal to the year above, or we can selecting the first instance of that combination.
hopefully that's clear. let me know if you need anymore help.
df["new_2"] = (
df.groupby(["id", "vacation"])["id", "year"]
.apply(lambda x: x.ne(x.shift()))
.all(axis=1)
.add(0)
)
print(df)
id year vacation new_2
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1
Here's one solution I came up with, using groupby and transform:
df = df.sort_values(["id", "vacation", "year"])
df["new"] = (
df.groupby(["id", "vacation"])
.transform(lambda x: x.iloc[0])
.year.eq(df.year)
.astype(int)
)
You'll get
id year vacation new
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1
Here is a way using groupby+cumcount and series.mask:
df['new']=df.groupby(['id','vacation']).cumcount().add(1).mask(lambda x: x.gt(1),0)
print(df)
id year vacation new
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1

How to calculate common elements in a dataframe depending on another column

I have a dataframe like this.
sport Country(s)
Foot_ball brazil
Foot_ball UK
Volleyball UK
Volleyball South_Africa
Volleyball brazil
Rugger UK
Rugger South_africa
Rugger Australia
Carrom UK
Carrom Australia
Chess UK
Chess Australia
I want to calculate the number of sports shared by two countries. For a example
Football and Volleyball is common to brazil and Uk. So the number of common sports played by brazil and Uk is 2.
carrom, chess and Rugger are common to australia and Uk. So the number of sports shared by australia and UK is 3.
Like this is there anyway that I can get a count in whole dataframe for
brazil, south_afriaca.
Brazil, Austrlia
SouthAfrica, Uk
e.t.c
Can anybody suggest me how to do this in pandas or any other way.
With the sample data you provided you can generate the desired output with below code:
import pandas as pd
df = pd.DataFrame(
[["Foot_ball", "brazil"],\
["Foot_ball", "UK"],\
["Volleyball", "UK"],\
["Volleyball", "South_Africa"],\
["Volleyball", "brazil"],\
["Rugger", "UK"],\
["Rugger", "South_Africa"],\
["Rugger", "Australia"],\
["Carrom", "UK"],\
["Carrom", "Australia"],\
["Chess", "UK"],\
["Chess", "Australia"]],\
columns = ["sport" , "Country"])
# Function to get the number of sports in common
def countCommonSports(row):
sports1 = df["sport"][df["Country"]==row["Country 1"]]
sports2 = df["sport"][df["Country"]==row["Country 2"]]
return len(list(set(sports1).intersection(sports2)))
# Generate the combinations of countries from original Dataframe
from itertools import combinations
comb = combinations(df["Country"].unique(), 2)
out = pd.DataFrame(list(comb), columns=["Country 1", "Country 2"])
# Find out the sports in common between coutries
out["common Sports count"] = out.apply(countCommonSports, axis = 1)
output is then:
>>> out
Country 1 Country 2 common Sports count
0 brazil UK 2
1 brazil South_Africa 1
2 brazil Australia 0
3 UK South_Africa 2
4 UK Australia 3
5 South_Africa Australia 1
pd.factorize and itertools.combinations
import pandas as pd
import numpy as np
from itertools import combinations, product
# Fix Capitalization
df['Country(s)'] = ['_'.join(map(str.title, x.split('_'))) for x in df['Country(s)']]
c0, c1 = zip(*[(a, b)
for s, c in df.groupby('sport')['Country(s)']
for a, b in combinations(c, 2)])
i, r = pd.factorize(c0)
j, c = pd.factorize(c1)
n, m = len(r), len(c)
o = np.zeros((n, m), np.int64)
np.add.at(o, (i, j), 1)
result = pd.DataFrame(o, r, c)
result
Australia Uk South_Africa Brazil
Uk 3 0 2 1
Brazil 0 1 0 0
South_Africa 1 0 0 1
Make symmetrical
result = result.align(result.T, fill_value=0)[0]
result
Australia Brazil South_Africa Uk
Australia 0 0 0 0
Brazil 0 0 0 1
South_Africa 1 1 0 0
Uk 3 1 2 0
pd.crosstab
This will be slower... almost certainly.
c0, c1 = map(pd.Series, zip(*[(a, b)
for s, c in df.groupby('sport')['Country(s)']
for a, b in combinations(c, 2)]))
pd.crosstab(c0, c1).rename_axis(None).rename_axis(None, axis=1).pipe(
lambda d: d.align(d.T, fill_value=0)[0]
)
Australia Brazil South_Africa Uk
Australia 0 0 0 0
Brazil 0 0 0 1
South_Africa 1 1 0 0
Uk 3 1 2 0
Or including all sports within a single country
c0, c1 = map(pd.Series, zip(*[(a, b)
for s, c in df.groupby('sport')['Country(s)']
for a, b in product(c, c)]))
pd.crosstab(c0, c1).rename_axis(None).rename_axis(None, axis=1)
Australia Brazil South_Africa Uk
Australia 3 0 1 3
Brazil 0 2 1 2
South_Africa 1 1 2 2
Uk 3 2 2 5

How to split a string variable and add its values in separate rows

I want to add multiple rows by deriving them from a string column in Stata.
I have a dataset like the following one:
year countryname intensitylevel
1990 India, Pakistan 1
1991 India, Pakistan 1
1992 India, Pakistan 1
1996 India, Pakistan 1
To be more precise, I want to split the country name variable for each country separately.
In the end, I want to have a dataset like the one below:
year countryname intensitylevel
1990 India 1
1990 Pakistan 1
1991 India 1
1991 Pakistan 1
This is a simple split and reshape:
clear
input year str15 countryname intensitylevel
1990 "India, Pakistan" 1
1991 "India, Pakistan" 1
1992 "India, Pakistan" 1
1996 "India, Pakistan" 1
end
split countryname, p(,)
drop countryname
reshape long countryname, i(countryname* year)
sort year countryname
list year countryname intensitylevel, abbreviate(15) sepby(year)
+-------------------------------------+
| year countryname intensitylevel |
|-------------------------------------|
1. | 1990 Pakistan 1 |
2. | 1990 India 1 |
|-------------------------------------|
3. | 1991 Pakistan 1 |
4. | 1991 India 1 |
|-------------------------------------|
5. | 1992 Pakistan 1 |
6. | 1992 India 1 |
|-------------------------------------|
7. | 1996 Pakistan 1 |
8. | 1996 India 1 |
+-------------------------------------+

Resources