How to split a string variable and add its values in separate rows - string

I want to add multiple rows by deriving them from a string column in Stata.
I have a dataset like the following one:
year countryname intensitylevel
1990 India, Pakistan 1
1991 India, Pakistan 1
1992 India, Pakistan 1
1996 India, Pakistan 1
To be more precise, I want to split the country name variable for each country separately.
In the end, I want to have a dataset like the one below:
year countryname intensitylevel
1990 India 1
1990 Pakistan 1
1991 India 1
1991 Pakistan 1

This is a simple split and reshape:
clear
input year str15 countryname intensitylevel
1990 "India, Pakistan" 1
1991 "India, Pakistan" 1
1992 "India, Pakistan" 1
1996 "India, Pakistan" 1
end
split countryname, p(,)
drop countryname
reshape long countryname, i(countryname* year)
sort year countryname
list year countryname intensitylevel, abbreviate(15) sepby(year)
+-------------------------------------+
| year countryname intensitylevel |
|-------------------------------------|
1. | 1990 Pakistan 1 |
2. | 1990 India 1 |
|-------------------------------------|
3. | 1991 Pakistan 1 |
4. | 1991 India 1 |
|-------------------------------------|
5. | 1992 Pakistan 1 |
6. | 1992 India 1 |
|-------------------------------------|
7. | 1996 Pakistan 1 |
8. | 1996 India 1 |
+-------------------------------------+

Related

Want To Collect The Same string of header

I have header of sheet as
'''
+--------------+------------------+----------------+--------------+---------------+
| usa_alaska | usa_california | france_paris | italy_roma | france_lyon |
|--------------+------------------+----------------+--------------+---------------|
+--------------+------------------+----------------+--------------+---------------+
'''
df = pd.DataFrame([], columns = 'usa_alaska usa_california france_paris italy_roma france_lyon'.split())
I want to separate the headers by country and region in a way so that when I call france, I should get paris and lyon as columns.
Create a MultiIndex from your column names:
Suppose this dataframe:
>>> df
usa_alaska usa_california france_paris italy_roma france_lyon
0 1 2 3 4 5
df.columns = df.columns.str.split('_', expand=True)
df = df.sort_index(axis=1)
Output
>>> df
france italy usa
lyon paris roma alaska california
0 5 3 4 1 2
>>> df['france']
paris lyon
0 3 5

Extract the mapping dictionary between two columns in pandas

I have a dataframe as shown below.
df:
id player country_code country
1 messi arg argentina
2 neymar bra brazil
3 tevez arg argentina
4 aguero arg argentina
5 rivaldo bra brazil
6 owen eng england
7 lampard eng england
8 gerrard eng england
9 ronaldo bra brazil
10 marria arg argentina
from the above df, I would like to extract the mapping dictionary that relates the country_code with country columns.
Expected Output:
d = {'arg':'argentina', 'bra':'brazil', 'eng':'england'}
Dictionary has unique keys, so is possible convert Series with duplicated index by column country_code:
d = df.set_index('country_code')['country'].to_dict()
If there is possible some country should be different per country_code, then is used last value per country.

Excel cell lookup in subtotaled range

I'd like to use index/match to lookup values in a subtotaled range. Using the sample data below, from another sheet (Sheet 2), I need to lookup the total NY Company hours for each employee.
Sheet 2:
| Bob | NY Company | ???? |
This formula returns the first match of NY Company Total
=INDEX('Sheet1!A1:C45,MATCH(Sheet2!B2 & " Total",'Sheet1!B1:B45,0),3)
Now I need to expand the lookup to include the Employee (Bob). Also, Column A is blank on the total Row. I've started to work with something like the following but no luck.
=INDEX('Sheet1!A1:C45,MATCH(1,('Sheet2!B2 & " Total"='Sheet1!B1:B45)*('Sheet2!B1='Sheet1!A1:A45)),3)
Also, as the sample data below looks perfect in the preview and then looks really bad after saving, I've added a pic with the sample data.
Sample data:
Sample Data:
A
B
C
Employee
Customer
Hours
Bob
ABC Company
5
Bob
ABC Company
3
ABC Company Total
8
Bob
NY Company
7
Bob
NY Company
7
Bob
NY Company
5
Bob
NY Company
3
NY Company Total
22
Bob
Jet Company
1
Jet Company Total
1
Carrie
ABC Company
1
Carrie
ABC Company
4
ABC Company Total
5
Carrie
NY Company
6
Carrie
NY Company
2
Carrie
NY Company
3
NY Company Total
11
Carrie
Jet Company
7
Carrie
Jet Company
9
Jet Company Total
16
Carrie
XYZ Company
4
XYZ Company Total
4
Gale
Cats Service
2
Gale
Cats Service
6
Gale
Cats Service
1
Cats Service Total
9
Gale
NY Company
6
Gale
NY Company
8
NY Company Total
14
Gale
XYZ Company
1
XYZ Company Total
1
John
NY Company
3
John
NY Company
5
NY Company Total
8
John
XYZ Company
8
John
XYZ Company
5
XYZ Company Total
13
Ken
ABC Company
10
ABC Company Total
10
Ken
NY Company
2
Ken
NY Company
3
Ken
NY Company
5
NY Company Total
10
Grand Total
132
Any suggestions??

Flag repeating entries in pandas time series

I have a data frame that takes this form (but is several millions of rows long):
import pandas as pd
dict = {'id':["A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D"],
'year': ["2000", "2001", "2002", "2000", "2001", "2003", "1999", "2000", "2001", "2000", "2000", "2001"],
'vacation':["France", "Morocco", "Morocco", "Germany", "Germany", "Germany", "Japan", "Australia", "Japan", "Canada", "Mexico", "China"],
'new':[1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1]}
df = pd.DataFrame(dict)
A 2000 France
A 2001 Morocco
A 2002 Morocco
B 2000 Germany
B 2001 Germany
B 2003 Germany
C 1999 Japan
C 2000 Australia
C 2001 Japan
D 2000 Canada
D 2000 Mexico
D 2001 China
For each person in each year, the holiday destination(s) is/are given; there can be multiple holiday destinations in a given year.
I would like to flag the rows when a participant goes to holiday to a destination to which they had not gone the year before (i.e., the destination is new). In the case above, the output would be:
id year vacation new
A 2000 France 1
A 2001 Morocco 1
A 2002 Morocco 0
B 2001 Germany 1
B 2002 Germany 0
B 2003 Germany 0
C 1999 Japan 1
C 1999 Australia 1
C 2000 Japan 1
D 2000 Canada 1
D 2000 Mexico 1
D 2001 China 1
For A, B, C, and D, the first holiday destination in our data frame is flagged as new. When A goes to Morocco two years in a row, the 2nd occurrence is not flagged, because A went there the year before. When B goes to Germany 3 times in a row, the 2nd and 3rd occurrences are not flagged. When person C goes to Japan twice, all of the occurrences are flagged, because they did not go to Japan two years in a row. D goes to 3 different destinations (albeit to 2 destinations in 2000) and all of them are flagged.
I have been trying to solve it myself, but have not been able to break away from iterations, which are too computationally intensive for such a massive dataset.
I'd appreciate any input; thanks.
IIUC,
what we are doing is grouping by id & vacation and ensuring that year is not equal to the year above, or we can selecting the first instance of that combination.
hopefully that's clear. let me know if you need anymore help.
df["new_2"] = (
df.groupby(["id", "vacation"])["id", "year"]
.apply(lambda x: x.ne(x.shift()))
.all(axis=1)
.add(0)
)
print(df)
id year vacation new_2
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1
Here's one solution I came up with, using groupby and transform:
df = df.sort_values(["id", "vacation", "year"])
df["new"] = (
df.groupby(["id", "vacation"])
.transform(lambda x: x.iloc[0])
.year.eq(df.year)
.astype(int)
)
You'll get
id year vacation new
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1
Here is a way using groupby+cumcount and series.mask:
df['new']=df.groupby(['id','vacation']).cumcount().add(1).mask(lambda x: x.gt(1),0)
print(df)
id year vacation new
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1

Python - Cleaning US and Canadian Zip Codes with `df.loc` and `str` Methods

I have the following code to create a column with cleaned up zip codes for the USA and Canada
df = pd.read_csv(file1)
usa = df['Region'] == 'USA'
canada = df['Region'] == 'Canada'
df.loc[usa, 'ZipCleaned'] = df.loc[usa, 'Zip'].str.slice(stop=5)
df.loc[canada, 'ZipCleaned'] = df.loc[canada, 'Zip'].str.replace(' |-','')
The issue that i am having is that some of the rows that have "USA" as the country contain Canadian postal codes in the dataset. So the USA logic from above is being applied to Canadian postal codes.
I tried the edited code above along with the below and experimented with one province ("BC") to prevent the USA logic from being applied in this case but it didn't work
usa = df['Region'] == 'USA'
usa = df['State'] != 'BC'
Region Date State Zip Customer Revenue
USA 1/3/2014 BC A5Z 1B6 Customer A $157.52
Canada 1/13/2014 AB Z3J-4E5 Customer B $750.00
USA 1/4/2014 FL 90210-9999 Customer C $650.75
USA 1/21/2014 FL 12345 Customer D $242.00
USA 1/25/2014 FL 45678 Customer E $15.00
USA 1/28/2014 NY 91011 Customer F $25.00
Thanks Kris. But what if I wanted to maintain the original values in the Region column and change "Zip Cleaned" based on whether Zip contains a Canadian or USA Zip. I tried the following but it's not working
usa = df.loc[df['Ship To Customer Zip'].str.contains('[0-9]')]
canada = df.loc[df['Ship To Customer Zip'].str.contains('[A-Za-z]')]
df.loc[usa, 'ZipCleaned'] = df.loc[usa, 'Ship To Customer Zip'].str.slice(stop=5)
df.loc[canada, 'ZipCleaned'] = df.loc[canada, 'Ship To Customer Zip'].str.replace(' |-','')
Give this a try:
# sample df provided by OP
>>> df
Region Date State Zip Customer Revenue
0 USA 2014-01-03 BC A5Z 1B6 Customer A 157.52
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750
2 USA 2014-01-04 FL 90210-999 Customer C 650.75
3 USA 2014-01-21 FL 12345 Customer D 242
4 USA 2014-01-25 FL 45678 Customer E 15
5 USA 2014-01-28 NY 91011 Customer F 25
# Edit 'Region' by testing 'Zip' for presence of letters (US Zip Codes are only numeric)
>>> df.loc[df['Zip'].str.contains('[A-Za-z]'), 'Region'] = 'Canada'
>>> df
Region Date State Zip Customer Revenue
0 Canada 2014-01-03 BC A5Z 1B6 Customer A 157.52
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750
2 USA 2014-01-04 FL 90210-999 Customer C 650.75
3 USA 2014-01-21 FL 12345 Customer D 242
4 USA 2014-01-25 FL 45678 Customer E 15
5 USA 2014-01-28 NY 91011 Customer F 25
# apply OP's original filtering and cleaning
>>> usa = df['Region'] == 'USA'
>>> canada = df['Region'] == 'Canada'
>>> df.loc[usa, 'ZipCleaned'] = df.loc[usa, 'Zip'].str.slice(stop=5)
>>> df.loc[canada, 'ZipCleaned'] = df.loc[canada, 'Zip'].str.replace(' |-','')
# display resultant df
>>> df
Region Date State Zip Customer Revenue ZipCleaned
0 Canada 2014-01-03 BC A5Z 1B6 Customer A 157.52 A5Z1B6
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750 Z3J4E5
2 USA 2014-01-04 FL 90210-999 Customer C 650.75 90210
3 USA 2014-01-21 FL 12345 Customer D 242 12345
4 USA 2014-01-25 FL 45678 Customer E 15 45678
5 USA 2014-01-28 NY 91011 Customer F 25 91011
EDIT: Update as requested by OP: we can do the following to leave the original 'Region' intact:
>>> df
Region Date State Zip Customer Revenue
0 USA 2014-01-03 BC A5Z 1B6 Customer A 157.52
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750
2 USA 2014-01-04 FL 90210-999 Customer C 650.75
3 USA 2014-01-21 FL 123456 Customer D 242
4 USA 2014-01-25 FL 45678 Customer E 15
5 USA 2014-01-28 NY 91011 Customer F 25
# create 'ZipCleaned' by referencing original 'Zip'
>>> df.loc[~df['Zip'].str.contains('[A-Za-z]'), 'ZipCleaned'] = df['Zip'].str.slice(stop=5)
>>> df.loc[df['Zip'].str.contains('[A-Za-z]'), 'ZipCleaned'] = df['Zip'].str.replace(' |-', '')
# Resultant df
>>> df
Region Date State Zip Customer Revenue ZipCleaned
0 USA 2014-01-03 BC A5Z 1B6 Customer A 157.52 A5Z1B6
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750 Z3J4E5
2 USA 2014-01-04 FL 90210-999 Customer C 650.75 90210
3 USA 2014-01-21 FL 123456 Customer D 242 12345
4 USA 2014-01-25 FL 45678 Customer E 15 45678
5 USA 2014-01-28 NY 91011 Customer F 25 91011
>>>

Resources