Convert a dataset with a single level header to a dataset with multiple level headers

Convert a dataset with a single level header to a dataset with multiple level headers - python-3.x

For the dataset df with one single-level header as follows, I hope to convert it into a dataset with a three-level header.
How to achieve this? Thanks.
year cpi ppi Country code Country name
0 1995 0.107797 0.085297 AUS Australia
1 1996 0.110997 0.082973 AUS Australia
2 1997 0.110098 0.083651 AUS Australia
3 1998 0.107635 0.086049 AUS Australia
4 1999 0.104969 0.089700 AUS Australia
5 1995 0.120071 0.129769 MEX Mexico
6 1996 0.119249 0.142606 MEX Mexico
7 1997 0.124866 0.151372 MEX Mexico
8 1998 0.127448 0.153303 MEX Mexico
9 1999 0.134342 0.159876 MEX Mexico
The expected output:
Country name Australia Australia Mexico Mexico
Country code AUS AUS MEX MEX
Indicator Name cpi ppi cpi ppi
0 1995 0.107797 0.085297 0.120071 0.129769
1 1996 0.110997 0.082973 0.119249 0.142606
2 1997 0.110098 0.083651 0.124866 0.151372
3 1998 0.107635 0.086049 0.127448 0.153303
4 1999 0.104969 0.0897 0.134342 0.159876

Use DataFrame.pivot with DataFrame.reorder_levels and DataFrame.sort_index:
df = (df.pivot(index=['year'],
columns=['Country code','Country name'])
.reorder_levels([2, 1, 0],axis=1)
.sort_index(axis=1))
print(df)
Country name Australia Mexico
Country code AUS MEX
cpi ppi cpi ppi
year
1995 0.107797 0.085297 0.120071 0.129769
1996 0.110997 0.082973 0.119249 0.142606
1997 0.110098 0.083651 0.124866 0.151372
1998 0.107635 0.086049 0.127448 0.153303
1999 0.104969 0.089700 0.134342 0.159876

Related

Full country name to country code in Dataframe

I have these kind of countries in the dataframe. There are some with full country names, there are some with alpha-2.
Country
------------------------
8836 United Kingdom
1303 ES
7688 United Kingdom
12367 FR
7884 United Kingdom
6844 United Kingdom
3706 United Kingdom
3567 UK
6238 FR
588 UK
4901 United Kingdom
568 UK
4880 United Kingdom
11284 France
1273 Spain
2719 France
1386 UK
12838 United Kingdom
868 France
1608 UK
Name: Country, dtype: object
Note: Some data in Country are empty.
How will I be able to create a new column with the alpha-2 country codes in it?
Country | Country Code
---------------------------------------
United Kingdom | UK
France | FR
FR | FR
UK | UK
Italy | IT
Spain | ES
ES | ES
...

You can try this, as already mentioned in the comment by me earlier.
import pandas as pd
df = pd.DataFrame([[1, 'UK'],[2, 'United Kingdom'],[3, 'ES'],[2, 'Spain']], columns=['id', 'Country'])
#Create copy of country column as alpha-2
df['alpha-2'] = df['Country']
#Create a look up with required values
lookup_table = {'United Kingdom':'UK', 'Spain':'ES'}
#replace the alpha-2 column with lookup values.
df = df.replace({'alpha-2':lookup_table})
print(df)
Output

You will have to define a dictionary for the replacements (or find a library that does it for you). The abbreviations look pretty close the IBAN codes to me. But the biggest stickout was United Kingdom => GB as opposed to UK in your example.
I would start with the IBAN codes and define a big dictionary like this:
mappings = {
"Afghanistan": "AF",
"Albania": "AL",
...
}
df["Country Code"] = df["Country"].replace(mappings)

Pandas: How to average rows with two columns having similar id's? [duplicate]

This question already has answers here:
Pandas dataframe: Group by two columns and then average over another column
(2 answers)
Closed 2 years ago.
I have a dataframe like the following:
State Name
County Name
Value
Idaho
Ada
20
Idaho
Ada
50
Pennsylvania
Adams
70
Colorado
Adams
25
Pennsylvania
Adams
21
Illinois
Adams
45
Illinois
Madison
45
Illinois
Madison
75
Then average the rows with similar State and County name such that the dataframe becomes this:
State Name
County Name
Mean
Idaho
Ada
12.5
Pennsylvania
Adams
55.47
Colorado
Adams
47.2
Illinois
Adams
19.5
Illinois
Madison
75.14
Any kind of help is appreciated.

Try:
df.groupby(['State Name','County Name']).mean()

Flag repeating entries in pandas time series

I have a data frame that takes this form (but is several millions of rows long):
import pandas as pd
dict = {'id':["A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D"],
'year': ["2000", "2001", "2002", "2000", "2001", "2003", "1999", "2000", "2001", "2000", "2000", "2001"],
'vacation':["France", "Morocco", "Morocco", "Germany", "Germany", "Germany", "Japan", "Australia", "Japan", "Canada", "Mexico", "China"],
'new':[1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1]}
df = pd.DataFrame(dict)
A 2000 France
A 2001 Morocco
A 2002 Morocco
B 2000 Germany
B 2001 Germany
B 2003 Germany
C 1999 Japan
C 2000 Australia
C 2001 Japan
D 2000 Canada
D 2000 Mexico
D 2001 China
For each person in each year, the holiday destination(s) is/are given; there can be multiple holiday destinations in a given year.
I would like to flag the rows when a participant goes to holiday to a destination to which they had not gone the year before (i.e., the destination is new). In the case above, the output would be:
id year vacation new
A 2000 France 1
A 2001 Morocco 1
A 2002 Morocco 0
B 2001 Germany 1
B 2002 Germany 0
B 2003 Germany 0
C 1999 Japan 1
C 1999 Australia 1
C 2000 Japan 1
D 2000 Canada 1
D 2000 Mexico 1
D 2001 China 1
For A, B, C, and D, the first holiday destination in our data frame is flagged as new. When A goes to Morocco two years in a row, the 2nd occurrence is not flagged, because A went there the year before. When B goes to Germany 3 times in a row, the 2nd and 3rd occurrences are not flagged. When person C goes to Japan twice, all of the occurrences are flagged, because they did not go to Japan two years in a row. D goes to 3 different destinations (albeit to 2 destinations in 2000) and all of them are flagged.
I have been trying to solve it myself, but have not been able to break away from iterations, which are too computationally intensive for such a massive dataset.
I'd appreciate any input; thanks.

IIUC,
what we are doing is grouping by id & vacation and ensuring that year is not equal to the year above, or we can selecting the first instance of that combination.
hopefully that's clear. let me know if you need anymore help.
df["new_2"] = (
df.groupby(["id", "vacation"])["id", "year"]
.apply(lambda x: x.ne(x.shift()))
.all(axis=1)
.add(0)
)
print(df)
id year vacation new_2
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1

Here's one solution I came up with, using groupby and transform:
df = df.sort_values(["id", "vacation", "year"])
df["new"] = (
df.groupby(["id", "vacation"])
.transform(lambda x: x.iloc[0])
.year.eq(df.year)
.astype(int)
)
You'll get
id year vacation new
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1

Here is a way using groupby+cumcount and series.mask:
df['new']=df.groupby(['id','vacation']).cumcount().add(1).mask(lambda x: x.gt(1),0)
print(df)
id year vacation new
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1

Python - Cleaning US and Canadian Zip Codes with `df.loc` and `str` Methods

I have the following code to create a column with cleaned up zip codes for the USA and Canada
df = pd.read_csv(file1)
usa = df['Region'] == 'USA'
canada = df['Region'] == 'Canada'
df.loc[usa, 'ZipCleaned'] = df.loc[usa, 'Zip'].str.slice(stop=5)
df.loc[canada, 'ZipCleaned'] = df.loc[canada, 'Zip'].str.replace(' |-','')
The issue that i am having is that some of the rows that have "USA" as the country contain Canadian postal codes in the dataset. So the USA logic from above is being applied to Canadian postal codes.
I tried the edited code above along with the below and experimented with one province ("BC") to prevent the USA logic from being applied in this case but it didn't work
usa = df['Region'] == 'USA'
usa = df['State'] != 'BC'
Region Date State Zip Customer Revenue
USA 1/3/2014 BC A5Z 1B6 Customer A $157.52
Canada 1/13/2014 AB Z3J-4E5 Customer B $750.00
USA 1/4/2014 FL 90210-9999 Customer C $650.75
USA 1/21/2014 FL 12345 Customer D $242.00
USA 1/25/2014 FL 45678 Customer E $15.00
USA 1/28/2014 NY 91011 Customer F $25.00

Thanks Kris. But what if I wanted to maintain the original values in the Region column and change "Zip Cleaned" based on whether Zip contains a Canadian or USA Zip. I tried the following but it's not working
usa = df.loc[df['Ship To Customer Zip'].str.contains('[0-9]')]
canada = df.loc[df['Ship To Customer Zip'].str.contains('[A-Za-z]')]
df.loc[usa, 'ZipCleaned'] = df.loc[usa, 'Ship To Customer Zip'].str.slice(stop=5)
df.loc[canada, 'ZipCleaned'] = df.loc[canada, 'Ship To Customer Zip'].str.replace(' |-','')

Give this a try:
# sample df provided by OP
>>> df
Region Date State Zip Customer Revenue
0 USA 2014-01-03 BC A5Z 1B6 Customer A 157.52
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750
2 USA 2014-01-04 FL 90210-999 Customer C 650.75
3 USA 2014-01-21 FL 12345 Customer D 242
4 USA 2014-01-25 FL 45678 Customer E 15
5 USA 2014-01-28 NY 91011 Customer F 25
# Edit 'Region' by testing 'Zip' for presence of letters (US Zip Codes are only numeric)
>>> df.loc[df['Zip'].str.contains('[A-Za-z]'), 'Region'] = 'Canada'
>>> df
Region Date State Zip Customer Revenue
0 Canada 2014-01-03 BC A5Z 1B6 Customer A 157.52
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750
2 USA 2014-01-04 FL 90210-999 Customer C 650.75
3 USA 2014-01-21 FL 12345 Customer D 242
4 USA 2014-01-25 FL 45678 Customer E 15
5 USA 2014-01-28 NY 91011 Customer F 25
# apply OP's original filtering and cleaning
>>> usa = df['Region'] == 'USA'
>>> canada = df['Region'] == 'Canada'
>>> df.loc[usa, 'ZipCleaned'] = df.loc[usa, 'Zip'].str.slice(stop=5)
>>> df.loc[canada, 'ZipCleaned'] = df.loc[canada, 'Zip'].str.replace(' |-','')
# display resultant df
>>> df
Region Date State Zip Customer Revenue ZipCleaned
0 Canada 2014-01-03 BC A5Z 1B6 Customer A 157.52 A5Z1B6
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750 Z3J4E5
2 USA 2014-01-04 FL 90210-999 Customer C 650.75 90210
3 USA 2014-01-21 FL 12345 Customer D 242 12345
4 USA 2014-01-25 FL 45678 Customer E 15 45678
5 USA 2014-01-28 NY 91011 Customer F 25 91011
EDIT: Update as requested by OP: we can do the following to leave the original 'Region' intact:
>>> df
Region Date State Zip Customer Revenue
0 USA 2014-01-03 BC A5Z 1B6 Customer A 157.52
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750
2 USA 2014-01-04 FL 90210-999 Customer C 650.75
3 USA 2014-01-21 FL 123456 Customer D 242
4 USA 2014-01-25 FL 45678 Customer E 15
5 USA 2014-01-28 NY 91011 Customer F 25
# create 'ZipCleaned' by referencing original 'Zip'
>>> df.loc[~df['Zip'].str.contains('[A-Za-z]'), 'ZipCleaned'] = df['Zip'].str.slice(stop=5)
>>> df.loc[df['Zip'].str.contains('[A-Za-z]'), 'ZipCleaned'] = df['Zip'].str.replace(' |-', '')
# Resultant df
>>> df
Region Date State Zip Customer Revenue ZipCleaned
0 USA 2014-01-03 BC A5Z 1B6 Customer A 157.52 A5Z1B6
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750 Z3J4E5
2 USA 2014-01-04 FL 90210-999 Customer C 650.75 90210
3 USA 2014-01-21 FL 123456 Customer D 242 12345
4 USA 2014-01-25 FL 45678 Customer E 15 45678
5 USA 2014-01-28 NY 91011 Customer F 25 91011
>>>

if a column has duplicates and next column has a value how to add these to duplicate values

Their is a dataframes with the columns, below is input ,if the username has duplicates and the owner region should be added to the duplicates also
Queue Owner Region username
xxy aan
xyz india aan
yyx aandiapp
xox UK aandiapp
yox china aashwins
zxy aashwins
yoz aus aasyed
zxo aasyed
The required output should be
Queue Owner Region username
xxy india aan
xyz india aan
yyx Uk aandiapp
xox Uk aandiapp
yox china aashwins
zxy china aashwins
yoz aus aasyed
zxo aus aasyed
please anyone help me , Thanks inadvance

I think need replace empty values to NaNs first and then per groups repalce them by forward and back filling:
df['Owner Region'] = df['Owner Region'].replace('', np.nan)
df['Owner Region'] = df.groupby('username')['Owner Region'].transform(lambda x: x.ffill().bfill())

You can use mask and groupby.
df['Owner Region'] = (
df['Owner Region']
.mask(df['Owner Region'].str.len().eq(0))
.groupby(df.username)
.ffill()
.bfill())
df
Queue Owner Region username
0 xxy india aan
1 xyz india aan
2 yyx UK aandiapp
3 xox UK aandiapp
4 yox china aashwins
5 zxy china aashwins
6 yoz aus aasyed
7 zxo aus aasyed
When you call groupby + ffill, the subsequent bfill call does not require a groupby.
If it is possible a group is only NaNs in it, you cannot avoid the apply...
df['Owner Region'] = (
df['Owner Region']
.mask(df['Owner Region'].str.len().eq(0))
.groupby(df.username)
.apply(lambda x: x.ffill().bfill()))
df
Queue Owner Region username
0 xxy india aan
1 xyz india aan
2 yyx UK aandiapp
3 xox UK aandiapp
4 yox china aashwins
5 zxy china aashwins
6 yoz aus aasyed
7 zxo aus aasyed

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Convert a dataset with a single level header to a dataset with multiple level headers - python-3.x

Related

Full country name to country code in Dataframe

Pandas: How to average rows with two columns having similar id's? [duplicate]

Flag repeating entries in pandas time series

Python - Cleaning US and Canadian Zip Codes with `df.loc` and `str` Methods

if a column has duplicates and next column has a value how to add these to duplicate values

Categories

Resources