Read excel and reformat the multi-index headers in Pandas - python-3.x

Given a excel file with format as follows:
Reading with pd.read_clipboard, I get:
year 2018 Unnamed: 2 2019 Unnamed: 4
0 city quantity price quantity price
1 bj 10 2 4 7
2 sh 6 8 3 4
Just wondering if it's possible to convert to the following format with Pandas:
year city quantity price
0 2018 bj 10 2
1 2019 bj 4 7
2 2018 sh 6 8
3 2019 sh 3 4

I think here is best convert excel file to DataFrame with MultiIndex in columns and first column as index:
df = pd.read_excel(file, header=[0,1], index_col=[0])
print (df)
year 2018 2019
city quantity price quantity price
bj 10 2 4 7
sh 6 8 3 4
print (df.columns)
MultiIndex([('2018', 'quantity'),
('2018', 'price'),
('2019', 'quantity'),
('2019', 'price')],
names=['year', 'city'])
Then reshape by DataFrame.stack, change order of levels by DataFrame.swaplevel, set index and columns names by DataFrame.rename_axis and last convert index to columns, and if encessary convert year to integers:
df1 = (df.stack(0)
.swaplevel(0,1)
.rename_axis(index=['year','city'], columns=None)
.reset_index()
.assign(year=lambda x: x['year'].astype(int)))
print (df1)
year city price quantity
0 2018 bj 2 10
1 2019 bj 7 4
2 2018 sh 8 6
3 2019 sh 4 3

Related

How to find again the index after pivoting dataframe?

I created a dataframe form a csv file containing data on number of deaths by year (running from 1946 to 2021) and month (within year):
dataD = pd.read_csv('MY_FILE.csv', sep=',')
First rows (out of 902...) of output are :
dataD
Year Month Deaths
0 2021 2 55500
1 2021 1 65400
2 2020 12 62800
3 2020 11 64700
4 2020 10 56900
As expected, the dataframe contains an index numbered 0,1,2, ... and so on.
Now, I pivot this dataframe in order to have only 1 row by year and months in column, using the following code:
dataDW = dataD.pivot(index='Year', columns='Month', values='Deaths')
The first rows of the result are now:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0
My question is:
What do I have to change in the previous pivoting code in order to find again the index 0,1,2,..etc. when I output the pivoted file? I think I need to specify index=*** in order to make the pivot instruction run. But afterwards, I would like to recover an index "as usual" (if I can say), exactly like in my first file dataD.
Any possibility?
You can reset_index() after pivoting:
dataDW = dataD.pivot(index='Year', columns='Month', values='Deaths').reset_index()
This would give you the following:
Month Year 1 2 3 4 5 6 7 8 9 10 11 12
0 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
2 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
3 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
4 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0
Note that the "Month" here might look like the index name but is actually df.columns.name. You can unset it if preferred:
df.columns.name = None
Which then gives you:
Year 1 2 3 4 5 6 7 8 9 10 11 12
0 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
2 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
3 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
4 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0

Unstack a dataframe with duplicated index in Pandas

Given a toy dataset as follow which has duplicated price and quantity:
city item value
0 bj price 12
1 bj quantity 15
2 bj price 12
3 bj quantity 15
4 bj level a
5 sh price 45
6 sh quantity 13
7 sh price 56
8 sh quantity 7
9 sh level b
I want to reshape it into the following dataframe, which means add sell_ for the first pair and buy_ for the second pair:
city sell_price sell_quantity buy_price buy_quantity level
0 bj 12 15 13 16 a
1 sh 45 13 56 7 b
I have tried with df.set_index(['city', 'item']).unstack().reset_index(), but it raises an error: ValueError: Index contains duplicate entries, cannot reshape.
How could I get the desired output as above? Thanks.
You can add for second duplicated values buy_ and for first duplicates sell_ and change values in item before your solution:
m1 = df.duplicated(['city', 'item'])
m2 = df.duplicated(['city', 'item'], keep=False)
df['item'] = np.where(m1, 'buy_', np.where(m2, 'sell_', '')) + df['item']
df = (df.set_index(['city', 'item'])['value']
.unstack()
.reset_index()
.rename_axis(None, axis=1))
#for change order of columns names
df = df[['city','sell_price','sell_quantity','buy_price','buy_quantity','level']]
print (df)
city sell_price sell_quantity buy_price buy_quantity level
0 bj 12 15 12 15 a
1 sh 45 13 56 7 b

Update multiple columns from another dataframe based on one common column in Pandas

Given the following two dataframes:
df1:
id city district year price
0 1 bjs cyq 2018 12
1 2 bjs cyq 2019 6
2 3 sh hp 2018 4
3 4 shs hpq 2019 3
df2:
id city district year
0 1 bj cy 2018
1 2 bj cy 2019
2 4 sh hp 2019
let's say some values in city and district from df1 have errors, so I need to update city and district values' in df1 with those of df2 based on id, my expected result is like this:
id city district year price
0 1 bj cy 2018 12
1 2 bj cy 2019 6
2 3 sh hp 2018 4
3 4 sh hp 2019 3
How could I do that in Pandas? Thanks.
Update:
Solution 1:
cities = df2.set_index('id')['city']
district = df2.set_index('id')['district']
df1['city'] = df1['id'].map(cities)
df1['district'] = df1['id'].map(district)
Solution 2:
df1[["city","district"]] = pd.merge(df1,df2,on=["id"],how="left")[["city_y","district_y"]]
print(df1)
Out:
id city district year price
0 1 bj cy 2018 12
1 2 bj cy 2019 6
2 3 NaN NaN 2018 4
3 4 sh hp 2019 3
Note the city and district for id is 3 are NaNs, but I want keep the values from df1.
Try combine_first:
df2.set_index('id').combine_first(df1.set_index('id')).reset_index()
Output:
id city district price year
0 1 bj cy 12.0 2018.0
1 2 bj cy 6.0 2019.0
2 3 sh hp 4.0 2018.0
3 4 sh hp 3.0 2019.0
Try this
df1[["city","district"]] = pd.merge(df1,df2,on=["id"],how="left")[["city_y","district_y"]]
IIUC, we can use .map
edit - input changed.
target_cols = ['city','district']
df1.loc[df1['id'].isin(df2['id']),target_cols] = np.nan
cities = df2.set_index('id')['city']
district = df2.set_index('id')['district']
df1['city'] = df1['city'].fillna(df1['id'].map(cities))
df1['district'] = df1['district'].fillna(df1['id'].map(cities))
print(df1)
id city district year price
0 1 bj bj 2018 12
1 2 bj bj 2019 6
2 3 sh hp 2018 4
3 4 sh sh 2019 3

Groupby one column and forward replace values in multiple columns based on condition using Pandas

Given a dataframe as follows:
id city district year price
0 1 bj cy 2018 8
1 1 bj cy 2019 6
2 1 xd dt 2020 7
3 2 sh hp 2018 4
4 2 sh hp 2019 3
5 2 sh pd 2020 5
Say there are typo errors in columns city and district for rows in the year columns which is 2020, so I want groupby id and ffill those columns with previous values.
How could I do that in Pandas? Thanks a lot.
The desired output will like this:
id city district year price
0 1 bj cy 2018 8
1 1 bj cy 2019 6
2 1 bj cy 2020 7
3 2 sh hp 2018 4
4 2 sh hp 2019 3
5 2 sh hp 2020 5
The following code works, but I'm not sure if it's the best solutions.
If you have others, welcome to share. Thanks.
df.loc[df['year'].isin(['2020']), ['city', 'district']] = np.nan
df[['city', 'district']] = df[['city', 'district']].fillna(df.groupby('id')[['city', 'district']].ffill())
Out:
id city district year price
0 1 bj cy 2018 8
1 1 bj cy 2019 6
2 1 bj cy 2020 7
3 2 sh hp 2018 4
4 2 sh hp 2019 3
5 2 sh hp 2020 5

How to split rows in pandas with special condition of date?

I have a DataFrame like:
Code Date sales
1 2/2013 10
1 3/2013 11
2 3/2013 12
2 4/2013 14
...
I want to convert it into a DataFrame with a timeline, code, and sales of each type of item:
Date Code Sales1 Code Sales2
2/2013 1 10 NA NA
3/2013 1 11 2 12
4/2013 NA NA 2 14
....
or into a simpler way:
Date Code Sales1 Date Code Sales2 .....
2/2013 1 10 3/2013 2 12
3/2013 1 11 4/2013 2 14
or even into the simplest way, splitting into many small DataFrames
IIUC using concatwith the groupby result
df.index=df.groupby('Code').cumcount()# create the key for concat
pd.concat([x for _,x in df.groupby('Code')],1)
Out[392]:
Code Date sales Code Date sales
0 1 2/2013 10 2 3/2013 12
1 1 3/2013 11 2 4/2013 14
Actually, I was stupid to split the data that way, I rethink and solve the problem with the pivot_table
pd.pivot_table(df, values = ['sales'], index = ['code'], columns = ['date'])
and the result should be like.
sum
date 2/2013 3/2013 4/2013 ....
code
1 10 11 NaN
2 NaN 12 14
...

Resources