Read multi-index excel file and reshape the headers in Pandas - python-3.x

Given an excel file data.xlsx as follows:
I have read it with df = pd.read_excel('data.xlsx', header = [0, 1], index_col = [0, 1], sheet_name = 'Sheet1'),
Out:
district 2018 2019
price ratio price ratio
bj cy 12 0.01 6 0.02
sh hp 4 0.02 3 0.05
I wonder if it's possible to transform it to the following format? Thank you for your help.

Use DataFrame.stack with DataFrame.rename_axis and DataFrame.reset_index:
df = df.stack(0).rename_axis(('city','district','year')).reset_index()
print (df)
city district year price ratio
0 bj cy 2018 12 0.01
1 bj cy 2019 6 0.02
2 sh hp 2018 4 0.02
3 sh hp 2019 3 0.05

Related

Create new columns by comparing the current row's values and previous in Pandas

Given a dummy dataset df as follow:
year v1 v2
0 2017 0.3 0.1
1 2018 0.1 0.1
2 2019 -0.2 0.5
3 2020 NaN -0.3
4 2021 0.8 0.0
or:
[{'year': 2017, 'v1': 0.3, 'v2': 0.1},
{'year': 2018, 'v1': 0.1, 'v2': 0.1},
{'year': 2019, 'v1': -0.2, 'v2': 0.5},
{'year': 2020, 'v1': nan, 'v2': -0.3},
{'year': 2021, 'v1': 0.8, 'v2': 0.0}]
I need to create two more columns trend_v1 and trend_v2 based on v1 and v2 respectively.
The logic to create new columns is this: for each column, if its current value is greater than the previous, the trend value is increase, if its current value is less than the previous, the trend value is decrease, if its current value is equal to the previous, the trend value is equal, if the current or previous value is NaN, the trend also is NaN.
year v1 v2 trend_v1 trend_v2
0 2017 0.3 0.1 NaN NaN
1 2018 0.1 0.1 decrease equal
2 2019 -0.2 0.5 decrease increase
3 2020 NaN -0.3 NaN decrease
4 2021 0.8 0.0 NaN increase
How could I achieve that in Pandas? Thanks for your help at advance.
You can specify columns for test trend by compare shifted values with filtered missing values:
cols = ['v1','v2']
arr = np.where(df[cols] < df[cols].shift(),'decrease',
np.where(df[cols] > df[cols].shift(),'increase',
np.where(df[cols].isna() | df[cols].shift().isna(), None, 'equal')))
df = df.join(pd.DataFrame(arr, columns=cols, index=df.index).add_prefix('trend_'))
print (df)
year v1 v2 trend_v1 trend_v2
0 2017 0.3 0.1 None None
1 2018 0.1 0.1 decrease equal
2 2019 -0.2 0.5 decrease increase
3 2020 NaN -0.3 None decrease
4 2021 0.8 0.0 None increase
Or:
cols = ['v1','v2']
m1 = df[cols] < df[cols].shift()
m2 = df[cols] > df[cols].shift()
m3 = df[cols].isna() | df[cols].shift().isna()
arr = np.select([m1, m2, m3],['decrease','increase', None], default='equal')
df = df.join(pd.DataFrame(arr, columns=cols, index=df.index).add_prefix('trend_'))
EDIT:
Nice improvement is change m3 like mentioned in comments:
cols = ['v1','v2']
m1 = df[cols] < df[cols].shift()
m2 = df[cols] > df[cols].shift()
m3 = df[cols] == df[cols].shift()
arr = np.select([m1, m2, m3],['decrease','increase', 'equal'], default=None)

Update multiple columns from another dataframe based on one common column in Pandas

Given the following two dataframes:
df1:
id city district year price
0 1 bjs cyq 2018 12
1 2 bjs cyq 2019 6
2 3 sh hp 2018 4
3 4 shs hpq 2019 3
df2:
id city district year
0 1 bj cy 2018
1 2 bj cy 2019
2 4 sh hp 2019
let's say some values in city and district from df1 have errors, so I need to update city and district values' in df1 with those of df2 based on id, my expected result is like this:
id city district year price
0 1 bj cy 2018 12
1 2 bj cy 2019 6
2 3 sh hp 2018 4
3 4 sh hp 2019 3
How could I do that in Pandas? Thanks.
Update:
Solution 1:
cities = df2.set_index('id')['city']
district = df2.set_index('id')['district']
df1['city'] = df1['id'].map(cities)
df1['district'] = df1['id'].map(district)
Solution 2:
df1[["city","district"]] = pd.merge(df1,df2,on=["id"],how="left")[["city_y","district_y"]]
print(df1)
Out:
id city district year price
0 1 bj cy 2018 12
1 2 bj cy 2019 6
2 3 NaN NaN 2018 4
3 4 sh hp 2019 3
Note the city and district for id is 3 are NaNs, but I want keep the values from df1.
Try combine_first:
df2.set_index('id').combine_first(df1.set_index('id')).reset_index()
Output:
id city district price year
0 1 bj cy 12.0 2018.0
1 2 bj cy 6.0 2019.0
2 3 sh hp 4.0 2018.0
3 4 sh hp 3.0 2019.0
Try this
df1[["city","district"]] = pd.merge(df1,df2,on=["id"],how="left")[["city_y","district_y"]]
IIUC, we can use .map
edit - input changed.
target_cols = ['city','district']
df1.loc[df1['id'].isin(df2['id']),target_cols] = np.nan
cities = df2.set_index('id')['city']
district = df2.set_index('id')['district']
df1['city'] = df1['city'].fillna(df1['id'].map(cities))
df1['district'] = df1['district'].fillna(df1['id'].map(cities))
print(df1)
id city district year price
0 1 bj bj 2018 12
1 2 bj bj 2019 6
2 3 sh hp 2018 4
3 4 sh sh 2019 3

Groupby one column and forward replace values in multiple columns based on condition using Pandas

Given a dataframe as follows:
id city district year price
0 1 bj cy 2018 8
1 1 bj cy 2019 6
2 1 xd dt 2020 7
3 2 sh hp 2018 4
4 2 sh hp 2019 3
5 2 sh pd 2020 5
Say there are typo errors in columns city and district for rows in the year columns which is 2020, so I want groupby id and ffill those columns with previous values.
How could I do that in Pandas? Thanks a lot.
The desired output will like this:
id city district year price
0 1 bj cy 2018 8
1 1 bj cy 2019 6
2 1 bj cy 2020 7
3 2 sh hp 2018 4
4 2 sh hp 2019 3
5 2 sh hp 2020 5
The following code works, but I'm not sure if it's the best solutions.
If you have others, welcome to share. Thanks.
df.loc[df['year'].isin(['2020']), ['city', 'district']] = np.nan
df[['city', 'district']] = df[['city', 'district']].fillna(df.groupby('id')[['city', 'district']].ffill())
Out:
id city district year price
0 1 bj cy 2018 8
1 1 bj cy 2019 6
2 1 bj cy 2020 7
3 2 sh hp 2018 4
4 2 sh hp 2019 3
5 2 sh hp 2020 5

Read excel and reformat the multi-index headers in Pandas

Given a excel file with format as follows:
Reading with pd.read_clipboard, I get:
year 2018 Unnamed: 2 2019 Unnamed: 4
0 city quantity price quantity price
1 bj 10 2 4 7
2 sh 6 8 3 4
Just wondering if it's possible to convert to the following format with Pandas:
year city quantity price
0 2018 bj 10 2
1 2019 bj 4 7
2 2018 sh 6 8
3 2019 sh 3 4
I think here is best convert excel file to DataFrame with MultiIndex in columns and first column as index:
df = pd.read_excel(file, header=[0,1], index_col=[0])
print (df)
year 2018 2019
city quantity price quantity price
bj 10 2 4 7
sh 6 8 3 4
print (df.columns)
MultiIndex([('2018', 'quantity'),
('2018', 'price'),
('2019', 'quantity'),
('2019', 'price')],
names=['year', 'city'])
Then reshape by DataFrame.stack, change order of levels by DataFrame.swaplevel, set index and columns names by DataFrame.rename_axis and last convert index to columns, and if encessary convert year to integers:
df1 = (df.stack(0)
.swaplevel(0,1)
.rename_axis(index=['year','city'], columns=None)
.reset_index()
.assign(year=lambda x: x['year'].astype(int)))
print (df1)
year city price quantity
0 2018 bj 2 10
1 2019 bj 7 4
2 2018 sh 8 6
3 2019 sh 4 3

Python correlation matrix 3d dataframe

I have in SQL Server a historical return table by date and asset Id like this:
[Date] [Asset] [1DRet]
jan asset1 0.52
jan asset2 0.12
jan asset3 0.07
feb asset1 0.41
feb asset2 0.33
feb asset3 0.21
...
So I need to calculate the correlation matrix for a given date range for all assets combinations: A1,A2 ; A1,A3 ; A2,A3
Im using pandas and in my SQL Select Where I'm filtering tha date range and ordering it by date.
I'm trying to do it using pandas df.corr(), numpy.corrcoef and Scipy but not able to do it for my n-variable dataframe
I see some example but it's always for a dataframe where you have an asset per column and one row per day.
This my code block where I'm doing it:
qryRet = "Select * from IndexesValue where Date > '20100901' and Date < '20150901' order by Date"
result = conn.execute(qryRet)
df = pd.DataFrame(data=list(result),columns=result.keys())
df1d = df[['Date','Id_RiskFactor','1DReturn']]
corr = df1d.set_index(['Date','Id_RiskFactor']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
print(corr)
conn.close()
For it I'm reciving this msg:
corr.columns = corr.columns.droplevel()
AttributeError: 'Index' object has no attribute 'droplevel'
**Print(df1d.head())**
Date Id_RiskFactor 1DReturn
0 2010-09-02 149 0E-12
1 2010-09-02 150 -0.004242875148
2 2010-09-02 33 0.000590000011
3 2010-09-02 28 0.000099999997
4 2010-09-02 34 -0.000010000000
**print(df.head())**
Date Id_RiskFactor Value 1DReturn 5DReturn
0 2010-09-02 149 0.040096000000 0E-12 0E-12
1 2010-09-02 150 1.736700000000 -0.004242875148 -0.013014321215
2 2010-09-02 33 2.283000000000 0.000590000011 0.001260000048
3 2010-09-02 28 2.113000000000 0.000099999997 0.000469999999
4 2010-09-02 34 0.615000000000 -0.000010000000 0.000079999998
**print(corr.columns)**
Index([], dtype='object')
Create a sample DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'daily_return': np.random.random(15),
'symbol': ['A'] * 5 + ['B'] * 5 + ['C'] * 5,
'date': np.tile(pd.date_range('1-1-2015', periods=5), 3)})
>>> df
daily_return date symbol
0 0.011467 2015-01-01 A
1 0.613518 2015-01-02 A
2 0.334343 2015-01-03 A
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
5 0.431729 2015-01-01 B
6 0.474905 2015-01-02 B
7 0.372366 2015-01-03 B
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
10 0.946504 2015-01-01 C
11 0.337204 2015-01-02 C
12 0.798704 2015-01-03 C
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
I'll assume you've already filtered your DataFrame for the relevant dates. You then want a pivot table where you have unique dates as your index and your symbols as separate columns, with daily returns as the values. Finally, you call corr() on the result.
corr = df.set_index(['date','symbol']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
>>> corr
symbol_2 A B C
symbol_1
A 1.000000 0.188065 -0.745115
B 0.188065 1.000000 -0.688808
C -0.745115 -0.688808 1.000000
You can select the subset of your DataFrame based on dates as follows:
start_date = pd.Timestamp('2015-1-4')
end_date = pd.Timestamp('2015-1-5')
>>> df.loc[df.date.between(start_date, end_date), :]
daily_return date symbol
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
If you want to flatten your correlation matrix:
corr.stack().reset_index()
symbol_1 symbol_2 0
0 A A 1.000000
1 A B 0.188065
2 A C -0.745115
3 B A 0.188065
4 B B 1.000000
5 B C -0.688808
6 C A -0.745115
7 C B -0.688808
8 C C 1.000000

Resources