Converting exchange rates in pandas dataframe to GBP - python-3.x

I have a dataframe that looks like this. I'm trying to convert these stock prices to the corresponding exchange rates and output them into GBP.
df_test = pd.DataFrame({'Stock': ['AAPL', 'AAL', 'BBW', 'BBY'],
'Price': [123, 21, 15, 311],
'Currency': ['USD', 'CAD', 'EUR', 'GBP'],
'Date': ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04'],
'USD': [1.6, 1.5, 1.55, 1.57],
'CAD': [1.7, 1.79, 1.75, 1.74],
'EUR': [1.17, 1.21, 1.18, 1.19],
'GBP': [1, 1, 1, 1]})
I want to multiply the stock price for each row so for example since AAPL is in USD I want it to be multiplied by 1.6 so it is converted into GBP, and the end output I am trying to get is something that would look like this.
df_end = pd.DataFrame({'Stock': ['AAPL', 'AAL', 'BBW', 'BBY'],
'Price': [123, 21, 15, 311],
'Currency': ['USD', 'CAD', 'EUR', 'GBP'],
'Date': ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04'],
'USD': [1.6, 1.5, 1.55, 1.57],
'CAD': [1.7, 1.79, 1.75, 1.74],
'EUR': [1.17, 1.21, 1.18, 1.19],
'GBP': [1, 1, 1, 1],
'GBP value': [196.8, 37.59, 17.7, 311]})
Any help at all on this would be very much appreciated, thank you.

This used to be .lookup which is deprecated in Pandas 1.2.0. A way around this is get_indexer:
idx = df_test.columns.get_indexer(df_test['Currency'])
df_test['GBP val'] = df_test['Price'] * df_test.values[df.index, idx]
Output:
Stock Price Currency Date USD CAD EUR GBP GBP val
0 AAPL 123 USD 2020-01-01 1.60 1.70 1.17 1 196.8
1 AAL 21 CAD 2020-01-02 1.50 1.79 1.21 1 37.59
2 BBW 15 EUR 2020-01-03 1.55 1.75 1.18 1 17.7
3 BBY 311 GBP 2020-01-04 1.57 1.74 1.19 1 311

df_test["GBP value"] = df_test.apply(
lambda x: x["Price"] * x[x["Currency"]], axis=1
)
print(df_test)
Prints:
Stock Price Currency Date USD CAD EUR GBP GBP value
0 AAPL 123 USD 2020-01-01 1.60 1.70 1.17 1 196.80
1 AAL 21 CAD 2020-01-02 1.50 1.79 1.21 1 37.59
2 BBW 15 EUR 2020-01-03 1.55 1.75 1.18 1 17.70
3 BBY 311 GBP 2020-01-04 1.57 1.74 1.19 1 311.00

Related

Quartiles calculations and classifications filtering by product

I am having a hard time to get this done:
What I have: pandas dataframe:
product seller price
A Yo 10
A Ka 5
A Poy 7.5
A Nyu 2.5
A Poh 1.25
B Poh 11.25
What I want:
given a df like the one above product, seller, price I wan to calculate 4 quartiles based on price's column for that particulary product and classify the price of each seller of that product into these quartiles.
When all prices are the same, the 4 quartiles has the same value and the price is classified as 1st quartile
Expected Outuput:
product seller price Quartile 1Q 2Q 3Q 4Q
A Yo 10 4 2.5 5 7.5 10
A Ka 5 2 2.5 5 7.5 10
A Poy 7.5 3 2.5 5 7.5 10
A Nyu 2.5 1 2.5 5 7.5 10
A Poh 1.25 1 2.5 5 7.5 10
B Poh 11.25 1 11.25 11.25 11.25 11.25
What I did so far:
if I use: df['Price'].quantile([0.25,0.5,0.75,1]) it will claculate 4 quartiles of all prices without filter by product, so its wrong.
I am lost because I dont know how to do this in python.
Can anyone give me some light here?
Thanks
#Hamza, look the output below. There´s something still not workin properly
Try:
dfQuantile = df.groupby("product")['Price'].quantile([0.25,0.5,0.75,1]).unstack().reset_index().rename(columns={0.25:"1Q",0.5:"2Q",0.75:"3Q",1:"4Q"})
out = pd.merge(df,dfQuantile,on="product",how="left")
out["Quantile"] = df.groupby(['product'])['Price'].transform(
lambda x: pd.qcut(x, 4, labels=False, duplicates="drop")).fillna(0).add(1)
print(out)
product seller Price Quantile 1Q 2Q 3Q 4Q
0 A Yo 10.00 4 2.50 5.00 7.50 10.00
1 A Ka 5.00 2 2.50 5.00 7.50 10.00
2 A Poy 7.50 3 2.50 5.00 7.50 10.00
3 A Nyu 2.50 1 2.50 5.00 7.50 10.00
4 A Poh 1.25 1 2.50 5.00 7.50 10.00
5 B Poh 11.25 1 11.25 11.25 11.25 11.25

Handling negative values in pct_changes

I have this df_is_processed dataframe which contain ticker (stock code), year, and net profit. I would like to do percentage change on the netprofit grouped by ticker and year.
Ticker
Year
Net Income
AAPL
2005
151
AAPL
2004
-50
MSFT
2005
80
MSFT
2004
100
To do that is actually straightforward forward with the help of pct_change function on pandas.
df_is_processed['Delta Net Income'] = df_is_processed.sort_values('Year').groupby(['Ticker'])['Net Income'].pct_change()
However i run into some issue when negative value are encountered, which can be seen on AAPL ticker.
Ticker
Year
Net Income
Delta Net Income
AAPL
2005
151
-4.02
AAPL
2004
-50
NaN
MSFT
2005
80
-0.20
MSFT
2004
100
NaN
The expected outcome is AAPL has positive delta net income at 2005.
Ticker
Year
Net Income
Delta Net Income
AAPL
2005
151
4.02
AAPL
2004
-50
NaN
MSFT
2005
80
-0.20
MSFT
2004
100
NaN
I have tried this post but it does not work and doesn’t have group by option.
[https://stackoverflow.com/questions/37362884/how-to-make-pandas-core-generic-pct-change-to-return-a-positive-value-when-chang][1]
Pandas version : 1.2.2
Python version : 3.7.9
groupby + shift
s = df.sort_values('Year').groupby('Ticker')['Net Income'].shift()
df['Delta Net Income'] = df['Net Income'].sub(s).div(s.abs())
Explanation
Sort the dataframe on Year then group the dataframe by Ticker and shift the column Net Income one unit downwards.
>>> s
1 NaN
3 NaN
0 -50.0
2 100.0
Name: Net Income, dtype: float64
Subtract the shifted column s from Net Income to get the difference.
>>> df['Net Income'].sub(s)
0 201.0
1 NaN
2 -20.0
3 NaN
Name: Net Income, dtype: float64
Divide the above difference by the magnitude of previous value to calculate the percent change.
>>> df['Net Income'].sub(s).div(s.abs())
0 4.02
1 NaN
2 -0.20
3 NaN
Name: Net Income, dtype: float64
Result
>>> df
Ticker Year Net Income Delta Net Income
0 AAPL 2005 151 4.02
1 AAPL 2004 -50 NaN
2 MSFT 2005 80 -0.20
3 MSFT 2004 100 NaN
Use this instead of pct_change():
>>> df['net_income_delta'] = -((df['net_income'].shift(1) - df['net_income']) / np.abs(df['net_income'].shift(1)) * 100)
>>> df[['ticker', 'net_income_delta']].groupby('ticker').sum()
Inspired by this post: https://math.stackexchange.com/questions/716767/how-to-calculate-the-percentage-of-increase-decrease-with-negative-numbers/716770

How to apply masking while creating next row value which is based on previous row's value and another column in Python Pandas?

Here is the data
import numpy as np
import pandas as pd
data = {
'cases': [120, 100, np.nan, np.nan, np.nan, np.nan, np.nan],
'percent_change': [0.03, 0.01, 0.00, -0.001, 0.05, -0.1, 0.003],
'tag': [7, 6, 5, 4, 3, 2, 1],
}
cases percent_change tag
0 120.0 0.030 7
1 100.0 0.010 6
2 NaN 0.000 5
3 NaN -0.001 4
4 NaN 0.050 3
5 NaN -0.100 2
6 NaN 0.003 1
I want to create next cases' value as (next value) = (previous value) * (1+current per_change). Specifically, I want it to be done in rows that has a tag value less than 6 (and I must use a mask (i.e., df.loc for this row selection). This should give me:
cases percent_change tag
0 120.0 0.030 7
1 100.0 0.010 6
2 100.0 0.000 5
3 99.9 -0.001 4
4 104.9 0.050 3
5 94.4 -0.100 2
6 94.7 0.003 1
I tried this but it doesn't work:
df_index = np.where(df['tag'] == 6)
index = df_index[0][0]
df.loc[(df.tag<6), 'cases'] = (df.percent_change.shift(0).fillna(1) + 1).cumprod() * df.at[index, 'cases']
cases percent_change tag
0 120.000000 0.030 7
1 100.000000 0.010 6
2 104.030000 0.000 5
3 103.925970 -0.001 4
4 109.122268 0.050 3
5 98.210042 -0.100 2
6 98.504672 0.003 1
I would do:
s = df.cases.isna()
percents = df.percent_change.where(s,0)+1
df['cases'] = df.cases.ffill()*percents.cumprod()
Output:
cases percent_change tag
0 120.000000 0.030 7
1 100.000000 0.010 6
2 100.000000 0.000 5
3 99.900000 -0.001 4
4 104.895000 0.050 3
5 94.405500 -0.100 2
6 94.688716 0.003 1
Update: If you really insist on masking on the Tag==6:
s = df.tag.eq(6).shift()
s = s.where(s).ffill()
percents = df.percent_change.where(s,0)+1
df['cases'] = df.cases.ffill()*percents.cumprod()

How to fill missing values relative to a value from another column

I'd like to fill missing values with conditions relative to the country:
For example, I'd want to replace China's missing values with the mean of Age and for USA it's the median of Age. For now, I don't want to touch of EU's missing values.
How could I do realise it ?
Below the dataframe
import pandas as pd
data = [['USA', ], ['EU', 15], ['China', 35],
['USA', 45], ['EU', 30], ['China', ],
['USA', 28], ['EU', 26], ['China', 78],
['USA', 65], ['EU', 53], ['China', 66],
['USA', 32], ['EU', ], ['China', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Country', 'Age'])
df.head(10)
Country Age
0 USA NaN
1 EU 15.0
2 China 35.0
3 USA 45.0
4 EU 30.0
5 China NaN
6 USA 28.0
7 EU 26.0
8 China 78.0
9 USA 65.0
10 EU NaN
Thank you
Not sure if this is the best way to do it but it is one way to do it
age_series = df['Age'].copy()
df.loc[(df['Country'] == 'China') & (df['Age'].isnull()), 'Age'] = age_series.mean()
df.loc[(df['Country'] == 'USA') & (df['Age'].isnull()), 'Age'] = age_series.median()
Note that I copied the age column before hand so that you get the median of the original age series not after calculating the mean for the US. This is the final results
Country Age
0 USA 33.500000
1 EU 15.000000
2 China 35.000000
3 USA 45.000000
4 EU 30.000000
5 China 40.583333
6 USA 28.000000
7 EU 26.000000
8 China 78.000000
9 USA 65.000000
10 EU 53.000000
11 China 66.000000
12 USA 32.000000
13 EU NaN
14 China 14.000000
May be you can try this
df['Age']=(np.where((df['Country'] == 'China') & (df['Age'].isnull()),df['Age'].mean()
,np.where((df['Country'] == 'USA') & (df['Age'].isnull()),df['Age'].median(),df['Age']))).round()
Output
Country Age
0 USA 34.0
1 EU 15.0
2 China 35.0
3 USA 45.0
4 EU 30.0
5 China 41.0
6 USA 28.0
7 EU 26.0
8 China 78.0
9 USA 65.0
10 EU 53.0
11 China 66.0
12 USA 32.0
13 EU NaN
14 China 14.0
IIUC, we can create a function to handle this as it's not easily automated (although I may be wrong)
the idea is to pass in the country name & fill type (i.e mean median) you can extend the function to add in your agg types.
it returns a data frame that modifies yours, so you can use this to assign it back to your col
def missing_values(dataframe,country,fill_type):
"""
takes 3 arguments, dataframe, country & fill_type:
fill_type is the method used to fill `NA` values, mean, median, etc.
"""
fill_dict = dataframe.loc[dataframe['Country'] == country]\
.groupby("Country")["Age"].agg(
["mean", "median"]).to_dict(orient='index')
dataframe.loc[dataframe['Country'] == country, 'Age'] \
= dataframe['Age'].fillna(fill_dict[country][fill_type])
return dataframe
print(missing_values(df,'China','mean')
Country Age
0 USA NaN
1 EU 15.00
2 China 35.00
3 USA 45.00
4 EU 30.00
5 China 48.25
6 USA 28.00
7 EU 26.00
8 China 78.00
9 USA 65.00
10 EU 53.00
11 China 66.00
12 USA 32.00
13 EU NaN
14 China 14.00
print(missing_values(df,'USA','median'))
Country Age
0 USA 38.50
1 EU 15.00
2 China 35.00
3 USA 45.00
4 EU 30.00
5 China 48.25
6 USA 28.00
7 EU 26.00
8 China 78.00
9 USA 65.00
10 EU 53.00
11 China 66.00
12 USA 32.00
13 EU NaN
14 China 14.00

Rolling window percentile rank over a multi-index Pandas DataFrame

I am creating a percentile rank over a rolling window of time and would like help refining my approach.
My DataFrame has a multi-index with the first level set to datetime and the second set to an identifier. Ultimately, I’d like the rolling window to evaluate the trailing n periods, including the current period, and produce the corresponding percentile ranks.
I referenced the posts shown below but found they were working with the data a bit differently than how I intend to. In those posts, the final functions group results by identifier and then by datetime, whereas I'm looking to use rolling panels of data in my function (dates and identifiers).
using rolling functions on multi-index dataframe in pandas
Panda rolling window percentile rank
This is an example of what I am after.
Create a sample DataFrame:
num_days = 5
np.random.seed(8675309)
stock_data = {
"AAPL": np.random.randint(1, max_value, size=num_days),
"MSFT": np.random.randint(1, max_value, size=num_days),
"WMT": np.random.randint(1, max_value, size=num_days),
"TSLA": np.random.randint(1, max_value, size=num_days)
}
dates = pd.date_range(
start="2013-01-03",
periods=num_days,
freq=BDay()
)
sample_df = pd.DataFrame(stock_data, index=dates)
sample_df = sample_df.stack().to_frame(name='data')
sample_df.index.names = ['date', 'ticker']
Which outputs:
date ticker
2013-01-03 AAPL 2
MSFT 93
TSLA 39
WMT 21
2013-01-04 AAPL 141
MSFT 43
TSLA 205
WMT 20
2013-01-07 AAPL 256
MSFT 93
TSLA 103
WMT 25
2013-01-08 AAPL 233
MSFT 60
TSLA 13
WMT 104
2013-01-09 AAPL 19
MSFT 120
TSLA 282
WMT 293
The code below breaks out the sample_df into 2 day increments and produces a rank vs. ranking over a rolling window of time. So it's close, but not what I'm after.
sample_df.reset_index(level=1, drop=True)[['data']] \
.apply(
lambda x: x.groupby(pd.Grouper(level=0, freq='2d')).rank()
)
I then tried what's shown below without much luck either.
from scipy.stats import rankdata
def rank(x):
return rankdata(x, method='ordinal')[-1]
sample_df.reset_index(level=1, drop=True) \
.rolling(window="2d", min_periods=1) \
.apply(
lambda x: rank(x)
)
I finally arrived at the output I'm looking for but the formula seems a bit contrived, so I'm hoping to identify a more elegant approach if one exists.
import numpy as np
import pandas as pd
from pandas.tseries.offsets import BDay
window_length = 1
target_column = "data"
def rank(df, target_column, ids, window_length):
percentile_ranking = []
list_of_ids = []
date_index = df.index.get_level_values(0).unique()
for date in date_index:
rolling_start_date = date - BDay(window_length)
first_date = date_index[0] + BDay(window_length)
trailing_values = df.loc[rolling_start_date:date, target_column]
# Only calc rolling percentile after the rolling window has lapsed
if date < first_date:
pass
else:
percentile_ranking.append(
df.loc[date, target_column].apply(
lambda x: stats.percentileofscore(trailing_values, x, kind="rank")
)
)
list_of_ids.append(df.loc[date, ids])
ranks, output_ids = pd.concat(percentile_ranking), pd.concat(list_of_ids)
df = pd.DataFrame(
ranks.values, index=[ranks.index, output_ids], columns=["percentile_rank"]
)
return df
ranks = rank(
sample_df.reset_index(level=1),
window_length=1,
ids='ticker',
target_column="data"
)
sample_df.join(ranks)
I get the feeling that my rank function is more than what's needed here. I appreciate any ideas/feedback to help in simplifying this code to arrive at the output below. Thank you!
data percentile_rank
date ticker
2013-01-03 AAPL 2 NaN
MSFT 93 NaN
TSLA 39 NaN
WMT 21 NaN
2013-01-04 AAPL 141 87.5
MSFT 43 62.5
TSLA 205 100.0
WMT 20 25.0
2013-01-07 AAPL 256 100.0
MSFT 93 50.0
TSLA 103 62.5
WMT 25 25.0
2013-01-08 AAPL 233 87.5
MSFT 60 37.5
TSLA 13 12.5
WMT 104 75.0
2013-01-09 AAPL 19 25.0
MSFT 120 62.5
TSLA 282 87.5
WMT 293 100.0
Edited: The original answer was taking 2d groups without the rolling effect, and just grouping the first two days that appeared. If you want rolling by every 2 days:
Dataframe pivoted to keep the dates as index and ticker as columns
pivoted = sample_df.reset_index().pivot('date','ticker','data')
Output
ticker AAPL MSFT TSLA WMT
date
2013-01-03 2 93 39 21
2013-01-04 141 43 205 20
2013-01-07 256 93 103 25
2013-01-08 233 60 13 104
2013-01-09 19 120 282 293
Now we can apply a rolling function and consider all stocks in the same window within that rolling
from scipy.stats import rankdata
def pctile(s):
wdw = sample_df.loc[s.index,:].values.flatten() ##get all stock values in the period
ranked = rankdata(wdw) / len(wdw)*100 ## their percentile
return ranked[np.where(wdw == s[len(s)-1])][0] ## return this value's percentile
pivoted_pctile = pivoted.rolling('2D').apply(pctile, raw=False)
Output
ticker AAPL MSFT TSLA WMT
date
2013-01-03 25.0 100.0 75.0 50.0
2013-01-04 87.5 62.5 100.0 25.0
2013-01-07 100.0 50.0 75.0 25.0
2013-01-08 87.5 37.5 12.5 75.0
2013-01-09 25.0 62.5 87.5 100.0
To get the original format back, we just melt the results:
pd.melt(pivoted_pctile.reset_index(),'date')\
.sort_values(['date', 'ticker']).reset_index()
Output
value
date ticker
2013-01-03 AAPL 25.0
MSFT 100.0
TSLA 75.0
WMT 50.0
2013-01-04 AAPL 87.5
MSFT 62.5
TSLA 100.0
WMT 25.0
2013-01-07 AAPL 100.0
MSFT 50.0
TSLA 75.0
WMT 25.0
2013-01-08 AAPL 87.5
MSFT 37.5
TSLA 12.5
WMT 75.0
2013-01-09 AAPL 25.0
MSFT 62.5
TSLA 87.5
WMT 100.0
If you prefer in one execution:
pd.melt(
sample_df\
.reset_index()\
.pivot('date','ticker','data')\
.rolling('2D').apply(pctile, raw=False)\
.reset_index(),'date')\
.sort_values(['date', 'ticker']).set_index(['date','ticker'])
Note that on day 7 this is different than what you displayed. This is actually rolling, so in day 7, because there is no day 6, the values are ranked only for that day, as the window of data is only 4 values and windows don't look forward. This differs from your result for that day.
Original
Is this something you might be looking for? I combined the groupby on the date (2 days) with transform so the number of observations is the same as the series provided. As you can see I kept the first observation of the window group.
df = sample_df.reset_index()
df['percentile_rank'] = df.groupby([pd.Grouper(key='date',freq='2D')]['data']\
.transform(lambda x: x.rank(ascending=True)/len(x)*100)
Output
Out[19]:
date ticker data percentile_rank
0 2013-01-03 AAPL 2 12.5
1 2013-01-03 MSFT 93 75.0
2 2013-01-03 WMT 39 50.0
3 2013-01-03 TSLA 21 37.5
4 2013-01-04 AAPL 141 87.5
5 2013-01-04 MSFT 43 62.5
6 2013-01-04 WMT 205 100.0
7 2013-01-04 TSLA 20 25.0
8 2013-01-07 AAPL 256 100.0
9 2013-01-07 MSFT 93 50.0
10 2013-01-07 WMT 103 62.5
11 2013-01-07 TSLA 25 25.0
12 2013-01-08 AAPL 233 87.5
13 2013-01-08 MSFT 60 37.5
14 2013-01-08 WMT 13 12.5
15 2013-01-08 TSLA 104 75.0
16 2013-01-09 AAPL 19 25.0
17 2013-01-09 MSFT 120 50.0
18 2013-01-09 WMT 282 75.0
19 2013-01-09 TSLA 293 100.0

Resources