How to resample data inside multiindex dataframe - python-3.x

I have the following dataframe:
I need to resample the data to calculate the weekly pct_change(). How can i get the weekly change ?
Something like data['pct_week'] = data['Adj Close'].resample('W').ffill().pct_change() but the data need to groupby data.groupby(['month', 'week'])
This way every month would yield 4 values for weekly change.Which i can graph then
What i did was df['pct_week'] = data['Adj Close'].groupby(['week', 'day']).pct_change() but i got this error TypeError: 'type' object does not support item assignment

If want grouping with resample first is necessary DatetimeIndex only, so added
DataFrame.reset_index by all levels without first, then grouping and resample with custom function, because pct_change for resample is not implemented:
def percent_change(x):
return pd.Series(x).pct_change()
Another idea is use numpy solution for pct_change:
def percent_change(x):
return x / np.concatenate(([np.nan], x[:-1])) - 1
df1 = (df.reset_index(level=[1,2,3])
.groupby(['month', 'week'])['Adj Close']
.resample('W')
.apply(percent_change))
this way every month would yield 4 values for weekly change
So it seems there is no groupby, only necessary downsample like sum and chain Series.pct_change:
df2 = (df.reset_index(level=[1,2,3])
.resample('W')['Adj Close']
.sum()
.pct_change())

Drop unwanted indexes. The datetime index is enough for re-sampling / grouping
df.index = df.index.droplevel(['month', 'week', 'day'])
Re-sample by week, select the column needed, add a aggregation function and then calculate percentage change.
df.resample('W')['Adj Close'].mean().pct_change()

Related

Pandas merge is not working, though data types of keys are the same

I am using pandas.merge to join two dataframes based on two columns values. I checked the data types in both dataframes, and they are the same. Besides, for each of the two columns in both datasets I calculated the intersection between sets and it is definitely not empty. However, merge is not working properly and I just cannot find a reason for it.
I will post a piece of code here, though I don't have a minimum dataset to reproduce it, the data are too big. There are datasets df and q0, both of them have columns permno date, and I want to merge based on them.
# just to make sure the data types are the same
df['permno'] = df['permno'].astype(int)
q0['permno'] = q0['permno'].astype(int)
df['date'] = df['date'].dt.to_period('D')
q0['date'] = q0['date'].dt.to_period('D')
print(df.dtypes, q0.dtypes)
Outpit for q0:
date period[D]
permno int64
sum float64
dtype: object
Output for df:
permno int64
date period[D]
sic int64
prc float64
...
Another step to make sure is to take the intersection between column values:
print(len(set(df.date.unique())&set(q0.date.unique())))
print(len(set(df.permno.unique())&set(q0.permno.unique())))
Output:
9154
5925
Merge:
df = pd.merge(df, q0, on=['permno', 'date'], how='inner')
print(len(df))
Output:
0
I tried it so many times but I can't figure out why it is not working now.

Create Dynamic Columns with Calculation

I have a dataframe called prices, with historical stocks prices for the following companies:
['APPLE', 'AMAZON', 'GOOGLE']
So far on, with the help of a friendly user, I was able to create a dataframe for each of this periods with the following code:
import pandas as pd
import numpy as np
from datetime import datetime, date
prices = pd.read_excel('database.xlsx')
companies=prices.columns
companies=list(companies)
del companies[0]
timestep = 250
prices_list = [prices[day:day + step] for day in range(len(prices) - step)]
Now, I need to evaluate the change in price for every period of 251 days (Price251/Price1; Price252/Price2; Price 253/Price and so on) for each one of the companies, and create a column for each one of them.
I would also like to put the column name dynamic, so I can replicate this to a much longer database.
So, I would get a dataframe similar to this:
open image here
Here you can find the dataframe head(3): Initial Dataframe
IIUC, try this:
def create_cols(df,num_dates):
for col in list(df)[1:]:
df['{}%'.format(col)] = - ((df['{}'.format(col)].shift(num_dates) - df['{}'.format(col)]) / df['{}'.format(col)].shift(num_dates)).shift(- num_dates)
return df
create_cols(prices,251)
you only would have to format the columns to percentages.

Pandas DataFrame - adding columns in for loop vs another approach

we have measurements for n points (say 22 points) over a period of time stored in a real time store. now we are looking for some understanding of trends for points mentioned above. In order to gain an objective we read measurements into a pandas DataFrame (python). Within this DataFrame points are now columns and rows are respective measurement time.
We would like to extend data frame with new columns for mean and std by inserting 'mean' and 'std' columns for each existing column, being a particular measurement. This means two new columns per 22 measurement points.
Now question is whether above is best achieved adding new mean and std columns while iterating existing columns or is there another more effective DataFrame built in operation or tricks?
Our understanding is that updating of DataFrame in a for loop would by far be worst practice.
Thanks for any comment or proposal.
From the comments, I guess this is what you are looking for -
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.normal(size = (1000,22))) # created an example dataframe
df.loc[:, 'means'] = df.mean(axis = 1)
df.loc[:, 'std'] = df.mean(axis = 1)

How to calculate mean with hieraechical index in pandas

I have a pandas dataframe with 1mi rows and hierarchical indexes (country, state, city, in this order) with price observations of a product for each row. How can I calculate de mean and standard deviation (std) for each country, state and city (keeping in mind I am avoinding loops as my df is big)?
For each level of mean and std, I want to save the values in new columns in this dataframe for future access.
Use groupby with the argument levels to group your data and then use mean and std. If you want to have your mean as new column in your existing dataframe, use transform which return a Series with the same index as your df :
grouped = df.groupby(level = ['Country','State', 'City'])
df['Mean'] = grouped['price_observation'].transform('mean')
df['Std'] = grouped['price_observation'].transform('std')
If you want to read more on grouping, you can read the pandas documentation

Python Pandas unique values

df = pd.DataFrame({"ID":['A','B','C','D','E','F'],
"IPaddress":['12.345.678.01','12.345.678.02','12.345.678.01','12.345.678.18','12.345.678.02','12.345.678.01'],
"score":[8,9,5,10,3,7]})
I'm using Python, and Pandas library. For those rows with duplicate IP addresses, I want to select only one row with highest score (score being from 0-10), and drop all duplicates.
I'm having a difficult time in turning this logic into a Python function.
Step 1: Using the groupby function of Pandas, split the df into groups of IPaddress.
df.groupby('IPaddress')
Result of this will create an groupby object. Once you check the type of this object, it will be the following: pandas.core.groupby.groupby.DataFrameGroupBy
Step 2: With the Pandas groupby object created from step1, using .idxmax() over the score, will return the Pandas series with maximum scores of each IPaddress
df.groupby('IPaddress').score.idxmax()
(Optional) Step 3: If you want to transform the above series to dataframe, you can do below:
df.loc[df.groupby('IPaddress').score.idxmax(),['IPaddress','score']]
Here, you are selecting all the rows with max scores, and showing the IPaddress, score columns.
Useful reference:
1. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html
https://www.geeksforgeeks.org/python-pandas-dataframe-idxmax/

Resources