Converting simple returns to monthly log returns - python-3.x

I have a pandas DataFrame with simple daily returns. I need to convert it to monthly log returns and add a column to the current DataFrame. I have to use np.log to compute the monthly return. But I can only compute daily log return. Below is my code.
df[‘return_monthly’]= np.log(data([‘Simple Daily Returns’]+1)
The code only produces daily log returns. Is there any particular methods I should be using in the above code to get monthly return??
Please see my input for pandas Dataframe, the third column in excel is the expected out.

The question is a little confusing, but it seems like you want to group the rows by month. This can be done with pandas.resample if you have a datetime index, pandas.groupby, or pandas.pivot.
Here is a simple implementation, let us know if this isn't what you're looking for. Furthermore, your values are less than 1, so the log is negative. You can adjust as needed. I aggregated the months with sum, but there are many other aggregation functions such as mean(), median(), size() and many more. See the link for a list of aggregating functions.
#create dataframe with 1220 values that match your dataset
df = pd.DataFrame({
'Date':pd.date_range(start = '1/1/2019' , end ='5/4/2022' , freq='1D'),
'Return':np.random.uniform(low=1e-6, high=1.0, size=1220) #avoid log 0 which returns NAN
}).set_index('Date') #set the index to the date so we can use resample
Return Log_return
Date
2019-01-31 14.604863 -33.950987
2019-02-28 13.118111 -32.025086
2019-03-31 14.541947 -32.962914
2019-04-30 14.212689 -33.684422
2019-05-31 14.154918 -33.347081
2019-06-30 10.710209 -43.474120
2019-07-31 12.358001 -43.051723
2019-08-31 17.932673 -30.328784
...

Related

calculate average difference between dates using pyspark

I have a data frame that looks like this- user ID and dates of activity. I need to calculate the average difference between dates using RDD functions (such as reduce and map) and not SQL.
The dates for each ID needs to be sorted by order before calculating the difference, as I need the difference between each consecutive dates.
ID
Date
1
2020-09-03
1
2020-09-03
2
2020-09-02
1
2020-09-04
2
2020-09-06
2
2020-09-16
the needed outcome for this example will be:
ID
average difference
1
0.5
2
7
thanks for helping!
You can use datediff with window function to calculate the difference, then take an average.
lag is one of the window function and it will take a value from the previous row within the window.
from pyspark.sql import functions as F
# define the window
w = Window.partitionBy('ID').orderBy('Date')
# datediff takes the date difference from the first arg to the second arg (first - second).
(df.withColumn('diff', F.datediff(F.col('Date'), F.lag('Date').over(w)))
.groupby('ID') # aggregate over ID
.agg(F.avg(F.col('diff')).alias('average difference'))
)

Arithmetic operations for groups within a dataframe

I have loaded multiple CSV (time series) to create one dataframe. This dataframe contains data for multiple stocks. Now I want to calculate 1 month return for all the datapoints.
There 172 datapoints for each stock i.e. from index 0 to 171. The time series for next stock starts from index 0 again.
When I am trying to calculate the 1 month return its getting calculated correctly for all data points except for index 0 of new stock. Because it is taking the difference with index 171 of the previous stock.
I want the return to be calculated per stock name basis so I tried the for loop but it doesnt seem working.
e.g. In the attached image (highlighted) the 1 month return is calculated for company name ITC with SHREECEM. I expect for SHREECEM the first value of 1Mreturn should be NaN
Using groupby instead of a for loop you can get the result you want:
Mreturn_function = lambda df: df['mean_price'].diff(periods=1)/df['mean_price'].shift(1)*100
gw_stocks.groupby('CompanyName').apply(Mreturn_function)

Date stuck as unformattable in pandas dataframe

I am trying to plot time series data and my date column is stuck like this, and I cannot seem to figure out what datatype it is to change it, as adding verbose = True doesn't yield any explanation for the data.
Here is a screenshot of the output Date formatting
Here is the code I have for importing the dataset and all the formatting I've done to it. Ultimately, I want date and time to be separate values, but I cannot figure out why it's auto formatting it and applying today's date.
df = pd.read_csv('dataframe.csv')
df['Avg_value'] = (df['Open'] + df['High'] + df['Low'] + df['Close'])/4
df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_datetime(df['Time'])
Any help would be appreciated
The output I'd be looking for would be something like:
Date: 2016-01-04 10:00:00
As one column rather than 2 separate ones.
When you pass a Pandas Series into pd.to_datetime(...), it parses the values and returns a new Series of dtype datetime64[ns] containing the parsed values:
>>> pd.to_datetime(pd.Series(["12:30:00"]))
0 2021-08-24 12:30:00
dtype: datetime64[ns]
The reason you see today's date is that a datetime64 value must have both date and time information. Since the date is missing, the current date is substituted by default.
Pandas does not really have a dtype that supports "just time of day, no date" (there is one that supports time intervals, such as "5 minutes"). The Python standard library does, however: the datetime.time class.
Pandas provides a convenience function called the .dt accessor for extracting a Series of datetime.time objects from a Series of datetime64 objects:
>>> pd.to_datetime(pd.Series(["12:30:00"])).dt.time
0 12:30:00
dtype: object
Note that the dtype is just object which is the fallback dtype Pandas uses for anything which is not natively supported by Pandas or NumPy. This means that working with these datetime.time objects will be a bit slower than working with a native Pandas dtype, but this probably won't be a bottleneck for your application.
Recommended reference: https://pandas.pydata.org/docs/user_guide/timeseries.html

pandas is not summing a numeric column

I have read into a DataFrame an Excel spreadsheet with column names such as Gross, Fee, Net, etc. When I invoke the sum method on the resulting DataFrame, I saw that it was not summing the Fee column because several rows had string data in that column. So I first loop through each row testing that column to see if it contains a string and if it does, I replace it with a 0. The DataFrame sum method still does not sum the Fee column. Yet when I write out the resulting DataFrame to a new Excel spreadsheet and read it back in and apply the sum method to the resulting DataFrame, it does sum the Fee column. Can anyone explain this? Here is the code and the printed output:
import pandas as pd
pp = pd.read_excel('pp.xlsx')
# get rid of any strings in column 'Fee':
for i in range(pp.shape[0]):
if isinstance(pp.loc[i, 'Fee'], str):
pp.loc[i, 'Fee'] = 0
pd.to_numeric(pp['Fee']) #added this but it makes no difference
# the Fee column is still not summed:
print(pp.sum(numeric_only=True))
print('\nSecond Spreadsheet\n')
# write out Dataframe: to an Excel spreadheet:
with pd.ExcelWriter('pp2.xlsx') as writer:
pp.to_excel(writer, sheet_name='PP')
# now read the spreadsheet back into another DataFrame:
pp2 = pd.read_excel('pp2.xlsx')
# the Fee column is summed:
print(pp2.sum(numeric_only=True))
Prints:
Gross 8677.90
Net 8572.43
Address Status 0.00
Shipping and Handling Amount 0.00
Insurance Amount 0.00
Sales Tax 0.00
etc.
Second Spreadsheet
Unnamed: 0 277885.00
Gross 8677.90
Fee -105.47
Net 8572.43
Address Status 0.00
Shipping and Handling Amount 0.00
Insurance Amount 0.00
Sales Tax 0.00
etc.
Try using pd.to_numeric
Ex:
pp = pd.read_excel('pp.xlsx')
print(pd.to_numeric(pp['Fee'], errors='coerce').dropna().sum())
The problem here is that the Fee column isn't numeric. So you need to convert it to a numeric field, save that updated field in the existing dataframe, and then compute the sum.
So that would be:
df = df.assign(Fee=pd.to_numeric(df['Fee'], errors='coerce'))
print(df.sum())
After a quick analysis, from what I can see is that you are replacing the string with an integer and the values of 'Fee' column could be a mix of both of float and integer which means the dtype of that column is an object. When you do pp.sum(numeric_only=True) , it ignores the object column because of the condition numeric_only. Convert your column to a float64 as in pp['Fee'] = pd.to_numeric(pp['Fee']) and it should work for you.
The reason that it is happening second time is because excel does the data conversion for you and when you read it, it's a numeric data type.
Everyone who has responded should get partial credit for telling me about pd.to_numeric. But they were all missing one piece. It is not sufficient to say pd.to_numeric(pp['Fee']. That returns the column converted to numeric but does not update the original DataFrame, so when I do a pp.sum(), nothing in pp was modified. You need:
pp['Fee'] = pd.to_numeric(pp['Fee'])
pp.sum()

Pandas Create Range of Dates Without Weekends

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'A':['a','b','c'],
'first_date':['2015-08-31 00:00:00','2015-08-24 00:00:00','2015-08-25 00:00:00']})
df.first_date=pd.to_datetime(df.first_date) #(dtype='<M8[ns]')
df['last_date']=pd.to_datetime('5/6/2016') #(dtype='datetime64[ns]')
df
A first_date last_date
0 a 2015-08-31 2016-05-06
1 b 2015-08-24 2016-05-06
2 c 2015-08-25 2016-05-06
I'd like to create a new column which contains the list (or array) of dates between 'first_date' and 'last_date' which excludes weekends.
So far, I've tried this:
pd.date_range(df['first_date'],df['last_date'])
...but this error occurs:
TypeError: Cannot convert input to Timestamp
I also tried this before pd.date_range...
pd.Timestamp(df['first_date'])
...but no dice.
Thanks in advance!
P.S.:
After this hurdle, I'm going to try looking at other lists of dates and if they fall within the generated array (per row in 'A'), then subtract them out of the list or array). I'll post it as a separate question.
freq='B' gives you business days, or no weekends.
Your error:
TypeError: Cannot convert input to Timestamp
Is the result of you passing a series to the pd.date_range function when it is expecting a Timestamp
Instead, use apply.
However, I still find it tricky to get lists into specific cells of dataframes. The way I use is to use a pd.Series([mylist]). Notice it is a list of a list. If it were just pd.Series(mylist) pandas would convert the list into a series and you'd get a series of series which is a dataframe.
try:
def fnl(x):
l = pd.date_range(x.loc['first_date'], x.loc['last_date'], freq='B')
return pd.Series([l])
df['range'] = df.apply(fnl, axis=1)

Resources