Using PyFolio alongside Pandas - python-3.x

I aim to do a time series analysis of financial data. Since I am working on Pakistan Stock Exchange (PSX), data is not available on yahoo. When I looked at some tutorials on Quantopian, the first step, data extraction is done through yahoo finance.
Now when I use PyFolio module and read in csv (Panda's Function) containing data, there is an issue with datetime format of Pandas and PyFolio. Below is the code of what I am doing.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pyfolio as pf
import datetime
from datetime import datetime
from datetime import timedelta
start_date = '2015-02-01'
end_date = '2017-03-20'
live_date = '2017-03-15'
symbols = ['KEL']
def converter(start_date):
convert=datetime.strptime(start_date, "%Y-%m-%d")
return convert
def data(symbols):
dates=pd.date_range(start_date,end_date)
df=pd.DataFrame(index=dates)
df_temp=pd.read_csv('/home/furqan/Desktop/python_data/{}.csv'.format(str(symbols[0])),usecols=['Date','Close'],
parse_dates=True,index_col='Date',na_values=['nan'])
df_temp = df_temp.rename(columns={'Close': symbols[0]})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
return df
new_date = converter (live_date)
df= data(symbols)
sheet = pf.create_returns_tear_sheet(df, live_start_date=new_date)
The above code leads to following error
TypeError: Cannot compare tz-naive and tz-aware timestamps
Given the above information I have two questions.
1) Can Quantopian be any good for my analysis if I have data on my PC? Since the data is not available on yahoo finance.
2)What does the above error exactly means? How can I fix this error.
For reference below is the link to PyFolio and Pandas documentation.
https://quantopian.github.io/pyfolio/notebooks/single_stock_example/#fetch-the-daily-returns-for-a-stock
http://pandas.pydata.org/pandas-docs/stable/

I got around this problem by adding TZ information to my series. If you know the timezone of your datetime index you can apply the following method:
df.tz_localize('UTC')
I hope it helps.

Related

Pandas - comparing average of hour periods against each other for a given date range

I'm trying to get used to using datetime data in Pandas and plotting different comparisons for a given dataset. I'm using the London Air Quality dataset for Ozone to practice and am trying to replicate the chart below (that I've created using a pivot table in Excel) with Pandas and matplotlib.
The chart plots an average of each hours Ozone reading for each location across the entire dataset to see if there is one location which is constantly higher than others or if different locations have the highest Ozone levels at different periods throughout the day.
Essentially, I'm looking to plot the hourly average of Ozone for each location.
I've attempted to reshape the data into a multiindex format and then plot, similar to what I'd do in excel before plotting but am unsure if this is the correct way to approach the problem. Code for reshaping is below. I am still getting used to reshaping so not sure if this is the correct use/I am approaching the problem in the correct way and open to other methods to accomplish this task. Any assistance to accomplish this task would be much appreciated!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
data = pd.read_csv('/Users/xx/Downloads/LaqnData.csv')
data['ReadingDateTime'] = pd.to_datetime(data['ReadingDateTime'])
data['Date'] = pd.to_datetime(data['ReadingDateTime']).dt.date
data['Time'] = pd.to_datetime(data['ReadingDateTime']).dt.time
data.set_index(['Date', 'Time'], inplace = True)
hourly_dataframe = data.pivot_table(columns = 'Site', values = 'Value', index = ['Date', 'Time'])
hourly_dataframe.fillna(method = 'ffill', inplace = True)
hourly_dataframe[hourly_dataframe < 0] = 0
I have gone to the site and downloaded a 24 hour reading for the following;
data.Site.unique()
array(['BX1', 'TH4', 'BT4', 'HI0', 'BL0', 'RD0'], dtype=object)
I adopted your code to this point:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
data = pd.read_csv('/Users/xx/Downloads/LaqnData.csv')
data['ReadingDateTime'] = pd.to_datetime(data['ReadingDateTime'])
I then use datetime index to call each hour in the groupby function.
data.groupby([data.index.hour, data['Site']])['Value'].mean().reset_index()`#Convert to dataframe.`
To plot, I chain unstack to the groupby function and plot directly.
data.groupby([data.index.hour, data['Site']])['Value'].mean().reset_index#unstack().plot()
plt.xlabel('Hour of the day')
plt.ylabel('Ozone')
plt.title('Avarage Hourly comparison')
plt.legend()`# If you want the legend to appear in default location`
If fussed about legend location, this post explains it very well. In your case;
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=6)

Pandas error "No numeric data to plot" when using stock data from datareader

I have a dataframe with closing stock prices:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn; seaborn.set()
from pandas_datareader import data
import pandas_datareader.data as web
from pandas.tseries.offsets import BDay
f = web.DataReader('^DJI', 'stooq')
CLOSE = f['Close']
CLOSE.plot(alpha= 0.5,style='-')
CLOSE.resample('BA').mean().plot(style=':')
CLOSE.asfreq(freq='BA').plot(style='--')
plt.legend(['input','resample','asfreq'],loc='upper left')
With resample() I get the average of the previous year. This works.
With asfreq() I try to get the closing value at the end of the year. This doesn't work.
I get the following error in the asfreq() line: TypeError: no numeric data to plot
f.info() displays that close is a non-null float64 type.
What could be wrong?
The indices were not hierachically sorted:
f= f.sort_index(axis=0) solved it.

Pandas plotting graph with timestamp

pandas 0.23.4
python 3.5.3
I have some code that looks like below
import pandas as pd
from datetime import datetime
from matplotlib import pyplot
def dateparse():
return datetime.strptime("2019-05-28T00:06:20,927", '%Y-%m-%dT%H:%M:%S,%f')
series = pd.read_csv('sample.csv', delimiter=";", parse_dates=True,
date_parser=dateparse, header=None)
series.plot()
pyplot.show()
The CSV file looks like below
2019-05-28T00:06:20,167;2070
2019-05-28T00:06:20,426;147
2019-05-28T00:06:20,927;453
2019-05-28T00:06:22,688;2464
2019-05-28T00:06:27,260;216
As you can see 2019-05-28T00:06:20,167 is the timestamp with milliseconds and 2070 is the value that I want plotted.
When I run this the graph gets printed however on the X-Axis I see numbers which is a bit odd. I was expecting to see actual timestamps (like MS Excel). Can someone tell me what I am doing wrong?
You did not set datetime as index. Aslo, you don't need a date parser, just pass the columns you want to parse:
dfstr = '''2019-05-28T00:06:20,167;2070
2019-05-28T00:06:20,426;147
2019-05-28T00:06:20,927;453
2019-05-28T00:06:22,688;2464
2019-05-28T00:06:27,260;216'''
df = pd.read_csv(pd.compat.StringIO(dfstr), sep=';',
header=None, parse_dates=[0])
plt.plot(df[0], df[1])
plt.show()
Output:
Or:
df.set_index(0)[1].plot()
gives a little better plot:

Keyerror in time/Date Components of datetime - what to do?

I am using a pandas DataFrame with datetime indexing. I know from the
Xarray documentation, that datetime indexing can be done as ds['date.year'] with ds being the DataArray of xarray, date the date index and years of the dates. Xarray points to datetime components which again leads to DateTimeIndex, the latter being panda documentation. So I thought of doing the same with pandas, as I really like this feature.
However, it is not working for me. Here is what I did so far:
# Import required modules
import pandas as pd
import numpy as np
# Create DataFrame (name: df)
df=pd.DataFrame({'Date': ['2017-04-01','2017-04-01',
'2017-04-02','2017-04-02'],
'Time': ['06:00:00','18:00:00',
'06:00:00','18:00:00'],
'Active': [True,False,False,True],
'Value': np.random.rand(4)})
# Combine str() information of Date and Time and format to datetime
df['Date']=pd.to_datetime(df['Date'] + ' ' + df['Time'],format = '%Y-%m-%d %H:%M:%S')
# Make the combined data the index
df = df.set_index(df['Date'])
# Erase the rest, as it is not required anymore
df = df.drop(['Time','Date'], axis=1)
# Show me the first day
df['2017-04-01']
Ok, so this shows me only the first entries. So far, so good.
However
df['Date.year']
results in KeyError: 'Date.year'
I would expect an output like
array([2017,2017,2017,2017])
What am I doing wrong?
EDIT:
I have a workaround, which I am able to go on with, but I am still not satisfied, as this doesn't explain my question. I did not use a pandas DataFrame, but an xarray Dataset and now this works:
# Load modules
import pandas as pd
import numpy as np
import xarray as xr
# Prepare time array
Date = ['2017-04-01','2017-04-01', '2017-04-02','2017-04-02']
Time = ['06:00:00','18:00:00', '06:00:00','18:00:00']
time = [Date[i] + ' ' + Time[i] for i in range(len(Date))]
time = pd.to_datetime(time,format = '%Y-%m-%d %H:%M:%S')
# Create Dataset (name: ds)
ds=xr.Dataset({'time': time,
'Active': [True,False,False,True],
'Value': np.random.rand(4)})
ds['time.year']
which gives:
<xarray.DataArray 'year' (time: 4)>
array([2017, 2017, 2017, 2017])
Coordinates:
* time (time) datetime64[ns] 2017-04-01T06:00:00 ... 2017-04-02T18:00:00
Just in terms of what you're doing wrong, your are
a) trying to call an index as a series
b) chaning commands within a string df['Date'] is a single column df['Date.year'] is a column called 'Date.year'
if you're datetime is the index, then use the .year or dt.year if it's a series.
df.index.year
#or assuming your dtype is a proper datetime (your code indicates it is)
df.Date.dt.year
hope that helps bud.

numpy busday_count for Days difference gives TypeError: dtype('<M8[us]') to dtype('<M8[D]') according to the rule 'safe'"

I am trying to calculate number of days between two dates and i get an error
TypeError: ("Iterator operand 0 dtype could not be cast from dtype('<M8[us]') to dtype('<M8[D]') according to the rule 'safe'", 'occurred at index 0')
The column REASSIGN_DATE has blank values in few rows which i am filling using another Column called SETUP_DATE to remove and blank values and fill a date.
Then i am trying to calculate the number days between these two days using the below code.
import pandas as pd
import numpy as np
import datetime
from datetime import date
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
import xlrd
import workdays
import defusedxml
from xlrd import open_workbook
defusedxml.defuse_stdlib()
def secure_open_workbook(**kwargs):
try:
return open_workbook(**kwargs)
except EntitiesForbidden:
raise ValueError('Please use a xlsx file without XEE')
#loading Raw Data
releases = pd.read_excel(r'C:\Desktop\Releases.xlsx',
sheet_name = 'Releases',
header = 0
)
releases.loc[releases['REASSIGN_DATE'].isnull(),'REASSIGN_DATE']=releases['SETUP_DATE']
releases['REASSIGN_DATE']=pd.to_datetime(releases['REASSIGN_DATE'])
releases['RELEASED_DATE']=pd.to_datetime(releases['RELEASED_DATE'])
releases['RELEASED_DAYS']=releases.apply(lambda x:
np.busday_count(x.REASSIGN_DATE,x.RELEASED_DATE),axis =1)
releases_2=releases.drop(['SETUPDATE','RELEASEDDATE','REASSIGNDATE'],axis=1)
I get the error.
I even tried to add astype('datetime64[D]') as well however i get the error astype('datetime64[ns]') cannot be converted to astype('datetime64[D]').
How could i avoid this error.
Regards,
Ren.

Resources