How to show alternative calendar dates in mplfinance? - python-3.x

TL;DR - The issue
I have an mplfinance plot based on a pandas dataframe in which the indices are in Georgian calendar format and I need to have them displayed as Jalali format.
My data and code
My data looks like this:
open high low close
date
2021-03-15 67330.0 69200.0 66870.0 68720.0
2021-03-16 69190.0 71980.0 69000.0 71620.0
2021-03-17 72450.0 73170.0 71700.0 71820.0
2021-03-27 71970.0 73580.0 70000.0 73330.0
2021-03-28 73330.0 73570.0 71300.0 71850.0
... ... ... ... ...
The first column is both a date and the index. This is required by mplfinance plot the data correctly;
Which I can plot with something like this:
import mplfinance as mpf
mpf.plot(chart_data.tail(7), figratio=(16,9), type="candle", style='yahoo', ylabel='', tight_layout=True, xrotation=90)
Where chart_data is the data above and the rest are pretty much formatting stuff.
What I have now
My chart looks like this:
However, the I need the dates to look like this: 1400-01-12. Here's a table of equivalence to further demonstrate my case.
2021-03-15 1399-12-25
2021-03-16 1399-12-26
2021-03-17 1399-12-27
2021-03-27 1400-01-07
2021-03-28 1400-01-08
What I've tried
Setting Jdates as my indices:
chart_data.index = history.jdate
mpf.plot(chart_data_j)
Throws this exception:
TypeError('Expect data.index as DatetimeIndex')
So I tried converting the jdates into datetimes:
chart_data_j.index = pd.to_datetime(history.jdate)
Which threw an out of bounds exception:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1398-03-18 00:00:00
So I though maybe changing the timezone/locale would be an option, so I tried changing the timezones, following the official docs:
pd.to_datetime(history.date).tz_localize(tz='US/Eastern')
But I got this exception:
raise TypeError(f"{ax_name} is not a valid DatetimeIndex or PeriodIndex")
And finally I tried using libraries such as PersianTools and pandas_jalali to no avail.

You can get this to work by creating your own custom DateFormatter class, and using mpf.plot() kwarg returnfig=True to gain access to the Axes objects to be able to install your own custom DateFormatter.
I have written a custom DateFormatter (see code below) that is aware of the special way that MPLfinance handles the x-axis when show_nontrading=False (i.e. the default value).
import pandas as pd
import mplfinance as mpf
import jdatetime as jd
import matplotlib.dates as mdates
from matplotlib.ticker import Formatter
class JalaliDateTimeFormatter(Formatter):
"""
Formatter for JalaliDate in mplfinance.
Handles both `show_nontrading=False` and `show_nontrading=True`.
When show_nonntrading=False, then the x-axis is indexed by an
integer representing the row number in the dataframe, thus:
Formatter for axis that is indexed by integer, where the integers
represent the index location of the datetime object that should be
formatted at that lcoation. This formatter is used typically when
plotting datetime on an axis but the user does NOT want to see gaps
where days (or times) are missing. To use: plot the data against
a range of integers equal in length to the array of datetimes that
you would otherwise plot on that axis. Construct this formatter
by providing the arrange of datetimes (as matplotlib floats). When
the formatter receives an integer in the range, it will look up the
datetime and format it.
"""
def __init__(self, dates=None, fmt='%b %d, %H:%M', show_nontrading=False):
self.dates = dates
self.len = len(dates) if dates is not None else 0
self.fmt = fmt
self.snt = show_nontrading
def __call__(self, x, pos=0):
'''
Return label for time x at position pos
'''
if self.snt:
jdate = jd.date.fromgregorian(date=mdates.num2date(x))
formatted_date = jdate.strftime(self.fmt)
return formatted_date
ix = int(round(x,0))
if ix >= self.len or ix < 0:
date = None
formatted_date = ''
else:
date = self.dates[ix]
jdate = jd.date.fromgregorian(date=mdates.num2date(date))
formatted_date = jdate.strftime(self.fmt)
return formatted_date
# ---------------------------------------------------
df = pd.read_csv('so_67001540.csv',index_col=0,parse_dates=True)
mpf.plot(df,figratio=(16,9),type="candle",style='yahoo',ylabel='',xrotation=90)
dates = [mdates.date2num(d) for d in df.index]
formatter = JalaliDateTimeFormatter(dates=dates,fmt='%Y-%m-%d')
fig, axlist = mpf.plot(df,figratio=(16,9),
type="candle",style='yahoo',
ylabel='',xrotation=90,
returnfig=True)
axlist[0].xaxis.set_major_formatter(formatter)
mpf.show()
The file 'so_67001540.csv' looks like this:
date,open,high,low,close,alt_date
2021-03-15,67330.0,69200.0,66870.0,68720.0,1399-12-25
2021-03-16,69190.0,71980.0,69000.0,71620.0,1399-12-26
2021-03-17,72450.0,73170.0,71700.0,71820.0,1399-12-27
2021-03-27,71970.0,73580.0,70000.0,73330.0,1400-01-07
2021-03-28,73330.0,73570.0,71300.0,71850.0,1400-01-08
When you run the above script, you should get the following two plots:

Have you tried making these dates
1399-12-25
1399-12-26
1399-12-27
1400-01-07
1400-01-08
the index of the dataframe (maybe that's what you mean by "swapping the indices"?) and set kwarg datetime_format='%Y-%m-%d' ?
I think that should work.
UPDATE:
It appears to me that the problem is that
mplfinace requires a Pandas.DatetimeIndex as the index of your dataframe, and
Pandas.DatetimeIndex is made up of Pandas.Timestamp objects, and
Pandas.Timestamp has limits which preclude dates having years less than 1677:
In [1]: import pandas as pd
In [2]: pd.Timestamp.max
Out[2]: Timestamp('2262-04-11 23:47:16.854775807')
In [3]: pd.Timestamp.min
Out[3]: Timestamp('1677-09-21 00:12:43.145225')
I am going to poke around and see if I can come up with another solution. Internally Matplotlib dates can go to year zero.

Related

Not able to make a class in Python and output it to excel

I'm fairly new in Python and working in a inventory management position.
One important thing in inventory management is calculating the safety stock.
So, this is what I'm trying to achieve.
I have imported a file with 3 columns; FR, sigma and LT for 3 rows. See hereunder the code and the output:
code:
import pandas as pd
df = pd.read_excel("Desktop\\TestPython4.xlsx")
xcol=["FR","sigma","LT"]
x=df[xcol].values
output:
snapshot
To calculate the safety stock, I have the following (simplified) formula of it;
CDF(FR)*sigma*sqrt(LT)
where CDF is the cumulative distribution function of the normal distribution and FR is a number between 0 and 1 (thus the well-knowned z-value is the output).
I want to output a the file with an extra column that displays the safety stock.
For this I made a class safetystock with the following code:
class Safetystock:
def __init__(self,FR,sigma,LT):
self.FR = FR
self.sigma = sigma
self.LT = LT
pass
def calculate():
SS=st.norm.ppf(FR)
return print(SS*sigma*np.sqrt(LT))
pass
Then I made the variable: "output"
Output = Safetystock(df.FR,df.sigma,df.LT)
I said that the data in the file needs to be taken into account.
Then I added a column to df, named output that needs to contain the variable "Output":
df["output"]=Output
Now, when I want to call df, it gives me this:
actual output
What am I doing wrong?
Cheers,
Steven
What about
import pandas as pd
import numpy as np
import scipy.stats as st
df = pd.read_excel("Desktop\\TestPython4.xlsx")
df["output"] = st.norm.ppf(df.FR)*df.sigma*np.sqrt(df.LT)

Bokeh ValueError: expected an element of either Seq(String)

I'm trying to build a simple bar chart via bokeh but struggling for it to recognize the x-axis and keep getting a ValueError... I think it needs to be in string format but for some reason whatever I try it just won't work. Please note, the column that contains the Years (as floats by the looks of it) is called RegionName, if it seems confusing. Please see my code below, any suggestions?
import pandas as pd
from bokeh.plotting import figure, output_file, show
from bokeh.models import ColumnDataSource
from bokeh.models.tools import HoverTool
import os
from bokeh.palettes import Spectral5
from bokeh.transform import factor_cmap
os.chdir("C:/Users/Vladimir.Tikhnenko/Python/Land Reg")
# Pivot data
def pivot2(infile="Land Registry.csv", outfile="SalesVolume.csv"):
df=pd.read_csv(infile)
table=pd.pivot_table(df,index=
["RegionName"],columns="Year",values="SalesVolume",aggfunc=sum)
table.to_csv(outfile)
return table
pivot2()
# Transpose data
df=pd.read_csv("SalesVolume.csv")
df=df.drop(df.columns[1:28],1)
df=pd.read_csv("SalesVolume.csv", index_col=0, header=None).T
df.to_csv("C:\\Users\Vladimir.Tikhnenko\Python\Land
Reg\SalesVolume.csv",index=None)
df=pd.read_csv("SalesVolume.csv")
source = ColumnDataSource(df)
years = source.data['RegionName'].tolist()
p = figure(x_range=['RegionName'])
color_map = factor_cmap(field_name='RegionName',palette=Spectral5,
factors=years)
p.vbar(x='RegionName', top='Southwark', source=source, width=1,
color=color_map)
p.title.text ='Transactions'
p.xaxis.axis_label = 'Years'
p.yaxis.axis_label = 'Number of Sales'
show(p)
the error message is
ValueError: expected an element of either Seq(String), Seq(Tuple(String,
String)) or Seq(Tuple(String, String, String)), got [1968.0, 1969.0, 1970.0,
1971.0, 1972.0, 1973.0, 1974.0, 1975.0, 1976.0, 1977.0, 1978.0, 1979.0,
1980.0, 1981.0, 1982.0, 1983.0, 1984.0, 1985.0, 1986.0, 1987.0, 1988.0,
1989.0, 1990.0, 1991.0, 1992.0, 1993.0, 1994.0, 1995.0, 1996.0, 1997.0,
1998.0, 1999.0, 2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, 2006.0,
2007.0, 2008.0, 2009.0, 2010.0, 2011.0, 2012.0, 2013.0, 2014.0, 2015.0,
2016.0, 2017.0, 2018.0]
Categorical factors must only be strings (or sequences of strings for nested factors), so factor_cmap only accepts lists of those things. You passed it a list a numbers, which causes the error shown. To use use the years as categorical factors, you need to convert them to strings as suggested, and use those string values to initialize x_range, and for the coordinates to vbar.
Alternatively, if you want to use numerical values for the years, but just want to have fixed, controlled tick locations, do this:
p = figure() # don't pass x_range
p.xaxis.ticker = years
And then also use linear_cmap to map the numerical values (instead of factor_cmap)

Pandas dataframe column float inside string (i.e. "float") to int

I'm trying to clean some data in a pandas df and I want the 'volume' column to go from a float to an int.
EDIT: The main issue was that the dtype for the float variable I was looking at was actually a str. So first it needed to be floated, before being changed.
I deleted the two other solutions I was considering, and left the one I used. The top one is the one with the errors, and the bottom one is the solution.
import pandas as pd
import numpy as np
#Call the df
t_df = pd.DataFrame(client.get_info())
#isolate only the 'symbol' column in t_df
tickers = t_df.loc[:, ['symbol']]
def tick_data(tickers):
for i in tickers:
tick_df = pd.DataFrame(client.get_ticker())
tick = tick_df.loc[:, ['symbol', 'volume']]
tick.iloc[:,['volume']].astype(int)
if tick['volume'].dtype != np.number:
print('yes')
else:
print('no')
return tick
Below is the revised code:
import pandas as pd
#Call the df
def ticker():
t_df = pd.DataFrame(client.get_info())
#isolate only the 'symbol' column in t_df
tickers = t_df.loc[:, ['symbol']]
for i in tickers:
#pulls out market data for each symbol
tickers = pd.DataFrame(client.get_ticker())
#isolates the symbol and volume
tickers = tickers.loc[:, ['symbol', 'volume']]
#floats volume
tickers['volume'] = tickers.loc[:, ['volume']].astype(float)
#volume to int
tickers['volume'] = tickers.loc[:, ['volume']].astype(int)
#deletes all symbols > 20,000 in volume, returns only symbol
tickers = tickers.loc[tickers['volume'] >= 20000, 'symbol']
return tickers
You have a few issues here.
In your first example, iloc only accepts integer locations for the rows and columns in the DataFrame, which is generating your error. I.e.
tick.iloc[:,['volume']].astype(int)
doesn't work. If you want label-based indexing, use .loc:
tick.loc[:,['volume']].astype(int)
Alternately, use bracket-based indexing, which allows you to take a whole column directly without using slice syntax (:) on the rows:
tick['volume'].astype(int)
Next, astype(int) returns a new value, it does not modify in-place. So what you want is
tick['volume'] = tick['volume'].astype(int)
As for your dtype is a number check, you don't want to check == np.number, but you don't want to check is either, which only returns True if it's np.number and not if it's a subclass like np.int64. Use np.issubdtype, or pd.api.types.is_numeric_dtype, i.e.:
if np.issubdtype(tick['volume'].dtype, np.number):
or:
if pd.api.types.is_numeric_dtype(tick['volume'].dtype):

Sort by datetime in python3

Looking for help on how to sort a python3 dictonary by a datetime object (as shown below, a value in the dictionary) using the timestamp below.
datetime: "2018-05-08T14:06:54-04:00"
Any help would be appreciated, spent a bit of time on this and know that to create the object I can do:
format = "%Y-%m-%dT%H:%M:%S"
# Make strptime obj from string minus the crap at the end
strpTime = datetime.datetime.strptime(ts[:-6], format)
# Create string of the pieces I want from obj
convertedTime = strpTime.strftime("%B %d %Y, %-I:%m %p")
But I'm unsure how to go about comparing that to the other values where it accounts for both day and time correctly, and cleanly.
Again, any nudges in the right direction would be greatly appreciated!
Thanks ahead of time.
Datetime instances support the usual ordering operators (< etc), so you should order in the datetime domain directly, not with strings.
Use a callable to convert your strings to timezone-aware datetime instances:
from datetime import datetime
def key(s):
fmt = "%Y-%m-%dT%H:%M:%S%z"
s = ''.join(s.rsplit(':', 1)) # remove colon from offset
return datetime.strptime(s, fmt)
This key func can be used to correctly sort values:
>>> data = {'s1': "2018-05-08T14:06:54-04:00", 's2': "2018-05-08T14:05:54-04:00"}
>>> sorted(data.values(), key=key)
['2018-05-08T14:05:54-04:00', '2018-05-08T14:06:54-04:00']
>>> sorted(data.items(), key=lambda item: key(item[1]))
[('s2', '2018-05-08T14:05:54-04:00'), ('s1', '2018-05-08T14:06:54-04:00')]

Matplotlib: Import and plot multiple time series with legends direct from .csv

I have several spreadsheets containing data saved as comma delimited (.csv) files in the following format: The first row contains column labels as strings ('Time', 'Parameter_1'...). The first column of data is Time and each subsequent column contains the corresponding parameter data, as a float or integer.
I want to plot each parameter against Time on the same plot, with parameter legends which are derived directly from the first row of the .csv file.
My spreadsheets have different numbers of (columns of) parameters to be plotted against Time; so I'd like to find a generic solution which will also derive the number of columns directly from the .csv file.
The attached minimal working example shows what I'm trying to achieve using np.loadtxt (minus the legend); but I can't find a way to import the column labels from the .csv file to make the legends using this approach.
np.genfromtext offers more functionality, but I'm not familiar with this and am struggling to find a way of using it to do the above.
Plotting data in this style from .csv files must be a common problem, but I've been unable to find a solution on the web. I'd be very grateful for your help & suggestions.
Many thanks
"""
Example data: Data.csv:
Time,Parameter_1,Parameter_2,Parameter_3
0,10,0,10
1,20,30,10
2,40,20,20
3,20,10,30
"""
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt('Data.csv', skiprows=1, delimiter=',') # skip the column labels
cols = data.shape[1] # get the number of columns in the array
for n in range (1,cols):
plt.plot(data[:,0],data[:,n]) # plot each parameter against time
plt.xlabel('Time',fontsize=14)
plt.ylabel('Parameter values',fontsize=14)
plt.show()
Here's my minimal working example for the above using genfromtxt rather than loadtxt, in case it is helpful for anyone else.
I'm sure there are more concise and elegant ways of doing this (I'm always happy to get constructive criticism on how to improve my coding), but it makes sense and works OK:
import numpy as np
import matplotlib.pyplot as plt
arr = np.genfromtxt('Data.csv', delimiter=',', dtype=None) # dtype=None automatically defines appropriate format (e.g. string, int, etc.) based on cell contents
names = (arr[0]) # select the first row of data = column names
for n in range (1,len(names)): # plot each column in turn against column 0 (= time)
plt.plot (arr[1:,0],arr[1:,n],label=names[n]) # omitting the first row ( = column names)
plt.legend()
plt.show()
The function numpy.genfromtxt is more for broken tables with missing values rather than what you're trying to do. What you can do is simply open the file before handing it to numpy.loadtxt and read the first line. Then you don't even need to skip it. Here is an edited version of what you have here above that reads the labels and makes the legend:
"""
Example data: Data.csv:
Time,Parameter_1,Parameter_2,Parameter_3
0,10,0,10
1,20,30,10
2,40,20,20
3,20,10,30
"""
import numpy as np
import matplotlib.pyplot as plt
#open the file
with open('Data.csv') as f:
#read the names of the colums first
names = f.readline().strip().split(',')
#np.loadtxt can also handle already open files
data = np.loadtxt(f, delimiter=',') # no skip needed anymore
cols = data.shape[1]
for n in range (1,cols):
#labels go in here
plt.plot(data[:,0],data[:,n],label=names[n])
plt.xlabel('Time',fontsize=14)
plt.ylabel('Parameter values',fontsize=14)
#And finally the legend is made
plt.legend()
plt.show()

Resources