I'm learning how to extract data from links and then proceeding to graph them.
For this tutorial, I was using the yahoo dataset of a stock.
The code is as follows
import matplotlib.pyplot as plt
import numpy as np
import urllib
import matplotlib.dates as mdates
import datetime
def bytespdate2num(fmt, encoding='utf-8'):
strconverter = mdates.strpdate2num(fmt)
def bytesconverter(b):
s = b.decode(encoding)
return strconverter(s)
return bytesconverter
def graph_data(stock):
stock_price_url = 'https://pythonprogramming.net/yahoo_finance_replacement'
source_code = urllib.request.urlopen(stock_price_url).read().decode()
stock_data = []
split_source=source_code.split('\n')
print(len(split_source))
for line in split_source:
split_line=line.split(',')
if (len(split_line)==7):
stock_data.append(line)
date,openn,closep,highp,lowp,openp,volume=np.loadtxt(stock_data,delimiter=',',unpack=True,converters={0:bytespdate2num('%Y-%m-%d')})
plt.plot_date(date,closep)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Graph')
plt.show()
graph_data('TSLA')
The whole code is pretty easy to understand except the part of converting the string datatype into date format using bytesupdate2num function.
Is there an easier way to convert strings extracted from reading a URL into date format during numpy extraction or is there another method I can use.
Thank you
With a guess as to the csv format, I can use the numpy 'native' datetime dtype:
In [183]: txt = ['2020-10-23 1 2.3']*3
In [184]: txt
Out[184]: ['2020-10-23 1 2.3', '2020-10-23 1 2.3', '2020-10-23 1 2.3']
If I let genfromtxt do its own dtype conversions:
In [187]: np.genfromtxt(txt, dtype=None, encoding=None)
Out[187]:
array([('2020-10-23', 1, 2.3), ('2020-10-23', 1, 2.3),
('2020-10-23', 1, 2.3)],
dtype=[('f0', '<U10'), ('f1', '<i8'), ('f2', '<f8')])
the date column is rendered as a string.
If I specify a datetime64 format:
In [188]: np.array('2020-10-23', dtype='datetime64[D]')
Out[188]: array('2020-10-23', dtype='datetime64[D]')
In [189]: np.genfromtxt(txt, dtype=['datetime64[D]',int,float], encoding=None)
Out[189]:
array([('2020-10-23', 1, 2.3), ('2020-10-23', 1, 2.3),
('2020-10-23', 1, 2.3)],
dtype=[('f0', '<M8[D]'), ('f1', '<i8'), ('f2', '<f8')])
This date appears to work in plt
In [190]: plt.plot_date(_['f0'], _['f1'])
I used genfromtxt because I'm more familiar with its ability to handle dtypes.
Related
import numpy as np
import pandas as pd
df = pd.DataFrame({'dt': ['2021-2-13', '2022-2-15'],
'w': [5, 7],
'n': [11, 8]})
df.reset_index()
print(list(df.loc[:,'dt'].values))
gives: ['2021-2-13', '2022-2-15']
NEEDED: [('2021-2-13'), ('2022-2-15')]
Important (at comment's Q): "NEEDED" is the way "mplfinance" accepts vlines argument for plot (checked) - I need to draw vertical lines for specified dates at x-axis of chart
import mplfinance as mpf
RES['Date'] = RES['Date'].dt.strftime('%Y-%m-%d')
my_vlines=RES.loc[:,'Date'].values # NOT WORKS
fig, axlist = mpf.plot( ohlc_df, type="candle", vlines= my_vlines, xrotation=30, returnfig=True, figsize=(6,4))
will only work if explcit my_vlines= [('2022-01-18'), ('2022-02-25')]
SOLVED: Oh, it really appears to be so simple after all
my_vlines=list(RES.loc[:,'Date'].values)
Your question asks for a list of Numpy arrays but your desired output looks like Tuples. If you need Tuples, note that it's the comma that makes the tuple not the parentheses, so you'd do something like this:
desired_format = [(x,) for x in list(df.loc[:,'dt'].values)]
If you want numpy arrays, you could do this
desired_format = [np.array(x) for x in list(df.loc[:,'dt'].values)]
I think I understand your problem. Please see the example code below and let me know if this resolves your problem. I expanded on your dataframe to meet mplfinance plot criteria.
import pandas as pd
import numpy as np
import mplfinance as mpf
df = pd.DataFrame({'dt': ['2021-2-13', '2022-2-15'],'Open': [5,7],'Close': [11, 8],'High': [21,30],'Low': [7, 3]})
df['dt']=pd.to_datetime(df['dt'])
df.set_index('dt', inplace = True)
mpf.plot(df, vlines = dict(vlines = df.index.tolist()))
Say if I would like to smooth the the following daily data named oildata with scipy.signal.savgol_filter:
from scipy.signal import savgol_filter
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = np.random.uniform(0, 10, size=90)
index= pd.date_range('20130226', periods=90)
oildata = pd.Series(data, index)
savgol_filter(oildata, 5, 3)
plt.plot(oildata)
plt.plot(pd.Series(savgol_filter(oildata, 5, 3), index=oildata.index))
plt.show()
Out:
Out:
When I replace savgol_filter(oildata, 5, 3) to savgol_filter(oildata, 31, 3):
Beside trial and error methods, I wonder if there are any criteria or methods to select a suitable window_length (which must be a positive odd integer) and polyorder (must be less than window_length) pairs quickly? Thanks.
Reference:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.savgol_filter.html
Below is a snippet that converts data into a NumPy array. It is then converted to a Pandas DataFrame where I intend to process it. I'm attempting to convert it back to a NumPy array. I'm failing at this. Badly.
import pandas as pd
import numpy as np
from pprint import pprint
data = [
('2020-11-01 00:00:00', 1.0),
('2020-11-02 00:00:00', 2.0)
]
coordinatesType = [('timestamp', 'datetime64[s]'), ('value', '<f8')]
npArray = np.asarray(data, coordinatesType)
df = pd.DataFrame(data = npArray)
# do some pandas processing, then convert back to a numpy array
mutatedNpArray = df.to_numpy(coordinatesType)
pprint(mutatedNpArray)
# don't suply dtype for kicks
pprint(df.to_numpy())
This yields crazytown:
array([[('2020-11-01T00:00:00', 1.6041888e+18),
('1970-01-01T00:00:01', 1.0000000e+00)],
[('2020-11-02T00:00:00', 1.6042752e+18),
('1970-01-01T00:00:02', 2.0000000e+00)]],
dtype=[('timestamp', '<M8[s]'), ('value', '<f8')])
array([[Timestamp('2020-11-01 00:00:00'), 1.0],
[Timestamp('2020-11-02 00:00:00'), 2.0]], dtype=object)
I realize a DataFrame is really a fancy NumPy array under the hood, but I'm passing back to a function that accepts a simple NumPy array. Clearly I'm not handling dtypes correctly and/or I don't understand the data structure inside my DataFrame. Below is what the function I'm calling expects:
[('2020-11-01T00:00:00', 1.000 ),
('2020-11-02T00:00:00', 2.000 )],
dtype=[('timestamp', '<M8[s]'), ('value', '<f8')])
I'm really lost on how to do this. Or what I should be doing instead.
Help!
As #hpaul suggested, I tried the following:
# ...
df = df.set_index('timestamp')
# do some pandas processing, then convert back to a numpy array
mutatedNpArray = df.to_records(coordinatesType)
# ...
All good!
Besides the to_records approach mentioned in comments, you can do:
df.apply(tuple, axis=1).to_numpy(coordinatesType)
Output:
array([('2020-11-01T00:00:00', 1.), ('2020-11-02T00:00:00', 2.)],
dtype=[('timestamp', '<M8[s]'), ('value', '<f8')])
Considerations:
I believe the issue here is related to the difference between the original array and the dataframe.
The shape your original numpy array is (2,), where each value is a tuple. When creating the dataframe, both df.shape and df.to_numpy() shapes are (2, 2) so that the dtype constructor does not work as expected. When converting rows to tuples into a pd.Series, you get the original shape of (2,).
I've been trying to convert string date into a format that can be plotted on a graph.
The code
import matplotlib.pyplot as plt
import numpy as np
import urllib
import matplotlib.dates as mdates
import datetime
def graph_data():
fig=plt.figure()
ax1=plt.subplot2grid((1,1),(0,0))
stock_price_url = 'https://pythonprogramming.net/yahoo_finance_replacement'
source_code = urllib.request.urlopen(stock_price_url).read().decode()
stock_data = []
split_source=source_code.split('\n')
print(len(split_source))
for line in split_source[1:]:
stock_data.append(line)
date,openn,closep,highp,lowp,openp,volume=np.loadtxt(stock_data,delimiter=',',unpack=True)
x = [datetime.strptime(d, '%Y-%m-%d') for d in date]
ax1.plot_date(x,closep,'-',linewidth=0.1)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Graph')
plt.show()
graph_data()
Any method of conversion just gives the same error
ValueError: could not convert string to float: '2017-07-26'
What method can I use to convert the string into date that can be plotted
Ther's nothing wrong with your code. The problem is with the data.
If you look at the data, you will find that from date to volume features your data is a string like this :
data = '2017-07-26,153.3500,153.9300,153.0600,153.5000,153.5000,12778195.00'.
That is the representation of a string.
So you need to do some preprocessing first. There may be various methods to do so.
I found this method helpful to me:
First, you have to remove the commas in data and replace them with spaces and then use the split function to convert data into a split format.
So, you need to make these changes in your code:
date = []
closep = []
for i in range(len(stock_data)):
temp = stock_data[i].replace(',', ' ').split()
date.append(temp[0])
closep.append(temp[2])
0 and two are the positions of date and closep in your dataset.
Now instead of 'x' and 'closep' as you have used in your plot method, use these 'date ' and 'closep' I just shared you via code.
One more thing is that the graph is having trouble with this big dataset I think.
So use date[0:100], closep[0:100] to try the plot for smaller dataset.
The complete code would look like this:
import matplotlib.pyplot as plt
import numpy as np
import urllib
import matplotlib.dates as mdates
import datetime
def graph_data():
fig = plt.figure()
ax1 = plt.subplot2grid((1, 1), (0, 0))
stock_price_url =
'https://pythonprogramming.net/yahoo_finance_replacement'
source_code = urllib.request.urlopen(stock_price_url).read().decode()
stock_data = []
split_source = source_code.split('\n')
for line in split_source[1:]:
stock_data.append(line)
date = []
closep = []
for i in range(len(stock_data)):
temp = stock_data[i].replace(',', ' ').split()
date.append(temp[0])
closep.append(temp[2])
ax1.plot_date(date[0:100], closep[0:100], '-', linewidth=0.1)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Graph')
plt.show()
graph_data()
Hope this helps.
numpy.genfromtxt on scipy page shows the following code. I cannot make sense of the following code especially the dtype and reading the string part and, therefore the code. The following is the code.
from io import StringIO
import numpy as np
s=StringIO(u"1,1.3,abced")
data=np.genfromtxt(s, dtype=[('myint', 'i8'),('myfloat','f8'), ('mystring','S5')], delimiter=",")
Ok. Here, I get that 1,1.3 and abced is being read from s=StringIO(u"1,1.3,abced"). But what does u do?
Also, I get that i8 is integer for 8 bytes. But what do 'myint', 'myfloat' and 'mystring' do?
'u' is for 'unicode', the default string type in Py3, so it isn't needed here. Also the StringIO isn't needed either. I just give genfromtxt a list of strings:
In [221]: txt = ["1,1.3,abced"]
In [223]: np.genfromtxt(txt,
dtype=[('myint', 'i8'),('myfloat','f8'), ('mystring','S5')],
delimiter=",")
Out[223]:
array((1, 1.3, b'abced'),
dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', 'S5')])
The dtype defines a compound dtype, one with 3 fields, one for each column. You access fields by name:
data['myint']
data['myfloat']