how to predict with gaussianhmm sklearn - scikit-learn

I'm trying to predict stock prices using sklearn. I'm new to prediction. I tried the example from sklearn for stock prediction with gaussian hmm. But predict gives states sequence which overlay on the price and it also takes points from given input close price. My question is how to generate next 10 prices?

You will always use the last state to predict the next state, so let's add 10 days worth of inputs by changing the end date to the 23rd:
date2 = datetime.date(2012, 1, 23)
You can double check the rest of the code to make sure I am not actually using future data for the prediction. The rest of these lines can be added to the bottom of the file. First we want to find out what the expected return is for a given state. The model.means_ array has returns, both those were the returns that got us to this state, not the future returns which is what you want. To get the future returns, we consider the probability of going to any one of the 5 states, and what the return of those states is. We get the probability of going to any particular state from the model.transmat_ matrix, the for the return of each state we use the model.means_ values. We take the dot product to get the expected return for a particular state. Then we remove the volume data (you can leave it in if you want, but you seemed to be most interested in future prices).
expected_returns_and_volumes = np.dot(model.transmat_, model.means_)
returns_and_volumes_columnwise = zip(*expected_returns_and_volumes)
returns = returns_and_volumes_columnwise[0]
If you print the value for returns[0], you'll see the expected return in dollars for state 0, returns[1] for state 1 etc. Now, given a day and a state, we want to predict the price for tomorrow. You said 10 days so let's use that for lastN.
predicted_prices = []
lastN = 10
for idx in xrange(lastN):
state = hidden_states[-lastN+idx]
current_price = quotes[-lastN+idx][2]
current_date = datetime.date.fromordinal(dates[-lastN+idx])
predicted_date = current_date + datetime.timedelta(days=1)
predicted_prices.append((predicted_date, current_price + returns[state]))
print(predicted_prices)
If you were running this in "production" you would set date2 to the last date you have and then lastN would be 1. Note that I don't take into account weekends for the predicted_date.
This is a fun exercise but you probably wouldn't run this in production, hence the quotes. First, the time series is the raw price; this should really be percentage returns or log returns. Plus there is no justification for picking 5 states for the HMM, or that a HMM is even good for this kinda problem, which I doubt. They probably just picked it as an example. I think the other sklearn example using PCA is much more interesting.

Related

Contribution analysis versus lca.score

I am interested in which processes/activities contribute most to the Life Cycle Impact Assessment (LCIA) that I am conducting. For this, I run a contribution analysis (see code below). To crosscheck the results of my contribution analysis and to ensure that I get everything right, I wanted to compare the returned contributions with the impact assessment result (lca.score).
The documentation of ca.annotated_top_processes(lca) says: "Returns a list of tuples: (lca score, supply, activity)."
In my understanding, lca.score should be the same value as the sum of all the first values in the tuples that are returned by ca.annotated_top_processes(lca) (the printed values). However, this is not the case. What am I missing? Is there some sort of cut-off applied or did I misunderstand something?
import bw2analyzer as bwa
random_act = db_ei381.random()
lca = bw2data.LCA(
{random_act: 1},
('ReCiPe Midpoint (H) V1.13', 'water depletion', 'WDP')
)
lca.lci()
lca.lcia()
print(lca.score)
# %% Contribution analysis
ca = bwa.ContributionAnalysis()
contributions = ca.annotated_top_processes(lca)
print(sum([i[0] for i in contributions]))
It is not well documented, but you can introduce an argument limit that specifies the number of activities that are considered in the contribution analysis. The default value I think is 25. It is sorted so the most important activities come first. If you write something like this you should see how the result converges to the total score as the number of activities increase:
import matplotlib.pyplot as plt
cutoff = [25,50,100,500,1000,1200]
scores = []
for n in cutoff:
contributions = ca.annotated_top_processes(lca,limit=n)
contr_sum = sum([i[0] for i in contributions])
scores.append(contr_sum)
plt.plot(cutoff,scores)
plt.axhline(lca.score,ls='--',color='r');

How to process the data returned from a function (Python 3.7)

Background:
My question should be relatively easy, however I am not able to figure it out.
I have written a function regarding queueing theory and it will be used for ambulance service planning. For example, how many calls for service can I expect in a given time frame.
The function takes two parameters; a starting value of the number of ambulances in my system starting at 0 and ending at 100 ambulances. This will show the probability of zero calls for service, one call for service, three calls for service….up to 100 calls for service. Second parameter is an arrival rate number which is the past historical arrival rate in my system.
The function runs and prints out the result to my screen. I have checked the math and it appears to be correct.
This is Python 3.7 with the Anaconda distribution.
My question is this:
I would like to process this data even further but I don’t know how to capture it and do more math. For example, I would like to take this list and accumulate the probability values. With an arrival rate of five, there is a cumulative probability of 61.56% of at least five calls for service, etc.
A second example of how I would like to process this data is to format it as percentages and write out a text file
A third example would be to process the cumulative probabilities and exclude any values higher than the 99% cumulative value (because these vanish into extremely small numbers).
A fourth example would be to create a bar chart showing the probability of n calls for service.
These are some of the things I want to do with the queueing theory calculations. And there are a lot more. I am planning on writing a larger application. But I am stuck at this point. The function writes an output into my Python 3.7 console. How do I “capture” that output as an object or something and perform other processing on the data?
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import math
import csv
def probability_x(start_value = 0, arrival_rate = 0):
probability_arrivals = []
while start_value <= 100:
probability_arrivals = [start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)]
print(probability_arrivals)
start_value = start_value + 1
return probability_arrivals
#probability_x(arrival_rate = 5, x = 5)
#The code written above prints to the console, but my goal is to take the returned values and make other calculations.
#How do I 'capture' this data for further processing is where I need help (for example, bar plots, cumulative frequency, etc )
#failure. TypeError: writerows() argument must be iterable.
with open('ExpectedProbability.csv', 'w') as writeFile:
writer = csv.writer(writeFile)
for value in probability_x(arrival_rate = 5):
writer.writerows(value)
writeFile.close()
#Failure. Why does it return 2. Yes there are two columns but I was expecting 101 as the length because that is the end of my loop.
print(len(probability_x(arrival_rate = 5)))
The problem is, when you write
probability_arrivals = [start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)]
You're overwriting the previous contents of probability_arrivals. Everything that it held previously is lost.
Instead of using = to reassign probability_arrivals, you want to append another entry to the list:
probability_arrivals.append([start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)])
I'll also note, your while loop can be improved. You're basically just looping over start_value until it reaches a certain value. A for loop would be more appropriate here:
for s in range(start_value, 101): # The end value is exclusive, so it's 101 not 100
probability_arrivals = [s, math.pow(arrival_rate, s) * math.pow(math.e, -arrival_rate) / math.factorial(s)]
print(probability_arrivals)
Now you don't need to manually worry about incrementing the counter.

pyramid-arima auto_arima order selection

I am working on Time Series Forecasting(Daily entry) using pyramid-arima auto_arima in python where y is my target and x_features are all exogenous variables. I want best order model based on lowest aic, But auto_arima returns only few order combinations.
PFA where 1st code line (start_p = start_q = 0 & max_p = 0, max_q = 3) returns all 4 combinations, but 2nd code line(start_p = start_q = 0 & max_p = 3, max_q = 3) returns only 7 combinations , din't gave (0,1,2) and (0,1,3) and others, which leads wrong model selection based on aic. All other parameters are as default e.g max_order = 10
Is there anything I am missing or wrongly done?
Thankyou in advance.
You say error_action='ignore', so probably (0,1,2) and (0,1,3) (and other orders) gave errors, so they didn't appear in the results.
(I don't have enough reputation to write a comment, sorry).
The number of models autoarima trains is based on the data you feed in and also the stepwise= True if it is True autoarima uses a proven way to reduce number of iterations to find the best model and it is the best 90% cases unless data is very varying.
If you want the rest of models also to run as it isnt taking alot of time to execute try keeping stepwise=False where it trains with all possible param combinations.
Hope this helps

Pandas .rolling.corr using date/time offset

I am having a bit of an issue with pandas's rolling function and I'm not quite sure where I'm going wrong. If I mock up two test series of numbers:
df_index = pd.date_range(start='1990-01-01', end ='2010-01-01', freq='D')
test_df = pd.DataFrame(index=df_index)
test_df['Series1'] = np.random.randn(len(df_index))
test_df['Series2'] = np.random.randn(len(df_index))
Then it's easy to have a look at their rolling annual correlation:
test_df['Series1'].rolling(365).corr(test_df['Series2']).plot()
which produces:
All good so far. If I then try to do the same thing using a datetime offset:
test_df['Series1'].rolling('365D').corr(test_df['Series2']).plot()
I get a wildly different (and obviously wrong) result:
Is there something wrong with pandas or is there something wrong with me?
Thanks in advance for any light you can shed on this troubling conundrum.
It's very tricky, I think the behavior of window as int and offset is different:
New in version 0.19.0 are the ability to pass an offset (or
convertible) to a .rolling() method and have it produce variable sized
windows based on the passed time window. For each time point, this
includes all preceding values occurring within the indicated time
delta.
This can be particularly useful for a non-regular time frequency index.
You should checkout the doc of Time-aware Rolling.
r1 = test_df['Series1'].rolling(window=365) # has default `min_periods=365`
r2 = test_df['Series1'].rolling(window='365D') # has default `min_periods=1`
r3 = test_df['Series1'].rolling(window=365, min_periods=1)
r1.corr(test_df['Series2']).plot()
r2.corr(test_df['Series2']).plot()
r3.corr(test_df['Series2']).plot()
This code would produce similar shape of plots for r2.corr().plot() and r3.corr().plot(), but note that the calculation results still different: r2.corr(test_df['Series2']) == r3.corr(test_df['Series2']).
I think for regular time frequency index, you should just stick to r1.
This mainly because the result of two rolling 365 and 365D are different.
For example
sub = test_df.head()
sub['Series2'].rolling(2).sum()
Out[15]:
1990-01-01 NaN
1990-01-02 -0.355230
1990-01-03 0.844281
1990-01-04 2.515529
1990-01-05 1.508412
sub['Series2'].rolling('2D').sum()
Out[16]:
1990-01-01 -0.043692
1990-01-02 -0.355230
1990-01-03 0.844281
1990-01-04 2.515529
1990-01-05 1.508412
Since there are a lot NaN in rolling 365, so the corr of two series in two way are quit different.

Predicting a users next action based on current day and time

I'm using Microsoft Azure Machine Learning Studio to try an experiment where I use previous analytics captured about a user (at a time, on a day) to try and predict their next action (based on day and time) so that I can adjust the UI accordingly. So if a user normally visits a certain page every Thursday at 1pm, then I would like to predict that behaviour.
Warning - I am a complete novice with ML, but have watched quite a few videos and worked through tutorials like the movie recommendations example.
I have a csv dataset with userid,action,datetime and would like to train a matchbox recommendation model, which, from my research appears to be the best model to use. I can't see a way to use date/time in the training. The idea being that if I could pass in a userid and the date, then the recommendation model should be able to give me a probably result of what that user is most likely to do.
I get results from the predictive endpoint, but the training endpoint gives the following error:
{
"error": {
"code": "ModuleExecutionError",
"message": "Module execution encountered an error.",
"details": [
{
"code": "18",
"target": "Train Matchbox Recommender",
"message": "Error 0018: Training dataset of user-item-rating triples contains invalid data."
}
]
}
}
Here is a link to a public version of the experiment
Any help would be appreciated.
Thanks.
Maybe this answer could be helpful, you may also take a look on this where you can read:
The problem is probably with the range of rating data. There's an upper limit for rating range, because the training gets expensive if the range between smallest and largest rating is too large.
[...]
One option would be to scale the ratings to a narrower range.
According to this MSDN, please note that you cannot have a gap between the min and max note higher than 100.
So you have to make a pre-processing on your csv file column data (userid, action, datetime etc...) in order to keep all column data in the [0-99] range.
Please see bellow a Python implementation (to share the logic):
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
big_gap_arr = [-250,-2350,850,-120,-1235,3212,1,5,65,48,265,1204,65,23,45,895,5000,3,325,3244,5482] #data with big gap
abs_min = abs(min(big_gap_arr)) #get the absolute minimal value
max_diff= ( max(big_gap_arr) + abs_min ) #get the maximal diff
specific_range_arr=[]
for each_value in big_gap_arr:
new_value = ( 99/1. * float( abs_min + each_value) / max_diff ) #get a corresponding value in the [0,99] range
specific_range_arr.append(new_value)
print specific_range_arr #post computed data => all in range [0,99]
Which give you:
[26.54494382022472, 0.0, 40.449438202247194, 28.18820224719101, 14.094101123595506, 70.3061797752809, 29.71769662921348, 29.76825842696629, 30.526685393258425, 30.31179775280899, 33.05477528089887, 44.924157303370784, 30.526685393258425, 29.995786516853933, 30.27387640449438, 41.01825842696629, 92.90730337078652, 29.742977528089888, 33.813202247191015, 70.71067415730337, 99.0]
Note that all data are now in the [0,99] range
Following this process:
User id could be float instead an integer
Action is an integer (if you got less than 100 actions) or float (if more than 100 actions)
Datetime will be splited in two integer (or one integer and one float), please see bellow:
Concerning:
(A) way to use date/time in the training
You may split your datetime in two column, something like:
one column for the weekday:
0: Sunday
1: Monday
2: Tuesday
[...]
6: Saturday
one column for the time in the day:
0: Between 00:00 & 00:15
1: Between 00:15 & 00:30
2: Between 00:30 & 00:40
[...]
95 : Between 23:45 & 00:00
If you need a better granularity (here it is a 15 min window) you may also use float number for the time column.
So from messing with this for a while, I think I may see where the issue may lie. I think that the first three inputs of the Train Matchbox Recommender would need to be filled in for an accurate prediction. I'll include screenshots of the sample for recommending restaurants, as well.
The first input would be the dataset consisting of the user, item, and rating.
The second input would be the features of each user.
And the third input would be the features of each feature (restaurant in this case).
So to help with the date/time issue, I'm wondering if the data would need to be munged to match something similar to the restaurant and user data.
I know it's not much, but I hope it helps lead you down the right track.

Resources