pyramid-arima auto_arima order selection - python-3.x

I am working on Time Series Forecasting(Daily entry) using pyramid-arima auto_arima in python where y is my target and x_features are all exogenous variables. I want best order model based on lowest aic, But auto_arima returns only few order combinations.
PFA where 1st code line (start_p = start_q = 0 & max_p = 0, max_q = 3) returns all 4 combinations, but 2nd code line(start_p = start_q = 0 & max_p = 3, max_q = 3) returns only 7 combinations , din't gave (0,1,2) and (0,1,3) and others, which leads wrong model selection based on aic. All other parameters are as default e.g max_order = 10
Is there anything I am missing or wrongly done?
Thankyou in advance.

You say error_action='ignore', so probably (0,1,2) and (0,1,3) (and other orders) gave errors, so they didn't appear in the results.
(I don't have enough reputation to write a comment, sorry).

The number of models autoarima trains is based on the data you feed in and also the stepwise= True if it is True autoarima uses a proven way to reduce number of iterations to find the best model and it is the best 90% cases unless data is very varying.
If you want the rest of models also to run as it isnt taking alot of time to execute try keeping stepwise=False where it trains with all possible param combinations.
Hope this helps

Related

Pandas .rolling.corr using date/time offset

I am having a bit of an issue with pandas's rolling function and I'm not quite sure where I'm going wrong. If I mock up two test series of numbers:
df_index = pd.date_range(start='1990-01-01', end ='2010-01-01', freq='D')
test_df = pd.DataFrame(index=df_index)
test_df['Series1'] = np.random.randn(len(df_index))
test_df['Series2'] = np.random.randn(len(df_index))
Then it's easy to have a look at their rolling annual correlation:
test_df['Series1'].rolling(365).corr(test_df['Series2']).plot()
which produces:
All good so far. If I then try to do the same thing using a datetime offset:
test_df['Series1'].rolling('365D').corr(test_df['Series2']).plot()
I get a wildly different (and obviously wrong) result:
Is there something wrong with pandas or is there something wrong with me?
Thanks in advance for any light you can shed on this troubling conundrum.
It's very tricky, I think the behavior of window as int and offset is different:
New in version 0.19.0 are the ability to pass an offset (or
convertible) to a .rolling() method and have it produce variable sized
windows based on the passed time window. For each time point, this
includes all preceding values occurring within the indicated time
delta.
This can be particularly useful for a non-regular time frequency index.
You should checkout the doc of Time-aware Rolling.
r1 = test_df['Series1'].rolling(window=365) # has default `min_periods=365`
r2 = test_df['Series1'].rolling(window='365D') # has default `min_periods=1`
r3 = test_df['Series1'].rolling(window=365, min_periods=1)
r1.corr(test_df['Series2']).plot()
r2.corr(test_df['Series2']).plot()
r3.corr(test_df['Series2']).plot()
This code would produce similar shape of plots for r2.corr().plot() and r3.corr().plot(), but note that the calculation results still different: r2.corr(test_df['Series2']) == r3.corr(test_df['Series2']).
I think for regular time frequency index, you should just stick to r1.
This mainly because the result of two rolling 365 and 365D are different.
For example
sub = test_df.head()
sub['Series2'].rolling(2).sum()
Out[15]:
1990-01-01 NaN
1990-01-02 -0.355230
1990-01-03 0.844281
1990-01-04 2.515529
1990-01-05 1.508412
sub['Series2'].rolling('2D').sum()
Out[16]:
1990-01-01 -0.043692
1990-01-02 -0.355230
1990-01-03 0.844281
1990-01-04 2.515529
1990-01-05 1.508412
Since there are a lot NaN in rolling 365, so the corr of two series in two way are quit different.

Finding the top three relevant category and its corresponding probabilities

From the below script, I find the highest probability and its corresponding category in a multi class text classification problem. How do I find the highest top 3 predicted probability and its corresponding category in a best efficient way without using loops.
probabilities = classifier.predict_proba(X_test)
max_probabilities = probabilities.max(axis=1)
order=np.argsort(probabilities, axis=1)
classification=(classifier.classes_[order[:, -1:]])
print(accuracy_score(classification,y_test))
Thanks in advance.
( I have around 50 categories, I want to extract the top 3 best relevant category among 50 categories for each of my narrations and display them in a dataframe)
You've done most of the hard work here, just missing a bit of numpy foo to finish it off. Your line
order = np.argsort(probabilities, axis=1)
Contains the indices of the sorted probabilities, so [[lowest_prob_class_1, ..., highest_prob_class_1]...] for each of your samples. Which you have used to give your classification with order[:, -1:], i.e. the index of the highest probability class. So to get the top three classes we can just make a simple change
top_3_classes = classifier.classes_[order[:, -3:]]
Then to get the corresponding probabilities we can use
top_3_probabilities = probabilities[np.repeat(np.arange(order.shape[0]), 3),
order[:, -3:].flatten()].reshape(order.shape[0], 3)

Spark - Optimize calculation time over a data frame, by using groupBy() instead of filter()

I have a data frame which contains different columns ('features').
My goal is to calculate column X statistical measures:
Mean, Standart-Deviation, Variance
But, to calculate all of those, with dependency on column Y.
e.g. Get all rows which Y = 1, and for them calculate mean,stddev, var,
then do the same for all rows which Y = 2 for them.
My current implementation is:
print "For CONGESTION_FLAG = 0:"
log_df.filter(log_df[flag_col] == 0).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
print "For CONGESTION_FLAG = 1:"
log_df.filter(log_df[flag_col] == 1).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
print "For CONGESTION_FLAG = 2:"
log_df.filter(log_df[flag_col] == 2).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
I was told the filter() way is wasteful in terms of computation times, and received an advice that for making those calculation run faster (i'm using this on 1GB data file), it would be better use groupBy() method.
Can someone please help me transform those lines to do the same calculations by using groupBy instead?
I got mixed up with the syntax and didn't manage to do so correctly.
Thanks.
Filter by itself is not wasteful. The problem is that you are calling it multiple times (once for each value) meaning you are scanning the data 3 times. The operation you are describing is best achieved by groupby which basically aggregates data per value of the grouped column.
You could do something like this:
agg_df = log_df.groupBy(flag_col).agg(mean(size_col).alias("mean"), stddev(size_col).alias("stddev"), pow(stddev(size_col),2).alias("pow"))
You might also get better performance by calculating stddev^2 after the aggregation (you should try it on your data):
agg_df = log_df.groupBy(flag_col).agg(mean(size_col).alias("mean"), stddev(size_col).alias("stddev"))
agg_df2 = agg_df.withColumn("pow", agg_df["stddev"] * agg_df["stddev"])
You can:
log_df.groupBy(log_df[flag_col]).agg(
mean(size_col), stddev(size_col), pow(stddev(size_col), 2)
)

Why I get different values everytime I run the function hmmlearn.hmm.GaussianHMM.fit()

I have a program.
n = 6
data=pd.read_csv('11.csv',index_col='datetime')
volume = data['TotalVolumeTraded']
close = data['ClosingPx']
logDel = np.log(np.array(data['HighPx'])) - np.log(np.array(data['LowPx']))
logRet_1 = np.array(np.diff(np.log(close)))
logRet_5 = np.log(np.array(close[5:])) - np.log(np.array(close[:-5]))
logVol_5 = np.log(np.array(volume[5:])) - np.log(np.array(volume[:-5]))
logDel = logDel[5:]
logRet_1 = logRet_1[4:]
close = close[5:]
Date = pd.to_datetime(data.index[5:])
A = np.column_stack([logDel,logRet_5,logVol_5])
model = GaussianHMM(n_components= n, covariance_type="full", n_iter=2000).fit([A])
hidden_states = model.predict(A)
I run the code the first time ,the value of "hidden_states" is as follow,
I run the code the second time ,the value of "hidden_states" is as follow,
Why are two values "hidden_states" different?
I am not completely sure what happens here, but here're two possible explanations for the results you're seeing.
The model does not maintain any ordering over state labels. So state labelled as 1 in one run could end up being 4 in another run. This is known as label switching problem in latent variable models.
GaussianHMM initializes emission parameters via k-means which might converge to different values depending on the data. The initial parameters are passed to the EM-algorithm which is also prone to local maxima. Therefore different runs could result in different parameter estimates and (as a result) slightly different predictions.
Try to control the randomness by setting the seed and the random_state when you define your model. Moreover you could initialize the startprob_ and the transmat_ and see how it behaves.
That way you might have a better explanation about the cause of this behavior.

how to predict with gaussianhmm sklearn

I'm trying to predict stock prices using sklearn. I'm new to prediction. I tried the example from sklearn for stock prediction with gaussian hmm. But predict gives states sequence which overlay on the price and it also takes points from given input close price. My question is how to generate next 10 prices?
You will always use the last state to predict the next state, so let's add 10 days worth of inputs by changing the end date to the 23rd:
date2 = datetime.date(2012, 1, 23)
You can double check the rest of the code to make sure I am not actually using future data for the prediction. The rest of these lines can be added to the bottom of the file. First we want to find out what the expected return is for a given state. The model.means_ array has returns, both those were the returns that got us to this state, not the future returns which is what you want. To get the future returns, we consider the probability of going to any one of the 5 states, and what the return of those states is. We get the probability of going to any particular state from the model.transmat_ matrix, the for the return of each state we use the model.means_ values. We take the dot product to get the expected return for a particular state. Then we remove the volume data (you can leave it in if you want, but you seemed to be most interested in future prices).
expected_returns_and_volumes = np.dot(model.transmat_, model.means_)
returns_and_volumes_columnwise = zip(*expected_returns_and_volumes)
returns = returns_and_volumes_columnwise[0]
If you print the value for returns[0], you'll see the expected return in dollars for state 0, returns[1] for state 1 etc. Now, given a day and a state, we want to predict the price for tomorrow. You said 10 days so let's use that for lastN.
predicted_prices = []
lastN = 10
for idx in xrange(lastN):
state = hidden_states[-lastN+idx]
current_price = quotes[-lastN+idx][2]
current_date = datetime.date.fromordinal(dates[-lastN+idx])
predicted_date = current_date + datetime.timedelta(days=1)
predicted_prices.append((predicted_date, current_price + returns[state]))
print(predicted_prices)
If you were running this in "production" you would set date2 to the last date you have and then lastN would be 1. Note that I don't take into account weekends for the predicted_date.
This is a fun exercise but you probably wouldn't run this in production, hence the quotes. First, the time series is the raw price; this should really be percentage returns or log returns. Plus there is no justification for picking 5 states for the HMM, or that a HMM is even good for this kinda problem, which I doubt. They probably just picked it as an example. I think the other sklearn example using PCA is much more interesting.

Resources