I am having a bit of an issue with pandas's rolling function and I'm not quite sure where I'm going wrong. If I mock up two test series of numbers:
df_index = pd.date_range(start='1990-01-01', end ='2010-01-01', freq='D')
test_df = pd.DataFrame(index=df_index)
test_df['Series1'] = np.random.randn(len(df_index))
test_df['Series2'] = np.random.randn(len(df_index))
Then it's easy to have a look at their rolling annual correlation:
test_df['Series1'].rolling(365).corr(test_df['Series2']).plot()
which produces:
All good so far. If I then try to do the same thing using a datetime offset:
test_df['Series1'].rolling('365D').corr(test_df['Series2']).plot()
I get a wildly different (and obviously wrong) result:
Is there something wrong with pandas or is there something wrong with me?
Thanks in advance for any light you can shed on this troubling conundrum.
It's very tricky, I think the behavior of window as int and offset is different:
New in version 0.19.0 are the ability to pass an offset (or
convertible) to a .rolling() method and have it produce variable sized
windows based on the passed time window. For each time point, this
includes all preceding values occurring within the indicated time
delta.
This can be particularly useful for a non-regular time frequency index.
You should checkout the doc of Time-aware Rolling.
r1 = test_df['Series1'].rolling(window=365) # has default `min_periods=365`
r2 = test_df['Series1'].rolling(window='365D') # has default `min_periods=1`
r3 = test_df['Series1'].rolling(window=365, min_periods=1)
r1.corr(test_df['Series2']).plot()
r2.corr(test_df['Series2']).plot()
r3.corr(test_df['Series2']).plot()
This code would produce similar shape of plots for r2.corr().plot() and r3.corr().plot(), but note that the calculation results still different: r2.corr(test_df['Series2']) == r3.corr(test_df['Series2']).
I think for regular time frequency index, you should just stick to r1.
This mainly because the result of two rolling 365 and 365D are different.
For example
sub = test_df.head()
sub['Series2'].rolling(2).sum()
Out[15]:
1990-01-01 NaN
1990-01-02 -0.355230
1990-01-03 0.844281
1990-01-04 2.515529
1990-01-05 1.508412
sub['Series2'].rolling('2D').sum()
Out[16]:
1990-01-01 -0.043692
1990-01-02 -0.355230
1990-01-03 0.844281
1990-01-04 2.515529
1990-01-05 1.508412
Since there are a lot NaN in rolling 365, so the corr of two series in two way are quit different.
Related
I'm trying to interpolate the following data with python (3.8.1) using the aforementioned scipy function (official documentation here; source code here). The official documentation is incredibly sparse, so I'm hopeful that someone else out there will have some experience using the function and may know the source of this issue. Specifically, I run the following four lines of code:
predictor = [[-1.7134013337139833, 0.9582376963057636, -0.21528572746395735], [3.25933089248862, -0.7087236333980123, 0.012808817274351122], [-0.5596739049487544, -1.8723369742231246, 0.03114189522349198], [0.23080764211370225, 1.0639221305852422, -0.602148693975945], [-0.9879484423429669, -0.16678510825693527, 0.5570132252912631], [0.0029439785978213986, -0.10016927713200409, -0.18197412051828055], [0.3530872261969887, 0.6347161018351574, 0.7285361235605389], [-1.122894723267098, 0.22837861478723648, -0.9022469946784363], [-0.02862856314533664, 0.014623415207400122, 3.078346263312741], [-1.3367570531570616, -0.3218239542354167, 0.489878302042675]]
respose = [0.020235605909933625, 1.4729016163456679e-05, 0.021931080605237303, 0.21271851410989498, 0.26870984350693583, 0.9577608837143238, 0.3470452852299319, 0.11918254249689647, 7.657429164576589e-05, 0.1187813551565562]
from scipy.interpolate import LinearNDInterpolator
away = LinearNDInterpolator(predictor, response)
Now, if I write away.__call__([0,0,0])[0] then python returns 0.8208492283847619,
which is the desired outcome and is a sensible value based on the given test data. Similarly, away.__call__([0,0,1])[0] returns 0.22018657078617598 which is also a sensible value.
However, away.__call__([0,1,1])[0] returns nan. What changed? Does anyone happen to know?
Thank you.
This occurs when away.__call__(x) is passed a value x which lies outside of the convex hull - essentially, when x lies outside of the region of interpolation.
I have a dataset that consists of 6169, time-series data points. I am trying to find the minimum within a certain rolling window. In this case, the window is of 396 (slightly over a year). I have written the following code below using pandas rolling function. However, When I run the code I end up with a lot more values than what I should get. What I mean is I should end up with 6169/396 = 15 or 16 values. But instead, I get with 258 values. Any ideas why?. To get an idea of the data I have posted a plot. I have marked a few red circles points which it should catch and by observing the graph it shouldn't definitely catch that many points. Is there anything wrong with the line of my code?
m4_minidx = df['fitted.values'].rolling(window = 396).min() == df['fitted.values']
m4_min = df[m4_minidx]
print(df.shape)
print(m4_min.shape)
output:
(6169, 5)
(258, 5)
The problem is the rolling window, you get a local minimum every time. Here's a sketch to explain:
The black lines are the moving window, while the red circle the local minima.
The problem you want to solve is slightly more complex, finding local minima is not trivial in general. Take a look at these other resources: local minima x-y or
local minima 1d array or peak finder in scipy library
============= edit ==================
If you have no repetition in your dataframe, you obtain the result you expected:
x = np.random.random(6169)
df = pd.DataFrame({'fitted.values': x})
m4_minidx = df['fitted.values'].rolling(window = 396).min() == df['fitted.values']
m4_min = df[m4_minidx]
print(df.shape)
print(m4_min.shape)
output:
(6169, 1)
(14, 1)
So im trying to normalize my features by using .apply() iteratively on all columns of the dataframe but it gives KeyError. Can someone help?
I've tried using below code but it doesnt work :
for x in df.columns:
df[x+'_norm'] = df[x].apply(lambda x:(x-df[x].mean())/df[x].std())
I don't think it's a good idea to use mean and std functions inside the apply. You are calculating them each time which that any row is going to get its new value. Instead you can calculate them in the beginning of the loop and use of it in the apply function. Like below:
for x in df.columns:
mean = df[x].mean()
std = df[x].std()
df[x+'_norm'] = df[x].apply(lambda y:(y-mean)/std)
I am working on Time Series Forecasting(Daily entry) using pyramid-arima auto_arima in python where y is my target and x_features are all exogenous variables. I want best order model based on lowest aic, But auto_arima returns only few order combinations.
PFA where 1st code line (start_p = start_q = 0 & max_p = 0, max_q = 3) returns all 4 combinations, but 2nd code line(start_p = start_q = 0 & max_p = 3, max_q = 3) returns only 7 combinations , din't gave (0,1,2) and (0,1,3) and others, which leads wrong model selection based on aic. All other parameters are as default e.g max_order = 10
Is there anything I am missing or wrongly done?
Thankyou in advance.
You say error_action='ignore', so probably (0,1,2) and (0,1,3) (and other orders) gave errors, so they didn't appear in the results.
(I don't have enough reputation to write a comment, sorry).
The number of models autoarima trains is based on the data you feed in and also the stepwise= True if it is True autoarima uses a proven way to reduce number of iterations to find the best model and it is the best 90% cases unless data is very varying.
If you want the rest of models also to run as it isnt taking alot of time to execute try keeping stepwise=False where it trains with all possible param combinations.
Hope this helps
I'm trying to predict stock prices using sklearn. I'm new to prediction. I tried the example from sklearn for stock prediction with gaussian hmm. But predict gives states sequence which overlay on the price and it also takes points from given input close price. My question is how to generate next 10 prices?
You will always use the last state to predict the next state, so let's add 10 days worth of inputs by changing the end date to the 23rd:
date2 = datetime.date(2012, 1, 23)
You can double check the rest of the code to make sure I am not actually using future data for the prediction. The rest of these lines can be added to the bottom of the file. First we want to find out what the expected return is for a given state. The model.means_ array has returns, both those were the returns that got us to this state, not the future returns which is what you want. To get the future returns, we consider the probability of going to any one of the 5 states, and what the return of those states is. We get the probability of going to any particular state from the model.transmat_ matrix, the for the return of each state we use the model.means_ values. We take the dot product to get the expected return for a particular state. Then we remove the volume data (you can leave it in if you want, but you seemed to be most interested in future prices).
expected_returns_and_volumes = np.dot(model.transmat_, model.means_)
returns_and_volumes_columnwise = zip(*expected_returns_and_volumes)
returns = returns_and_volumes_columnwise[0]
If you print the value for returns[0], you'll see the expected return in dollars for state 0, returns[1] for state 1 etc. Now, given a day and a state, we want to predict the price for tomorrow. You said 10 days so let's use that for lastN.
predicted_prices = []
lastN = 10
for idx in xrange(lastN):
state = hidden_states[-lastN+idx]
current_price = quotes[-lastN+idx][2]
current_date = datetime.date.fromordinal(dates[-lastN+idx])
predicted_date = current_date + datetime.timedelta(days=1)
predicted_prices.append((predicted_date, current_price + returns[state]))
print(predicted_prices)
If you were running this in "production" you would set date2 to the last date you have and then lastN would be 1. Note that I don't take into account weekends for the predicted_date.
This is a fun exercise but you probably wouldn't run this in production, hence the quotes. First, the time series is the raw price; this should really be percentage returns or log returns. Plus there is no justification for picking 5 states for the HMM, or that a HMM is even good for this kinda problem, which I doubt. They probably just picked it as an example. I think the other sklearn example using PCA is much more interesting.