StatsModels SARIMAX with exogenous variable and linear time trend - python-3.x

I am trying to forecast a SARIMAX model with a the linear time trend taking the value 1 for the first datapoint in the and and increasing by 1 for each successive observation up to N= sample size. The trend term is introduced because of it improves the model’s predictive power significantly but we want to freeze it out to the last observed value for out-sample forecasting. Namely if in-sample size is 100 we want to use this value for each step in the forecasting insted of increasing by 1 at each step
The model has been fitted as follows
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(endog=Unemployment_series,exog=sm.add_constant(insample['GDP_yoy'].values),order=(1,0,0),trend ='t').fit(disp=-1)
According to statsmodels documentation in https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html, the parameter trend allows us to fix the linear time trend.
The problem arises when try to forecast using the methods get_forecast or get_prediction
forecast = model.get_forecast(steps=len(outsample),exog = sm.add_constant(outsample['GDP_yoy'].values,has_constant='add'))
or
forecast = model.get_prediction(start=len(insample),end=len(insample)+len(outsample)-1,exog = sm.add_constant(outsample['GDP_yoy'].values,has_constant='add'))
Because of I have not found any parameter that allows to control the behavior of the time trend, any advice?

Related

Scoring Model giving reversed results using logistic regression

I am trying to implement a scoring model following the link https://rstudio-pubs-static.s3.amazonaws.com/376828_032c59adbc984b0ab892ce0026370352.html#1_introduction.
Post the entire implementation though, When I create pivot with my generated scores and the original labels, the average scores for "good' labels is significantly lower than the ones for " high" labels.
Hence, my problem can be oversimplified to why would logistic regression give reversed probabilities for 0-1 target variable( In my model I am using 0 for bad and 1 for good).
Any suggestions and solutions would be welcome.

Weighted Least Squares vs Monte Carlo comparison

I have an experimental dataset of the following values (y, x1, x2, w), where y is the measured quantity, x1 and x2 are the two independet variables and w is the error of each measurement.
The function I've chosen to describe my data is
These are my tasks:
1) Estimate values of bi
2) Estimate their standard errors
3) Calculate predicted values of f(x1, x2) on a mesh grid and estimate their confidence intervals
4) Calculate predicted values of
and definite integral
and their confidence intervals on a mesh grid
I have several questions:
1) Can all of my tasks be solved by weighted least squares? I've solved task 1-3 using WLS in matrix form by linearisation of the chosen function, but I have no idea, how to solve step №4.
2) I've performed Monte Carlo simulations to estimate bi and their s.e. I've generated perturbated values y'i from normal distribution with mean yi and standard deviation wi. I did this operation N=5000 times. For each perturbated dataset I estimated b'i, and from 5000 values of b'i I calculated mean values and their standard distribution. In the end, bi estimated from Monte-Carlo simulation coincide with those found by WLS. Am I correct, that standard deviations of b'i must be devided by № of Degrees of freedom to obtain standard error?
3) How to estimate confidence bands for predicted values of y using Monte-Carlo approach? I've generated a bunch of perturbated bi values from normal distribution using their BLUE as mean and standard deviations. Then I calculated lots of predicted values of f(x1,x2), found their means and standard deviations. Values of f(x1,x2) found by WLS and MC coincide, but s.d. found from MC are 5-45 order higher than those from WLS. What is the scaling factor that I'm missing here?
4) It seems that some of parameters b are not independent of each other, since there are only 2 independent variables. Should I take this into account in question 3, when I generate bi values? If yes, how can this be done? Should I use Chi-squared test to decide whether generated values of bi are suitable for further calculations, or should they be rejected?
In fact, I not only want to solve tasks I've mentioned earlier, but also I want to compare the two methods for regression analysys. I would appreciate any help and suggestions!

Statsmodels seasonal decomposition - Trend not a straight line

This query refers to decomposition of classic Airline passengers data into Trend, Seasonal and Residual. We expect linear trend to be a straight line. However, the result is not so. I wonder what is the logic behind extraction of Trend. Can you please throw some light?
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(airline['Thousands of Passengers'], model='additive')
result.plot();
Two things to clarify:
1) Not all trends are linear
2) Even linear trends can be subject to some variation depending on the time series in question.
For instance, let's consider the trend for maximum air temperature in Dublin, Ireland over a number of years (modelled using statsmodels):
In this example, you can see that the trend both ascends and descends - given that air temperature is subject to changing seasons we would expect this.
In terms of the airline dataset, we can see that the trend is being observed over a number of years. Even when the observed, seasonal and residual components have been extracted, the trend itself will be subject to shifts over time.

How can I use Correlation Coefficient to calculate change in variables

I calculated a correlation of two dependent variables (size of plot/house vs cost), the correlation stands at 0.87. I want to use this index to measure the increase or decrease in cost if size is increased or decreased. Is it possible using correlation? How?
Correlation only tells us how much two variables are linearly related based on the data we have, but in it does not provide a method to calculate the value of variable given the value of another.
If the variables are linearly related we can predict the actual values that a variable Y will assume when a variable X has some value using Linear Regression:
The idea is to try and fit the data to a linear function, and use it to predict the values:
Y = bX + a
Usually we first discover if two variables are related using a Correlation Coefficient(ex. Pearson Coefficient), then we use a Regression method(ex. Linear) to predict values of a variable of interest given another.
Here is an easy to follow tutorial on Linear Regression in Python with some theory:
https://realpython.com/linear-regression-in-python/#what-is-regression
Here a tutorial on the typical problem of house price prediction:
https://blog.akquinet.de/2017/09/19/predicting-house-prices-on-kaggle-part-i/

Why does k=1 in KNN give the best accuracy?

I am using Weka IBk for text classificaiton. Each document basically is a short sentence. The training dataset contains 15,000 documents. While testing, I can see that k=1 gives the best accuracy? How can this be explained?
If you are querying your learner with the same dataset you have trained on with k=1, the output values should be perfect barring you have data with the same parameters that have different outcome values. Do some reading on overfitting as it applies to KNN learners.
In the case where you are querying with the same dataset as you trained with, the query will come in for each learner with some given parameter values. Because that point exists in the learner from the dataset you trained with, the learner will match that training point as closest to the parameter values and therefore output whatever Y value existed for that training point, which in this case is the same as the point you queried with.
The possibilities are:
The data training with data tests are the same data
Data tests have high similarity with the training data
The boundaries between classes are very clear
The optimal value for K is depends on the data. In general, the value of k may reduce the effect of noise on the classification, but makes the boundaries between each classification becomes more blurred.
If your result variable contains values of 0 or 1 - then make sure you are using as.factor, otherwise it might be interpreting the data as continuous.
Accuracy is generally calculated for the points that are not in training dataset that is unseen data points because if you calculate the accuracy for unseen values (values not in training dataset), you can claim that my model's accuracy is the accuracy that is been calculated for the unseen values.
If you calculate accuracy for training dataset, KNN with k=1, you get 100% as the values are already seen by the model and a rough decision boundary is formed for k=1. When you calculate the accuracy for the unseen data it performs really bad that is the training error would be very low but the actual error would be very high. So it would be better if you choose an optimal k. To choose an optimal k you should be plotting a graph between error and k value for the unseen data that is the test data, now you should choose the value of the where the error is lowest.
To answer your question now,
1) you might have taken the entire dataset as train data set and would have chosen a subpart of the dataset as the test dataset.
(or)
2) you might have taken accuracy for the training dataset.
If these two are not the cases than please check the accuracy values for higher k, you will get even better accuracy for k>1 for the unseen data or the test data.

Resources