Time Series Modeling of Choppy Data - statistics

I'm trying to model 10 years of monthly time series data that is very choppy and overal it has an upward trend. At first glance it looks like a strong seasonal series, however the test results indicate that it is definitely not seasonal. This is a pricing variable that I'm trying to model as a function of macroeconomic environment, such as interest rates and yield curves.
I've tryed linear OLS regression (proc reg), but I don't get a very goo dmodel with that.
I've also tried autoregressive error models (proc autoreg), but it captures 7 lags of the error term as significant factors. I don't really want to include that many lag of the error term in the model. In addition most of the macroeconomic variables become insignificant when I include all these error lags in the model.
Any suggestions on modeling method/technique that could help me model this choppy data is really appreciated.

At a past project, we've used proc arima to predict future product sales based on a time series of past sales:
http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/viewer.htm#etsug_arima_sect019.htm (note that arima is also an autoregressive model)
But as Joe said, for really statistical feedback on your question, you're better of asking at the Cross Validated site.

Related

Normality Assumption - how to check you have not violated it?

I am rleatively new to statistics and am stuggling with the normality assumption.
I understand that parametric tests are underpinned by the assumption that the data is normally distributed, but there seems to be lots of papers and articles providing conflicting information.
Some articles say that independant variables need to be normally disrbiuted and this may require a transformation (log, SQRT etc.). Others says that in linear modelling there are no assumptions about any linear the distribution of the independent variables.
I am trying to create a multiple regression model to predict highest pain scores on hospital admissions:
DV: numeric pain scores (0-no pain -> 5 intense pain)(discrete- dependant variable).
IVs: age (continuous), weight (continuous), sex (nominal), depreviation status (ordinal), race (nominal).
Can someone help clear up the following for me?
Before fitting a model, do I need to check the whether my independant variables are normally distributed? If so, why? Does this only apply to continuous variables (e.g. age and weight in my model)?
If age is positively skewed, would a transformation (e.g. log, SQRT) be appropriate and why? Is it best to do this before or after fitting a model? I assume I am trying to get close to a linear relationship between my DV and IV.
As part of the SPSS outputs it provides plots of the standardised residuals against predicted values and also normal P-P plots of standardised residuals. Are these tests all that is needed to check the normality assumption after fitting a model?
Many Thanks in advance!

Multi-features modeling based on one binary-feature which is rarely 1 (imbalanced data) when there is a cost

I need to model a multi-variate time-series data to predict a binary-target which is rarely 1 (imbalanced data).
This means that we want to model based on one feature is binary (outbreak), rarely 1?
All of the features are binary and rarely 1.
What is the suggested solution?
This features has an effect on cost function based on the following cost function. We want to know prepared or not prepared if the cost is the same as following.
Problem Definition:
Model based on outbreak which is rarely 1.
Prepared or not prepared to avoid the outbreak of a disease and the cost of outbreak is 20 times of preparation
cost of each day(next day):
cost=20*outbreak*!prepared+prepared
Model:prepare(prepare for next day)for outbreak for which days?
Questions:
Build a model to predict outbreaks?
Report the cost estimation for every year
csv file is uploaded and data is for end of the day
The csv file contains rows which each row is a day with its different features some of them are binary and last feature is outbreak which is rarely 1 and a main features considering in the cost.
You are describing class imbalance.
Typical approach is to generate balanced training data
by repeatedly running through examples containing
your (rare) positive class,
and each time choosing a new random sample
from the negative class.
Also, pay attention to your cost function.
You wouldn't want to reward a simple model
for always choosing the majority class.
My suggestions:
Supervised Approach
SMOTE for upsampling
Xgboost by tuning scale_pos_weight
replicate minority class eg:10 times
Try to use ensemble tree algorithms, trying to generate a linear surface is risky for your case.
Since your data is time series you can generate days with minority class just before real disease happened. For example you have minority class at 2010-07-20. Last observations before that time is 2010-06-27. You can generate observations by slightly changing variance as 2010-07-15, 2010-07-18 etc.
Unsupervised Approach
Try Anomaly Detection algorithms. Such as IsolationForest (try extended version of it also).
Cluster your observations check minority class becomes a cluster itself or not. If its successful you can label your data with cluster names (cluster1, cluster2, cluster3 etc) then train a decision tree to see split patterns. (Kmeans + DecisionTreeClassifier)
Model Evaluation
Set up a cost matrix. Do not use confusion matrix precision etc directly. You can find further information about cost matrix in here: http://mlwiki.org/index.php/Cost_Matrix
Note:
According to OP's question in comments groupby year could be done like this:
df["date"] = pd.to_datetime(df["date"])
df.groupby(df["date"].dt.year).mean()
You can use other aggregators also (mean, sum, count, etc)

Using SVM to perform classification on multi-dimensional time series datasets

I would like to use scikit-learn's svm.SVC() estimator to perform classification tasks on multi-dimensional time series - that is, on time series where the points in the series take values in R^d, where d > 1.
The issue with doing this is that svm.SVC() will only take ndarray objects of dimension at most 2, whereas the dimension of such a dataset would be 3. Specifically, the shape of a given dataset would be (n_samples, n_features, d).
Is there a workaround available? One simple solution would just be to reshape the dataset so that it is 2-dimensional, however I imagine this would lead to the classifier not learning from the dataset properly.
Without any further knowledge about the data reshaping is the best you can do. Feature engineering is a very manual art that depends heavily on domain knowledge.
As a rule of thumb: if you don't really know anything about the data throw in the raw data and see if it works. If you have an idea what properties of the data may be beneficial for classification, try to work it in a feature.
Say we want to classify swiping patterns on a touch screen. This closely resembles your data: We acquired many time series of such patterns by recording the 2D position every few milliseconds.
In the raw data, each time series is characterized by n_timepoints * 2 features. We can use that directly for classification. If we have additional knowledge we can use that to create additional/alternative features.
Let's assume we want to distinguish between zig-zag and wavy patterns. In that case smoothness (however that is defined) may be a very informative feature that we can add as a further column to the raw data.
On the other hand, if we want to distinguish between slow and fast patterns, the instantaneous velocity may be a good feature. However, the velocity can be computed as a simple difference along the time axis. Even linear classifiers can model this easily so it may turn out that such features, although good in principle, do not improve classification of raw data.
If you have lots and lots and lots and lots of data (say an internet full of good examples) Deep Learning neural networks can automatically learn features to some extent, but let's say this is rather advanced. In the end, most practical applications come down to try and error. See what features you can come up with and try them out in practice. And beware the overfitting gremlin.

find important features for classification

I'm trying to classify some EEG data using a logistic regression model (this seems to give the best classification of my data). The data I have is from a multichannel EEG setup so in essence I have a matrix of 63 x 116 x 50 (that is channels x time points x number of trials (there are two trial types of 50), I have reshaped this to a long vector, one for each trial.
What I would like to do is after the classification to see which features were the most useful in classifying the trials. How can I do that and is it possible to test the significance of these features? e.g. to say that the classification was drive mainly by N-features and these are feature x to z. So I could for instance say that channel 10 at time point 90-95 was significant or important for the classification.
So is this possible or am I asking the wrong question?
any comments or paper references are much appreciated.
Scikit-learn includes quite a few methods for feature ranking, among them:
Univariate feature selection (http://scikit-learn.org/stable/auto_examples/feature_selection/plot_feature_selection.html)
Recursive feature elimination (http://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_digits.html)
Randomized Logistic Regression/stability selection (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RandomizedLogisticRegression.html)
(see more at http://scikit-learn.org/stable/modules/feature_selection.html)
Among those, I definitely recommend giving Randomized Logistic Regression a shot. In my experience, it consistently outperforms other methods and is very stable.
Paper on this: http://arxiv.org/pdf/0809.2932v2.pdf
Edit:
I have written a series of blog posts on different feature selection methods and their pros and cons, which are probably useful for answering this question in more detail:
http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/
http://blog.datadive.net/selecting-good-features-part-ii-linear-models-and-regularization/
http://blog.datadive.net/selecting-good-features-part-iii-random-forests/
http://blog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-side/

Looking for a simple machine learning approach to predict final exam score from training set

I am trying to predict test reuslts based on known previous scores. The test is made up of three subjects, each contributing to the final exam score. For all students I have their previous scores for mini-tests in each of the three subjects, and I know which teacher they had. For half of the students (the training set) I have their final score, for the other half I don't (the test set). I want predict their final score.
So the test set looks like this:
student teacher subject1score subject2score subject3score finalscore
while the test set is the same but without the final score
student teacher subject1score subject2score subject3score
So I want to predict the final score of the test set students. Any ideas for a simple learning algorithm or statistical technique to use?
The simplest and most reasonable method to try is a linear regression, with the teacher and the three scores used as predictors. (This is based on the assumption that the teacher and the three test scores will each have some predictive ability towards the final exam, but they could contribute differently- for example, the third test might matter the most).
You don't mention a specific language, but let's say you loaded it into R as two data frames called 'training.scoresandtest.scores`. Fitting the model would be as simple as using lm:
lm.fit = lm(finalscore ~ teacher + subject1score + subject2score + subject3score, training.scores)
And then the prediction would be done as:
predicted.scores = predict(lm.fit, test.scores)
Googling for "R linear regression", "R linear models", or similar searches will find many resources that can help. You can also learn about slightly more sophisticated methods such as generalized linear models or generalized additive models, which are almost as easy to perform as the above.
ETA: There have been books written about the topic of interpreting linear regression- an example simple guide is here. In general, you'll be printing summary(lm.fit) to print a bunch of information about the fit. You'll see a table of coefficients in the output that will look something like:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14.4511 7.0938 -2.037 0.057516 .
setting 0.2706 0.1079 2.507 0.022629 *
effort 0.9677 0.2250 4.301 0.000484 ***
The Estimate will give you an idea how strong the effect of that variable was, while the p-values (Pr(>|T|)) give you an idea whether each variable actually helped or was due to random noise. There's a lot more to it, but I invite you to read the excellent resources available online.
Also plot(lm.fit) will graphs of the residuals (residuals mean the amount each prediction is off by in your testing set), which tells you can use to determine whether the assumptions of the model are fair.

Resources