Regression using two dependent variables - statistics

I have some data for time series prediction. variable 1 is speed and variable 2 is time of the day the vehicle is starting. The output is time taken for the vehicle to reach destination. I used both variable 1 and variable 2 as inputs for svr using libsvm but later found out that variable 1 and variable 2 are dependent since speed of the vehicle depends on time of the day.
Can we do regression using two dependent variables as inputs? As I know the regression model y = a + b1.x1 + b2.x2 + ....+ e is for independent variables.

The standard regression model is not for independent inputs: no assumption is made about dependence between input variables. However, if there is an interaction effect, you might find that simply adding an interaction term into the regression model improves results: with this, your model becomes:
y = a + b1.x1 + b2.x2 + b2.x1.x2
I'm not sure what the state of SVR is, and whether you can put this option in directly; you can certainly fake it by adding that feature to the input, or use a regression method which directly supports it.
Another potential hazard is how you're representing time, as I can easily see this going wrong. What does your time input look like?

Related

XGboost model variable transformation

I am working on an XGBoost model with a few input variables. There is one variable X that I am testing different ways of transformation.
option 1. I apply a group-by average of X, and use the deviation X - group_by_mean(X) as input
option 2. I apply a simple line regression y = aX + b on group-by X, and use y as input
I run two models, with otherwise identical input,
Result, I get better model prediction from option 1 then option 2 on XGboost model.
My question is, can anyone direct me to the potential theorical reason why opition 1 gives me better result as an input to XGboost model, than option 2?
I suspect it is due to option 2 a simple linear regression creates unobserved error, while option 1 a simple average has unobserved error 0, since I apply all known information in option 1. But I would appreciate more theorical reasoning and backing if possible.

What's the selection criterion for the "best" model in "regsubsets" & how do I access several "best" models?

I've been playing around with the regsubsets function a bit, using the "forward" method to select variables for a linear regression model. However, despite also reading the documentation I can't seem to figure out, how the leaps.setup underlying this function determines the "best" model for each separate number of variables in a model.
Say I have a model with potential 10 variables in it (and nvmax = 10), I get exactly one "best" model for a model with 1 var, 2 vars etc. But how is this model selected by the function? I wonder particularly because after having run this function, I'm able to extract the best model of all models with different(!) sizes by determining a specific criterion (e.g., adjr2).
Related to this, I wonder: If I set, for example, nbest = 5 I understand that the function calculates the five best models for each model size (i.e., for a model with ten variables it gives five different variations that perform better than the rest). If I understand that correctly, is there any way to extract these five models for a specific model size? That is, for example, display the coefficients of these five best models?
I hope, I'm being clear about my problems here... Please, let me know, if exemplary data or any further information will help to clarify the issue!
The "best" model picked by regsubsets is the one that minimizes the sum of the squares of the residuals.
I'm still working on the second question...
Addressing the second question: the next code displays the coefficients of the 5 best models for each quantity of explanatory variables, from 1 to 3 variables. Y is the response variable of the models.
library(leaps)
best_models = regsubsets( Y ~ ., data = data_set, nvmax=3, nbest=5)
coef(best_models, 1:15)

How to combine LIBSVM probability estimates from two (or three) two class SVM classifiers.

I have training data that falls into two classes, let's say Yes and No. The data represents three tasks, easy, medium and difficult. A person performs these tasks and is classified into one of the two classes as a result. Each task is classified independently and then the results are combined. I am using 3 independently trained SVM classifiers and then voting on the final result.
I am looking to provide a measure of confidence or probability associated with each classification. LIBSVM can provide a probability estimate along with the classification for each task (easy, medium and difficult, say Pe, Pm and Pd) but I am unsure of how best to combine these into an overall estimate for the final classification of the person (let's call it Pp).
My attempts so far have been along the lines of a simple average:
Pp = (Pe + Pm + Pd) / 3
An Inverse-variance weighted average (since each task is repeated a few times and sample variance (VARe, VARm and VARd) can be calculated - in which case Pe would be a simple average of all the easy samples):
Pp = (Pe/VARe + Pm/VARm + Pd/VARd) / (( 1/VARe ) + ( 1/VARm ) + ( 1/VARd ))
Or a multiplication (under the assumption that these events are independent, which I am unsure of since the underlying tasks are related):
Pp = Pe * Pm * Pd
The multiplication would provide a very low number, so it's unclear how to interpret that as an overall probability when the results of the voting are very clear.
Would any of these three options be the best or is there some other method / detail I'm overlooking?
Based on your comment, I will make the following suggestion. If you need to do this as an SVM (and because, as you say, you get better performance when you do it this way), take the output from your intermediate classifiers and feed them as features to your final classifier. Even better, switch to a multi-layer Neural Net where your inputs represent inputs to the intermediates, the (first) hidden layer represents outputs to the intermediate problem, and subsequent layer(s) represent the final decision you want. This way you get the benefit of an intermediate layer, but its output is optimised to help with the final prediction rather than for accuracy in its own right (which I assume you don't really care about).
The correct generative model for these tests likely looks something like the following:
Generate an intelligence/competence score i
For each test t: generate pass/fail according to p_t(pass | i)
This is simplified, but I think it should illustrate tht you have a latent variable i on which these tests depend (and there's also structure between them, since presumably p_easy(pass|i) > p_medium(pass|i) > p_hard(pass|i); you could potentially model this as a logistic regression with a continuous 'hardness' feature). I suspect what you're asking about is a way to do inference on some thresholding function of i, but you want to do it in a classification way rather than as a probabilistic model. That's fine, but without explicitly encoding the latent variable and the structure between the tests it's going to be hard (and no average of the probabilities will account for the missing structure).
I hope that helps---if I've made assumptions that aren't justified, please feel free to correct.

Online learning with Naive Bayes Classifier

I am trying to predict the inter-arrival time of the incoming network packets. I measure the inter-arrival times of network packets and represent this data in the form of binary features: xi= 0,1,1,1,0,... where xi=0 if the inter-arrival time is less than a break-even-time and 1 otherwise. The data has to be mapped into two possible classes C={0,1}, where C=0 represents a short inter-arrival time and 1 represents a long inter-arrival time. Since I want to implement the classifier in an online feature, where as soon as I observe a vector of features xi=0,1,1,0..., I calculate the MAP class. Since I don't have a prior estimation of the conditional and prior probabilities, I initialize them as follows:
p(x=0|c=0)=p(x=1|c=0)=p(x=0|c=1)=p(x=1|c=1)=0.5
p(c=0)=p(c=1)=0.5
For each feature vector (x1=m1,x2=m2,...,xn=mn), when I output a class C, I update the conditional and prior probabilities as follows:
p(xi=mi|y=c)=a+(1-a)*p(p(xi=mi|c)
p(y=c)=b+(1-b)*p(y=c)
The problem is that, I am always getting a biased prediction. Since the number of long inter-arrival times are comparatively less than the short, the posterior of short always remains higher than the long. Is there any way to improve this? or am I doing something wrong? Any help will be appreciated.
Since you have a long time series, the best path would probably be to take into account more than a single previous value. the standard way of doing this would be to use a time-window, i.e. split the long vector Xi to overlapping pieces of a constant length, with the last value treated as the class, and use them as the train set. This could be also done on streaming data in an online manner, by incrementally updating the NB model with new data as it arrives.
Note that Using this method, other regression algorithms might end up being a better choice than NB.
Weka (version 3.7.3 and up) has a very nice dedicated tool supporting time-series analysis. alternatively, MOA is also based on Weka, and supports modeling of streaming data.
EDIT: it might also be a good idea to move from binary features to the real values (maybe normalized), and apply the threshold post-classification. This might give more information to the regression model (NB or other), allowing better accuracy.

SPSS logistic regression

I'm wondering if there is a way to get many single covariate logistic regression. I want to do it for all my variables because of the missing values. I wanted to have a multiple logistic regression but I have too many missing values. I don't want to compute a logistic regression for each variable in my DB, is there any automatic way?
Thank you very much!
You can code it using SPSS syntax.
For example:
LOGISTIC REGRESSION VARIABLES F2B16C -- Dependent variable
/METHOD=BSTEP -- Backwards step - all variables in then see what could be backed out
XRACE BYSES2 BYTXMSTD F1RGPP2 F1STEXP XHiMath -- Independent variables
/contrast (xrace)=indicator(6) -- creates the dummy variables with #6 as the base case
/contrast (F1Rgpp2)=indicator(6)
/contrast (f1stexp)=indicator(6)
/contrast (XHiMath)=indicator(5)
/PRINT=GOODFIT CORR ITER(1)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).`
If you do that you can also tell it to keep records with missing values where appropriate.
add /MISSING=INCLUDE
If anyone knows of a good explanation of the implications of including the missing values, I'd love to hear it.

Resources