What's the selection criterion for the "best" model in "regsubsets" & how do I access several "best" models? - subset

I've been playing around with the regsubsets function a bit, using the "forward" method to select variables for a linear regression model. However, despite also reading the documentation I can't seem to figure out, how the leaps.setup underlying this function determines the "best" model for each separate number of variables in a model.
Say I have a model with potential 10 variables in it (and nvmax = 10), I get exactly one "best" model for a model with 1 var, 2 vars etc. But how is this model selected by the function? I wonder particularly because after having run this function, I'm able to extract the best model of all models with different(!) sizes by determining a specific criterion (e.g., adjr2).
Related to this, I wonder: If I set, for example, nbest = 5 I understand that the function calculates the five best models for each model size (i.e., for a model with ten variables it gives five different variations that perform better than the rest). If I understand that correctly, is there any way to extract these five models for a specific model size? That is, for example, display the coefficients of these five best models?
I hope, I'm being clear about my problems here... Please, let me know, if exemplary data or any further information will help to clarify the issue!

The "best" model picked by regsubsets is the one that minimizes the sum of the squares of the residuals.
I'm still working on the second question...

Addressing the second question: the next code displays the coefficients of the 5 best models for each quantity of explanatory variables, from 1 to 3 variables. Y is the response variable of the models.
library(leaps)
best_models = regsubsets( Y ~ ., data = data_set, nvmax=3, nbest=5)
coef(best_models, 1:15)

Related

Normality Assumption - how to check you have not violated it?

I am rleatively new to statistics and am stuggling with the normality assumption.
I understand that parametric tests are underpinned by the assumption that the data is normally distributed, but there seems to be lots of papers and articles providing conflicting information.
Some articles say that independant variables need to be normally disrbiuted and this may require a transformation (log, SQRT etc.). Others says that in linear modelling there are no assumptions about any linear the distribution of the independent variables.
I am trying to create a multiple regression model to predict highest pain scores on hospital admissions:
DV: numeric pain scores (0-no pain -> 5 intense pain)(discrete- dependant variable).
IVs: age (continuous), weight (continuous), sex (nominal), depreviation status (ordinal), race (nominal).
Can someone help clear up the following for me?
Before fitting a model, do I need to check the whether my independant variables are normally distributed? If so, why? Does this only apply to continuous variables (e.g. age and weight in my model)?
If age is positively skewed, would a transformation (e.g. log, SQRT) be appropriate and why? Is it best to do this before or after fitting a model? I assume I am trying to get close to a linear relationship between my DV and IV.
As part of the SPSS outputs it provides plots of the standardised residuals against predicted values and also normal P-P plots of standardised residuals. Are these tests all that is needed to check the normality assumption after fitting a model?
Many Thanks in advance!

Is there a way to calculate standard errors when using predict.mppm?

I'm using spatstat to run some mppm models and would like to be able to calculate standard errors for the predictions as in predict.ppm. I could use predict.ppm on each point process individually of course, but I'm wondering if this in invalid for any reason or if there is a better way of doing so?
This is not yet implemented as an option in predict.mppm. (It is on our long list of things to do. I will move it closer to the top of the list.)
However, it is available by applying predict.ppm to each element of subfits(model), where model was the original fitted model of class mppm. Something like:
m <- mppm(......)
fits <- subfits(m)
Y <- lapply(fits, predict, se=TRUE)
Just to clarify, fits[[i]] is a point process model, of class ppm, for the data in row i of the data hyperframe, implied by the big model m. The parameter estimates and variance estimates in fits[[i]] are based on information from the entire hyperframe. This is not the same as fitting a separate model of class ppm to the data in each row of the hyperframe and calculating predictions and standard errors for those fits.

What is the difference between scale and fit in sklearn?

i am new to datascience and when i was going through one of the kaggle blog, i saw that the user is using both scale and fit on the data set. i tried to understand the difference by going through the documentation but was not able to understand
It's hard to understand the source of your confusion without any code. Inside the link you provided, the data is first scaled with sklearn.preprocessing.scale() and then fit to a sklearn.ensemble.GradientBoostingRegressor.
So the scaling operation transforms data such that all the features are represented on the same scale, and the fitting operation trains the model with the said data.
From your question it sounds like you thought these two operations were mutually exclusive, or somehow equivalent, but they are actually logical consecutive steps.
In general, before model is trained, data is somehow preprocessed (with .scale() in this case), then trained. In sklearn the .fit() methods are for training (fitting functions/models to the data).
Hope it makes sense!
Scale is a data normalization technique and it is used when data in different features are of not similar values like in one feature you have values ranging from 1 to 10 and in other features you have values ranging from 1000 to 10000.
Where as fit is the function that actually starts your model training
Scaling is conversion of data, a method used to normalize the range of independent variables or features of data. The fit method is a training step.

How to combine LIBSVM probability estimates from two (or three) two class SVM classifiers.

I have training data that falls into two classes, let's say Yes and No. The data represents three tasks, easy, medium and difficult. A person performs these tasks and is classified into one of the two classes as a result. Each task is classified independently and then the results are combined. I am using 3 independently trained SVM classifiers and then voting on the final result.
I am looking to provide a measure of confidence or probability associated with each classification. LIBSVM can provide a probability estimate along with the classification for each task (easy, medium and difficult, say Pe, Pm and Pd) but I am unsure of how best to combine these into an overall estimate for the final classification of the person (let's call it Pp).
My attempts so far have been along the lines of a simple average:
Pp = (Pe + Pm + Pd) / 3
An Inverse-variance weighted average (since each task is repeated a few times and sample variance (VARe, VARm and VARd) can be calculated - in which case Pe would be a simple average of all the easy samples):
Pp = (Pe/VARe + Pm/VARm + Pd/VARd) / (( 1/VARe ) + ( 1/VARm ) + ( 1/VARd ))
Or a multiplication (under the assumption that these events are independent, which I am unsure of since the underlying tasks are related):
Pp = Pe * Pm * Pd
The multiplication would provide a very low number, so it's unclear how to interpret that as an overall probability when the results of the voting are very clear.
Would any of these three options be the best or is there some other method / detail I'm overlooking?
Based on your comment, I will make the following suggestion. If you need to do this as an SVM (and because, as you say, you get better performance when you do it this way), take the output from your intermediate classifiers and feed them as features to your final classifier. Even better, switch to a multi-layer Neural Net where your inputs represent inputs to the intermediates, the (first) hidden layer represents outputs to the intermediate problem, and subsequent layer(s) represent the final decision you want. This way you get the benefit of an intermediate layer, but its output is optimised to help with the final prediction rather than for accuracy in its own right (which I assume you don't really care about).
The correct generative model for these tests likely looks something like the following:
Generate an intelligence/competence score i
For each test t: generate pass/fail according to p_t(pass | i)
This is simplified, but I think it should illustrate tht you have a latent variable i on which these tests depend (and there's also structure between them, since presumably p_easy(pass|i) > p_medium(pass|i) > p_hard(pass|i); you could potentially model this as a logistic regression with a continuous 'hardness' feature). I suspect what you're asking about is a way to do inference on some thresholding function of i, but you want to do it in a classification way rather than as a probabilistic model. That's fine, but without explicitly encoding the latent variable and the structure between the tests it's going to be hard (and no average of the probabilities will account for the missing structure).
I hope that helps---if I've made assumptions that aren't justified, please feel free to correct.

Looking for a simple machine learning approach to predict final exam score from training set

I am trying to predict test reuslts based on known previous scores. The test is made up of three subjects, each contributing to the final exam score. For all students I have their previous scores for mini-tests in each of the three subjects, and I know which teacher they had. For half of the students (the training set) I have their final score, for the other half I don't (the test set). I want predict their final score.
So the test set looks like this:
student teacher subject1score subject2score subject3score finalscore
while the test set is the same but without the final score
student teacher subject1score subject2score subject3score
So I want to predict the final score of the test set students. Any ideas for a simple learning algorithm or statistical technique to use?
The simplest and most reasonable method to try is a linear regression, with the teacher and the three scores used as predictors. (This is based on the assumption that the teacher and the three test scores will each have some predictive ability towards the final exam, but they could contribute differently- for example, the third test might matter the most).
You don't mention a specific language, but let's say you loaded it into R as two data frames called 'training.scoresandtest.scores`. Fitting the model would be as simple as using lm:
lm.fit = lm(finalscore ~ teacher + subject1score + subject2score + subject3score, training.scores)
And then the prediction would be done as:
predicted.scores = predict(lm.fit, test.scores)
Googling for "R linear regression", "R linear models", or similar searches will find many resources that can help. You can also learn about slightly more sophisticated methods such as generalized linear models or generalized additive models, which are almost as easy to perform as the above.
ETA: There have been books written about the topic of interpreting linear regression- an example simple guide is here. In general, you'll be printing summary(lm.fit) to print a bunch of information about the fit. You'll see a table of coefficients in the output that will look something like:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14.4511 7.0938 -2.037 0.057516 .
setting 0.2706 0.1079 2.507 0.022629 *
effort 0.9677 0.2250 4.301 0.000484 ***
The Estimate will give you an idea how strong the effect of that variable was, while the p-values (Pr(>|T|)) give you an idea whether each variable actually helped or was due to random noise. There's a lot more to it, but I invite you to read the excellent resources available online.
Also plot(lm.fit) will graphs of the residuals (residuals mean the amount each prediction is off by in your testing set), which tells you can use to determine whether the assumptions of the model are fair.

Resources