I'm trying to model outcomes of 2 binomial variables that I don't believe to be independent of one another. I had the idea of nesting the outcomes of one variable within the other like this:
but I'm unsure of how to go about this in SAS. Otherwise is it acceptable to use multinomial regression and make dummy variables to compare groups without nesting? the two variables appear related but there's other variables were interested in looking at significance for
Related
I've been playing around with the regsubsets function a bit, using the "forward" method to select variables for a linear regression model. However, despite also reading the documentation I can't seem to figure out, how the leaps.setup underlying this function determines the "best" model for each separate number of variables in a model.
Say I have a model with potential 10 variables in it (and nvmax = 10), I get exactly one "best" model for a model with 1 var, 2 vars etc. But how is this model selected by the function? I wonder particularly because after having run this function, I'm able to extract the best model of all models with different(!) sizes by determining a specific criterion (e.g., adjr2).
Related to this, I wonder: If I set, for example, nbest = 5 I understand that the function calculates the five best models for each model size (i.e., for a model with ten variables it gives five different variations that perform better than the rest). If I understand that correctly, is there any way to extract these five models for a specific model size? That is, for example, display the coefficients of these five best models?
I hope, I'm being clear about my problems here... Please, let me know, if exemplary data or any further information will help to clarify the issue!
The "best" model picked by regsubsets is the one that minimizes the sum of the squares of the residuals.
I'm still working on the second question...
Addressing the second question: the next code displays the coefficients of the 5 best models for each quantity of explanatory variables, from 1 to 3 variables. Y is the response variable of the models.
library(leaps)
best_models = regsubsets( Y ~ ., data = data_set, nvmax=3, nbest=5)
coef(best_models, 1:15)
I plan on using a data set which contains 3 target values of interest. Ultimately I will be trying classification methods on a binary target and also plan on using regression methods for two separate continuous targets.
Is is it a bad practice to do a different train/test split for each target variable?
Otherwise, I am not sure how to split the data in a way that will allow me to predict each target, separately.
If they're effectively 3 different models trained and evaluated separately then for the purposes of scientifically evaluating each model's performance it doesn't matter if you use different test-train splits for each model, as no information will be leaking from model to model. But if you plan on comparing the results of the 3 models or combining all 3 scores into some aggregate metric then you would want to use the same test-train split so that all 3 models are working from the same training data, as otherwise the performance of each model will likely depend to some extent on the test data for the other models, and therefore your combined score will to some extent be a function of your test data.
Maybe it is obvious but I would like to be sure of what I am doing:
I understand that Group K-fold implemented in sklearn, is a variation of k-fold cross validation where it is ensured that data belonging to the same group will not be represented in train and sets at the same time.
That is what I also need. However, before I discover the aforementioned implementation of group k-fold, as i was trying to calculate the validation curve concerning a problem, I noticed the following parameter (the highlighted one):
validation_curve(estimator, X, y, param_name, param_range, groups=None, cv=None...)
According to the documentation if I provide a list of size [n_samples] providing the labels for the corresponding groups, then train/test dataset splitting will be done according to these labels.
And here comes the question. Since a such convenient variable is provided, why - according to my searches- everyone in need of group k-fold validation is first using sklearn.model_selection.GroupKFold ?
Am I missing something here?
In a hypothetical situation where I have 3 independent variables and one of those variables has a non-linear relationship(exponential) with the dependent variable and the other two independent variables are linearly related to the dependent variable. In such a case, what would be the best approach for running a regression analysis?
Considering I tried transforming the one non-linear independent variable.
I'm wondering if there is a way to get many single covariate logistic regression. I want to do it for all my variables because of the missing values. I wanted to have a multiple logistic regression but I have too many missing values. I don't want to compute a logistic regression for each variable in my DB, is there any automatic way?
Thank you very much!
You can code it using SPSS syntax.
For example:
LOGISTIC REGRESSION VARIABLES F2B16C -- Dependent variable
/METHOD=BSTEP -- Backwards step - all variables in then see what could be backed out
XRACE BYSES2 BYTXMSTD F1RGPP2 F1STEXP XHiMath -- Independent variables
/contrast (xrace)=indicator(6) -- creates the dummy variables with #6 as the base case
/contrast (F1Rgpp2)=indicator(6)
/contrast (f1stexp)=indicator(6)
/contrast (XHiMath)=indicator(5)
/PRINT=GOODFIT CORR ITER(1)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).`
If you do that you can also tell it to keep records with missing values where appropriate.
add /MISSING=INCLUDE
If anyone knows of a good explanation of the implications of including the missing values, I'd love to hear it.