logistic regression with sparse predictor variables - statistics

I am currently modeling some data using a binary logistic regression. The dependent variable has a good number of positive cases and negative cases - it is not sparse. I also have a large training set (> 100,000) and the number of main effects I'm interested in is about 15 so I'm not worried about a p>n issue.
What I'm concerned about is that many of my predictor variables, if continuous, are zero most of the time, and if nominal, are null most of the time. When these sparse predictor variables take a value > 0 (or not null), I know because of familiarity with the data that they should be of importance in predicting my positive cases. I have been trying to look for information on how the sparseness of these predictors could be affecting my model.
In particular, I would not want the effect of a sparse but important variable to be not included in my model if there is another predictor variable that is not sparse and is correlated but actually doesn't do as good a job of predicting the positive cases. To illustrate an example, if I were trying to model whether or not someone ended up being accepted at a particular ivy league university and my three predictors were SAT score, GPA, and "donation > $1M" as a binary, I have reason to believe that "donation >$1M", when true, is going to be very predictive of acceptance - more so than a high GPA or SAT - but it is also very sparse. How, if at all, is this going to effect my logistic model and do I need to make adjustments for this? Also, would another type of model (say decision tree, random forest, etc) handle this better?
Thanks,
Christie

Related

Normality Assumption - how to check you have not violated it?

I am rleatively new to statistics and am stuggling with the normality assumption.
I understand that parametric tests are underpinned by the assumption that the data is normally distributed, but there seems to be lots of papers and articles providing conflicting information.
Some articles say that independant variables need to be normally disrbiuted and this may require a transformation (log, SQRT etc.). Others says that in linear modelling there are no assumptions about any linear the distribution of the independent variables.
I am trying to create a multiple regression model to predict highest pain scores on hospital admissions:
DV: numeric pain scores (0-no pain -> 5 intense pain)(discrete- dependant variable).
IVs: age (continuous), weight (continuous), sex (nominal), depreviation status (ordinal), race (nominal).
Can someone help clear up the following for me?
Before fitting a model, do I need to check the whether my independant variables are normally distributed? If so, why? Does this only apply to continuous variables (e.g. age and weight in my model)?
If age is positively skewed, would a transformation (e.g. log, SQRT) be appropriate and why? Is it best to do this before or after fitting a model? I assume I am trying to get close to a linear relationship between my DV and IV.
As part of the SPSS outputs it provides plots of the standardised residuals against predicted values and also normal P-P plots of standardised residuals. Are these tests all that is needed to check the normality assumption after fitting a model?
Many Thanks in advance!

What to do if a significant factor in univariate linear regression become insignificant in multivariate analysis?

I am studying the impact of several factors "sleeping time, studying time, anxiety degree, depression degree ..." on students final exam mark.
when I did the univaraite linear regression analysis all models where significant (as final exam mark is a dependent variable) despite some have a small R^2.
then I tried to put all predictor factors in one multiple linear regression model, the result was most of the predictors are insignificant with exception to study time which was significant and has a big R^2 in uni and multi variate analysis.
How should I explain this in my paper? is it okay to have this result? or should I search for another model?
I sound like you have highly correlated predictors. This gives you a very unstable model, one where small changes in a few observations could produce large changes in regression coefficients.
You should try various models that use subsets of your predictors, and select a final model that has a significant overall F statistic, and significant t stats for your included predictors.
In your paper, you could explain that anxiety score and depression score were too highly correlated to allow them both into the model and you’ve selected the best model that doesn’t contain both of these scores.

How to combine LIBSVM probability estimates from two (or three) two class SVM classifiers.

I have training data that falls into two classes, let's say Yes and No. The data represents three tasks, easy, medium and difficult. A person performs these tasks and is classified into one of the two classes as a result. Each task is classified independently and then the results are combined. I am using 3 independently trained SVM classifiers and then voting on the final result.
I am looking to provide a measure of confidence or probability associated with each classification. LIBSVM can provide a probability estimate along with the classification for each task (easy, medium and difficult, say Pe, Pm and Pd) but I am unsure of how best to combine these into an overall estimate for the final classification of the person (let's call it Pp).
My attempts so far have been along the lines of a simple average:
Pp = (Pe + Pm + Pd) / 3
An Inverse-variance weighted average (since each task is repeated a few times and sample variance (VARe, VARm and VARd) can be calculated - in which case Pe would be a simple average of all the easy samples):
Pp = (Pe/VARe + Pm/VARm + Pd/VARd) / (( 1/VARe ) + ( 1/VARm ) + ( 1/VARd ))
Or a multiplication (under the assumption that these events are independent, which I am unsure of since the underlying tasks are related):
Pp = Pe * Pm * Pd
The multiplication would provide a very low number, so it's unclear how to interpret that as an overall probability when the results of the voting are very clear.
Would any of these three options be the best or is there some other method / detail I'm overlooking?
Based on your comment, I will make the following suggestion. If you need to do this as an SVM (and because, as you say, you get better performance when you do it this way), take the output from your intermediate classifiers and feed them as features to your final classifier. Even better, switch to a multi-layer Neural Net where your inputs represent inputs to the intermediates, the (first) hidden layer represents outputs to the intermediate problem, and subsequent layer(s) represent the final decision you want. This way you get the benefit of an intermediate layer, but its output is optimised to help with the final prediction rather than for accuracy in its own right (which I assume you don't really care about).
The correct generative model for these tests likely looks something like the following:
Generate an intelligence/competence score i
For each test t: generate pass/fail according to p_t(pass | i)
This is simplified, but I think it should illustrate tht you have a latent variable i on which these tests depend (and there's also structure between them, since presumably p_easy(pass|i) > p_medium(pass|i) > p_hard(pass|i); you could potentially model this as a logistic regression with a continuous 'hardness' feature). I suspect what you're asking about is a way to do inference on some thresholding function of i, but you want to do it in a classification way rather than as a probabilistic model. That's fine, but without explicitly encoding the latent variable and the structure between the tests it's going to be hard (and no average of the probabilities will account for the missing structure).
I hope that helps---if I've made assumptions that aren't justified, please feel free to correct.

Comparing a Poisson Regression to a logistic Regression

I have data which has an associated binary outcome variable. Naturally I ran a logistic regression in order to see parameter estimates and odds ratios. I was curious though, to change this data from a binary outcome to count data. Then I ran a poisson regression (and negative binomial regression) on the count data.
I have no idea of how to compare these different models though, all comparisons I see seem to only be concerned with nested models.
How would you go about deciding on the best model to use in this situation?
Essentially both models will be roughly equal. What really matters is what is your objective- what you really want to predict. If you want to determine how many of cases are good or bad (1 or 0), then you go for logistic regression. If you are really interested on how much the cases are going to do (counts) then do poisson.
In other words, the only difference between these two models is the logistic transformation and the fact that logistic regression tries to minimize the misclassification error (-2 log likelihood) .To put it simply, even if you run a linear regression (OLS) on the binary outcome, you should not see big differences from your logistic model apart from the fact that the results may not be between 0 and 1 (e.g. the Area under the RoC curve will be similar to the logistic model) .
To sum up, don't worry about which of these two models is better, they should be roughly the same in the way the capture your features' information. Just think what makes more sense to optimize, counts or probabilties. The answer might have been different if you were considering non-linear models (e.g random forests or neural networks etc), but the two you are considering are both (almost) linear- so don't worry about it.
One thing to consider is the sample design. If you are using a case-control study, then logistic regression is the way to go because of its logit link function, rather than log of ratios as in Poisson regression. This is because, where there is an oversampling of cases such as in case-control study, odds ratio is unbiased.

Training set - proportion of pos / neg / neutral sentences

I am hand tagging twitter messages as Positive, Negative, Neutral. I am try to appreciate is there some logic one can use to identify of the training set what proportion of message should be positive / negative and neutral ?
So for e.g. if I am training a Naive Bayes classifier with 1000 twitter messages should the proportion of pos : neg : neutral be 33 % : 33% : 33% or should it be 25 % : 25 % : 50 %
Logically in my head it seems that I i train (i.e. give more samples for neutral) that the system would be better at identifying neutral sentences then whether they are positive or negative - is that true ? or I am missing some theory here ?
Thanks
Rahul
The problem you're referring to is known as the imbalance problem. Many machine learning algorithms perform badly when confronted with imbalanced training data, i.e. when the instances of one class heavily outnumber those of the other class. Read this article to get a good overview of the problem and how to approach it. For techniques like naive bayes or decision trees it is always a good idea to balance your data somehow, e.g. by random oversampling (explained in the references paper). I disagree with mjv's suggestion to have a training set match the proportions in the real world. This may be appropriate in some cases but I'm quite confident it's not in your setting. For a classification problem like the one you describe, the more the sizes of the class sets differ, the more most ML algorithms will have problems discriminating the classes properly. However, you can always use the information about which class is the largest in reality by taking it as a fallback such that when the classifier's confidence for a particular instance is low or this instance couldn't be classified at all, you would assign it the largest class.
One further remark: finding the positivity/negativity/neutrality in Twitter messages seems to me to be a question of degree. As such, it may be viewes as a regression rather than a classification problem, i.e. instead of a three class scheme you perhaps may want calculate a score which tells you how positive/negative the message is.
There are many other factors... but an important one (in determining a suitable ratio and volume of training data) is the expected distribution of each message category (Positive, Neutral, Negative) in the real world. Effectively, a good baseline for the training set (and the control set) is
[qualitatively] as representative as possible of the whole "population"
[quantitatively] big enough that measurements made from such sets is statistically significant.
The effect of the [relative] abundance of a certain category of messages in the training set is hard to determine; it is in any case a lesser factor -or rather one that is highly sensitive to- other factors. Improvements in the accuracy of the classifier, as a whole, or with regards to a particular category, is typically tied more to the specific implementation of the classifier (eg. is it Bayesian, what are the tokens, are noise token eliminated, is proximity a factor, are we using bi-grams etc...) than to purely quantitative characteristics of the training set.
While the above is generally factual but moderately helpful for the selection of the training set's size and composition, there are ways of determining, post facto, when an adequate size and composition of training data has been supplied.
One way to achieve this is to introduce a control set, i.e. one manually labeled but that is not part of the training set and to measure for different test runs with various subsets of the training set, the recall and precision obtained for each category (or some similar accuracy measurements), for this the classification of the control set. When these measurements do not improve or degrade, beyond what's statistically representative, the size and composition of the training [sub-]set is probably the right one (unless it is an over-fitting set :-(, but that's another issue altogether... )
This approach, implies that one uses a training set that could be 3 to 5 times the size of the training subset effectively needed, so that one can build, randomly (within each category), many different subsets for the various tests.

Resources