Statistics Dummy Variable as Dependent Variable Regression - statistics

I have a bunch of independent variables: height, weight, etc that I want to regress a dummy variable on to. For instance, if I wanted to regress diabetes (0 if patient doesnt have diabetes, 1 if patient does have diabetes) and I wanted to figure out the effect of an increase in 1 pound of weight on the probability of having diabetes, how would I do that? I'm sure there are multiple ways of doing it but I just never have heard of a model that does this. I thought it was the probit model but I'm not sure. Any thoughts?

The problem you are describing is known as logistic regression; a web search for that should turn up a lot of hits. Most commonly, the response is some function of a linear combination of inputs, but more generally, the response could be a nonlinear function of inputs.
The dependence of the response on an input (e.g. weight) is interesting, but not exactly well-posed, since the change of the probability of the response varies over the range of the input variable; the change is very small for very large or very small values of the input, and reaches some maximum in between.

Related

why do object detection methods have an output value for every class

Most recent object detection methods rely on a convolutional neural network. They create a feature map by running input data through a feature extraction step. They then add more convolutional layers to output a set of values like so (this set is from YOLO, but other architectures like SSD differ slightly):
pobj: probability of being an object
c1, c2 ... cn: indicating which class the object belongs to
x, y, w, h: bounding box of the object
However, one particular box cannot be multiple objects. As in, wouldn't having a high value for, say, c1 mean that the values for all the others c2 ... cn would be low? So why use different values for c1, c2 ... cn? Couldn't they all be represented by a single value, say 0-1, where each object has a certain range within the 0-1, say 0-0.2 is c1, 0.2-0.4 is c2 and so on...
This would reduce the dimension of the output from NxNx(5+C) (5 for the probability and bounding box, +C one for each class) to NxNx(5+1) (5 same as before and 1 for the class)
Thank you
Short answer, NO! That is almost certainly not an acceptable solution. It sounds like your core question is: Why is a a single value in the range [0,1] not a sufficient, compact output for object classification? As a clarification, I'd say this doesn't really have to do with single-shot detectors; the outputs from 2-stage detectors and most all classification networks follows this same 1D embedding structure. As a secondary clarification, I'd say that many 1-stage networks also don't output pobj in their original implementations (YOLO is the main one that does but Retinanet and I believe SSD does not).
An object's class is a categorical attribute. Assumed within a standard classification problem is that the set of possible classes is flat (i.e. no class is a subclass of any other), mutually exclusive (each example falls into only a single class), and unrelated (not quite the right term here but essentially no class is any more or less related to any other class).
This assumed attribute structure is well represented by an orthonormal encoding vector of the same length as the set of possible attributes. A vector [1,0,0,0] is no more similar to [0,1,0,0] than it is to [0,0,0,1] in this space.
(As an aside, a separate branch of ML problems called multilabel classification removes the mutual exclusivity constrain (so [0,1,1,0] and [0,1,1,1] would both be valid label predictions. In this space class or label combinations COULD be construed as more or less related since they share constituent labels or "basis vectors" in the orthonormal categorical attribute space. But enough digression..)
A single, continuous variable output for class destroys the assumption that all classes are unrelated. In fact, it assumes that the relation between any two classes is exact and quantifiable! What an assumption! Consider attempting to arrange the classes of, let's say, the ImageNet classification task, along a single dimension. Bus and car should be close, no? Let's say 0.1 and 0.2, respectively in our 1D embedding range of [0,1]. Zebra must be far away from them, maybe 0.8. But should be close to zebra fish (0.82)? Is a striped shirt closer to a zebra or a bus? Is the moon more similar to a bicycle or a trumpet? And is a zebra really 5 times more similar to a zebra fish than a bus is to a car? The exercise is immediately, patently absurd. A 1D embedding space for object class is not sufficiently rich to capture the differences between object classes.
Why can't we just place object classes randomly in the continuous range [0,1]? In a theoretical sense nothing is stopping you, but the gradient of the network would become horrendously, unmanageably non-convex and conventional approaches to training the network would fail. Not to mention the network architecture would have to encode extremely non-linear activation functions to predict the extremely hard boundaries between neighboring classes in the 1D space, resulting in a very brittle and non-generalizable model.
From here, the nuanced reader might suggest that in fact, some classes ARE related to one another (i.e. the unrelated assumption of the standard classification problem is not really correct). Bus and car are certainly more related than bus and trumpet, no? Without devolving into a critique on the limited usefulness of strict ontological categorization of the world, I'll simply suggest that in many cases there is an information embedding that strikes a middle ground. A vast field of work has been devoted to finding embedding spaces that are compact (relative to the exhaustive enumeration of "everything is its own class of 1") but still meaningful. This is the work of principal component analysis and object appearance embedding in deep learning.
Depending on the particular problem, you may be able to take advantage of a more nuanced embedding space better suited towards the final task you hope to accomplish. But in general, canonical deep learning tasks such as classification / detection ignore this nuance in the hopes of designing solutions that are "pretty good" generalized over a large range of problem spaces.
For object classification head, usually cross-entropy loss function is used which operates on the probability distribution to compute the difference between ground-truth(a one hot encoded vector) and prediction class scores.
On the otherhand, you are proposing a different way of encoding the ground-truth class labels which can be further used with certain custom loss function say L1/l2 loss function, which looks theoretically correct but it might not be as good as cross-entropy function in terms of model convergence/optimization.

Scikit learn models gives weight to random variable? Should I remove features with less importance?

I do some feature selection by removing correlated variables and backwards elimination. However, after all that is done as a test I threw in a random variable, and then trained logistic regression, random forest and XGBoost. All 3 models have the feature importance of the random feature as greater than 0. First, how can that be? Second, all models have it ranked toward the bottom, but it's not the lowest feature. Is this a valid step for another round of feature selection -i.e. remove all those who score below the random feature?
The random feature is created with
model_data['rand_feat'] = random.randint(100, size=(model_data.shape[0]))
This can happen, What random is the number you sample, but this random sampling can still generate a pattern by chance. I dont know whether you are doing classification or regression but lets consider the simple example of binary classification. We have class 1 and 0 and 1000 data point from each. When you sample a random number for each data point, it can happen that for example a majority of class 1 gets some value higher than 50, whereas majority of class 0 gets a random number smaller than 50.
So in the end effect, this might result into some pattern. So I would guess everytime you run your code the random feature importance changes. It is always ranked low because it is very unlikely that a good pattern is generated(e.g all 1s get higher than 50 and all 0s get lower than 50).
Finally, yes you should consider to drop the features with low value
I agree with berkay's answer that a random variable can have patterns that are by chance associated to your outcome variable. Secondly, I will neither include random variable in model building nor as my filtering threshold because if random variable has by chance significant or nearly significant association to the outcome it will suppress the expression of important features of original data and you probably end up losing those important features.
In early phase of model development I always include two random variables.
For me it is like a 'sanity check' since these are in effect junk variables or junk features.
If any of my features are worse in importance than the junk features then that is a warning sign that I need to look more carefully at the worth* of those features or to do some better feature engineering.
For example what does theory suggest about the inclusion of those features?

Under what conditions can two classes have different average values, yet be indistinguishable to a SVM?

I am asking because I have observed sometimes in neuroimaging that a brain region might have different average activation between two experimental conditions, but sometimes an SVM classifier somehow can't distinguish the patterns of activation between the two conditions.
My intuition is that this might happen in cases where the within-class variance is far greater than the between-class variance. For example, suppose we have two classes, A and B, and that for simplicity our data consists just of integers (rather than vectors). Let the data falling under class A be 0,0,0,0,0,10,10,10,10,10. Let the data falling under class B be 1,1,1,1,1,11,11,11,11,11. Here, A and B are clearly different on average, yet there's no decision boundary that would allow A and B to be distinguished. I believe this logic would hold even if our data consisted of vectors, rather than integers.
Is this a special case of some broader range of cases where an SVM would fail to distinguish two classes that are different on average? Is it possible to delineate the precise conditions under which an SVM classifier would fail to distinguish two classes that differ on average?
EDIT: Assume a linear SVM.
As described in the comments - there are no such conditions because SVM will separate data just fine (I am not talking about any generalisation here, just separating training data). For the rest of the answer I am assuming there are no two identical points with different labels.
Non-linear case
For a kernel case, using something like RBF kernel, SVM will always perfectly separate any training set, given that C is big enough.
Linear case
If data is linearly separable then again - with big enough C it will separate data just fine. If data is not linearly separable, cranking up C as much as possible will lead to smaller and smaller training error (of course it will not get 0 since data is not linearly separable).
In particular for the data you provided kernelized SVM will get 100%, and any linear model will get 50%, but it has nothing to do with means being different or variances relations - it is simply a dataset where any linear separator has at most 50% accuracy, literally every decision point, thus it has nothing to do with SVM. In particular it will separate them "in the middle", meaning that the decision point will be somewhere around "5".

standard error of addition, subtraction, multiplication and ratio

Let's say, I have two random variables,x and y, both of them have n observations. I've used a forecasting method to estimate xn+1 and yn+1, and I also got the standard error for both xn+1 and yn+1. So my question is that what the formula would be if I want to know the standard error of xn+1 + yn+1, xn+1 - yn+1, (xn+1)*(yn+1) and (xn+1)/(yn+1), so that I can calculate the prediction interval for the 4 combinations. Any thought would be much appreciated. Thanks.
Well, the general topic you need to look at is called "change of variables" in mathematical statistics.
The density function for a sum of random variables is the convolution of the individual densities (but only if the variables are independent). Likewise for the difference. In special cases, that convolution is easy to find. For example, for Gaussian variables the density of the sum is also a Gaussian.
For product and quotient, there aren't any simple results, except in special cases. For those, you might as well compute the result directly, maybe by sampling or other numerical methods.
If your variables x and y are not independent, that complicates the situation. But even then, I think sampling is straightforward.

Occurrence prediction

I'd like to know what method is best suited for predicting event occurrences.
For example, given a set of data from 5 years of malaria infection occurrences and several other factors that affect the occurrences, I'd like to predict the next five years for malaria infection occurrences.
What I thought of doing was to derive a kind of occurrence factor using fuzzy logic rules, and then average the occurrences with the occurrence factor to get the first predicted occurrence, and then average all again with the predicted occurrence and keep on iterating for all five years, but I decided to seek for help online.
There are many ways to do forecasting, each has its own advantages and disadvantages. The science of determining the accuracy of a forecast often consists of trying to minimize error. All forecasting comes down to using the past as a predictor of the future, adjusting it by some amount. E.g. tomorrow the temperature will be the same as today, plus or minus some amount. How you decide the +/- is what varies.
Here are a range of techniques you might want to review:
Moving Averages (simple, single, double)
Exponential Smoothing
Decomposition(Trend + Seasonality + Cyclicals + Irregualrities)
Linear Regression
Multiple Regression
Box-Jenkis (a.k.a. ARIMA,
Auto-Regressive Integrated Moving
Average)
Sorry, for the vague answer but forecasting is complex stuff.
What you describe about feeding your predictions back into the model to produce future predictions is standard stuff. I don't know if "fuzzy logic" gets you anything in particular. As any forecasting instructor will tell you, sometimes you just squint and look at the data. Context is everything.
I would use a logit or probit model to predict occurrence given a set of exogenous circumstances. Not sure why you want to iterate. That would basically be equivalent to including a lag in the regression formula. You could do it, and as long as the coefficient was <1, you wouldn't have the explosion problem.
If you want to introduce an element of endogeneity to the independent variables, you could use a VAR.
I think with your idea as stated, you'll have asymptotic behavior as time goes by. Either your data will converge to 0, or it will explode. That said, you'd probably have to give some data and/or describe its properties before anyone can help you. This is basically a simulation, and the factors are everything when it comes to extrapolation.

Resources