logistic regression missing values - statistics

Could I have a logistic regression with missing values?
I have many continuos attributes and some categorical, could I set them as user-missing? Could it be useful?

For doing a regression analysis you need all variables measured for each event. Perhaps another technique works with missing attributes, but not regression.
BTW, you should try posting the question at https://stats.stackexchange.com/
HTH!

Most regression procedures require complete data, but there are a variety of methods for dealing with missing values. This is a subtle topic, so I won't pretend to give a complete answer here, and recommend doing some reading on the subject. Briefly, though:
Never delete observations to fix this problem.
Deletion of variables is always allowed, but obviously is quite severe in terms of one's data budget.
Filling in missing values with global constants, such as the mean or median of the non-missings, should be done sparingly (when the proportion of missings is very low) if at all.
Filling in missing values with values chosen based on other independent variables is preferred over number 3, above.
To learn more about this subject, seek information on the terms "imputation", especially "single imputation" and "multiple imputation", "missing at random" and "missing completely at random".

Related

How do I analyze the change in the relationship between two variables?

I'm working on a simple project in which I'm trying to describe the relationship between two positively correlated variables and determine if that relationship is changing over time, and if so, to what degree. I feel like this is something people probably do pretty often, but maybe I'm just not using the correct terminology because google isn't helping me very much.
I've plotted the variables on a scatter plot and know how to determine the correlation coefficient and plot a linear regression. I thought this may be a good first step because the linear regression tells me what I can expect y to be for a given x value. This means I can quantify how "far away" each data point is from the regression line (I think this is called the squared error?). Now I'd like to see what the error looks like for each data point over time. For example, if I have 100 data points and the most recent 20 are much farther away from where the regression line/function says it should be, maybe I could say that the relationship between the variables is showing signs of changing? Does that make any sense at all or am I way off base?
I have a suspicion that there is a much simpler way to do this and/or that I'm going about it in the wrong way. I'd appreciate any guidance you can offer!
I can suggest two strands of literature that study changing relationships over time. Typing these names into google should provide you with a large number of references so I'll stick to more concise descriptions.
(1) Structural break modelling. As the name suggest, this assumes that there has been a sudden change in parameters (e.g. a correlation coefficient). This is applicable if there has been a policy change, change in measurement device, etc. The estimation approach is indeed very close to the procedure you suggest. Namely, you would estimate the squared error (or some other measure of fit) on the full sample and the two sub-samples (before and after break). If the gains in fit are large when dividing the sample, then you would favour the model with the break and use different coefficients before and after the structural change.
(2) Time-varying coefficient models. This approach is more subtle as coefficients will now evolve more slowly over time. These changes can originate from the time evolution of some observed variables or they can be modeled through some unobserved latent process. In the latter case the estimation typically involves the use of state-space models (and thus the Kalman filter or some more advanced filtering techniques).
I hope this helps!

What are the things to note in order to maximize the auc_roc_score?

What are the tips & tricks to improve our auc_roc_score?
Example:
Is balanced data required?
Is recall is more important than precision?
Is oversampling is usually better than undersampling?
Thanks again!
This is highly dependent upon the type of data in use and the type of model it is trained on along with the hyper parameters being currently used. The ROC is just the outcome of how well the data pre processing and model building is done. Thus you must look into ways of improving your model. Also precision and recall are useful in their own ways. It depends on the scenario.
Like precision answers the question :
What proportions of the predicted True labels were actually correct ?
whereas recall answers :
What proportion of the actually correct labels were identified correctly ?
Thus you would want a higher recall in case of identifying a corona patient while you would prefer a higher precision when it comes to the fact that the cost of acting is high and the cost of not acting is low.
Also, what exactly do u mean by balanced data varies from situation to situation.
Thus if you could specify the kind of model, problem and data in use, we may be able to help you more.
Please share the code of your model for the same.

Handling optional data in Logistic regression

I am working with data which contains marks and other features of students and trying to predict whether they will get a high salary or not using scikit-learn in python. I ran into a problem,
since a student does not take all the subject his/her score in a subject is -1 if he has not taken the subject (a student can take multiple subjects).
Below a snapshot taken from the data file:
Snapshot
I am trying to find a way to interpret the -1 in a way that doesn't alter the data much.
My Approach:
Take the percentile marks for each student and then take the average of all percentiles for each student giving a single number for each student which a lot easier to work with but this method may lose some information about the distribution of marks.
Fill the -1 value with the average of marks for all the students in that subject, but this will not work if the data is biased towards one subject
Is there any better way the deal with this kind of data?
Your "-1"'s amount to missing data, so you are asking how to approach a classification task with missing data. See here and here and here, among many others, for discussions on this topic.
A couple important considerations that come to mind:
One option is to "impute" the missing values, which is what you're describing with using "average marks." This approach often requires the assumption that the data is "missing at random" which in your case is unlikely to be true: for example, a bad student is more likely to not take a difficult subject, so missing values tell you something.
Using regression models (like logistic regression) are in general going to require some type of imputation. But there are other models, like decision trees or Random forests, that can handle missing data without imputation.

Questions about standardizing and scaling

I am trying to generate a model that uses several physico-chemical properties of a molecule (incl. number of atoms, number of rings, volume, etc.) to predict a numeric value Y. I would like to use PLS Regression, and I understand that standardization is very important here. I am programming in Python, using scikit-learn. The type and range for the features varies. Some are int64 while other are float. Some features generally have small (positive or negative) values, while other have very large value. I have tried using various scalers (e.g. standard scaler, normalize, minmax scaler, etc.). Yet, the R2/Q2 are still low. I have a few questions:
Is it possible that by scaling, some of the very important features lose their significance, and thus contribute less to explaining the variance of the response variable?
If yes, if I identify some important features (by expert knowledge), is it OK to scale other features but those? Or scale the important features only?
Some of the features, although not always correlated, have values that are in a similar range (e.g. 100-400), compared to others (e.g. -1 to 10). Is it possible to scale only a specific group of features that are within the same range?
The whole idea of scaling is to make models more robust to analysis on features space. For example, if you have 2 features as 5 Kg and 5000 gm, we know both are same, but for some algorithm, which are sensitive to metric space such as KNN, PCA etc, they will be more weighted towards second features, so scaling must be done for these algos.
Now coming to your question,
Scaling doesn't effect the significance of features. As i explained above, it helps in better analysis of data.
No, you should not do, reason explained above.
If you want to include domain knowledge in your model, you can use it as prior information. In short, for linear model, this is same as regularization. It has very good features. if you think, you have many useless-features, you can use L1 regularization, which creates sparse effect on features space, which is nothing but assign 0 weight to useless features. Here is the link for more-info.
One more point, some method such as tree based model doesn't need scaling, In last, it mostly depend on the model, you choose.
Lose significance? Yes. Contribute less? No.
No, it's not OK. It's either all or nothing.
No. The idea of scaling is not to decrease / increase significance / effect of a variable. It's to transform all variables to a common scale that can be interpreted.

Dummy Coding of Nominal Attributes - Effect of Using K Dummies, Effect of Attribute Selection

Summing up my understanding of the topic 'Dummy Coding' is usually understood as coding a nominal attribute with K possible values as K-1 binary dummies. The usage of K values would cause redundancy and would have a negative impact e.g. on logistic regression, as far as I learned it. That far, everything's clear to me.
Yet, two issues are unclear to me:
1) Bearing in mind the issue stated above, I am confused that the 'Logistic' classifier in WEKA actually uses K dummies (see picture). Why would that be the case?
2) An issue arises as soon as I consider attribute selection. Where the left-out attribute value is implicitly included as the case where all dummies are zero if all dummies are actually used for the model, it isn't included clearly anymore, if one dummy is missing (as not selected in attribute selection). The issue is much easy to understand with the sketch I uploaded. How can that issue be treated?
Secondly
Images
WEKA Output: The Logistic algorithm was run on the UCI dataset German Credit, where the possible values of the first attribute are A11,A12,A13,A14. All of them are included in the logistic regression model. http://abload.de/img/bildschirmfoto2013-089out9.png
Decision Tree Example: Sketch showing the issue when it comes to running decision trees on datasets with dummy-coded instances after attribute selection. http://abload.de/img/sketchziu5s.jpg
The output is generally more easy to read, interpret and use when you use k dummies instead of k-1 dummies. I figure that is why everybody seems to actually use k dummies.
But yes, as the k values sum up to 1, there exists a correlation that may cause problems. But correlations in data sets are common, you will never completely get rid of them!
I believe feature selection and dummy coding just doesn't fit. It equals dropping some values from the attribute. Why do you insist on doing feature selection?
You really should be using weighting, or consider more advanced algorithms that can handle such data. In fact the dummy variables can cause just as much trouble, because they are binary, and oh so many algorithms (e.g. k-means) don't make much sense on binary variables.
As for the decision tree: don't perform, feature selection on your output attribute...
Plus, as a decision tree already selects features, it does not make sense to do all this anyway... leave it to the decision tree to decide upon which attribute to use for splitting. This way, it can learn dependencies, too.

Resources