what is the concordance index (c-index)? - statistics

I am relatively new in statistics and I need some help with some basic concepts,
could somebody explain the following questions relative to the c-index?
What is the c-index?
Why is it used over other methods?

The c-index is "A measure of goodness of fit for binary outcomes in a logistic regression model."
The reason we use the c-index is because we can predict more accurately whether a patient has a condition or not.
The C-statistic is actually NOT used very often as it only gives you a general idea about a model; A ROC curve contains much more information about accuracy, sensitivity and specificity.
ROC curve

Related

How to know which features contribute significantly in prediction models?

I am novice in DS/ML stuff. I am trying to solve Titanic case study in Kaggle, however my approach is not systematic till now. I have used correlation to find relationship between variables and have used KNN and Random Forest Classification, however my models performance has not improved. I have selected features based on the result of correlation between variables.
Please guide me if there are certain sk-learn methods which can be used to identify features which can contribute significantly in forecasting.
Through Various Boosting Techniques You can Improve accuracy approx 99% I suggest you to use Gradient Boosting.

Does SciKit Have A InHouse Function That Tallies The Accuracy For Each Y Solution?

I have LinearSVC algorithm that predicts some data for stock. It has a 90% acc rating, but I think this might be due to the fact that some y's are far more likely than others. I want to see if there is a way to see if for each y I've defined, how accurately that y was predicted.
I haven't seen anything like this in the docs, but it just makes sense to have it.
If what your really want is a measure of confidence rather than actual probabilities, you can use the method LinearSVC.decision_function(). See the documentation or the probability calibration CalibratedClassifierCV using this documentation.
You can use a confusion matrix representation implemented in SciKit to generate an accuracy matrix between the predicted and real values of your classification problem for each individual attribute. The diagonal represents the raw accuracy, which can easily be converted to a percentage accuracy.

Adaboost vs. Gaussian Naive Bayes

I'm new to Adaboost, but have been reading about it, and it seemed like the perfect solution for a problem I've been working on.
I have a data set where the classes are 'UP' and 'DOWN'. The Gaussian Naive Bayes classifier classifies both classes with ~55% accuracy (weakly accurate). I thought that using Adaboost with Gaussian Naive Bayes as my base estimator would allow me to get a greater accuracy, however when I do this, my accuracy drops to around 45-50%.
Why is this? I find it very unusual that Adaboost would underperform its base estimator. Additionally, any tips for getting Adaboost to work better would be appreciated. I have tried it with many different estimators with similar poor results.
The reason could be the Diversity dilemma of the Ensemble methods, which particularly concerns the Adaboost algorithm.
Diversity is the error between the component classifiers of the Adaboost algorithm, which we prefer to keep uncorrelated. Otherwise, component classifiers will perform worse than single component classifiers. On the other hand, if we use weak base classifiers but achieve reasonable accuracy, the final ensemble will achieve higher accuracy.
This is well explained in this paper.
From which we can retrieve this explanation:
Accuracy and Diversity dilemma of Adaboost
This diagram is a scatter-plot where each point corresponds to a component classifier. The x coordinate value of a point is the diversity value of the corresponding component classifier while the y coordinate value is the accuracy value of the corresponding component classifier. From this figure, it can be observed that, if the component classifiers are too accurate, it is difficult to find very diverse ones, and combining these accurate but non-diverse classifiers often leads to very limited improvement (Windeatt, 2005). On the other hand, if the component classifiers are too inaccurate, although we can find diverse ones, the
combination result may be worse than that of combining both more accurate and diverse component classifiers. This is because if the combination result is dominated by too many inaccurate component classifiers, it will be wrong most of the time, leading to poor classification result
To directly answer your question, it may be that using the Guassian Naive Bayes as base estimators is creating classifiers that do not disagree (enough) with each other (diversify the error), hence Adaboost generalizes even worse than the single Gaussian Naive Bayes.

Binary semi-supervised classification with positive only and unlabeled data set

My data consist of comments (saved in files) and few of them are labelled as positive. I would like to use semi-supervised and PU classification to classify these comments into positive and negative classes. I would like to know if there is any public implementation for semi-supervised and PU implementations in python (scikit-learn)?
You could try to train a one-class SVM and see what kind of results that gives you. I haven't heard about the PU paper. I think for all practical purposes you will be much better of labelling some points and then using semi-supervised methods.
If finding negative points is hard, I would try to use heuristics to find putative negative points (which I think is similar to the techniques in the PU paper). You could either classify unlabelled vs positive and then only look at the ones that score strongly for unlabelled, or learn a one-class SVM or similar and then look for negative points in the outliers.
If you are interested in actually solving the task, I would much rather invest time in manual labelling than implementing fancy methods.

information criteria for confusion matrices

One can measure goodness of fit of a statistical model using Akaike Information Criterion (AIC), which accounts for goodness of fit and for the number of parameters that were used for model creation. AIC involves calculation of maximized value of likelihood function for that model (L).
How can one compute L, given prediction results of a classification model, represented as a confusion matrix?
It is not possible to calculate the AIC from a confusion matrix since it doesn't contain any information about the likelihood. Depending on the model you are using it may be possible to calculate the likelihood or quasi-likelihood and hence the AIC or QIC.
What is the classification problem that you are working on, and what is your model?
In a classification context often other measures are used to do GoF testing. I'd recommend reading through The Elements of Statistical Learning by Hastie, Tibshirani and Friedman to get a good overview of this kind of methodology.
Hope this helps.
Information-Based Evaluation Criterion for Classifier's Performance by Kononenko and Bratko is exactly what I was looking for:
Classification accuracy is usually used as a measure of classification performance. This measure is, however, known to have several defects. A fair evaluation criterion should exclude the influence of the class probabilities which may enable a completely uninformed classifier to trivially achieve high classification accuracy. In this paper a method for evaluating the information score of a classifier''s answers is proposed. It excludes the influence of prior probabilities, deals with various types of imperfect or probabilistic answers and can be used also for comparing the performance in different domains.

Resources