auc_score in scikit-learn 0.14 - scikit-learn

I'm training a RandomForestClassifier on a binary classification problem in scikit-learn. I want to maximize my auc score for the model. I understand this is not possible in the 0.13 stable version but is possible in the 0.14 bleeding edge version.
I tried this but I seemed to get a worse result:
ic = RandomForestClassifier(n_estimators=100, compute_importances=True, criterion='entropy', score_func = auc_score);
Does this work as a parameter for the model or only in gridsearchCV?
If I use it in gridsearchCV will it make the model fit the data better for auc_score? I also want to try it to maximize recall_score.

I am surprised the above does not raise an error. You can use the AUC only for model selection as in GridSearchCV.
If you use it there (scoring='roc_auc' iirc), this means that the model with the best auc will be selected. It does not make the individual models better with respect to this score.
It is still worth trying, though.

I have found a journal article that addresses highly imbalanced classes with random forests. Although it is aimed at running RDF on Hadoop clusters, the same techniques seem to work well on smaller problems as well:
del Río, S., López, V., Benítez, J. M., & Herrera, F. (2014). On the use of MapReduce for imbalanced big data using Random Forest. Information Sciences, 285, 112-137.
http://sci2s.ugr.es/rf_big_imb/pdf/rio14_INS.pdf

Related

How to put more weight on one class during training in Pytorch [duplicate]

I have a multilabel classification problem, which I am trying to solve with CNNs in Pytorch. I have 80,000 training examples and 7900 classes; every example can belong to multiple classes at the same time, mean number of classes per example is 130.
The problem is that my dataset is very imbalance. For some classes, I have only ~900 examples, which is around 1%. For “overrepresented” classes I have ~12000 examples (15%). When I train the model I use BCEWithLogitsLoss from pytorch with a positive weights parameter. I calculate the weights the same way as described in the documentation: the number of negative examples divided by the number of positives.
As a result, my model overestimates almost every class… Mor minor and major classes I get almost twice as many predictions as true labels. And my AUPRC is just 0.18. Even though it’s much better than no weighting at all, since in this case the model predicts everything as zero.
So my question is, how do I improve the performance? Is there anything else I can do? I tried different batch sampling techniques (to oversample minority class), but they don’t seem to work.
I would suggest either one of these strategies
Focal Loss
A very interesting approach for dealing with un-balanced training data through tweaking of the loss function was introduced in
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollar Focal Loss for Dense Object Detection (ICCV 2017).
They propose to modify the binary cross entropy loss in a way that decrease the loss and gradient of easily classified examples while "focusing the effort" on examples where the model makes gross errors.
Hard Negative Mining
Another popular approach is to do "hard negative mining"; that is, propagate gradients only for part of the training examples - the "hard" ones.
see, e.g.:
Abhinav Shrivastava, Abhinav Gupta and Ross Girshick Training Region-based Object Detectors with Online Hard Example Mining (CVPR 2016)
#Shai has provided two strategies developed in the deep learning era. I would like to provide you some additional traditional machine learning options: over-sampling and under-sampling.
The main idea of them is to produce a more balanced dataset by sampling before starting your training. Note that you probably will face some problems such as losing the data diversity (under-sampling) and overfitting the training data (over-sampling), but it might be a good start point.
See the wiki link for more information.

Getting feature_importances_ from RandomForestRegressor() or RandomForestClassifier() for regression and classification problems

There is a difference between the feature_importance_ matrices from RandomForestClassifier() and RandomForestRegressor().
So if I want to get best features for Classification/Regression problems, which one should I choose based on the class/reg problem?
feature_importance_ gives the relative importance of the random-forest. Regarding the choice between regressor or classifier, it will depend on your target and the problem that you are solving. If your target is discrete, you probably have a classification problem. If it is continuous, you have a regression problem.
However, be aware that feature_importance_ have limitation and should be used carefully. You are better off using the permutation_importance. You can look at the example which explains the issue with low cardinality feature for which the importance is reported low with feature_importance_: https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html#sphx-glr-auto-examples-inspection-plot-permutation-importance-py

Multilabel classification with class imbalance in Pytorch

I have a multilabel classification problem, which I am trying to solve with CNNs in Pytorch. I have 80,000 training examples and 7900 classes; every example can belong to multiple classes at the same time, mean number of classes per example is 130.
The problem is that my dataset is very imbalance. For some classes, I have only ~900 examples, which is around 1%. For “overrepresented” classes I have ~12000 examples (15%). When I train the model I use BCEWithLogitsLoss from pytorch with a positive weights parameter. I calculate the weights the same way as described in the documentation: the number of negative examples divided by the number of positives.
As a result, my model overestimates almost every class… Mor minor and major classes I get almost twice as many predictions as true labels. And my AUPRC is just 0.18. Even though it’s much better than no weighting at all, since in this case the model predicts everything as zero.
So my question is, how do I improve the performance? Is there anything else I can do? I tried different batch sampling techniques (to oversample minority class), but they don’t seem to work.
I would suggest either one of these strategies
Focal Loss
A very interesting approach for dealing with un-balanced training data through tweaking of the loss function was introduced in
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollar Focal Loss for Dense Object Detection (ICCV 2017).
They propose to modify the binary cross entropy loss in a way that decrease the loss and gradient of easily classified examples while "focusing the effort" on examples where the model makes gross errors.
Hard Negative Mining
Another popular approach is to do "hard negative mining"; that is, propagate gradients only for part of the training examples - the "hard" ones.
see, e.g.:
Abhinav Shrivastava, Abhinav Gupta and Ross Girshick Training Region-based Object Detectors with Online Hard Example Mining (CVPR 2016)
#Shai has provided two strategies developed in the deep learning era. I would like to provide you some additional traditional machine learning options: over-sampling and under-sampling.
The main idea of them is to produce a more balanced dataset by sampling before starting your training. Note that you probably will face some problems such as losing the data diversity (under-sampling) and overfitting the training data (over-sampling), but it might be a good start point.
See the wiki link for more information.

Logistic regression overfits even using cross validation in sklearn?

I am implementing a logistic regression model using sklearn, for a text classification competition on Kaggle.
When I use unigram, there are 23,617 features. The best mean_test_score Cross validation search (sklearn's GridSearchCV) gives me is similar to the score I got from Kaggle, using the best model.
There are 1,046,524 features if I use bigram. GridSearchCV gives me a better mean_test_score compared to unigram, but using this new model I got a much much lower score on Kaggle.
I guess the reason might be overfitting, since I have too many features. I have tried to set the GridSearchCV using 5-fold, or even 2-fold, but the scores are still inconsistent.
Does it really indicate my second model is overfitting, even in the validation stage? If so, how can I tune the regularization term for my logistic model using sklearn? Any suggestions are appreciated!
Assuming you are using sklearn. You could try looking into using the tuning parameters max_df, min_df, and max_features. Throwing these into a GridSearch may take a long time but you will likely get some interesting results back. I know these features are implemented in the sklearn.feature_extraction.text.TfidfVectorizer, but I am sure they use them elsewhere as well. Essentially the idea is that including too many grams can lead to overfitting, same thing with having too many grams with low or high document frequencies.

Is there class weight (or alternative way) for GradientBoostingClassifier in Sklearn when dealing with VotingClassifier or Grid search?

I'm using GradientBoostingClassifier for my unbalanced labeled datasets. It seems like class weight doesn't exist as a parameter for this classifier in Sklearn. I see I can use sample_weight when fit but I cannot use it when I deal with VotingClassifier or GridSearch. Could someone help?
Currently there isn't a way to use class_weights for GB in sklearn.
Don't confuse this with sample_weight
Sample Weights change the loss function and your score that you're trying to optimize. This is often used in case of survey data where sampling approaches have gaps.
Class Weights are used to correct class imbalances as a proxy for over \ undersampling. There is no direct way to do that for GB in sklearn (you can do that in Random Forests though)
Very late, but I hope it can be useful for other members.
In the article of Zichen Wang in towardsdatascience.com, the point 5 Gradient Boosting it is told:
For instance, Gradient Boosting Machines (GBM) deals with class imbalance by constructing successive training sets based on incorrectly classified examples. It usually outperforms Random Forest on imbalanced dataset For instance, Gradient Boosting Machines (GBM) deals with class imbalance by constructing successive training sets based on incorrectly classified examples. It usually outperforms Random Forest on imbalanced dataset.
And a chart shows that the half of the grandient boosting model have an AUROC over 80%. So considering GB models performances and the way they are done, it seems not to be necessary to introduce a kind of class_weight parameter as it is the case for RandomForestClassifier in sklearn package.
In the book Introduction To Machine Learning with Pyhton written by Andreas C. Müller and Sarah Guido, edition 2017, page 89, Chapter 2 *Supervised Learning, section Ensembles of Decision Trees, sub-section Gradient boosted regression trees (gradient boosting machines):
They are generally a bit more sensitive to
parameter settings than random forests, but can provide better accuracy if the parameters are set correctly.
Now if you still have scoring problems due to imbalance proportions of categories in the target variable, it is possible you should see if your data should be splited to apply different models on it, because they are not as homogeneous as it seems to be. I mean it may have a variable you have not in your dataset train (an hidden variable clearly) that influences a lot the model results, then it is difficult even for the greater GB to give correct scoring because it misses a huge information that you cannot make appear in the matrix to compute sometimes for many reasons.
Some updates:
I found, by random, there are libraries that implement it as parameters of their gradient boosting instance objects. It is the case of H2O where for the parameter balance_classes it is told:
Balance training data class counts via over/under-sampling (for
imbalanced data).
Type: bool (default: False).
If you want to keep with sklearn you should do as HakunaMaData told: over/under-sampling because that's what other libraries finally do when the parameter exist.

Resources