feature_importances_ when using random forests in scikit-learn - scikit-learn

I am using Random forests in scikit-learn. I used feature_importances_ to see how much each feature is important in prediction goal. But I don't understand what is this score. Googling feature_importances_ says it is the mean decrease impurity. But I'm still confused whether this is the same as mean decrease gigi impurity. If so, how it is calculated for trees and random forests? Beside the math I want to really understand what does it mean.

feature_importances_ function will tell you how much each feature is contributing towards prediction (Information gain)
Random forest classify the independent variables or features based on Gini, Information Gain, Chi-square or entropy. Those features will get high score which contribute maximum to the information gain.

Related

Find optimized threshold value

I have a dataset which has a fraud_label and some other sets of feature variable. How can I find the best rule which would help me identify fraud_label correctly with the best precision and recall values. Example of features are number_of_site_visits, external_fraud_score etc. I need to be able to come up with a rule which says that if number_of_site_visits is less than X and external_fraud_score is greater than Y then we will get the best precision and recall. I have to do this in Python and any help you can provide or direction would be very helpful.
I have tried Random Forest model but that gives me feature importances and not exact threshold values.
The best way to find the best rule for identifying fraud_label correctly with the best precision and recall values is to use a supervised machine learning algorithm such as logistic regression or support vector machines. These algorithms can be used to train a model on your dataset and then use the trained model to predict the fraud_label. The model can then be evaluated using metrics such as precision and recall.
You can also use grid search or cross-validation to find the optimal parameters for your model, which will help you identify the best thresholds for each feature variable. This will allow you to create a rule that will give you the best precision and recall values.
In Python, you can use scikit-learn library for implementing these algorithms.

How to know which features contribute significantly in prediction models?

I am novice in DS/ML stuff. I am trying to solve Titanic case study in Kaggle, however my approach is not systematic till now. I have used correlation to find relationship between variables and have used KNN and Random Forest Classification, however my models performance has not improved. I have selected features based on the result of correlation between variables.
Please guide me if there are certain sk-learn methods which can be used to identify features which can contribute significantly in forecasting.
Through Various Boosting Techniques You can Improve accuracy approx 99% I suggest you to use Gradient Boosting.

Loss functions in LightFM

I recently came across LightFM while learning to train a recommender system. And so far what I know is that it utilizes loss functions which are logistic, BPR, WARP and k-OS WARP. I did not go through the math behind all these functions. Now what I am confused about is that how will I know that which loss function to use where?
From lightfm model documentation page:
logistic: useful when both positive (1) and negative (-1) interactions are present.
BPR: Bayesian Personalised Ranking 1 pairwise loss. Maximises the prediction difference between a positive example and a randomly chosen negative example. Useful when only positive interactions are present and optimising ROC AUC is desired.
WARP: Weighted Approximate-Rank Pairwise [2] loss. Maximises the rank of positive examples by repeatedly sampling negative examples until rank violating one is found. Useful when only positive interactions are present and optimising the top of the recommendation list (precision#k) is desired.
k-OS WARP: k-th order statistic loss [3]. A modification of WARP that uses the k-the positive example for any given user as a basis for pairwise updates.
Everything boils down to how your dataset is structured and what kind of user interacions you're looking at. Obviously one approach would be to include the loss function in your parameter grid when going through hyperparameter tuning (at least that's what I did) and check model accuracy. I find investingating why a given loss function performed better/worse on a dataset as a good learning exercise.

Why does more features in a random forest decrease accuracy dramatically?

I am using sklearn's random forests module to predict values based on 50 different dimensions. When I increase the number of dimensions to 150, the accuracy of the model decreases dramatically. I would expect more data to only make the model more accurate, but more features tend to make the model less accurate.
I suspect that splitting might only be done across one dimension which means that features which are actually more important get less attention when building trees. Could this be the reason?
Yes, the additional features you have added might not have good predictive power and as random forest takes random subset of features to build individual trees, the original 50 features might have got missed out. To test this hypothesis, you can plot variable importance using sklearn.
Your model is overfitting the data.
From Wikipedia:
An overfitted model is a statistical model that contains more parameters than can be justified by the data.
https://qph.fs.quoracdn.net/main-qimg-412c8556aacf7e25b86bba63e9e67ac6-c
There are plenty of illustrations of overfitting, but for instance, this 2d plot represents the different functions that would have been learned for a binary classification task. Because the function on the right has too many parameters, it learns wrongs data patterns that don't generalize properly.

information criteria for confusion matrices

One can measure goodness of fit of a statistical model using Akaike Information Criterion (AIC), which accounts for goodness of fit and for the number of parameters that were used for model creation. AIC involves calculation of maximized value of likelihood function for that model (L).
How can one compute L, given prediction results of a classification model, represented as a confusion matrix?
It is not possible to calculate the AIC from a confusion matrix since it doesn't contain any information about the likelihood. Depending on the model you are using it may be possible to calculate the likelihood or quasi-likelihood and hence the AIC or QIC.
What is the classification problem that you are working on, and what is your model?
In a classification context often other measures are used to do GoF testing. I'd recommend reading through The Elements of Statistical Learning by Hastie, Tibshirani and Friedman to get a good overview of this kind of methodology.
Hope this helps.
Information-Based Evaluation Criterion for Classifier's Performance by Kononenko and Bratko is exactly what I was looking for:
Classification accuracy is usually used as a measure of classification performance. This measure is, however, known to have several defects. A fair evaluation criterion should exclude the influence of the class probabilities which may enable a completely uninformed classifier to trivially achieve high classification accuracy. In this paper a method for evaluating the information score of a classifier''s answers is proposed. It excludes the influence of prior probabilities, deals with various types of imperfect or probabilistic answers and can be used also for comparing the performance in different domains.

Resources