Using cross-validation to calculate feature importance "Some Questions" - python-3.x

I am currently working on a project. I already selected my features and want to check their importance. I have some questions if anyone can help me please.
1- Does it make sense if I use RandomForestClassifier with cross-validation to calculate the feature importance?
2- I tried it to calculate the feature Importance using the cross_validate function
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html . The function provides the test_score and train_score results. The results I got with a 10 Fold cross-validation were as follows:
test_score [0.99950158, 0.9997231 , 0.9997231 , 0.99994462, 0.99977848, 0.99983386, 0.99977848, 0.9997231 , 0.99977847, 1.]
train_score [0.99998769, 0.99998154, 0.99997539, 0.99997539, 0.99998154,0.99997539, 0.99998154, 0.99997539, 0.99998154, 0.99997539],
Can anyone explain these results? And what does it indicate?
3- The cross_validate function has a parameter called scoring, which has different scoring values such as accuracy, balanced_accuracy and f1. What does the scoring parameter do? And what do these values mean? And how should I decide which one to choose? I already read the scikit-learn documentation but wasn't clear to me.
Thank you.

Your question 1 is slightly out of scope here. For each run (fold) of cross-validation, you will get an array of importance for your features. Then how would you combine those into single importance per feature? There may be outputs which can show that a specific feature is important based on higher scores on different folds. But that may vary.
Now, cross_validate will return the default score of the estimator used inside it, unless the scoring param is set. So if you leave the scoring, it will use RandomForestClassifier's score() method which returns accuracy.
(In scikit, all classifiers will return accuracy in score() and all regressors will return r-squared value)
So for your question 2: the returned scores are accuracies per cv fold.
If you do not want to use accuracy and want some other score, you may set the scoring param incross_validate.

Related

What do sklearn.cross_validation scores mean?

I am working on a time-series prediction problem using GradientBoostingRegressor, and I think I'm seeing significant overfitting, as evidenced by a significantly better RMSE for training than for prediction. In order to examine this, I'm trying to use sklearn.model_selection.cross_validate, but I'm having problems understanding the result.
First: I was calculating RMSE by fitting to all my training data, then "predicting" the training data outputs using the fitted model and comparing those with the training outputs (the same ones I used for fitting). The RMSE that I observe is the same order of magnitude the predicted values and, more important, it's in the same ballpark as the RMSE I get when I submit my predicted results to Kaggle (although the latter is lower, reflecting overfitting).
Second, I use the same training data, but apply sklearn.model_selection.cross_validate as follows:
cross_validate( predictor, features, targets, cv = 5, scoring = "neg_mean_squared_error" )
I figure the neg_mean_squared_error should be the square of my RMSE. Accounting for that, I still find that the error reported by cross_validate is one or two orders of magnitude smaller than the RMSE I was calculating as described above.
In addition, when I modify my GradientBoostingRegressor max_depth from 3 to 2, which I would expect reduces overfitting and thus should improve the CV error, I find that the opposite is the case.
I'm keenly interested to use Cross Validation so I don't have to validate my hyperparameter choices by using up Kaggle submissions, but given what I've observed, I'm not clear that the results will be understandable or useful.
Can someone explain how I should be using Cross Validation to get meaningful results?
I think there is a conceptual problem here.
If you want to compute the error of a prediction you should not use the training data. As the name says theese type of data are used only in training, for evaluating accuracy scores you ahve to use data that the model has never seen.
About cross-validation I can tell that it's an approach to find the best training/testing set. The process is as follows: you divide your data into n groups and you do various iterating changing the testing group you pick. If you have n groups you will do n iteration and each time the training and testing set will be different. It's more understamdable in the image below.
Basically what you should do it's kile this:
Train the model using months from 0 to 30 (for example)
See the predictions made with months from 31 to 35 as input.
If the input has to be the same lenght divide feature in half (should be 17 months).
I hope I understood correctly, othewise comment.

How to know which features have more impact in predicting the target class?

I have a business problem, I have run the regression model in python to predict my target value. When validating it with my test set I came to know that my predicted variable is very far from my actual value. Now the thing I want to extract from this model is that, which feature played the role to deviate my predicted value from actual value (let say difference is in some threshold value)?
I want to rank the features impact wise so that I could address to my client.
Thanks
It depends on the estimator you chose, linear models often have a coef_ method you can call to get the coef used for each feature, given they are normalized this tells you what you want to know.
As told above for tree model you have the feature importance. You can also use libraries like treeinterpreter described here:
Interpreting Random Forest
examples
You can have a look at this -
Feature selection
Check the Random Forest Regressor - for performing Regression.
# Example
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_features=4, n_informative=2,
random_state=0, shuffle=False)
regr = RandomForestRegressor(max_depth=2, random_state=0,
n_estimators=100)
regr.fit(X, y)
print(regr.feature_importances_)
print(regr.predict([[0, 0, 0, 0]]))
Check regr.feature_importances_ for getting the higher, more important features. Further information on FeatureImportance
Edit-1:
As pointed out in user (#blacksite) comment, only feature_importance does not provide complete interpretation of Random forest. For further analysis of results and responsible Features. Please refer to following blogs
https://medium.com/usf-msds/intuitive-interpretation-of-random-forest-2238687cae45 (preferred as it provides multiple techniques )
https://blog.datadive.net/interpreting-random-forests/ (focuses on 1 technique but also provides python library - treeinterpreter)
More on feature_importance:
If you simply use the feature_importances_ attribute to select the
features with the highest importance score. Feature selection using
feature
importances
Feature importance also depends on the criteria used for splitting
and calculating importance Interpreting Decision Tree in context of
feature
importances

How to provide weighted eval set to XGBClassifier.fit()?

From the sklearn-style API of XGBClassifier, we can provide eval examples for early-stopping.
eval_set (list, optional) – A list of (X, y) pairs to use as a
validation set for early-stopping
However, the format only mentions a pair of features and labels. So if the doc is accurate, there is no place to provide weights for these eval examples.
Am I missing anything?
If it's not achievable in the sklearn-style, is it supported in the original (i.e. non-sklearn) XGBClassifier API? A short example will be nice, since I never used that version of the API.
As of a few weeks ago, there is a new parameter for the fit method, sample_weight_eval_set, that allows you to do exactly this. It takes a list of weight variables, i.e. one per evaluation set. I don't think this feature has made it into a stable release yet, but it is available right now if you compile xgboost from source.
https://github.com/dmlc/xgboost/blob/b018ef104f0c24efaedfbc896986ad3ed1b66774/python-package/xgboost/sklearn.py#L235
EDIT - UPDATED per conversation in comments
Given that you have a target-variable representing real-valued gain/loss values which you would like to classify as "gain" or "loss", and you would like to make sure the validation-set of the classifier weighs the large-absolute-value gains/losses heaviest, here are two possible approaches:
Create a custom classifier which is just XGBoostRegressor fed to a treshold where the real-valued regression predictions are converted to 1/0 or "gain"/"loss" classifications. The .fit() method of this classifier would just call .fit() of xgbregressor, while .predict() method of this classifier would call .predict() of the regressor and then return the thresholded category predictions.
you mentioned you would like to try weighting the treatment of the records in your validation set, but there is no option for this in xgboost. The way to implement this would be to implement a custom eval-metric. However, you pointed out that eval_metric must be able to return a score for a single label/pred record at a time, so it couldn't accept all your row-values and perform the weighting in the eval metric. The solution to this you mentioned in your comment was "create a callable which has a ref to all validation examples, pass the indices (instead of labels and scores) into eval_set, use the indices to fetch labels and scores from within the callable and return metric for each validation examples." This should also work.
I would tend to prefer option 1 as more straightforward, but trying two different approaches and comparing results is generally a good idea if you have the time, so interested how these turn out for you.

Custom loss function: Apply weights to binary cross-entropy error

I am playing around with Keras and try to predict a word from within a context e.g. from a sentence "I have to say the food was tasty!" I hope to get something like this:
[say the ? was tasty] -> food, meals, spaghetti, drinks
However, my problem currently is that the network I am training appears to learn just the probabilities of the single words, and not the probabilities they have in a particular context.
Since the frequency of words is not balanced I thought I might/could/should apply weights to my loss function - which is currently the binary-cross entropy function.
I simply multiply the converse probability of each word with the error:
def weighted_binary_crossentropy(y_true, y_pred):
return K.mean(K.binary_crossentropy(y_pred, y_true) * (1-word_weights), axis=1)
This function is being used by the model as loss function:
model.compile(optimizer='adam', loss=weighted_binary_crossentropy)
However, my results are the exact same and I am not sure if just my model is broken or if I am using the loss paramter/function wrong.
is my weighted_binary_crossentropy() function doing what I just described? I asked because for some reason this works similar:
word_weights), axis=1)
Actually, as one may read in a documentation of a fit function, one may provide sample_weights which seem to be exactly what you want use.

Does GridSearchCV use predict or predict_proba, when using auc_score as score function?

Does GridSearchCV use predict or predict_proba, when using auc_score as score function?
The predict function generates predicted class labels, which will always result in a triangular ROC-curve. A more curved ROC-curve is obtained using the predicted class probabilities. The latter one is, as far as I know, more accurate. If so, the area under the 'curved' ROC-curve is probably best to measure classification performance within the grid search.
Therefore I am curious if either the class labels or class probabilities are used for the grid search, when using the area under the ROC-curve as performance measure. I tried to find the answer in the code, but could not figure it out. Does anyone here know the answer?
Thanks
To use auc_score for grid searching you really need to use predict_proba or decision_function as you pointed out. This is not possible in the 0.13 release. If you do score_func=auc_score it will use predict which doesn't make any sense.
[edit]Since 0.14[/edit] it is possible to do grid-search using auc_score, by setting the new scoring parameter to roc_auc: GridSearch(est, param_grid, scoring='roc_auc'). It will do the right thing and use predict_proba (or decision_function if predict_proba is not available).
See the whats new page of the current dev version.
You need to install the current master from github to get this functionality or wait until April (?) for 0.14.
After performing some experiments with Sklearn SVC (which has predict_proba available) comparing some results with predict_proba and decision_function, it seems that roc_auc in GridSearchCV uses decision_function to compute AUC scores. I found a similar discussion here: Reproducing Sklearn SVC within GridSearchCV's roc_auc scores manually

Resources