Computing values of feature importances

Computing values of feature importances - scikit-learn

Where does scikit-learn compute the values of sklearn.ensemble.RandomForestClassifier.feature_importances_?

The corresponding code should be within https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_forest.py but I do not find sklearn.ensemble.RandomForestClassifier.feature_importances there, however there is a class called RandomForestClassifier.

The relevant section of the source code is here: https://github.com/scikit-learn/scikit-learn/blob/b194674c4/sklearn/ensemble/_forest.py#L415

Related

How to use GroupKFold with CalibratedClassifierCV?

Unlike GridSearchCV, CalibratedClassifierCV doesn't seem to support passing the groups parameter to the fit method. I found this very old github issue that reports this problem but it doesn't seem to be fixed yet. The documentation makes it clear that not properly stratifying your cv folds will result in an incorrectly calibrated model. My dataset has multiple observations from the same users so I would need to use GroupKFold to ensure proper calibration.

scikit-learn can take an iterable of (train, test) splits as the cv object, so just create them manually. For example:
my_cv = (
(train, test) for train, test in GroupKFold(n_splits=5).split(X, groups=my_groups)
)
cal_clf = CalibratedClassifierCV(clf, cv=my_cv)

I've created a modified version of CalibratedClassifierCV that addresses this issue for now. Until this is fixed in sklearn master, you can similarly modify the fit method of CalibratedClassifierCV to use GroupKFold. My solution can be found in this gist. This is based of sklearn version 0.24.1 but you can easily adapt it to your version of sklearn as needed.

LightGBM Python API. Best_iteration and best_score for custom evaluation function (feval)

I'm using lightgbm.train with valid_sets, early_stopping_rounds and feval function for multiclass problem with "objective": "multiclass". I want to find best_iteration and best_score for my custom evaluation function. But it finds them for multi_logloss metrics, which is corresponding to specified objective. So the question is can I find in LightGBM best_iteration and best_score for my feval function and how?

This happens due to the fact that the objective function is included in the list of evaluation metrics by default. Early stopping in LightGBM happens based on any included metric. See a short summary and a link to another issue with a longer discussion in this LightGBM issue.

You can use objective:"multi_error", or also you can combine objectives as
objective: "multi_error", "multi_logloss"
Multi_error will directly focus on the accuracy.

How to see the column that influences the prediction result the most?

I'm using Azure Machine Learning Studio in order to predict a column using Two-Class Boosted Decision Tree and split data.
The diagram that I have assembled can be found here:
What I need is that I'd like to see the column in the dataset that affects and influences the prediction the most. In other words, the column that changes the prediction result more than the other columns in the dataset.
Sorry if this has been asked before, but I couldn't find a proper answer to this simple question.

As said before, Permutation Feature Importance do the trick. Attach the Permutation Feature Importance block do the train block, click on the output port, and select visualize to get results of the module. The figure above shows the list of features sorted in descending order of their permutation importance scores.
An advice: be careful when interpreting results of permutation score when you have high correlated features.
For more info, see:
https://standupdata.com/category/permutation-feature-importance/ https://gallery.cortanaintelligence.com/Experiment/Permutation-Feature-Importance-5

Most ML implementation for decision tree includes something called "feature importance" in its model. For example, Scikit Learn Decision Tree Classifier has an attribute that indicates the importance of each feature.
Azure ML implementation should be no exception. Please look at the below link Permutation Feature Importance.

How to find the definition of Keras function?

Keras' just provide a very brief definition of it's functions like:
Available metrics
binary_accuracy
binary_accuracy(y_true, y_pred)
...
But I want a math formula style definition, does anyone knows where can I find the math definitions of all the functions?

Keras is open source project and you can find everything on github, for your questions, the metrics calculation code can be found here:
binary_accuracy
= https://github.com/fchollet/keras/blob/master/keras/metrics.py#L20

Can i predict data price based on a survey on azure machine learning?

I want to predict my input price based on a list of questions/answers using azure machine learning.
I built one using the "bayesian linear regression" but it seems that it is predicting the price based on the prices i have in my dataset and not based on the Q/A.
Am i in the wrong path or am i missing something?
Any suggestion would be helpful.

Check the Q/A s that you using is not having missing values. If there's any missing values follow data preprocessing techniques to fill those.
What kind of answers do you have as inputs? (yes/no, numeric values, different textual answers, etc...) In my opinion numerical values and yes/no inputs makes your model more accurate.
Try different regression algorithms (https://azure.microsoft.com/en-us/documentation/articles/machine-learning-algorithm-cheat-sheet/) and check their accuracy.

you need to set features and label properly. if you publish your experiment in Gallery using unlisted mode and paste the link here, we can take a look.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Computing values of feature importances - scikit-learn

Where does scikit-learn compute the values of sklearn.ensemble.RandomForestClassifier.feature_importances_?

The corresponding code should be within https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_forest.py but I do not find sklearn.ensemble.RandomForestClassifier.feature_importances there, however there is a class called RandomForestClassifier.

The relevant section of the source code is here: https://github.com/scikit-learn/scikit-learn/blob/b194674c4/sklearn/ensemble/_forest.py#L415

Related

How to use GroupKFold with CalibratedClassifierCV?

LightGBM Python API. Best_iteration and best_score for custom evaluation function (feval)

How to see the column that influences the prediction result the most?

How to find the definition of Keras function?

Can i predict data price based on a survey on azure machine learning?

Categories

Resources