Is there a python package or function that can calculate %incMSE and %incNodePurity in the same way that randomForest package in R calculates them thru importance function?
If I understand correctly, %incNodePurity refers to the Gini feature importance; this is implemented under sklearn.ensemble.RandomForestClassifier.feature_importances_. According to the original Random Forest paper, this gives a "fast variable importance that is often very consistent with the permutation importance measure."
As far as I know, the permuation feature importance itself (%incMSE) is not implemented in scikit-learn.
Related
As in title i was wondering where i can check which decision tree algorithms is used by RandomForestClassifier in scikit-learn. It says in attributes base_estimator_ = DecisionTreeClassifier, then behind DecisionTreeClassifier in scikitlearn is CART so is it my answer?
link to scikit-learn RandomForest
Any suggestions would be appreciated
Scikit-learn uses an optimized version of CART by default (https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart).
It constructs trees by 'using the feature and threshold that yield the largest information gain'. The function to measure the quality of the split (a.k.a. largest information gain) in the trees can be set using the criterion parameter in the RandomForestClassifier.
The default function is the gini impurity, but you can also select entropy. In practice these two are quite similar, but you can find more information here: https://datascience.stackexchange.com/questions/10228/when-should-i-use-gini-impurity-as-opposed-to-information-gain-entropy
I'm translating a random forest using h20 and r into a random forest using SciKit Learn's Random Forest Classifier with python. H2o's randomForest model has an argument 'stopping_rounds'. Is there a way to do this in python using the SKLearn Random Forest Classifier model? I've looked through the documentation, so I'm afraid I might have to hard code this.
No, I don't believe scikit-learn algorithms have any sort of automatic early stopping mechanism (that's what stopping_rounds relates to in H2O algorithms). You will have to figure out the optimal number of trees manually.
Per the sklearn random forest classifier docs, early stopping is determined by the min_impurity_split (deprecated) and min_impurity_decrease arguments. It doesn't seem to have the same functionality as H2O, but it might be what you're looking for.
I am trying to figure out what algorithms are used within the pROC package to conduct ROC analysis. For instance what algorithm corresponds to the condition 'algorithm==2'? I only recently started using R in conjunction with Python because of the ease of finding CI estimates, significance test results etc. My Python code uses Linear Discriminant Analysis to get results on a binary classification problem. When using the pROC package to compute confidence interval estimates for AUC, sensitivity, specificity, etc., all I have to do is load my data and run the package. The AUC I get when using pROC is the same as the AUC that is returned by my Python code that uses Linear Discriminant Analysis (LDA). In order to be able to report consistent results I am trying to find out if LDA is one of the algorithm choices within pROC? Any ideas on this or how to go about figuring this out would be very helpful. Where can I access the source code for pROC?
The core algorithms of pROC are described in a 2011 BMC Bioinformatics paper. Some algorithms added later are described in the PDF manual. As every CRAN package, the source code is available from the CRAN package page. As many R packages these days it is also on GitHub.
To answer your question specifically, unfortunately I don't have a good reference for the algorithm to calculate the points of the ROC curve with algorithm 2. By looking at it you will realize it is ultimately equivalent to the standard ROC curve algorithm, albeit more efficient when the number of thresholds increases, as I tried to explain in this answer to a question on Cross Validated. But you have to trust me (and most packages calculating ROC curves) on it.
Which binary classifier you use, whether LDA or other, is irrelevant to ROC analysis, and outside the scope of pROC. ROC analysis is a generic way to assesses predictions, scores, or more generally signal coming out of a binary classifier. It doesn't assess the binary classifier itself, or the signal detector, only the signal itself. This makes it very easy to compare different classification methods, and is instrumental to the success of ROC analysis in general.
I want to build a Random Forest Regressor to model count data (Poisson distribution). The default 'mse' loss function is not suited to this problem. Is there a way to define a custom loss function and pass it to the random forest regressor in Python (Sklearn, etc..)?
Is there any implementation to fit count data in Python in any packages?
In sklearn this is currently not supported. See discussion in the corresponding issue here, or this for another class, where they discuss reasons for that a bit more in detail (mainly the large computational overhead for calling a Python function).
So it could be done as discussed within the issues, by forking sklearn, implementing the cost function in Cython and then adding it to the list of available 'criterion'.
If the problem is that the counts c_i arise from different exposure times t_i, then indeed one cannot fit the counts, but one can still fit the rates r_i = c_i/t_i using MSE loss function, where one should, however, use weights proportional to the exposures, w_i = t_i.
For a true Random Forest Poisson regression, I've seen that in R there is the rpart library for building a single CART tree, which has a Poisson regression option. I wish this kind of algorithm would have been imported to scikit-learn.
In R, writing a custom objective function is fairly simple.
randomForestSRC package in R has provision for writing your own custom split rule. The custom split rule, however has to be written in pure C language.
All you have to do is, write your own custom split rule, register the split rule, compile and install the package.
The custom split rule has to be defined in the file called splitCustom.c in randomForestSRC source code.
You can find more info
here.
The file in which you define the split rule is
this.
I am working with sklearn's implementation of KNN. While my input data has about 20 features, I believe some of the features are more important than others. Is there a way to:
set the feature weights for each feature when "training" the KNN learner.
learn what the optimal weight values are with or without pre-processing the data.
On a related note, I understand generally KNN does not require training but since sklearn implements it using KDTrees, the tree must be generated from the training data. However, this sounds like its turning KNN into a binary tree problem. Is that the case?
Thanks.
kNN is simply based on a distance function. When you say "feature two is more important than others" it usually means difference in feature two is worth, say, 10x difference in other coords. Simple way to achive this is by multiplying coord #2 by its weight. So you put into the tree not the original coords but coords multiplied by their respective weights.
In case your features are combinations of the coords, you might need to apply appropriate matrix transform on your coords before applying weights, see PCA (principal component analysis). PCA is likely to help you with question 2.
The answer to question to is called "metric learning" and currently not implemented in Scikit-learn. Using the popular Mahalanobis distance amounts to rescaling the data using StandardScaler. Ideally you would want your metric to take into account the labels.