Save Scikit-learn RandomForestClassifier in Weka J48 format

Save Scikit-learn RandomForestClassifier in Weka J48 format - scikit-learn

I'm looking to save a fitted RandomForestClassifier in the format specified Page 27 - Figure 26 here. Alternatively, if such a thing does not exist, how can I extract the internals of the decision trees to build the J48 formatted file myself? DecisionTreeClassifier makes this available through tree.export_text, but RandomForestClassifier does not have a tree_ attribute.

Toward the last sentence of your question, RandomForestClassifier has its fitted trees accessible through the attribute estimators_; these are instances of DecisionTreeClassifier, so each of them has a tree_ attribute.

If you use the sklearn-weka-plugin (examples), and train Weka's RandomForest on your data, then you can create that output by simply turning the model into a string using str(...).
Of course, this will require a Java JDK installation to run.

Related

Extracting fixed vectors from BioBERT without using terminal command?

If we want to use weights from pretrained BioBERT model, we can execute following terminal command after downloading all the required BioBERT files.
os.system('python3 extract_features.py \
--input_file=trial.txt \
--vocab_file=vocab.txt \
--bert_config_file=bert_config.json \
--init_checkpoint=biobert_model.ckpt \
--output_file=output.json')
The above command actually reads individual file containing the text, reads the textual content from it, and then writes the extracted vectors to another file. So, the problem with this is that it could not be scaled easily for very large data-sets containing thousands of sentences/paragraphs.
Is there is a way to extract these features on the go (using an embedding layer) like it could be done for the word2vec vectors in PyTorch or TF1.3?
Note: BioBERT checkpoints do not exist for TF2.0, so I guess there is no way it could be done with TF2.0 unless someone generates TF2.0 compatible checkpoint files.
I will be grateful for any hint or help.

You can get the contextual embeddings on the fly, but the total time spend on getting the embeddings will always be the same. There are two options how to do it: 1. import BioBERT into the Transformers package and treat use it in PyTorch (which I would do) or 2. use the original codebase.
1. Import BioBERT into the Transformers package
The most convenient way of using pre-trained BERT models is the Transformers package. It was primarily written for PyTorch, but works also with TensorFlow. It does not have BioBERT out of the box, so you need to convert it from TensorFlow format yourself. There is convert_tf_checkpoint_to_pytorch.py script that does that. People had some issues with this script and BioBERT (seems to be resolved).
After you convert the model, you can load it like this.
import torch
from transformers import *
# Load dataset, tokenizer, model from pretrained model/vocabulary
tokenizer = BertTokenizer.from_pretrained('directory_with_converted_model')
model = BertModel.from_pretrained('directory_with_converted_model')
# Call the model in a standard PyTorch way
embeddings = model([tokenizer.encode("Cool biomedical tetra-hydro-sentence.", add_special_tokens=True)])
2. Use directly BioBERT codebase
You can get the embeddings on the go basically using the code that is exctract_feautres.py. On lines 346-382, they initialize the model. You get the embeddings by calling estimator.predict(...).
For that, you need to format your format the input. First, you need to format the string (using code on line 326-337) and then apply and call convert_examples_to_features on it.

XGboost classifier

I am new to XGBoost and I am currently working on a project where we have built an XGBoost classifier. Now we want to run some feature selection techniques. Is backward elimination method a good idea for this? I have used it in regression but I am not sure if/how to use it in a classification problem. Any leads will be greatly appreciated.
Note: I have already tried permutation line importance and it has yielded good results! Looking for another method to evaluate the features in the model.

Consider asking your question on Cross Validated since feature selection is more about theory/practice than code.
What is your concern ? Remove "noisy" features who drive down your results, obtain a sparse model ? Backward selection is one way to do of course. That being said, not sure if you are aware of this but XGBoost computes its own "variable importance" values.
# plot feature importance using built-in function
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
model = XGBClassifier()
model.fit(X, y)
# plot feature importance
plot_importance(model)
pyplot.show()
Something like this. This importance is based on how many times a feature is used to make a split. You can then define for instance a threshold below which you do not keep the variables. However do not forget that :
This variable importance has been obtained on the training data only
The removal of a variable with high importance may not affect your prediction error, e.g. if it is correlated with another highly important variable. Other tricks such as this one may exist.

Adding correct labels to decision trees

I am using a Random Forests regressor in a Machine Learning Project. In order to better understand the logic of the predictions, I'd like to visualize some decision trees and check which features are used when.
In order to do so, I wrote the following code:
from sklearn.tree import export_graphviz
from subprocess import call
from IPython.display import Image
# Select one estimator from the Random Forests
estimator = best_estimators_regr['RandomForestRegressor'][0].estimators_[0]
export_graphviz(estimator, out_file=path+'tree.dot',
rounded=True, proportion=False,
precision=2, filled=True)
call(['dot', '-Tpng', path+'tree.dot', '-o', path+'tree.png', '-Gdpi=600'])
Image(filename=path+'tree.png')
The problem is that I use the max_features parameter when training the model, so I do not know which features are used in each tree. Thus, when plotting a tree, I simply get X[some_number]. Does this number correspond to the column in the original dataset? If not, how can I tell it to use the name of the columns rather than the number?

The 'max_features' parameter in RandomForestClassifier is used to get the number of features at a time to find the best split. That parameter is passed to all the individual estimators (DecisionTreeClassifier). The base DecisionTreeClassifier objects all accept the whole data (where the samples are sampled from the training data but all column features are passed to each tree). The feature ordering is decided into that single DecisionTreeClassifier object. So no need to worry about that.
You can just use the feature_names parameter in export_graphviz to pass the names of each features for all your features.

Convert scikit-learn SVM model to LibSVM

I have trained a SVM (svc) using scikit-learn over half a terabyte of data. The model is working fine and I need to port it to C, but I don't want to re-train the SVM from scratch because it takes way too long for me. Is there a way to easily export the model generated by scikit-learn and import it into LibSVM? Internally scikit-learn uses LibSVM so theoretically it should be possible, but I haven't been able to find anything in the documentation. Any suggestion?

Is there a way to easily export the model generated by scikit-learn and import it into LibSVM?
No. The scikit-learn version of LIBSVM has been hacked up severely to fit it into the Python environment and the model is stored as NumPy/SciPy data structures.
Your best shot is to study the SVM decision function and reimplement it in C. The support vectors can be obtained from the SVC object as NumPy arrays, which are easily translated to C arrays.

One standard error rule for cross-validation in scikit-learn

I'm trying to fit some models in scikit-learn using grisSearchCV, and I would like to use the "one standard error" rule to select the best model, i.e. selecting the most parsimonious model from the subset of models whose score is within one standard error of the best score. Is there a way to do this?

You can compute the standard error of the mean of the validation scores using:
from scipy.stats import sem
Then access the grid_scores_ attribute of the fitted GridSearchCV object. This attribute has changed in the master branch of scikit-learn so please use an interactive shell to introspect its structure.
As for selecting the most parsimonious model, the model parameters of the models do not always have a degrees of freedom interpretation. The meaning of the parameters is often model specific and there is no high level metadata to interpret their "parsimony". You can have to encode your interpretation on a case by case basis for each model class.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Save Scikit-learn RandomForestClassifier in Weka J48 format - scikit-learn

Toward the last sentence of your question, RandomForestClassifier has its fitted trees accessible through the attribute estimators_; these are instances of DecisionTreeClassifier, so each of them has a tree_ attribute.

If you use the sklearn-weka-plugin (examples), and train Weka's RandomForest on your data, then you can create that output by simply turning the model into a string using str(...). Of course, this will require a Java JDK installation to run.

Related

Extracting fixed vectors from BioBERT without using terminal command?

XGboost classifier

Adding correct labels to decision trees

Convert scikit-learn SVM model to LibSVM

One standard error rule for cross-validation in scikit-learn

Categories

Resources