intel daal4py classifiers with scikit-learn - scikit-learn

I am testing the sklearn-compatible wrappers for the latest version of the intel daal4py classifiers. The intel k-nearest classifier works fine with sklearn’s cross_val_score() and GridSearchCV. The performance boost from the intel classifier is significant and the intel and sklearn models provide generally comparable results across 10 different large public datasets and some simulated datasets.
The sklearn-compatible wrapper for the intel random forest classifier seems to be completely broken. The score() method does not work so I cannot proceed further with the intel random forest wrapper class.
I posted this at the intel AI Developer Forum, but I was wondering if anyone here has gotten the intel sklearn-compatible random forest classifier to work.
My next step is to test the native daal4py random forest object and possibly write my own wrapper because the native daal4py api is so different from sklearn. I was hoping to avoid this.
There seems to be some confusion on the intel site regarding the names of the wrapper classes.
I am using:
For k-nearest: daal4py.sklearn.neighbors.kdtree_knn_classifier (this
works fine)
For random forest:
daal4py.sklearn.ensemble.decision_forest.RandomForestClassifier
The failure in the intel RandomForestClassifier is in forest.py because n_classes_ is an int. n_classes_ matches the number of classes for the label variable that is passed. The label variable is an integer.
predictions = [np.zeros((n_samples, n_classes_[k]))
for k in range(self.n_outputs_)]

Please find below the steps we used to compute scores for daal4py RandomForestClassifier
(i) For cross_val_score
from daal4py.sklearn.ensemble.decision_forest import RandomForestClassifier
from sklearn.model_selection import cross_val_score
clf = RandomForestClassifier()
scores = cross_val_score(clf, train_data, train_labels, cv=3)
print(scores)
(ii)For GridSearchCV
from sklearn.model_selection import GridSearchCV
from daal4py.sklearn.ensemble.decision_forest import RandomForestClassifier
param_grid = {
'n_estimators': [200, 700],
'max_features': ['auto', 'sqrt', 'log2']
}
clf = RandomForestClassifier()
CV_rfc = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 5)
CV_rfc.fit(train_data, train_labels)
score=CV_rfc.score(train_data, train_labels)

Related

RandomForestClassifier crashes on training

I am using scikit.learn RandomForestClassifier to generate a binary classifier, whenever I try fitting the model instance with training dataset, the kernel crashes after 3 or 4 seconds, I read a stackoverflow answer (Link: Jupyter Notebook and Colab Keep Crashing From Running Random Forest Model) to use pruning methods, but they don't seem to work.
The code for the Classifier is as follows -
# implementing RF
from sklearn.ensemble import RandomForestClassifier
# Instantiating rf model
rf_model = RandomForestClassifier(n_estimators = 10, random_state = 42, max_depth=10, max_leaf_nodes=10, max_features=None)
# Fitting the model
rf_model.fit(train_features, train_labels.ravel())
The shape of the training and testing dataset are as follows -
Training Features shape: (224553, 54)
Training Labels shape: (224553, 1)
Testing Features shape:(74852, 54)
Testing Labels shape: (74852, 1)
I have tried various methods but can't seem to fit the dataset, neither on my localmachine nor on Google collab, My machine stats are -
RAM 16gb
intel i7
Nvidia quadro 4gb graphics card
It would be great if you could help me with it. Thank you in advance.

Single prediction using a model pre-trained with scaled features

I trained a SVM scikit-learn model with scaled features and persist it to be used later. In another file I loaded the saved model and I want to submit a new set of features to perform a prediction. Do I have to scale this new set of features? How can I do this with only one set of features?
I am not scaling the new values and I am getting weird outcomes and I cannot do the predictions. Despite of this, the prediction with a large test set generated by StratifiedShuffleSplit is working fine and I am getting a 97% of accuracy.
The problem is with the single predictions using a persisted SVM model trained with scaled features. Some idea of what am I doing wrong?
Yes, you should absolutely perform the same scaling on the new data. However, this might be impossible if you haven't saved the scaler you trained before.
This is why instead of training and saving your SVM, you should train and save your scaler with your SVM together. In the machine learning jargon, this is called a Pipeline.
This is how you would use it on a toy example:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X,y)
pipe = Pipeline([('scaler',StandardScaler()), ('svc', SVC())])
This pipeline then supports the same operations as a regular scikit-learn model:
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
When fitting the pipe, it first scales and then feeds the scaled features into the classifier.
Once it is trained, you can save the pipe object just like you saved the SVM before. When you will load it and apply it to new data, it will do the scaling as desired before the predictions.

Balanced Random Forest in scikit-learn (python)

I'm wondering if there is an implementation of the Balanced Random Forest (BRF) in recent versions of the scikit-learn package. BRF is used in the case of imbalanced data. It works as normal RF, but for each bootstrapping iteration, it balances the prevalence class by undersampling. For example, given two classes N0 = 100, and N1 = 30 instances, at each random sampling it draws (with replacement) 30 instances from the first class and the same amount of instances from the second class, i.e. it trains a tree on a balanced data set. For more information please refer to this paper.
RandomForestClassifier() does have the 'class_weight=' parameter, which might be set to 'balanced', but I'm not sure that it is related to downsampling of the bootsrapped training samples.
What you're looking for is the BalancedBaggingClassifier from imblearn.
imblearn.ensemble.BalancedBaggingClassifier(base_estimator=None,
n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True,
bootstrap_features=False, oob_score=False, warm_start=False, ratio='auto',
replacement=False, n_jobs=1, random_state=None, verbose=0)
Effectively what it allow you to do is to successively undersample your majority class while fitting an estimator on top. You can use random forest or any base estimator from scikit-learn. Here is an example.
There is now a class in imblearn called BalancedRandomForestClassifier. It works similar to previously mentioned BalancedBaggingClassifier but is specifically for random forests.
from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=0)
brf.fit(X_train, y_train)
y_pred = brf.predict(X_test)

Use caffe to simulate the SGDclassifier or Logisticregression linear models in sklearn

I'm trying to use caffe to simulate the SGDclassifier and Logisticregression linear models in sklearn. As we all know, in caffe, one "InnerProduct" layer plus one "Softmaxwithloss" layer represent a logistic regression Y = Logit(WX+b).
I'm now using the digits dataset in the sklearn datasets package as the trianing set(5/6 of all the data-label pairs) and testing set(the rest 1/6). However, the accuracy obtained by SGDclassifer() or LogisticRegression() could reach nearly 90%, while the accuracy obtained by two-layer Neural Network cannot exceed 30% after training. Is this because of the parameter settings or something else? The gap between them is just kind of too large.

sklearn: vectorizing in cross validation for text classification

I have a question about using cross validation in text classification in sklearn. It is problematic to vectorize all data before cross validation, because the classifier would have "seen" the vocabulary occurred in the test data. Weka has filtered classifier to solve this problem. What is the sklearn equivalent for this function? I mean for each fold, the feature set would be different because the training data are different.
The scikit-learn solution to this problem is to cross-validate a Pipeline of estimators, e.g.:
>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.svm import LinearSVC
>>> clf = Pipeline([('vect', TfidfVectorizer()), ('svm', LinearSVC())])
clf is now a composite estimator that does feature extraction and SVM model fitting. Given a list of documents (i.e. an ordinary Python list of string) documents and their labels y, calling
>>> cross_val_score(clf, documents, y)
will do feature extraction in each fold separately so that each of the SVMs knows only the vocabulary of its (k-1) folds training set.

Resources