sklearn random forest: .oob_score_ too low? - scikit-learn

I was searching for applications for random forests, and I found the following knowledge competition on Kaggle:
https://www.kaggle.com/c/forest-cover-type-prediction.
Following the advice at
https://www.kaggle.com/c/forest-cover-type-prediction/forums/t/8182/first-try-with-random-forests-scikit-learn,
I used sklearn to build a random forest with 500 trees.
The .oob_score_ was ~2%, but the score on the holdout set was ~75%.
There are only seven classes to classify, so 2% is really low. I also consistently got scores near 75% when I cross validated.
Can anyone explain the discrepancy between the .oob_score_ and the holdout/cross validated scores? I would expect them to be similar.
There's a similar question here:
https://stats.stackexchange.com/questions/95818/what-is-a-good-oob-score-for-random-forests
Edit: I think it might be a bug, too.
The code is given by the original poster in the second link I posted. The only change is that you have to set oob_score = True when you build the random forest.
I didn't save the cross validation testing I did, but I could redo it if people need to see it.

Q: Can anyone explain the discrepancy ...
A: The sklearn.ensemble.RandomForestClassifier object and it's observed .oob_score_ attribute value is not a bug-related issue.
First, RandomForest-based predictors { Classifier | Regressor } belong to rather specific corner of so called ensemble methods, so be informed, that typical approaches, incl. Cross-Validation, do not work the same way as for other AI/ML-learners.
RandomForest "inner"-logic works heavily with RANDOM-PROCESS, by which the Samples ( DataSET X ) with known y == { labels ( for Classifier ) | targets ( for Regressor ) }, gets split throughout the forest generation, where trees get bootstrapped by RANDOMLY split DataSET into part, that the tree can see and a part, the tree will not see ( forming thus an inner-oob-subSET ).
Besides other effects on a sensitivity to overfitting et al, the RandomForest ensemble does not have a need to get Cross-Validated, because it does not over-fit by design. Many papers and also Breiman's (Berkeley) empirical proofs have provided support for such statement, as they brought evidence, that CV-ed predictor will have the same .oob_score_
import sklearn.ensemble
aRF_PREDICTOR = sklearn.ensemble.RandomForestRegressor( n_estimators = 10, # The number of trees in the forest.
criterion = 'mse', # { Regressor: 'mse' | Classifier: 'gini' }
max_depth = None,
min_samples_split = 2,
min_samples_leaf = 1,
min_weight_fraction_leaf = 0.0,
max_features = 'auto',
max_leaf_nodes = None,
bootstrap = True,
oob_score = False, # SET True to get inner-CrossValidation-alike .oob_score_ attribute calculated right during Training-phase on the whole DataSET
n_jobs = 1, # { 1 | n-cores | -1 == all-cores }
random_state = None,
verbose = 0,
warm_start = False
)
aRF_PREDICTOR.estimators_ # aList of <DecisionTreeRegressor> The collection of fitted sub-estimators.
aRF_PREDICTOR.feature_importances_ # array of shape = [n_features] The feature importances (the higher, the more important the feature).
aRF_PREDICTOR.oob_score_ # float Score of the training dataset obtained using an out-of-bag estimate.
aRF_PREDICTOR.oob_prediction_ # array of shape = [n_samples] Prediction computed with out-of-bag estimate on the training set.
aRF_PREDICTOR.apply( X ) # Apply trees in the forest to X, return leaf indices.
aRF_PREDICTOR.fit( X, y[, sample_weight] ) # Build a forest of trees from the training set (X, y).
aRF_PREDICTOR.fit_transform( X[, y] ) # Fit to data, then transform it.
aRF_PREDICTOR.get_params( [deep] ) # Get parameters for this estimator.
aRF_PREDICTOR.predict( X ) # Predict regression target for X.
aRF_PREDICTOR.score( X, y[, sample_weight] ) # Returns the coefficient of determination R^2 of the prediction.
aRF_PREDICTOR.set_params( **params ) # Set the parameters of this estimator.
aRF_PREDICTOR.transform( X[, threshold] ) # Reduce X to its most important features.
One shall be also informed, that default values do not serve best, the less serve well under any circumstances. One shall take care to the problem-domain so as to propose a reasonable set of ensemble parametrisation, before moving further.
Q: What is a good .oob_score_ ?
A: .oob_score_ is RANDOM! . . . . . . .....Yes, it MUST ( be random )
While this sound as a provocative epilogue, do not throw your hopes away.
RandomForest ensemble is a great tool. Some problems may come with categoric-values in features ( DataSET X ), however the costs of processing the ensemble are still adequate once you need not struggle with neither bias nor overfitting. That's great, isn't it?
Due to the need to be able to reproduce same results on subsequent re-runs, it is a recommendable practice to (re-)set numpy.random & .set_params( random_state = ... ) to a know-state before the RANDOM-PROCESS ( embedded into every bootstrapping of the RandomForest ensemble ). Doing that, one may observe a "de-noised" progression of the RandomForest-based predictor in a direction of better .oob_score_ rather due to truly improved predictive powers introduced by more ensemble members ( n_estimators ), less constrained tree-construction ( max_depth, max_leaf_nodes et al ) and not just stochastically by just "better luck" during the RANDOM-PROCESS of how to split the DataSET...
Going closer towards better solutions typically involves more trees into the ensemble ( RandomForest decisions are based on a majority vote, so 10-estimators is not a big basis for making good decisions on highly complex DataSETs ). Numbers above 2000 are not uncommon. One may iterate over a range of sizings ( with RANDOM-PROCESS kept under state-full control ) to demonstrate the ensemble "improvements".
If initial values of .oob_score_ fall somewhere around about 0.51 - 0.53 your ensemble is 1% - 3% better than a RANDOM-GUESS
Only after you make your ensemble-based predictor to something better, you may move into some additional tricks on feature engineering et al.
aRF_PREDICTOR.oob_score_ Out[79]: 0.638801 # n_estimators = 10
aRF_PREDICTOR.oob_score_ Out[89]: 0.789612 # n_estimators = 100

Related

How does a trained SVR model predict values?

I've been trying to understand how does a model trained with support vector machines for regression predict values. I have trained a model with the sklearn.svm.SVR, and now I'm wondering how to "manually" predict the outcome of an input.
Some background - the model is trained with kernel SVR, with RBF function and uses the dual formulation. So now I have arrays of the dual coefficients, the indexes of the support vectors, and the support vectors themselves.
I found the function which is used to fit the hyperplane but I've been unsuccessful in applying that to "manually" predict outcomes without the function .predict.
The few things I tried all include the dot products of the input (features) array, and all the support vectors.
If anyone ever needs this, I've managed to understand the equation and code it in python.
The following is the used equation for the dual formulation:
where N is the number of observations, and αi multiplied by yi are the dual coefficients found from the model's attributed model.dual_coef_. The xiT are some of the observations used for training (support vectors) accessed by the attribute model.support_vectors_ (transposed to allow multiplication of the two matrices), x is the input vector containing a value for each feature (its the one observation for which we want to get prediction), and b is the intercept accessed by model.intercept_.
The xiT and x, however, are the observations transformed in a higher-dimensional space, as explained by mery in this post.
The calculation of the transformation by RBF can be either applied manually step by stem or by using the sklearn.metrics.pairwise.rbf_kernel.
With the latter, the code would look like this (my case shows I have 589 support vectors, and 40 features).
First we access the coefficients and vectors:
support_vectors = model.support_vectors_
dual_coefs = model.dual_coef_[0]
Then:
pred = (np.matmul(dual_coefs.reshape(1,589),
rbf_kernel(support_vectors.reshape(589,40),
Y=input_array.reshape(1,40),
gamma=model.get_params()['gamma']
)
)
+ model.intercept_
)
If the RBF funcion needs to be applied manually, step by step, then:
vrbf = support_vectors.reshape(589,40) - input_array.reshape(1,40)
pred = (np.matmul(dual_coefs.reshape(1,589),
np.diag(np.exp(-model.get_params()['gamma'] *
np.matmul(vrbf, vrbf.T)
)
).reshape(589,1)
)
+ model.intercept_
)
I placed the .reshape() function even where it is not necessary, just to emphasize the shapes for the matrix operations.
These both give the same results as model.predict(input_array)

Multiclass semantic segmentation model evaluation

I am doing a project on multiclass semantic segmentation. I have formulated a model that outputs pretty descent segmented images by decreasing the loss value. However, I cannot evaluate the model performance in metrics, such as meanIoU or Dice coefficient.
In case of binary semantic segmentation it was easy just to set the threshold of 0.5, to classify the outputs as an object or background, but it does not work in the case of multiclass semantic segmentation. Could you please tell me how to obtain model performance on the aforementioned metrics? Any help will be highly appreciated!
By the way, I am using PyTorch framework and CamVid dataset.
If anyone is interested in this answer, please also look at this issue. The author of the issue points out that mIoU can be computed in a different way (and that method is more accepted in literature). So, consider that before using the implementation for any formal publication.
Basically, the other method suggested by the issue-poster is to separately accumulate the intersections and unions over the entire dataset and divide them at the final step. The method in the below original answer computes intersection and union for a batch of images, then divides them to get IoU for the current batch, and then takes a mean of the IoUs over the entire dataset.
However, this below given original method is problematic because the final mean IoU would vary with the batch-size. On the other hand, the mIoU would not vary with the batch size for the method mentioned in the issue as the separate accumulation would ensure that batch size is irrelevant (though higher batch size can definitely help speed up the evaluation).
Original answer:
Given below is an implementation of mean IoU (Intersection over Union) in PyTorch.
def mIOU(label, pred, num_classes=19):
pred = F.softmax(pred, dim=1)
pred = torch.argmax(pred, dim=1).squeeze(1)
iou_list = list()
present_iou_list = list()
pred = pred.view(-1)
label = label.view(-1)
# Note: Following for loop goes from 0 to (num_classes-1)
# and ignore_index is num_classes, thus ignore_index is
# not considered in computation of IoU.
for sem_class in range(num_classes):
pred_inds = (pred == sem_class)
target_inds = (label == sem_class)
if target_inds.long().sum().item() == 0:
iou_now = float('nan')
else:
intersection_now = (pred_inds[target_inds]).long().sum().item()
union_now = pred_inds.long().sum().item() + target_inds.long().sum().item() - intersection_now
iou_now = float(intersection_now) / float(union_now)
present_iou_list.append(iou_now)
iou_list.append(iou_now)
return np.mean(present_iou_list)
Prediction of your model will be in one-hot form, so first take softmax (if your model doesn't already) followed by argmax to get the index with the highest probability at each pixel. Then, we calculate IoU for each class (and take the mean over it at the end).
We can reshape both the prediction and the label as 1-D vectors (I read that it makes the computation faster). For each class, we first identify the indices of that class using pred_inds = (pred == sem_class) and target_inds = (label == sem_class). The resulting pred_inds and target_inds will have 1 at pixels labelled as that particular class while 0 for any other class.
Then, there is a possibility that the target does not contain that particular class at all. This will make that class's IoU calculation invalid as it is not present in the target. So, you assign such classes a NaN IoU (so you can identify them later) and not involve them in the calculation of the mean.
If the particular class is present in the target, then pred_inds[target_inds] will give a vector of 1s and 0s where indices with 1 are those where prediction and target are equal and zero otherwise. Taking the sum of all elements of this will give us the intersection.
If we add all the elements of pred_inds and target_inds, we'll get the union + intersection of pixels of that particular class. So, we subtract the already calculated intersection to get the union. Then, we can divide the intersection and union to get the IoU of that particular class and add it to a list of valid IoUs.
At the end, you take the mean of the entire list to get the mIoU. If you want the Dice Coefficient, you can calculate it in a similar fashion.

SVM model averaging in sklearn

l would like to average the scores of two different SVMs trained on different samples but same classes
# Data have the smae label x_1[1] has y_1[1] and x_2[1] has y_2[1]
# Where y_2[1] == y_1[1]
Dataset_1=(x_1,y)
Dataset_2=(x_2,y)
test_data=(test_sample,test_labels)
We have 50 classes. Same classes for dataset_1 and dataset_2 :
list(set(y_1))=list(set(y_2))
What l have tried :
from sklearn.svm import SVC
clf_1 = SVC(kernel='linear', random_state=42).fit(x_1, y)
clf_2 = SVC(kernel='linear', random_state=42).fit(x_2, y)
How to average clf_1 and clf_2 scores before doing :
predict(test_sample)
?
What l would like to do ?
Not sure I understand your question; to simply average the scores as in a typical ensemble, you should first get prediction probabilities from each model separately, and then just take their average:
pred1 = clf_1.predict_proba(test_sample)
pred2 = clf_2.predict_proba(test_sample)
pred = (pred1 + pred2)/2
In order to get prediction probabilities instead of hard classes, you should initialize the SVC using the additional argument probability=True.
Each row of pred will be an array of length 50, as many as your classes, with each element representing the probability that the sample belongs to the respective class.
After averaging, simply take the argmax of pred - just be sure that the order of the returned probabilities is OK; according to the docs:
The columns correspond to the classes in sorted order, as they appear in the attribute classes_
As I am not exactly sure what this means, run some checks with predictions on your training set, to be sure that the order is correct.

How to compare predictive power of PCA and NMF

I would like to compare the output of an algorithm with different preprocessed data: NMF and PCA.
In order to get somehow a comparable result, instead of choosing just the same number of components for each PCA and NMF, I would like to pick the amount that explains e.g 95% of retained variance.
I was wondering if its possible to identify the variance retained in each component of NMF.
For instance using PCA this would be given by:
retainedVariance(i) = eigenvalue(i) / sum(eigenvalue)
Any ideas?
TL;DR
You should loop over different n_components and estimate explained_variance_score of the decoded X at each iteration. This will show you how many components do you need to explain 95% of variance.
Now I will explain why.
Relationship between PCA and NMF
NMF and PCA, as many other unsupervised learning algorithms, are aimed to do two things:
encode input X into a compressed representation H;
decode H back to X', which should be as close to X as possible.
They do it in a somehow similar way:
Decoding is similar in PCA and NMF: they output X' = dot(H, W), where W is a learned matrix parameter.
Encoding is different. In PCA, it is also linear: H = dot(X, V), where V is also a learned parameter. In NMF, H = argmin(loss(X, H, W)) (with respect to H only), where loss is mean squared error between X and dot(H, W), plus some additional penalties. Minimization is performed by coordinate descent, and result may be nonlinear in X.
Training is also different. PCA learns sequentially: the first component minimizes MSE without constraints, each next kth component minimizes residual MSE subject to being orthogonal with the previous components. NMF minimizes the same loss(X, H, W) as when encoding, but now with respect to both H and W.
How to measure performance of dimensionality reduction
If you want to measure performance of an encoding/decoding algorithm, you can follow the usual steps:
Train your encoder+decoder on X_train
To measure in-sample performance, compare X_train'=decode(encode(X_train)) with X_train using your preferred metric (e.g. MAE, RMSE, or explained variance)
To measure out-of-sample performance (generalizing ability) of your algorithm, do step 2 with the unseen X_test.
Let's try it with PCA and NMF!
from sklearn import decomposition, datasets, model_selection, preprocessing, metrics
# use the well-known Iris dataset
X, _ = datasets.load_iris(return_X_y=True)
# split the dataset, to measure overfitting
X_train, X_test = model_selection.train_test_split(X, test_size=0.5, random_state=1)
# I scale the data in order to give equal importance to all its dimensions
# NMF does not allow negative input, so I don't center the data
scaler = preprocessing.StandardScaler(with_mean=False).fit(X_train)
X_train_sc = scaler.transform(X_train)
X_test_sc = scaler.transform(X_test)
# train the both decomposers
pca = decomposition.PCA(n_components=2).fit(X_train_sc)
nmf = decomposition.NMF(n_components=2).fit(X_train_sc)
print(sum(pca.explained_variance_ratio_))
It will print you explained variance ratio of 0.9536930834362043 - the default metric of PCA, estimated using its eigenvalues. We can measure it in a more direct way - by applying a metric to actual and "predicted" values:
def get_score(model, data, scorer=metrics.explained_variance_score):
""" Estimate performance of the model on the data """
prediction = model.inverse_transform(model.transform(data))
return scorer(data, prediction)
print('train set performance')
print(get_score(pca, X_train_sc))
print(get_score(nmf, X_train_sc))
print('test set performance')
print(get_score(pca, X_test_sc))
print(get_score(nmf, X_test_sc))
which gives
train set performance
0.9536930834362043 # same as before!
0.937291711378812
test set performance
0.9597828443047842
0.9590555069007827
You can see that on the training set PCA performs better than NMF, but on the test set their performance is almost identical. This happens, because NMF applies lots of regularization:
H and W (the learned parameter) must be non-negative
H should be as small as possible (L1 and L2 penalties)
W should be as small as possible (L1 and L2 penalties)
These regularizations make NMF fit worse than possible to the training data, but they might improve its generalizing ability, which happened in our case.
How to choose the number of components
In PCA, it is simple, because its components h_1, h_2, ... h_k are learned sequentially. If you add the new component h_(k+1), the first k will not change. Thus, you can estimate performance of each component, and these estimates will not depent on the number of components. This makes it possible for PCA to output the explained_variance_ratio_ array after only a single fit to data.
NMF is more complex, because all its components are trained at the same time, and each one depends on all the rest. Thus, if you add the k+1th component, the first k components will change, and you cannot match each particular component with its explained variance (or any other metric).
But what you can to is to fit a new instance of NMF for each number of components, and compare the total explained variance:
ks = [1,2,3,4]
perfs_train = []
perfs_test = []
for k in ks:
nmf = decomposition.NMF(n_components=k).fit(X_train_sc)
perfs_train.append(get_score(nmf, X_train_sc))
perfs_test.append(get_score(nmf, X_test_sc))
print(perfs_train)
print(perfs_test)
which would give
[0.3236945680665101, 0.937291711378812, 0.995459457205891, 0.9974027602663655]
[0.26186701106012833, 0.9590555069007827, 0.9941424954209546, 0.9968456603914185]
Thus, three components (judging by the train set performance) or two components (by the test set) are required to explain at least 95% of variance. Please notice that this case is unusual and caused by a small size of training and test data: usually performance degrades a little bit on the test set, but in my case it actually improved a little.

Use features based on tf idf score for text classification using naive bayes (sklearn)

I am learning to implement text classification (into two classes) using tfidf and naive bayes by referring to this blog and sklearn tfidf
below is the code snippet:
kf = StratifiedKFold(n_splits=5)
totalNB = 0
totalMatNB = np.zeros((2,2));
for train_index, test_index in kf.split(documents, labels):
X_train = [documents[i] for i in train_index]
X_test = [documents[i] for i in test_index]
y_train, y_test = labels[train_index], labels[test_index]
vectorizer = TfidfVectorizer(min_df=2, max_df= 0.2, use_idf= True, stop_words=stop_words)
train_corpus_tf_idf = vectorizer.fit_transform(X_train)
test_corpus_tf_idf = vectorizer.transform(X_test)
model2 = MultinomialNB()
model2.fit(train_corpus_tf_idf, y_train)
result2 = model2.predict(test_corpus_tf_idf)
totalMatNB = totalMatNB + confusion_matrix(y_test, result2)
totalNB = totalNB + sum(y_test == result2)
The above code is working as expected.
I have read the documents, but I am still confuse about min_df and max_df.
How to use the features for the classification based on the tf-idf score, i.e. filter the features based on tf-idf score
eg.
use the features whose tf-idf score is greater than x [ score(features) >x]
use the features whose tf-idf score between x and y [ y> score(features)>x ] or [ y>= score(features)>=x ]
When training the vectorizer, setting specific values for min_df and max_df is supposed to help you tweak the eventual tf-idf representation to best suit your needs by limiting the vocabulary. It also helps with reducing the dimension of the vector representation which is usually a good thing since they tend to be huge.
Setting a high min_df value will remove relatively infrequent terms from the representation. If your eventual model is not supposed to care too much about very unique terms this would be a good thing.
Setting a low max_df will remove relatively frequent terms from the representation. If your eventual model doesn't care about words that are used in many contexts (e.g. "the", "or", "and") then this would be a good thing. Note that "low" here can mean either a nonzero integer > 1 or a float < 1 close to 0.
Important note: your suggestion of filtering features after-the-fact based on their tf-idf weight is a totally different thing. Setting min_df and max_df when fitting the vectorizer will limit the eventual vocabulary based on document frequency across the entire training sample. Whereas the eventual tf-idf weight in a given vector is a document-specific value (since it's also impacted by the term frequency in that specific document).
Hope this helps!

Resources