I'm noticing that given the same feature table (training data) and feature vector for an SVC, I am getting different results for the predict_proba output.
Is this expected behavior for an SVC or should I be getting consistent results?
Thanks for your help!
I think this is caused by the fact that libsvm is calibrating probabilities using cross-validation on random folds of the dataset. In recent versions of sklearn (0.14.1+), passing the random_state=0 as constructor param should fix the PRNG seed used internally by libsvm. If it does not fix the outcome, please feel free to open github issue with a minimalistic reproduction script.
Related
I am new to XGBoost and I am currently working on a project where we have built an XGBoost classifier. Now we want to run some feature selection techniques. Is backward elimination method a good idea for this? I have used it in regression but I am not sure if/how to use it in a classification problem. Any leads will be greatly appreciated.
Note: I have already tried permutation line importance and it has yielded good results! Looking for another method to evaluate the features in the model.
Consider asking your question on Cross Validated since feature selection is more about theory/practice than code.
What is your concern ? Remove "noisy" features who drive down your results, obtain a sparse model ? Backward selection is one way to do of course. That being said, not sure if you are aware of this but XGBoost computes its own "variable importance" values.
# plot feature importance using built-in function
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
model = XGBClassifier()
model.fit(X, y)
# plot feature importance
plot_importance(model)
pyplot.show()
Something like this. This importance is based on how many times a feature is used to make a split. You can then define for instance a threshold below which you do not keep the variables. However do not forget that :
This variable importance has been obtained on the training data only
The removal of a variable with high importance may not affect your prediction error, e.g. if it is correlated with another highly important variable. Other tricks such as this one may exist.
I'm new in keras and i have one question.
To get reproducible result, i fixed seed. If the fit function shuffle parameter is true, is traning data order always same for all epochs or not?
Thanks in advance.
Yes, if you set the seed correctly to a certain value the training order should always be the same with the same seed. However I there were some problems regarding reproducibility when using TF and multiprocessing. I'm not sure if this is solved by now.
You can also checkout this site in the Keras Documentation.
I am implementing a logistic regression model using sklearn, for a text classification competition on Kaggle.
When I use unigram, there are 23,617 features. The best mean_test_score Cross validation search (sklearn's GridSearchCV) gives me is similar to the score I got from Kaggle, using the best model.
There are 1,046,524 features if I use bigram. GridSearchCV gives me a better mean_test_score compared to unigram, but using this new model I got a much much lower score on Kaggle.
I guess the reason might be overfitting, since I have too many features. I have tried to set the GridSearchCV using 5-fold, or even 2-fold, but the scores are still inconsistent.
Does it really indicate my second model is overfitting, even in the validation stage? If so, how can I tune the regularization term for my logistic model using sklearn? Any suggestions are appreciated!
Assuming you are using sklearn. You could try looking into using the tuning parameters max_df, min_df, and max_features. Throwing these into a GridSearch may take a long time but you will likely get some interesting results back. I know these features are implemented in the sklearn.feature_extraction.text.TfidfVectorizer, but I am sure they use them elsewhere as well. Essentially the idea is that including too many grams can lead to overfitting, same thing with having too many grams with low or high document frequencies.
I'm new to scikit-learn, and SVM methods in general. I've got my data set working well with scikit-learn OneClassSVM in order to detect outliers; I train the OneClassSVM using observation all of which are 'inliers' and then use predict() to generate binary inlier/outlier predictions on my testing set of data.
However to continue further with my analysis I'd like to get the probabilities associated with each new observation in my test set. E.g. The probability of being an outlier associated with each new observation. I've noticed other classification methods in scikit-learn offer the ability to pass the parameter probability=True to compute this, but OneClassSVM does not offer this. Is there an easy way to get these results?
I was searching for an answer for the same question of yours until I got to this page. Stuck for sometime, then, I went back to check the original LIBSVM package since OneClassSVM of scikit-learn is based on the implementation of LIBSVM as stated here.
At the main page of LIBSVM, they state the following for option '-b' that is used to activate returning probability output scores for some variants of SVM:
-b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)
In other words, the one-class SVM which is of type SVM (neither SVC nor SVR) does not have implementation for probability estimation.
If I go and try to force this option (i.e. -b) using the command line interface of LIBSVM, for example:
./svm-train -s 2 -t 2 -b 1 heart_scale
I receive the following error message:
ERROR: one-class SVM probability output not supported yet
In summary, this very desired output is not yet supported by LIBSVM and thus, scikit-learn is not offering it for the moment. I hope in near future, they activate this functionality and update the thread here.
It provides decision function scores which in theory is the distance from the marginal decision boundary between normal and anomales OCSVM does unsupervised classification. This means that the anomaly inside the algorithm is defined based on the distance to the origin (quoted from Scholkopf's paper from NIPS https://papers.nips.cc/paper/1999/file/8725fb777f25776ffa9076e44fcfd776-Paper.pdf).
TLDR: use
clf.decision_function(samples) * (-1)
as scores. you get a sparse distributiion of scores.
Does GridSearchCV use predict or predict_proba, when using auc_score as score function?
The predict function generates predicted class labels, which will always result in a triangular ROC-curve. A more curved ROC-curve is obtained using the predicted class probabilities. The latter one is, as far as I know, more accurate. If so, the area under the 'curved' ROC-curve is probably best to measure classification performance within the grid search.
Therefore I am curious if either the class labels or class probabilities are used for the grid search, when using the area under the ROC-curve as performance measure. I tried to find the answer in the code, but could not figure it out. Does anyone here know the answer?
Thanks
To use auc_score for grid searching you really need to use predict_proba or decision_function as you pointed out. This is not possible in the 0.13 release. If you do score_func=auc_score it will use predict which doesn't make any sense.
[edit]Since 0.14[/edit] it is possible to do grid-search using auc_score, by setting the new scoring parameter to roc_auc: GridSearch(est, param_grid, scoring='roc_auc'). It will do the right thing and use predict_proba (or decision_function if predict_proba is not available).
See the whats new page of the current dev version.
You need to install the current master from github to get this functionality or wait until April (?) for 0.14.
After performing some experiments with Sklearn SVC (which has predict_proba available) comparing some results with predict_proba and decision_function, it seems that roc_auc in GridSearchCV uses decision_function to compute AUC scores. I found a similar discussion here: Reproducing Sklearn SVC within GridSearchCV's roc_auc scores manually