Recursive feature elemination with CV doesn't reduce feature count - scikit-learn

I have this protein dataset that I need to perform a RFE on. There are 100 examples with binary class labels (sick - 1, healthy - 0) and 9847 features for each example. To reduce the dimensionality I am performing a RFECV with a LogisticRegression estimator and 5 fold CV. This is the code:
model = LogisticRegression()
rfecv = RFECV(estimator=model, step=1, cv=StratifiedKFold(5), n_jobs=-1)
rfecv.fit(X_train, y_train)
print("Number of features selected: %d" % rfecv.n_features_)
Number of features selected: 9874
I then plot the number of features vs the CV scores:
plt.figure()
plt.xlabel("feature count")
plt.ylabel("CV accuracy")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
What I think is happening (and this is what I need an expert for) is that the first peak shows the optimal number of features. After that the curve drops and only starts to climb again because of overfitting, not really seperating classes but examples. Could this be the case? And if so how can I obtain these features (i.e. the ones at that first peak), because rfecv.support_ only gives me the ones where the highest accuracy was reached (meaning: all of them).
And while I am at it: How would I choose the best estimator for the RFE? Is it just by trial and error, going through all possible classifiers or is there any logic why I would use a Logit over a linear SVC for example?

One way that i use for feature relevance is the RandomForest or ExtremeRandomizedTrees.
i can use:
rfecv.n_features
to see how much features the find and:
rfec.ranking
to see the features index in descending order. another algorithm that you can use is the PCA to reduce the dimension of you Dataset.

Related

Evaluating Gaussian Mixture model using a score metric?

I have 1D data (on column data). I used Gaussian Mixture Model (GMM) as a density estimation, using this implementation in Python: https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html. By relying on AIC/BIC criteron i was able to determine number of components. After i fit the GMM, i plotted kernel density estimation of original observation + that of sampled data drawn from GMM. the plot of original and sampled desnities are quiet similar( that is good). But, i would like some metrics to report how good is the fitted model.
g = GaussianMixture(n_components = 35)
data= df['x'].values.reshape(-1,1) # data taken from data frame (10,000 data pints)
clf= g.fit(data)# fit model
samples= clf.sample(10000)[0] # generate sample data points (same # as original data points)
I found score in the implementation, but not sure how to implememnt. Am i doing it wrong? or is there any better way to show how accuracy is the fitted model, apart from histogram or kernel densities plots?.
print(clf.score(data))
print(clf.score(samples))
You can use normalized_mutual_info_score, adjusted_rand_score or silhouette score to evaluate your clusters. All of these metrics are implemented under sklearn.metrics section.
EDIT: You can check this link for more detail explanations.
In a summary:
Adjusted Rand Index: measures the similarity of the two assignments.
Normalized Mutual Information: measures the agreement of the two assignments.
Silhouette Coefficient: measures how well-assigned each individual point is.
gmm.fit(x_vec)
pred = gmm.predict(x_vec)
print ("gmm: silhouttte: ", silhouette_score(x_vec, pred))
I would better use cross-validation and try to see the accuracy of the trained model.
Use the predict method of the fitted model to predict the labels of unseen data (use cross-validation and report the acurracy): https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture.predict
Toy example:
g = GaussianMixture(n_components = 35)
g.fit(train_data)# fit model
y_pred = g.predict(test_data)
EDIT:
There are several options to measure the performance of your unsupervised case. For GMM, which base on real probabilities, the most common are BIC and AIC. They are immediatly included in the scikit GMM class.

SMOTE over sampling applied on text classification

I am working on text classification, where I am using Multinominal Naive Bayes Classifier to predict article titles into their respective subject categories. Both of these are stored in a pandas data frame and are text columns. However they're are two categories which contain 50,000 records and 30,000 records respectively. Hence I need to do oversampling of the data and then apply the algorithm. When I do oversampling it reduces the model accuracy score and give me 15%. Please tell me how I can improve it.
X_train, X_test, Y_train, Y_test=train_test_split(df['Title'],df['Subjects'], test_size=0.2,random_state=42)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train_tfidf, Y_train)
print("Shape after smote is:",X_train_res.shape,y_train_res.shape)
nb = Pipeline([('clf', MultinomialNB())])
nb.fit(X_train_res, y_train_res)
y_pred = nb.predict(count_vect.transform(X_test))
print(accuracy_score(Y_test,y_pred))
I expect to increase model accuracy by doing so. Model accuracy without oversampling is 62% and after oversampling is 15%, when it should actually be higher.
Actually, using SMOTE for balancing/oversampling classes can be problematic in text classification tasks. There are nice explanations and suggestions for alternatives here:
https://datascience.stackexchange.com/a/27758
In short, the SMOTE output may not represent "meaningful" substitutes and due to the size of the feature space its nearest-neighbor based approach may yield poor results.
Some more ideas:
Instead of using accuracy, it is advisable to use F1 or similar.
Rather unlikely to help but did you try undersampling?
For the MultinomialNB classifier you might try setting class_prior explicitly.
Finally, other methods like Forests and Boosting approaches might be better suited for imbalanced datasets.

Scikit SVM gives very poor accuracy for STL-10 dataset

I am using Scikit-learn SVM for training my model for STL-10 dataset which contains 5000 training images (10 pre-defined folds). So I have 5000*96*96*3 size dataset for training and test purposes. I used following code to train it and measure the accuracy for the test set. (80% 20%). Final result was 0.323 accuracy. How can I increase the accuracy for SVM.
This is STL10 dataset
def train_and_evaluate(clf, train_x, train_y):
clf.fit(train_x, train_y)
#make 2D array as we can apply only 2d to fit() function
nsamples, nx, ny, nz = images.shape
reshaped_train_dataset = images.reshape((nsamples, nx * ny * nz))
X_train, X_test, Y_train, Y_test = train_test_split(reshaped_train_dataset, read_labels(LABEL_PATH), test_size=0.20, random_state=33)
train_and_evaluate(my_svc, X_train, Y_train)
print(metrics.accuracy_score(Y_test, clf2.predict(X_test)))
So it seems you are using raw SVM directly on the images. That is usually not a good idea (it is rather bad actually).
I will describe the classic image-classification pipeline popular in the last decades! Keep in mind, that the highest performing approaches right now might use Deep Neural Networks to combine some of these steps (a very different approach; a lot of research in the last years!)
First step:
Preprocessing is needed!
Normalize mean and variance (i would not expect your dataset to be already normalized)
Optional: histogram-equalization
Second step:
Feature-extraction -> you should learn some features from these images. There are a lot of approaches including
(Kernel-)PCA
(Kernel-)LDA
Dictionary-learning
Matrix-factorization
Local binary patterns
... (just test with LDA initially)
Third:
SVM for classification
again there might be a Normalization-step needed before this and as mentioned in the comments by #David Batista: there might be some parameter-tuning needed (especially for Kernel-SVM)
It is also not clear, if using color-information is wise here. For more simple approaches i expect black-and-white images to be superior (you are losing information but tuning your pipeline is more robust; high-performance approaches will of course use color-information).
See here for some random tutorial describing a similar problem. While i don't know if it's good work, you could immediatly recognize the processing-pipeline mentioned above (preprocessing, feature-extraction, classifier-learning)!
Edit:
Why preprocessing?: some algorithms assume centered samples with unit-variance, therefore normalization is needed. This is (at least) very important for PCA, LDA and SVM's.

scikit-learn RandomForestClassifier probabilistic prediction vs majority vote

In the documentation of scikit-learn in section 1.9.2.1 (excerpt is posted below), why does the implementation of random forest differ from the original paper by Breiman? As far as I'm aware, Breiman opted for a majority vote (mode) for classification and an average for regression (paper written by Liaw and Wiener, the maintainers of the original R code with citation below) when aggregating the ensembles of classifiers.
Why does scikit-learn use probabilistic prediction instead of a majority vote?
Is there any advantage in using probabilistic prediction?
The section in question:
In contrast to the original publication [B2001], the scikit-learn
implementation combines classifiers by averaging their probabilistic
prediction, instead of letting each classifier vote for a single
class.
Source: Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. R news, 2(3), 18-22.
This question has now been answered on Cross Validated. Included here for reference:
Such questions are always best answered by looking at the code, if
you're fluent in Python.
RandomForestClassifier.predict, at least in the current version
0.16.1, predicts the class with highest probability estimate, as given by predict_proba. (this
line)
The documentation for predict_proba says:
The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the trees in the forest. The
class probability of a single tree is the fraction of samples of the
same class in a leaf.
The difference from the original method is probably just so that
predict gives predictions consistent with predict_proba. The
result is sometimes called "soft voting", rather than the "hard"
majority vote used in the original Breiman paper. I couldn't in quick
searching find an appropriate comparison of the performance of the two
methods, but they both seem fairly reasonable in this situation.
The predict documentation is at best quite misleading; I've
submitted a pull
request to
fix it.
If you want to do majority vote prediction instead, here's a function
to do it. Call it like predict_majvote(clf, X) rather than
clf.predict(X). (Based on predict_proba; only lightly tested, but
I think it should work.)
from scipy.stats import mode
from sklearn.ensemble.forest import _partition_estimators, _parallel_helper
from sklearn.tree._tree import DTYPE
from sklearn.externals.joblib import Parallel, delayed
from sklearn.utils import check_array
from sklearn.utils.validation import check_is_fitted
def predict_majvote(forest, X):
"""Predict class for X.
Uses majority voting, rather than the soft voting scheme
used by RandomForestClassifier.predict.
Parameters
----------
X : array-like or sparse matrix of shape = [n_samples, n_features]
The input samples. Internally, it will be converted to
``dtype=np.float32`` and if a sparse matrix is provided
to a sparse ``csr_matrix``.
Returns
-------
y : array of shape = [n_samples] or [n_samples, n_outputs]
The predicted classes.
"""
check_is_fitted(forest, 'n_outputs_')
# Check data
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
# Assign chunk of trees to jobs
n_jobs, n_trees, starts = _partition_estimators(forest.n_estimators,
forest.n_jobs)
# Parallel loop
all_preds = Parallel(n_jobs=n_jobs, verbose=forest.verbose,
backend="threading")(
delayed(_parallel_helper)(e, 'predict', X, check_input=False)
for e in forest.estimators_)
# Reduce
modes, counts = mode(all_preds, axis=0)
if forest.n_outputs_ == 1:
return forest.classes_.take(modes[0], axis=0)
else:
n_samples = all_preds[0].shape[0]
preds = np.zeros((n_samples, forest.n_outputs_),
dtype=forest.classes_.dtype)
for k in range(forest.n_outputs_):
preds[:, k] = forest.classes_[k].take(modes[:, k], axis=0)
return preds
On the dumb synthetic case I tried, predictions agreed with the
predict method every time.
This was studied by Breiman in Bagging predictor (http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf).
This gives nearly identical results, but using soft voting gives smoother probabilities. Note that if you are using completely developed tree, you won't have any difference.

How can i know probability of class predicted by predict() function in Support Vector Machine?

How can i know sample's probability that it belongs to a class predicted by predict() function of Scikit-Learn in Support Vector Machine?
>>>print clf.predict([fv])
[5]
There is any function?
Definitely read this section of the docs as there's some subtleties involved. See also Scikit-learn predict_proba gives wrong answers
Basically, if you have a multi-class problem with plenty of data predict_proba as suggested earlier works well. Otherwise, you may have to make do with an ordering that doesn't yield probability scores from decision_function.
Here's a nice motif for using predict_proba to get a dictionary or list of class vs probability:
model = svm.SVC(probability=True)
model.fit(X, Y)
results = model.predict_proba(test_data)[0]
# gets a dictionary of {'class_name': probability}
prob_per_class_dictionary = dict(zip(model.classes_, results))
# gets a list of ['most_probable_class', 'second_most_probable_class', ..., 'least_class']
results_ordered_by_probability = map(lambda x: x[0], sorted(zip(model.classes_, results), key=lambda x: x[1], reverse=True))
Use clf.predict_proba([fv]) to obtain a list with predicted probabilities per class. However, this function is not available for all classifiers.
Regarding your comment, consider the following:
>> prob = [ 0.01357713, 0.00662571, 0.00782155, 0.3841413, 0.07487401, 0.09861277, 0.00644468, 0.40790285]
>> sum(prob)
1.0
The probabilities sum to 1.0, so multiply by 100 to get percentage.
When creating SVC class to compute the probability estimates by setting probability=True:
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
Then call fit as usual and then predict_proba([fv]).
For clearer answers, I post again the information from scikit-learn for svm.
Needless to say, the cross-validation involved in Platt scaling is an expensive operation for large datasets. In addition, the probability estimates may be inconsistent with the scores, in the sense that the “argmax” of the scores may not be the argmax of the probabilities. (E.g., in binary classification, a sample may be labeled by predict as belonging to a class that has probability <½ according to predict_proba.) Platt’s method is also known to have theoretical issues. If confidence scores are required, but these do not have to be probabilities, then it is advisable to set probability=False and use decision_function instead of predict_proba.
For other classifiers such as Random Forest, AdaBoost, Gradient Boosting, it should be okay to use predict function in scikit-learn.
This is one way of obtaining the Probabilities
svc = SVC(probability=True)
preds_svc = svc.fit(X_train, y_train).predict(X_test)
probs_svc = svc.decision_function(X_test)#The decision function tells us on which side of the hyperplane generated by the classifier we are (and how far we are away from it).
probs_svc = (probs_svc - probs_svc.min()) / (probs_svc.max() - probs_svc.min())

Resources