Very few distinct prediction probabilities for CV instances with sparse SVM - scikit-learn

I’m having an issue using the prediction probabilities for sparse SVM, where many of the predictions come out the same for my test instances. These probabilities are produced during cross validation, and when I plot an ROC curve for the folds, the results look very strange, as there are a handful of clustered points on the graph. Here is my cross validation code, I based it off of the samples on the scikit website:
skf = StratifiedKFold(y, n_folds=numfolds)
for train_index, test_index in skf:
#split the training and testing sets
X_train, X_test = X_scaled[train_index], X_scaled[test_index]
y_train, y_test = y[train_index], y[test_index]
#train on the subset for this fold
print 'Training on fold ' + str(fold)
classifier = svm.SVC(C=C_val, kernel='rbf', gamma=gamma_val, probability=True)
probas_ = classifier.fit(X_train, y_train).predict_proba(X_test)
#Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1])
mean_tpr += interp(mean_fpr, fpr, tpr)
mean_tpr[0] = 0.0
roc_auc = auc(fpr, tpr)
I’m just trying to figure out if there’s something I’m obviously missing here, since I used this same training set and SVM parameters with libsvm and got much better results. When I used libsvm and printed out the distances from the hyperplane for the CV test instances and then plotted the ROC, it came out much more like I expected, and a much better AUC.

Related

Different result roc_auc_score and plot_roc_curve

I am training a RandomForestClassifier (sklearn) to predict credit card fraud. When I then test the model and check the rocauc score i get different values when I use roc_auc_score and plot_roc_curve. roc_auc_score gives me around 0.89 and the plot_curve calculates AUC to 0.96 why is that?
The labels are all 0 and 1 as well as the predictions are 0 or 1.
CodE:
clf = RandomForestClassifier(random_state =42)
clf.fit(X_train, y_train[target].values)
pred_test = clf.predict(X_test)
print(roc_auc_score(y_test, pred_test))
clf_disp = plot_roc_curve(clf, X_test, y_test)
plt.show()
Output of the code (the roc_auc_Score is just above the graph).
You are feeding the prediction classes instead of prediction probabilities to
roc_auc_score.
From Documentation:
y_score: array-like of shape (n_samples,) or (n_samples, n_classes)
Target scores. In the binary and multilabel cases, these can be either probability estimates or non-thresholded decision values (as returned by decision_function on some classifiers).
change your code to:
clf = RandomForestClassifier(random_state =42)
clf.fit(X_train, y_train[target].values)
y_score = clf.predict_prob(X_test)
print(roc_auc_score(y_test, y_score[:, 1]))
The ROC Curve and the roc_auc_score take the prediction probabilities as input, but as I can see from your code you are providing the prediction labels. You need to fix that.

Scaling of stock data

I am trying to apply machine learning on stock prediction, and I run into problem regarding scaling on future unseen (much higher) stock close value.
Lets say I use random forrest regression on predicting stock price. I break the data into train set and test set.
For the train set, I use standardscaler, and do fit and transform
And then I use regressor to fit
For the test set, I use standardscaler, and do transform
And then I use regressor to predict, and compare to test label
If I plot predict and test label on a graph, predict seems to max out or ceiling. The problem is that standardscaler fit on train set, test set (later in the timeline) have much higher value, the algorithm does not know what to do with these extreme data
def test(X, y):
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=False)
# preprocess the data
pipeline = Pipeline([
('std_scaler', StandardScaler()),
])
# model = LinearRegression()
model = RandomForestRegressor(n_estimators=20, random_state=0)
# preprocessing fit transform on train data
X_train = pipeline.fit_transform(X_train)
# fit model on train data with train label
model.fit(X_train, y_train)
# transform on test data
X_test = pipeline.transform(X_test)
# predict on test data
y_pred = model.predict(X_test)
# print(np.sqrt(mean_squared_error(y_test, y_pred)))
d = {'actual': y_test, 'predict': y_pred}
plot_data = pd.DataFrame.from_dict(d)
sns.lineplot(data=plot_data)
plt.show()
What should be done with the scaling?
This is what I got for plotting prediction, actual close price vs time
The problem mainly comes from the model you are using. RandomForest regressor is created upon Decision Trees. It is learning to map an input to an output for every examples in the training set. Consequently RandomForest regressor will work for middle values but for extreme values that it hasn't seen during training it will of course perform has your picture is showing.
What you want, is to learn a function directly using linear/polynomial regression or more advanced algorithms like ARIMA.

Why is my detection score high inspite of obvious misclassifications during prediction?

I am working on an intrusion classification problem using NSL-KDD dataset. I used 10 features (out of 42) for training after applying Recursive feature elimination technique using Random Forest Classifier as the estimator parameter and Gini index as criterion for splitting Decision tree. After training the classifier, I use same classifier to predict the classes of test data. My cross validation score (Accuracy, precision, recall, f-score) using cross_val_score of sklearn gave above 99 % scores for all the four scores. But plotting the confusion matrix showed otherwise with higher values seen in False positive and False negative values. Claerly, they are not matching with accuracy and all these scores. Where did I do wrong ?
# Train set contain X_train (dataframe of features) and Y_train (series
# of target labels)
# Test set contain X_test and Y_test
# Classifier variable
clf = RandomForestClassifier(n_estimators = 10, criterion = 'gini')
#Training
clf.fit(X_train, Y_train)
# Testing
Y_pred = clf.predict(X_test)
pandas.crosstab(Y_test, Y_pred, rownames = ['Actual'], colnames =
['Predicted'])
# Scoring
accuracy = cross_val_score(clf, X_test, Y_test, cv = 10, scoring =
'accuracy')
print("Accuracy: %0.5f (+/- %0.5f)" % (accuracy.mean(), accuracy.std() *
2))
precision = cross_val_score(clf, X_test, Y_test, cv = 10, scoring =
'precision_weighted')
print("Precision: %0.5f (+/- %0.5f)" % (precision.mean(), precision.std()
* 2))
recall = cross_val_score(clf, X_test, Y_test, cv = 10, scoring =
'recall_weighted')
print("Recall: %0.5f (+/- %0.5f)" % (recall.mean(), recall.std() * 2))
f = cross_val_score(clf, X_test, Y_test, cv = 10, scoring = 'f1_weighted')
print("F-Score: %0.5f (+/- %0.5f)" % (f.mean(), f.std() * 2))
I got accuracy, precision, recall and f-score of
Accuracy 0.99825
Precision 0.99826
Recall 0.99825
F-Score 0.99825
However, the confusion matrix showed otherwise
Predicted 9670 41
Actual 5113 2347
Am I training the whole thing wrong or is it just misclassification problem from poor feature selection?
Your predicted values are stored in y_pred.
accuracy_score(y_test,y_pred)
Just check whether this works...
You are not comparing equivalent results! For the confusion matrix, you train on (X_train,Y_train) and test on (X_test,Y_test).
However, the crossvalscore fits the estimator on k-1 folds of (X_test,Y_test) and test it on the remaining fold of (X_test,Y_test) because crossvalscore do its own cross-validation (with 10 folds here) on the dataset you provide. Check out crossvalscore documentation for more explanation.
So basically, you don't fit and test your algorithm on the same data. This might explain some inconsistency in the results.

Is passing sklearn tfidf matrix to train MultinomialNB model proper?

I'm do some text classification tasks. What I have observed is that if fed tfidf matrix(from sklearn's TfidfVectorizer), Logistic Regression model is always outperforming MultinomialNB model. Below is my code for training both:
X = df_new['text_content']
y = df_new['label']
X_train, X_test, y_train, y_test = train_test_split(X, y)
vectorizer = TfidfVectorizer(stop_words='english')
X_train_dtm = vectorizer.fit_transform(X_train)
X_test_dtm = vectorizer.transform(X_test)
clf_lr = LogisticRegression()
clf_lr.fit(X_train_dtm, y_train)
y_pred = clf_lr.predict(X_test_dtm)
lr_score = accuracy_score(y_test, y_pred) # perfectly balanced binary classes
clf_mnb = MultinomialNB()
clf_mnb.fit(X_train_dtm, y_train)
y_pred = clf_mnb.predict(X_test_dtm)
mnb_score = accuracy_score(y_test, y_pred) # perfectly balanced binary classes
Currently lr_score > mnb_score always. I'm wondering how exactly MultinomialNB is using the tfidf matrix since the term frequency in tfidf is calculated based on no class information. Any chance that I should not feed tfidf matrix to MultinomialNB the same way I did to LogisticRegression?
Update: I understand the difference between results of TfidfVectorizer and CountVectorizer. And I also just checked the sources code of sklearn's MultinomialNB.fit() function, looks like it does expect a count as oppose to frequency. This will also explain the performance boost mentioned in my comment below. However, I'm still wondering if under any circumstances pass tfidf into MultinomialNB makes sense. The sklearn documentation briefly mentioned the possibility, but not much details.
Any advice would be much appreciated!

Binary classification (logistic regression) predict wrong label with high accuracy

I have a problem that a binary Logistic regression (using scikit-learn python=2.7) classification that is predicting the wrong/opposite class with a high accuracy. That is, after fitting the model the predicted score and predicted probabilities for each class are very consistent but always of the wrong class. I cannot share the data, but some pseudo-code of my approach is:
X = np.vstack((cond_1, cond_2)) # shape of X = 200*51102
y = np.concatenate([np.zeros(len(cond_1)), np.ones(len(cond_2)])
scls = []
clfs = []
scores = []
for train, test in cv.split(X, y):
clf = LogisticRegression(C=1)
scl = StandardScaler()
scl.fit(X[train])
X_train = scl.transform(X[train])
scls.append(scl)
X_test = scl.transform(X[test])
clf.fit(X_train, y[train])
y_pred = clf.predict(X_test)
scores.append(roc_auc_score(y[test], y_pred))
The roc_auc scores have a mean of 0.065% and a standard deviation of 0.05% so there seems to be going something, but what? I have plotted the features and they seem to be okay normally distributed. I also look that at the probabilities from predict_proba and they are mostly above 80% for the wrong class/label.
Any ideas what is going on and/or how to proper diagnose the problem?
I apologise for not being able to ask a more precise question but I'm lacking the vocabulary.

Resources