Model performance is "Good". But coefficient weightings are strange - scikit-learn

I am training a model to detect Good/Bad clients. My input features are:
'Net Receivables', 'Sales', 'Cost of Goods sold', 'Current Assets',
'Property, plant and equipment', 'Securities', 'Total assets',
'Depreciation', 'Selling, General & Administrative Expense',
'Total long term debt', 'Current Liabilites', 'Net Receivables.1',
'Sales.1', 'Cost of Goods sold.1', 'Current Assets.1',
'Property, plant and equipment.1', 'Securities.1', 'Total assets.1',
'Depreciation.1', 'Selling, General & Administrative Expense.1',
'Total long term debt.1', 'Current Liabilites.1',
'Income from Continuing Operations', 'Cash Flows from Operations'
I trained a simple model using Logistic Regression:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = LogisticRegression()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
Then I try to evaluate the model using AUC and accuracy
print(roc_auc_score(y_test, pred))
print(accuracy_score(y_test, pred))
The result is
0.765625
0.7727272727272727
But when I try to evaluate the feature importance by
odds = np.exp(clf.coef_[0])
I found some strange coefficients. It seems that no features are relatively more significant
array([1.00000001, 1.00000035, 0.99999963, 0.99999987, 0.99999928,
1. , 1. , 0.99999993, 1.00000019, 0.9999994 ,
0.99999976, 1.00000016, 0.99999996, 1.00000003, 0.99999967,
0.99999967, 1. , 1.00000035, 0.99999995, 0.99999985,
1.00000035, 1.00000021, 1.00000008, 1.00000051])
My training set is relatively small: 174 rows * 24 features.
Can I trust the score of the model?

Why do you use np.exp ?
And why do you do use coef_[0], the normal approach to get the coefficient for your logistic regresion should be:
print(clf.coef_, clf.intercept_)
followed also by this post.

Related

Different results using OneVsRestClassifier(KNeighborsClassifier(n_neighbors=2)) compared to KNeighborsClassifier(n_neighbors=2)

I'm implementing a multi-class classifier and I'm getting different results when wrapping KNN in a multi-class classifier.
Unsure why as I understood KNN worked for multiclass already?
y = rock_df['Sample_type']
X = rock_df[col_list]
def model_eval(model, X,y):
""" Function implements classifier model on X and y with a 0.33 test hold out, stratified by y and returns accuracy and standard deviation
Inputs:
model: The ML model to be tested
X: the cleaned and preprocessed data (normalized, and NAN dealt with)
y: Target labels for input data X
"""
#Split train /test
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42, stratify = y)
n = X_test.size
#Fit model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
#Scoring
confusion_matrix(y_test,y_pred)
balanced_accuracy_score( y_test, y_pred)
scores = cross_val_score(model, X, y, cv=3)
mean= scores.mean()
sd = scores.std()
print("For {} : {:.1%} accuracy on cross validation, with a standard deviation of {:.1%}".format(model, mean, sd) )
# binomial confidence interval - 95% -- confirm difference with SD
#interval = 1.96 * sqrt( (mean * (1 - mean)) /n )
#print('Confidence Interval: {:.3%}'.format(interval) )
#return balanced_accuracy_score, confusion_matrix
model = OneVsRestClassifier(KNeighborsClassifier(n_neighbors=2))
model_eval(model, X,y)
model = KNeighborsClassifier(n_neighbors=2)
model_eval(model, X,y)
First model I get:
For OneVsRestClassifier(estimator=KNeighborsClassifier(n_neighbors=2)) : 78.6% accuracy on cross validation, with a standard deviation of 5.8%
second:
For KNeighborsClassifier(n_neighbors=2) : 83.3% accuracy on cross validation, with a standard deviation of 8.9%
thanks
It is OK that you have different results. KNeighborsClassifier doesn't employ one-vs-rest strategy; majority vote works with 3 and more classes and there is no need to have OvR in the original implementation. But trying OneVsRestClassifier might be useful as well. I believe that generally decision boundaries will be different. Here I played with Iris dataset to get decision boundaries using KNeighborsClassifier(n_neighbors=5) and OneVsRestClassifier(KNeighborsClassifier(n_neighbors=5)):

How to reduce false positives in xgboost?

My dataset is evenly split between 0 and 1 classifiers. 100,000 data points total with 50,000 being classified as 0 and another 50,000 classified as 1. I did an 80/20 split to train/test the data and returned a 98% accuracy score. However, when looking at the confusion matrix I have an awful lot of false positives. I'm new to xgboost and decision trees in general. What settings can I change in the XGBClassifier to reduce the number of false positives or is it even possible? Thank you.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0, stratify=y) # 80% training and 20% test
model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.1, max_delta_step=0, max_depth=9,
min_child_weight=1, missing=None, monotone_constraints='()',
n_estimators=180, n_jobs=4, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', use_label_encoder=False,
validate_parameters=1, verbosity=None)
model.fit(X_train,
y_train,
verbose = True,
early_stopping_rounds=10,
eval_metric = "aucpr",
eval_set = [(X_test, y_test)])
plot_confusion_matrix(model,
X_test,
y_test,
values_format='d',
display_labels=['Old Forests', 'Not Old Forests'])
Yes
If you are looking for a simple fix, you lower the value of scale_pos_weight. This will lower false positive rate even though your dataset is balanced.
For a more robust fix, you will need to run hyperparamter tuning search. Especially you should try different values of : scale_pos_weight, alpha, lambda, gamma and min_child_weight. Since they are the ones with the most impact on how conservative the model is going to be.

Models evaluation and parameter tuning with CV

I try to compare three models SVM RandomForest and LogisticRegression.
I have an imbalance dataset. First i split it to with a 80% - 20% ratio to train and test set. I set the stratify=y.
Next, i used StratifiedKfold only on train set. What i try to do now is fit the models and choose the best one. Also i want to use grid search for each one of the models to find the best parameters.
My code until now is the next
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, shuffle=True, stratify=y, random_state=42)
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=21)
for train_index, test_index in skf.split(X_train, y_train):
X_train_folds, X_test_folds = X_train[train_index], X_train[test_index]
y_train_folds, y_test_folds = y_train[train_index], y_train[test_index]
X_train_2, X_test_2, y_train_2, y_test_2 = X[train_index], X[test_index], y[train_index], y[test_index]
How can i fit a model usin all the folds? How can i gridsearch? Should i have a doulbe loop? can you help?
You can use scikit-learn's GridSearchCV.
You will find an example here of how to evaluate the performance of the various models and assess the statistical significance of the results.

Why is my detection score high inspite of obvious misclassifications during prediction?

I am working on an intrusion classification problem using NSL-KDD dataset. I used 10 features (out of 42) for training after applying Recursive feature elimination technique using Random Forest Classifier as the estimator parameter and Gini index as criterion for splitting Decision tree. After training the classifier, I use same classifier to predict the classes of test data. My cross validation score (Accuracy, precision, recall, f-score) using cross_val_score of sklearn gave above 99 % scores for all the four scores. But plotting the confusion matrix showed otherwise with higher values seen in False positive and False negative values. Claerly, they are not matching with accuracy and all these scores. Where did I do wrong ?
# Train set contain X_train (dataframe of features) and Y_train (series
# of target labels)
# Test set contain X_test and Y_test
# Classifier variable
clf = RandomForestClassifier(n_estimators = 10, criterion = 'gini')
#Training
clf.fit(X_train, Y_train)
# Testing
Y_pred = clf.predict(X_test)
pandas.crosstab(Y_test, Y_pred, rownames = ['Actual'], colnames =
['Predicted'])
# Scoring
accuracy = cross_val_score(clf, X_test, Y_test, cv = 10, scoring =
'accuracy')
print("Accuracy: %0.5f (+/- %0.5f)" % (accuracy.mean(), accuracy.std() *
2))
precision = cross_val_score(clf, X_test, Y_test, cv = 10, scoring =
'precision_weighted')
print("Precision: %0.5f (+/- %0.5f)" % (precision.mean(), precision.std()
* 2))
recall = cross_val_score(clf, X_test, Y_test, cv = 10, scoring =
'recall_weighted')
print("Recall: %0.5f (+/- %0.5f)" % (recall.mean(), recall.std() * 2))
f = cross_val_score(clf, X_test, Y_test, cv = 10, scoring = 'f1_weighted')
print("F-Score: %0.5f (+/- %0.5f)" % (f.mean(), f.std() * 2))
I got accuracy, precision, recall and f-score of
Accuracy 0.99825
Precision 0.99826
Recall 0.99825
F-Score 0.99825
However, the confusion matrix showed otherwise
Predicted 9670 41
Actual 5113 2347
Am I training the whole thing wrong or is it just misclassification problem from poor feature selection?
Your predicted values are stored in y_pred.
accuracy_score(y_test,y_pred)
Just check whether this works...
You are not comparing equivalent results! For the confusion matrix, you train on (X_train,Y_train) and test on (X_test,Y_test).
However, the crossvalscore fits the estimator on k-1 folds of (X_test,Y_test) and test it on the remaining fold of (X_test,Y_test) because crossvalscore do its own cross-validation (with 10 folds here) on the dataset you provide. Check out crossvalscore documentation for more explanation.
So basically, you don't fit and test your algorithm on the same data. This might explain some inconsistency in the results.

Very few distinct prediction probabilities for CV instances with sparse SVM

I’m having an issue using the prediction probabilities for sparse SVM, where many of the predictions come out the same for my test instances. These probabilities are produced during cross validation, and when I plot an ROC curve for the folds, the results look very strange, as there are a handful of clustered points on the graph. Here is my cross validation code, I based it off of the samples on the scikit website:
skf = StratifiedKFold(y, n_folds=numfolds)
for train_index, test_index in skf:
#split the training and testing sets
X_train, X_test = X_scaled[train_index], X_scaled[test_index]
y_train, y_test = y[train_index], y[test_index]
#train on the subset for this fold
print 'Training on fold ' + str(fold)
classifier = svm.SVC(C=C_val, kernel='rbf', gamma=gamma_val, probability=True)
probas_ = classifier.fit(X_train, y_train).predict_proba(X_test)
#Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1])
mean_tpr += interp(mean_fpr, fpr, tpr)
mean_tpr[0] = 0.0
roc_auc = auc(fpr, tpr)
I’m just trying to figure out if there’s something I’m obviously missing here, since I used this same training set and SVM parameters with libsvm and got much better results. When I used libsvm and printed out the distances from the hyperplane for the CV test instances and then plotted the ROC, it came out much more like I expected, and a much better AUC.

Resources