I am following this images tutorial for Tensorflow, but I am having trouble setting up a confusion matrix because the tutorial does not follow the X_test, y_test format that traditional examples use:
Ex:
from sklearn.metrics import classification_report, confusion_matrix
y_proba = model.predict(X_test)
y_pred = np.argmax(y_proba,axis=1)
print('Confusion Matrix')
print(confusion_matrix(y_pred, y_test))
print('Classification Report')
print(classification_report(y_pred, y_test))
How can I set up a confusion matrix based on the tensorflow images tutorial?
#NewtoNN
I looked over the tutorial that you provided, but I cannot find the predicted values.
You could try to first predict and then compare the true values with the predicted ones.
Try the following or something similar:
from sklearn.metrics import confusion_matrix
y_proba = model.predict(val_ds)`
y_pred = np.argmax(y_proba,axis=1)`
print('Confusion Matrix')
print(confusion_matrix(y_pred, val_ds))
Related
Well, I am trying to solve this clustering problem that involves the Gaussian Naive-Bayes algorithm.
Question:
Classification
Consider the data in the file - link below. Train the algorithm Gaussian Naive Bayes using the method of cross-validation holdout (Use the first 700 lines for the training set and the rest for the test set.) What is the accuracy of the training set? What is the accuracy of the test set? Do the same training with the method Leave-One-Out. What is the average accuracy for the training set? What is the average accuracy for the test set?
My solution that I am not sure about:
Basic Code (Full code in the Collab link below):
#Using Holdout
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred_train = classifier.predict(X_train)
cm0 = confusion_matrix(y_train, y_pred_train )
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))
print(accuracy_score(y_train, y_pred_train))
My Answer for the holdout:
[[ 23 51]
[ 21 205]]
0.76
0.7871428571428571
LOO:
#Using LOO
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import LeaveOneOut
#This is where I got the code: https://machinelearningmastery.com/loocv-for-evaluating-machine-learning-algorithms/
cv = LeaveOneOut()
accuracies = cross_val_score(estimator=classifier, X = X_train, y = y_pred_train, scoring='accuracy',cv=cv)
print(f"Accuracy Train {accuracies.mean()}")
print(f"Standard Deviation {accuracies.std()}")
accuraciestest = cross_val_score(estimator=classifier, X = X_test, y = y_test, scoring='accuracy', cv=cv)
print(f"Accuracy Test {accuraciestest.mean()}")
print(f"Standard Deviation Test {accuraciestest.std()}")
My Answer for the LeaveOneOut:
Accuracy Train 0.9771428571428571
Standard Deviation 0.1494479637785374
Accuracy Test 0.7433333333333333
Standard Deviation Test 0.43679387460092534
Data:
https://drive.google.com/file/d/1v9V-007yV3vVckPcQN0Q5VuNZYF_JjBW/view?usp=sharing
Colabs Link: https://colab.research.google.com/drive/1X68-Li6FacnAAQ4ASg3mmqdrU15v2ReP?usp=sharing
I'm trying to create a Naive Bayes Classifier in Python. For finding the accuracy of the classifier, I have train and test data explicitly available, and I want to train my model using train.csv and then test it on test.csv.
Is there a function except scikit test_train_split which can help me doing that?
Following from comment above:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import mean_squared_error
# Create an instance
nv_clf = GaussianNB()
# Fit on training set
nv_clf.fit(X_train, y_train)
# Pedict on X_test
y_pred = nv_clif.predict(X_test)
# Calcuate error/accuracy on y_test
nv_mse = mean_squared_error(y_test, y_pred)
# or
nv_rmse = np.sqrt(nv_mse) # root mean squared error
I am new to machine learning and I am building my first model independently. I have a dataset that evaluates cars, it contains features of price, safety and luxury and classifies if its good, very good, acceptable and unacceptable. I converted all the non-numeric columns into numeric, trained the data and predicted with a test set. However, my predictions are awful; I used LinearRegression and r2_score outputs 0.05 which is practically 0. I have tried a few different models and all have been giving me horrible predictions and accuracy.
What am I doing wrong? I have seen tutorials, read articles with similar methodology, yet they end up with 0.92 accuracy and I'm getting 0.05. How do you make a good model for your data and how do you know which model to use?
Code:
import numpy as np
import pandas as pd
from sklearn import preprocessing, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class value']
df = pd.read_csv('car.data.txt', index_col=False, names=columns)
for col in df.columns.values:
try:
if df[col].astype(int):
pass
except ValueError:
enc = preprocessing.LabelEncoder()
enc.fit(df[col])
df[col] = enc.transform(df[col])
#Split the data
class_y = df.pop('class value')
x_train, x_test, y_train, y_test = train_test_split(df, class_y, test_size=0.2, random_state=0)
#Make the model
regression_model = linear_model.LinearRegression()
regression_model = regression_model.fit(x_train, y_train)
#Predict the test data
y_pred = regression_model.predict(x_test)
score = r2_score(y_test, y_pred)
You should not use Linear Regression, which is used for predicting continuous values and not categorical values. In your case what you are trying to predict is categorical. Technically, each situation is a class.
I would suggest trying Logistic Regression or other type of classification methods such as Naive Bayes, SVM , decision tree classifiers etc. instead.
I would like to use k-fold cross validation while learning a model. So far I am doing it like this:
# splitting dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(dataset_1, df1['label'], test_size=0.25, random_state=4222)
# learning a model
model = MultinomialNB()
model.fit(X_train, y_train)
scores = cross_val_score(model, X_train, y_train, cv=5)
At this step I am not quite sure whether I should use model.fit() or not, because in the official documentation of sklearn they do not fit but just call cross_val_score as following (they do not even split the data into training and test sets):
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
I would like to tune the hyper parameters of the model while learning the model. What is the right pipeline?
If you want to do hyperparameter selection then look into RandomizedSearchCV or GridSearchCV. If you want to use the best model afterwards, then call either of these with refit=True and then use best_estimator_.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
log_params = {'penalty': ['l1', 'l2'], 'C': [1E-7, 1E-6, 1E-6, 1E-4, 1E-3]}
clf = LogisticRegression()
search = RandomizedSearchCV(clf, scoring='average_precision', cv=10,
n_iter=10, param_distributions=log_params,
refit=True, n_jobs=-1)
search.fit(X_train, y_train)
clf = search.best_estimator_
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
Your second example is right for doing the cross validation. See the example here: http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics
The fitting will be done inside the cross_val_score function, you don't need to worry about this beforehand.
[Edited] If, besides cross validation, you want to train a model, you can call model.fit() afterwards.
I'm using scikit-learn cross_validation(http://scikit-learn.org/stable/modules/cross_validation.html) and get for example 0.82 mean score(r2_scorer).
How could I know do I have over-fitting or under-fitting using scikit-learn functions?
Unfortunately I confirm that there is no built-in tool to compare train and test scores in a CV setup. The cross_val_score tool only reports test scores.
You can setup your own loop with the train_test_split function as in Ando's answer but you can also use any other CV scheme.
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.metrics import SCORERS
scorer = SCORERS['r2']
cv = KFold(5)
train_scores, test_scores = [], []
for train, test in cv:
regressor.fit(X[train], y[train])
train_scores.append(scorer(regressor, X[train], y[train]))
test_scores.append(scorer(regressor, X[test], y[test]))
mean_train_score = np.mean(train_scores)
mean_test_score = np.mean(test_scores)
If you compute the mean train and test scores with cross validation you can then find out if you are:
Underfitting: the train score is far from the perfect score (which is 1.0 for r2)
Overfitting: the train and test scores are not close from on another (the mean test score is significantly lower than the mean train score).
Note: you can be both significantly underfitting and overfitting at the same time if your model is inadequate and your data is too noisy.
You should compare your scores when testing on training and testing data. If the scores are close to equal, you are likely underfitting. If they are far apart, you are likely overfitting (unless using a method such as random forest).
To compute the scores for both train and test data, you can use something along the following (assuming your data is in variables X and Y):
from sklearn import cross_validation
#do five iterations
for i in range(5):
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, Y, test_size=0.4)
#Your predictor, linear SVM in this example
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
print "Test score", clf.score(X_test, y_test)
print "Train score", clf.score(X_train, y_train)