confusion matrix for multiclass data - keras

I have fit the following model to my 7 classes data and I want to create a confusion matrix for my model:
history1 = model1.fit(data_generator.flow(train_x, to_categorical(train_y),batch_size=BATCH_SIZE),
steps_per_epoch=len(train_x) / BATCH_SIZE,
validation_data=data_generator.flow(val_x, to_categorical(val_y),batch_size=BATCH_SIZE),
validation_steps=len(val_x) / BATCH_SIZE,epochs=NUM_EPOCHS)
Also when I did this, predicting the train set results:
y_train_pred = model1.predict(train_x)
cm_train = confusion_matrix(train_y, y_train_pred)
It gave me this error:
Classification metrics can't handle a mix of unknown and multiclass targets
Can you please guide me how to do it?

You seem to be using
sklearn.metrics.confusion_matrix
try using
from sklearn.metrics import multilabel_confusion_matrix
instead.

Related

Gaussian Process Regressor scikit learn does not recognise 'eval_MSE=True'

I am using Gaussian Process Regressor scikit learn to predit data for the models. While using gp, I also need to find uncertainity in each value present in the dataset. The documentation suggests to use "gp.predict(self, X, eval_MSE=True)". I used the same 'eval_MSE' in the code available online to test, but it gives me this error.
TypeError: predict() got an unexpected keyword argument 'eval_MSE'
The code I used for testing:
gp = GaussianProcessRegressor(corr='squared_exponential', theta0=1e-1,
thetaL=1e-3, thetaU=1,
nugget=(dy / y) ** 2,
random_start=100)
gp.fit(X, y)
y_pred, MSE = gp.predict(x, eval_MSE=True)
sigma = np.sqrt(MSE)
Can anyone provide a solution for this?
Either go back to previous version of scikitlearn: GaussianProcess.predict
... or you adapt to the newest: GaussianProcessClassifier.predict
Not only the predict arguments has changed, but the name of the classifier itself, the input arguments, etc.
Summary from previous links:
Old GaussianProcess (version 0.17):
class sklearn.gaussian_process.GaussianProcess(regr='constant', corr='squared_exponential', beta0=None, storage_mode='full', verbose=False, theta0=0.1, thetaL=None, thetaU=None, optimizer='fmin_cobyla', random_start=1, normalize=True, nugget=2.2204460492503131e-15, random_state=None)
predict(X, eval_MSE=False, batch_size=None)
New GaussianProcessClassifier:
class sklearn.gaussian_process.GaussianProcessClassifier(kernel=None, *, optimizer='fmin_l_bfgs_b', n_restarts_optimizer=0, max_iter_predict=100, warm_start=False, copy_X_train=True, random_state=None, multi_class='one_vs_rest', n_jobs=None)
predict(X)

How to use cross_val_predict to predict probabilities for a new dataset?

I am using sklearn's cross_val_predict for training like so:
myprobs_train = cross_val_predict(LogisticRegression(),X = x_old, y=y_old, method='predict_proba', cv=10)
I am happy with the returned probabilities, and would like now to score up a brand-new dataset. I tried:
myprobs_test = cross_val_predict(LogisticRegression(), X =x_new, y= None, method='predict_proba',cv=10)
but this did not work, it's complaining about y having zero shape. Does it mean there's no way to apply the trained and cross-validated model from cross_val_predict on new data? Or am I just using it wrong?
Thank you!
You are looking at a wrong method. Cross validation methods do not return a trained model; they return values that evaluate the performance of a model (logistic regression in your case). Your goal is to fit some data and then generate prediction for new data. The relevant methods are fit and predict of the LogisticRegression class. Here is the basic structure:
logreg = linear_model.LogisticRegression()
logreg.fit(x_old, y_old)
predictions = logreg.predict(x_new)
I have the same concern as #user3490622. If we can only use cross_val_predict on training and testing sets, why y (target) is None as the default value? (sklearn page)
To partially achieve the desired results of multiple predicted probability, one could use the fit then predict approach repeatedly to mimic the cross-validation.

GridsearchCV with RandomForest

So I am doing some parameter thing with RandomForest and GridsearchCV. Here is my code.
#Import 'GridSearchCV' and 'make_scorer'
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
Create the parameters list you wish to tune
parameters = {'n_estimators':[5,10,15]}
#Initialize the classifier
clf = GridSearchCV(RandomForestClassifier(), parameters)
#Make an f1 scoring function using 'make_scorer'
f1_scorer = make_scorer(f1_scorer)
#Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(clf, param_grid=parameters, scoring=f1_scorer,cv=5)
print(clf.get_params().keys())
#Fit the grid search object to the training data and find the optimal parameters
grid_obj = grid_obj.fit(X_train_100,y_train_100)
So the issue is the following error: "ValueError: Invalid parameter max_features for estimator GridSearchCV. Check the list of available parameters with estimator.get_params().keys()."
I followed the advice given by the error and the output of print(clf.get_params().keys()) is below. However even when I copy and paste these titles into my parameter dictionary I still get an error. I've hunted around stack overflow and most people are using really similar parameter dictionaries to mine. Anyone have any idea on how to iron out this issue? Thanks again!
dict_keys(['pre_dispatch', 'cv', 'estimator__max_features', 'param_grid', 'refit', 'estimator__min_impurity_split', 'n_jobs', 'estimator__random_state', 'error_score', 'verbose', 'estimator__min_samples_split', 'estimator__n_jobs', 'fit_params', 'estimator__min_weight_fraction_leaf', 'scoring', 'estimator__warm_start', 'estimator__criterion', 'estimator__verbose', 'estimator__bootstrap', 'estimator__class_weight', 'estimator__oob_score', 'iid', 'estimator', 'estimator__max_depth', 'estimator__max_leaf_nodes', 'estimator__min_samples_leaf', 'estimator__n_estimators', 'return_train_score'])
I think the problem is with the two lines:
clf = GridSearchCV(RandomForestClassifier(), parameters)
grid_obj = GridSearchCV(clf, param_grid=parameters, scoring=f1_scorer,cv=5)
What this is essentially doing is creating an object with a structure like:
grid_obj = GridSearchCV(GridSearchCV(RandomForestClassifier()))
which is probably one more GridSearchCV than you want.

How do I correctly manually recreate sklearn (python) logistic regression predict_proba outcome for multiple classification

If I run a basic logistic regression with 4 classes, I can get the predict_proba array.
How can i manually calculate the probabilities using the coefficients and intercepts? What are the exact steps to get the same answers that predict_proba generates?
There seem to be multiple questions about this online and several suggestions which are either incomplete or don't match up anyway.
For example, I can't replicate this process from my sklearn model so what is missing?
https://stats.idre.ucla.edu/stata/code/manually-generate-predicted-probabilities-from-a-multinomial-logistic-regression-in-stata/
Thanks,
Because I had the same question but could not find an answer that gave the same results I had a look at the sklearn GitHub repository to find the answer. Using the functions from their own package I was able to create the same results I got from predict_proba().
It appears that sklearn uses a special softmax() function that differs from the usual softmax function in their code.
Let's assume you build a model like this:
from sklearn.linear_model import LogisticRegression
X = ...
Y = ...
model = LogisticRegression(multi_class="multinomial", solver="saga")
model.fit(X, Y)
Then you can calculate the probabilities either with model.predict(X) or use the sklearn function mentioned above to calculate them manually like this.
from sklearn.utils.extmath import softmax,
import numpy as np
scores = np.dot(X, model.coef_.T) + model.intercept_
softmax(scores) # Sklearn implementation
In the documentation for their own softmax() function, they note that
The softmax function is calculated by
np.exp(X) / np.sum(np.exp(X), axis=1)
This will cause overflow when large values are exponentiated. Hence
the largest value in each row is subtracted from each data point to
prevent this.
Replicate sklearn calcs (saw this on a different post):
V = X_train.values.dot(model.coef_.transpose())
U = V + model.intercept_
A = np.exp(U)
P=A/(1+A)
P /= P.sum(axis=1).reshape((-1, 1))
seems slightly different than softmax calcs, or the UCLA stat example, but it works.

SVM overfitting in scikit learn

I am building digit recognition classification using SVM. I have 10000 data and I split them to training and test data with a ratio 7:3. I use linear kernel.
The results turn out that training accuracy is always 1 when change training example numbers, however the test accuracy is just around 0.9 ( I am expecting a much better accuracy, at least 0.95). I think the results indicates overfitting. However, I worked on the parameters, like C, gamma, ... they don't change the results very much.
Can anyone help me out with how to deal with overfitting in SVM? Thanks very much in advance for your time and help.
The following is my code:
from sklearn import svm, cross_validation
svc = svm.SVC(kernel = 'linear',C = 10000, gamma = 0.0, verbose=True).fit(sample_X,sample_y_1Num)
clf = svc
predict_y_train = clf.predict(sample_X)
predict_y_test = clf.predict(test_X)
accuracy_train = clf.score(sample_X,sample_y_1Num)
accuracy_test = clf.score(test_X,test_y_1Num)
#conduct cross-validation
cv = cross_validation.ShuffleSplit(sample_y_1Num.size, n_iter=10,test_size=0.2, random_state=None)
scores = cross_validation.cross_val_score(clf,sample_X,sample_y_1Num,cv = cv)
score_mean = mean(scores)
One way to reduce the overfitting is by adding more training observations. Since your problem is digit recognition, it easy to synthetically generate more training data by slightly changing the observations in your original data set. You can generate 4 new observations from each of your existing observations by shifting the digit images one pixel left, right, up, and down. This will greatly increase the size of your training data set and should help the classifier learn to generalize, instead of learning the noise.

Resources