Cross_val_predict: Getting predicted values and predicted probabilities in one step - scikit-learn

Following example script outputs the predicted values and predicted probabilities:
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_predict
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
lg = linear_model.LogisticRegression(random_state=0, solver='lbfgs')
y_prob = cross_val_predict(lg, X, y, cv=4, method='predict_proba')
y_pred = cross_val_predict(lg, X, y, cv=4)
y_prob[0:5]
y_pred[0:5]
I tried following without success:
test = cross_val_predict(lg, X, y, cv=4, method=['predict','predict_proba'])
Question: Is there a way to get both predicted values and predicted probabilities in one step, without running cross-validation twice? Also, I have to make sure that the values and probabilities correspond to the same input data.

The values of y_pred can be derived from y_prob:
# The probabilities as in the original code sample
y_prob = cross_val_predict(lg, X, y, cv=4, method='predict_proba')
import numpy as np
# Get a list of classes that matches the columns of `y_prob`
y_sorted = np.unique(y)
# Use the highest probability for predicting the label
indices = np.argmax(y_prob, axis=1)
# Get the label for each sample
y_pred = y_sorted[indices]
Now, it may happen that y_pred from cross_val_predict does not match the y_pred here in all cases. This happens, when there are multiple classes with identical highest probability, as is the case in your sample code. For example, the predicted probabilites are zero for all classes for the first sample. Anyway, it seems to me, that logistic regression (which is, in fact, classification) is not suitable for the diabetes dataset.
For the rationale of y_sorted see the cross_val_predict docs:
method : string, optional, default: ‘predict’
Invokes the passed method name of the passed estimator. For method=’predict_proba’, the columns correspond to the classes in sorted order.

Related

How to use Sklearn linear regression with doc2vec input

I have 250k text documents (tweets and newspaper articles) represented as vectors obtained with a doc2vec model. Now, I want to use a regressor (multiple linear regression) to predict continuous value outputs - in my case the UK Consumer Confidence Index.
My code runs, since forever. What am I doing wrong?
I imported my data from Excel and splitted it into x_train and x_dev. The data are composed of preprocessed text and CCI continuous values.
# Import doc2vec model
dbow = Doc2Vec.load('dbow_extended.d2v')
dmm = Doc2Vec.load('dmm_extended.d2v')
concat = ConcatenatedDoc2Vec([dbow, dmm]) # model uses vector_size 400
def get_vectors(model, input_docs):
vectors = [model.infer_vector(doc.words) for doc in input_docs]
return vectors
# Prepare X_train and y_train
train_text = x_train["preprocessed_text"].tolist()
train_tagged = [TaggedDocument(words=str(_d).split(), tags=[str(i)]) for i, _d in list(enumerate(train_text))]
X_train = get_vectors(concat, train_tagged)
y_train=x_train['CCI_UK']
# Fit regressor
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)
# Predict and evaluate
prediction=reg.predict(X_dev)
print(classification_report(y_true=y_dev,y_pred=prediction),'\n')
Since the fitting never completed, I wonder whether I am using a wrong input. However, no error message is shown and the code simply runs forever. What am I doing wrong?
Thank you so much for your help!!
The variable X_train is a list or a list of lists (since the function get_vectors() return a list) whereas the input to sklearn's Linear Regression should be a 2-D array.
Try converting X_train to an array using this :
X_train = np.array(X_train)
This should help !

Multinomial naive bayes softmax altering

In scikit learn, I was doing multi-class classification using MultinomialNB for labelled text data.
While predicting I used "predict_proba" feature of multinomialNB
clf=MultinomialNB()
print(clf.fit(X_train,Y_train))
clf.predict_proba(X_test[0])
As a result I got a vector of probability values for each class which added upto 1. I know this is because of softmax cross entropy function.
array ( [ [ 0.01245064, 0.02346781, 0.84694063, 0.03238112, 0.01833107,
0.03103464, 0.03539408 ] ] )
My question here is, while predicting I need to have binary_cross_entropy so that I get a vector of probability values for each class between 0 and 1 independent of each other. So how do i change the function while doing prediction in scikit-learn?
You can have log likelihood for every class by using:
_joint_log_likelihood(self, X):
"""Compute the unnormalized posterior log probability of X
I.e. ``log P(c) + log P(x|c)`` for all rows x of X, as an array-like of
shape [n_classes, n_samples].
Input is passed to _joint_log_likelihood as-is by predict,
predict_proba and predict_log_proba.
"""
Naive Bayes predict_log_proba works simply by normalizing function above.
def predict_log_proba(self, X):
"""
Return log-probability estimates for the test vector X.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Returns
-------
C : array-like, shape = [n_samples, n_classes]
Returns the log-probability of the samples for each class in
the model. The columns correspond to the classes in sorted
order, as they appear in the attribute `classes_`.
"""
jll = self._joint_log_likelihood(X)
# normalize by P(x) = P(f_1, ..., f_n)
log_prob_x = logsumexp(jll, axis=1)
return jll - np.atleast_2d(log_prob_x).T

Cross_val_predict return each test result independently

From my understanding cross_val_predict with cv parameter set to 10, will create 10 independent data sets, use one as a test set and the other 9 as training sets, and do this until all 10 sets have been used as the test data. the return is one combined data set containing all the predictions for the data with confidence levels.
What I want is something like an array containing ten arrays of predictions so I can calculate a mean score, i know cross_val_score returns this but I'm trying to compare the ROC AUC metric, so need to use the 'predict_proba' method, below is the code to create the current output I am receiving.
Here is the output :
from sklearn.preprocessing import label_binarize
y_bin = label_binarize(y, classes=['label1','label2'])
#import module to calculate roc_auc score
from sklearn.metrics import roc_auc_score
#Import the cross-validation prediction module
from sklearn.model_selection import cross_val_predict
#using the loops to change parameters again
NNmodel = MLPClassifier(hidden_layer_sizes=(10,),activation='logistic')
proba = cross_val_predict(NNmodel, X, y, cv=10, method='predict_proba')
auc = roc_auc_score(y_bin,proba[:,1])
print ("Hidden Neurons set to 10", "ROC AUC Score = %",auc*100)
I would much rather report the mean score of each test set rather than just the single result I can produce at the moment.

scikit-learn LogisticRegressionCV: best coefficients

I am trying to understand how the best coefficients are calculated in a logistic regression cross-validation, where the "refit" parameter is True.
If I understand the docs correctly, the best coefficients are the result of first determining the best regularization parameter "C", i.e., the value of C that has the highest average score over all folds. Then, the best coefficients are simply the coefficients that were calculated on the fold that has the highest score for the best C. I assume that if the maximum score is achieved by several folds, the coefficients of these folds would be averaged to give the best coefficients (I didn't see anything on how this case is handled in the docs).
To test my understanding, I determined the best coefficients in two different ways:
directly from the coef_ attribute of the fitted model, and
from the coefs_paths attribute, which contains the path of the coefficients obtained during cross-validating across each fold and then across each C.
The results I get from 1. and 2. are similar but not identical, so I was hoping someone could point out what I am doing wrong here.
Thanks!
An example to demonstrate the issue:
from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Set parameters
n_folds = 10
C_values = [0.001, 0.01, 0.05, 0.1, 1., 100.]
# Load and preprocess data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
X_train_scaled = StandardScaler().fit_transform(X_train)
# Fit model
clf = LogisticRegressionCV(Cs=C_values, cv=n_folds, penalty='l1',
refit=True, scoring='roc_auc',
solver='liblinear', random_state=0,
fit_intercept=False)
clf.fit(X_train_scaled, y_train)
########################
# Get and plot coefficients using method 1
########################
coefs1 = clf.coef_
coefs1_series = pd.Series(coefs1.ravel(), index=cancer['feature_names'])
coefs1_series.sort_values().plot(kind="barh")
########################
# Get and plot coefficients using method 2
########################
# mean of scores of class "1"
scores = clf.scores_[1]
mean_scores = np.mean(scores, axis=0)
# Get index of the C that has the highest average score across all folds
best_C_idx = np.where(mean_scores==np.max(mean_scores))[0][0]
# Get index (here: indices) of the folds with highest scores for the
# best C
best_folds_idx = np.where(scores[:, best_C_idx]==np.max(scores[:, best_C_idx]))[0]
paths = clf.coefs_paths_[1] # has shape (n_folds, len(C_values), n_features)
coefs2 = np.squeeze(paths[best_folds_idx, best_C_idx, :])
coefs2 = np.mean(coefs2, axis=0)
coefs2_series = pd.Series(coefs2.ravel(), index=cancer['feature_names'])
coefs2_series.sort_values().plot(kind="barh")
I think this article answers your question: https://orvindemsy.medium.com/understanding-grid-search-randomized-cvs-refit-true-120d783a5e94.
The key point is the refit parameter of LogisticRegressionCV.
According to sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html)
refitbool, default=True
If set to True, the scores are averaged across all folds, and the coefs and the C that corresponds to the best score is taken, and a final refit is done using these parameters. Otherwise the coefs, intercepts and C that correspond to the best scores across folds are averaged.
Best.

Multi-label classification with class weights in Keras

I have a 1000 classes in the network and they have multi-label outputs. For each training example, the number of positive output is same(i.e 10) but they can be assigned to any of the 1000 classes. So 10 classes have output 1 and rest 990 have output 0.
For the multi-label classification, I am using 'binary-cross entropy' as cost function and 'sigmoid' as the activation function. When I tried this rule of 0.5 as the cut-off for 1 or 0. All of them were 0. I understand this is a class imbalance problem. From this link, I understand that, I might have to create extra output labels.Unfortunately, I haven't been able to figure out how to incorporate that into a simple neural network in keras.
nclasses = 1000
# if we wanted to maximize an imbalance problem!
#class_weight = {k: len(Y_train)/(nclasses*(Y_train==k).sum()) for k in range(nclasses)}
inp = Input(shape=[X_train.shape[1]])
x = Dense(5000, activation='relu')(inp)
x = Dense(4000, activation='relu')(x)
x = Dense(3000, activation='relu')(x)
x = Dense(2000, activation='relu')(x)
x = Dense(nclasses, activation='sigmoid')(x)
model = Model(inputs=[inp], outputs=[x])
adam=keras.optimizers.adam(lr=0.00001)
model.compile('adam', 'binary_crossentropy')
history = model.fit(
X_train, Y_train, batch_size=32, epochs=50,verbose=0,shuffle=False)
Could anyone help me with the code here and I would also highly appreciate if you could suggest a good 'accuracy' metric for this problem?
Thanks a lot :) :)
I have a similar problem and unfortunately have no answer for most of the questions. Especially the class imbalance problem.
In terms of metric there are several possibilities: In my case I use the top 1/2/3/4/5 results and check if one of them is right. Because in your case you always have the same amount of labels=1 you could take your top 10 results and see how many percent of them are right and average this result over your batch size. I didn't find a possibility to include this algorithm as a keras metric. Instead, I wrote a callback, which calculates the metric on epoch end on my validation data set.
Also, if you predict the top n results on a test dataset, see how many times each class is predicted. The Counter Class is really convenient for this purpose.
Edit: If found a method to include class weights without splitting the output.
You need a numpy 2d array containing weights with shape [number classes to predict, 2 (background and signal)].
Such an array could be calculated with this function:
def calculating_class_weights(y_true):
from sklearn.utils.class_weight import compute_class_weight
number_dim = np.shape(y_true)[1]
weights = np.empty([number_dim, 2])
for i in range(number_dim):
weights[i] = compute_class_weight('balanced', [0.,1.], y_true[:, i])
return weights
The solution is now to build your own binary crossentropy loss function in which you multiply your weights yourself:
def get_weighted_loss(weights):
def weighted_loss(y_true, y_pred):
return K.mean((weights[:,0]**(1-y_true))*(weights[:,1]**(y_true))*K.binary_crossentropy(y_true, y_pred), axis=-1)
return weighted_loss
weights[:,0] is an array with all the background weights and weights[:,1] contains all the signal weights.
All that is left is to include this loss into the compile function:
model.compile(optimizer=Adam(), loss=get_weighted_loss(class_weights))

Resources