Multinomial naive bayes softmax altering - python-3.x

In scikit learn, I was doing multi-class classification using MultinomialNB for labelled text data.
While predicting I used "predict_proba" feature of multinomialNB
clf=MultinomialNB()
print(clf.fit(X_train,Y_train))
clf.predict_proba(X_test[0])
As a result I got a vector of probability values for each class which added upto 1. I know this is because of softmax cross entropy function.
array ( [ [ 0.01245064, 0.02346781, 0.84694063, 0.03238112, 0.01833107,
0.03103464, 0.03539408 ] ] )
My question here is, while predicting I need to have binary_cross_entropy so that I get a vector of probability values for each class between 0 and 1 independent of each other. So how do i change the function while doing prediction in scikit-learn?

You can have log likelihood for every class by using:
_joint_log_likelihood(self, X):
"""Compute the unnormalized posterior log probability of X
I.e. ``log P(c) + log P(x|c)`` for all rows x of X, as an array-like of
shape [n_classes, n_samples].
Input is passed to _joint_log_likelihood as-is by predict,
predict_proba and predict_log_proba.
"""
Naive Bayes predict_log_proba works simply by normalizing function above.
def predict_log_proba(self, X):
"""
Return log-probability estimates for the test vector X.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Returns
-------
C : array-like, shape = [n_samples, n_classes]
Returns the log-probability of the samples for each class in
the model. The columns correspond to the classes in sorted
order, as they appear in the attribute `classes_`.
"""
jll = self._joint_log_likelihood(X)
# normalize by P(x) = P(f_1, ..., f_n)
log_prob_x = logsumexp(jll, axis=1)
return jll - np.atleast_2d(log_prob_x).T

Related

Custom loss for single-label, multi-class problem

I have a single-label, multi-class classification problem, i.e., a given sample is in exactly one class (say, class 3), but for training purposes, predicting class 2 or 5 is still okay to not penalise the model that heavily.
For example, the ground truth for 1 sample is [0,1,1,0,1] of 5 classes, instead of a one-hot vector. This implies that, the model predicting any one (not necessarily all) of the above classes (2,3 or 5) is fine.
For every batch, the predicted output dimension is of the shape bs x n x nc, where bs is the batch size, n is the number of samples per point and nc is the number of classes. The ground truth is also of the same shape as the predicted tensor.
For every batch, I'm expecting my loss function to compare n tensors across nc classes and then average it across n.
Eg: When dimensions are 32 x 8 x 5000. There are 32 batch points in a batch (for bs=32). Each batch point has 8 vector points, and each vector point has 5000 classes. For a given batch point, I wish to compute loss across all (8) vector points, compute their average and do so for the rest of the batch points (32). Final loss would be loss over all losses from each batch point.
How can I approach designing such a loss function? Any help would be deeply appreciated
P.S.: Let me know if the question is ambiguous
One way to go about this was to use a sigmoid function on the network output, which removes the implicit interdependency between class scores that a softmax function has.
As for the loss function, you can then calculate the loss based on the highest prediction for any of your target classes and ignore all other class predictions. For your example:
# your model output
y_out = torch.tensor([[0.1, 0.2, 0.95, 0.1, 0.01]], requires_grad=True)
# class labels
y = torch.tensor([[0,1,1,0,1]])
since we only care about the highest class probability, we set all other class scores to the maximum value achieved for one of the classes:
class_mask = y == 1
max_class_score = torch.max(y_out[class_mask])
y_hat = torch.where(class_mask, max_class_score, y_out)
From which we can use a regular Cross-Entropy loss function
loss_fn = torch.nn.CrossEntropyLoss()
loss = loss_fn(y_hat, y.float())
loss.backward()
when inspecting the gradients, we see that this only updates the prediction that achieved the highest score as well ass all predictions outside of any of the classes.
>>> y_out.grad
tensor([[ 0.3326, 0.0000, -0.6653, 0.3326, 0.0000]])
Predictions for other target classes do not receive a gradient update. Note that if you have a very high ratio of possible classes, this might slow down your convergence.

Threshold does not work on numpy array for accuracy metric

I am trying to implement logistic regression from scratch using numpy. I wrote a class with the following methods to implement logistic regression for a binary classification problem and to score it based on BCE loss or Accuracy.
def accuracy(self, true_labels, predictions):
"""
This method implements the accuracy score. Where the accuracy is the number
of correct predictions our model has.
args:
true_labels: vector of shape (1, m) that contains the class labels where,
m is the number of samples in the batch.
predictions: vector of shape (1, m) that contains the model predictions.
"""
counter = 0
for y_true, y_pred in zip(true_labels, predictions):
if y_true == y_pred:
counter+=1
return counter/len(true_labels)
def train(self, score='loss'):
"""
This function trains the logistic regression model and updates the
parameters based on the Batch-Gradient Descent algorithm.
The function prints the training loss and validation loss on every epoch.
args:
X: input features with shape (num_features, m) or (num_features) for a
singluar sample where m is the size of the dataset.
Y: gold class labels of shape (1, m) or (1) for a singular sample.
"""
train_scores = []
dev_scores = []
for i in range(self.epochs):
# perform forward and backward propagation & get the training predictions.
training_predictions = self.propagation(self.X_train, self.Y_train)
# get the predictions of the validation data
dev_predictions = self.predict(self.X_dev, self.Y_dev)
# calculate the scores of the predictions.
if score == 'loss':
train_score = self.loss_function(training_predictions, self.Y_train)
dev_score = self.loss_function(dev_predictions, self.Y_dev)
elif score == 'accuracy':
train_score = self.accuracy((training_predictions==+1).squeeze(), self.Y_train)
dev_score = self.accuracy((dev_predictions==+1).squeeze(), self.Y_dev)
train_scores.append(train_score)
dev_scores.append(dev_score)
plot_training_and_validation(train_scores, dev_scores, self.epochs, score=score)
after testing the code with the following input
model = LogisticRegression(num_features=X_train.shape[0],
Learning_rate = 0.01,
Lambda = 0.001,
epochs=500,
X_train=X_train,
Y_train=Y_train,
X_dev=X_dev,
Y_dev=Y_dev,
normalize=False,
regularize = False,)
model.train(score = 'loss')
i get the following results
however when i swap the scoring metric to measure over time from loss to accuracy ass follows model.train(score='accuracy') i get the following result:
I have removed normalization and regularization to make sure i am using a simple implementation of logistic regression.
Note that i use an external method to visualize the training/validation score overtime in the LogisticRegression.train() method.
The trick you are using to create your predictions before passing into the accuracy method is wrong. You are using (dev_predictions==+1).
Your problem statement is a Logistic Regression model that would generate a value between 0 and 1. Most of the times, the values will NOT be exactly equal to +1.
So essentially, every time you are passing a bunch of False or 0 to the accuracy function. I bet if you check the number of classes in your datasets having the value False or 0 would be :
exactly 51.7 % in validation dataset
exactly 56.2 % in training dataset.
To fix this, you can use a in-between threshold like 0.5 to generate your labels. So use something like dev_predictions>0.5

Cross_val_predict: Getting predicted values and predicted probabilities in one step

Following example script outputs the predicted values and predicted probabilities:
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_predict
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
lg = linear_model.LogisticRegression(random_state=0, solver='lbfgs')
y_prob = cross_val_predict(lg, X, y, cv=4, method='predict_proba')
y_pred = cross_val_predict(lg, X, y, cv=4)
y_prob[0:5]
y_pred[0:5]
I tried following without success:
test = cross_val_predict(lg, X, y, cv=4, method=['predict','predict_proba'])
Question: Is there a way to get both predicted values and predicted probabilities in one step, without running cross-validation twice? Also, I have to make sure that the values and probabilities correspond to the same input data.
The values of y_pred can be derived from y_prob:
# The probabilities as in the original code sample
y_prob = cross_val_predict(lg, X, y, cv=4, method='predict_proba')
import numpy as np
# Get a list of classes that matches the columns of `y_prob`
y_sorted = np.unique(y)
# Use the highest probability for predicting the label
indices = np.argmax(y_prob, axis=1)
# Get the label for each sample
y_pred = y_sorted[indices]
Now, it may happen that y_pred from cross_val_predict does not match the y_pred here in all cases. This happens, when there are multiple classes with identical highest probability, as is the case in your sample code. For example, the predicted probabilites are zero for all classes for the first sample. Anyway, it seems to me, that logistic regression (which is, in fact, classification) is not suitable for the diabetes dataset.
For the rationale of y_sorted see the cross_val_predict docs:
method : string, optional, default: ‘predict’
Invokes the passed method name of the passed estimator. For method=’predict_proba’, the columns correspond to the classes in sorted order.

keras: unsupervised learning with external constraint

I have to train a network on unlabelled data of binary type (True/False), which sounds like unsupervised learning. This is what the normalised data look like:
array([[-0.05744527, -1.03575495, -0.1940105 , -1.15348956, -0.62664491,
-0.98484037],
[-0.05497629, -0.50935675, -0.19396862, -0.68990988, -0.10551919,
-0.72375012],
[-0.03275552, 0.31480204, -0.1834951 , 0.23724946, 0.15504367,
0.29810553],
...,
[-0.05744527, -0.68482282, -0.1940105 , -0.87534175, -0.23580062,
-0.98484037],
[-0.05744527, -1.50366446, -0.1940105 , -1.52435329, -1.14777063,
-0.98484037],
[-0.05744527, -1.26970971, -0.1940105 , -1.33892142, -0.88720777,
-0.98484037]])
However, I do have a constraint on the total number of True labels in my data. This doesn't mean I can build a classical custom loss function in Keras taking (y_true, y_pred) arguments as required: my external constraint is just on the predicted total of True and False, not on the individual labels.
My question is whether there is a somewhat "standard" approach to this kind of problems, and how that is implementable in Keras.
POSSIBLE SOLUTION
Should I assign y_true randomly as 0/1, have a network return y_pred as 1/0 with a sigmoid activation function, and then define my loss function as
sum_y_true = 500 # arbitrary constant known a priori
def loss_function(y_true, y_pred):
loss = np.abs(y_pred.sum() - sum_y_true)
return loss
In the end, I went with the following solution, which worked.
1) Define batches in your dataframe df with a batch_id column, so that in each batch Y_train is your identical "batch ground truth" (in my case, the total number of True labels in the batch). You can then pass these instances together to the network. This can be done with a generator:
def grouper(g,x,y):
while True:
for gr in g.unique():
# this assigns indices to the entire set of values in g,
# then subsects to all the rows in which g == gr
indices = g == gr
yield (x[indices],y[indices])
# train set
train_generator = grouper(df.loc[df['set'] == 'train','batch_id'], X_train, Y_train)
# validation set
val_generator = grouper(df.loc[df['set'] == 'val','batch_id'], X_val, Y_val)
2) define a custom loss function, to track how close the total number of instances predicted as true matches the ground truth:
def custom_delta(y_true, y_pred):
loss = K.abs(K.mean(y_true) - K.sum(y_pred))
return loss
def custom_wrapper():
def custom_loss_function(y_true, y_pred):
return custom_delta(y_true, y_pred)
return custom_loss_function
Note that here
a) Each y_true label is already the sum of the ground truth in our batch (cause we don't have individual values). That's why y_true is not summed over;
b) K.mean is actually a bit of an overkill to extract a single scalar from this uniform tensor, in which all y_true values in each batch are identical - K.min or K.max would also work, but I haven't tested whether their performance is faster.
3) Use fit_generator instead of fit:
fmodel = Sequential()
# ...your layers...
# Create the loss function object using the wrapper function above
loss_ = custom_wrapper()
fmodel.compile(loss=loss_, optimizer='adam')
history1 = fmodel.fit_generator(train_generator, steps_per_epoch=total_batches,
validation_data=val_generator,
validation_steps=df.loc[encs.df['set'] == 'val','batch_id'].nunique(),
epochs=20, verbose = 2)
This way the problem is basically addressed as one of supervised learning, although without individual labels, which means that notions like true/false positive are meaningless here.
This approach not only managed to give me a y_pred that closely matches the totals I know per batch. It actually finds two groups (True/False) that occupy the expected different portions of parameter space.

Can GridSearchCV use predict_proba when using a custom score function?

I am trying to use a custom scoring function that calculates multi-class log loss with the ground truth and predict_proba y array. Is there a way to make GridSearchCV use this scoring function?
def multiclass_log_loss(y_true, y_pred):
Parameters
----------
y_true : array, shape = [n_samples]
true class, intergers in [0, n_classes - 1)
y_pred : array, shape = [n_samples, n_classes]
Returns
-------
loss : float
"""
eps=1e-15
predictions = np.clip(y_pred, eps, 1 - eps)
# normalize row sums to 1
predictions /= predictions.sum(axis=1)[:, np.newaxis]
actual = np.zeros(y_pred.shape)
n_samples = actual.shape[0]
actual[np.arange(n_samples), y_true.astype(int)] = 1
vectsum = np.sum(actual * np.log(predictions))
loss = -1.0 / n_samples * vectsum
return loss
I see that there are multiple options, score_func, loss_func and make_scorer. I tried using make_scorer with greater_is_better=False and also tried the loss_func parameter but it seems to still use the .predict method. How can I get around this problem?
UPDATE - if I set needs_threshold=True I get a multi-class error. Am I correct to understand multi-class is not supported in this case? If yes, can someone suggest a workaround?
Thanks.
The top answer to this question:
Pass estimator to custom score function via sklearn.metrics.make_scorer
might have what you need. One can define a scorer that takes as arguments a classifier clf, feature array X, and targets y_true, and feed the result of the clf.predict_proba() method to a scoring function that returns the error. As a hint, for binary classification, you probably need to use
clf.predict_proba(X)[:,1]
This worked for my needs (a normalized Gini score). For some reason, I couldn't get sklearn's metrics.make_scorer to work with my custom function that needs probabilities.

Resources