What is the difference between "predict" and "predict_class" functions in keras? - keras

What is the difference between predict and predict_class functions in keras?
Why does Model object don't have predict_class function?

predict will return the scores of the regression and predict_class will return the class of your prediction. Although it seems similar, there are some differences:
Imagine you are trying to predict if the picture is a dog or a cat (you have a classifier):
predict will return you: 0.6 cat and 0.4 dog (for example).
predict_class will return the index of the class having maximum value. For example, if cat is 0.6 and dog is 0.4, it will return 0 if the class cat is at index 0)
Now, imagine you are trying to predict house prices (you have a regressor):
predict will return the predicted price
predict_class will not make sense here since you do not have a classifier
TL:DR: use predict_class for classifiers (outputs are labels) and use predict for regressions (outputs are non-discrete)
Hope it helps!
For your second question, the answer is here

predict_classes method is only available for the Sequential class but not for the Model class
You can check this answer

Why does Model object don't have predict_class function?
The answer was given here in this github issue. (Nevertheless, this is still a very complex explanation. Any help is welcomed)
For models that have more than one output, these concepts are
ill-defined. And it would be a bad idea to make available something in
the one-output case but not in other cases (inconsistent API).
For the Sequential model, the reason this is supported is for
backwards compatibility only.
Predict_class is missing from the functional API.

Related

Why my LSTM for Multi-Label Text Classification underperforms?

I'm using Windows 10 machine.
Libraries: Keras with Tensorflow 2.0
Embeddings:Glove(100 dimensions)
I am trying to implement an LSTM architecture for multi-label text classification.
My problem is that no matter how much fine-tuning I do, the results are really bad.
I am not experienced in DL practical implementations that's why I ask for your advice.
Below I will state basic information about my dataset and my model so far.
I can't embed images since I am a new member so they appear as links.
Dataset form+Embedings form+train-test-split form
Dataset's labels distribution
My Implementation of LSTM
Model's Summary
Model's Accuracy plot
Model's Loss plot
As you can see my dataset is really small (~6.000 examples) and maybe that's one reason why I cannot achieve better results. Still, I chose it because it's unbiased.
I'd like to know if there is any fundamental mistake in my code regarding the dimensions, shape, activation functions, and loss functions for multi-label text classification?
What would you recommend to achieve better results on my model? Also any general advice regarding optimizing, methods,# of nodes, layers, dropouts, etc is very welcome.
Model's best val accuracy that I achieved so far is ~0.54 and even if I tried to raise it, it seems stuck there.
There are many ways to get this wrong but the most common mistake is to get your model overfit the training data.
I suspect that 0.54 accuracy means that your model selects the most common label (offensive) for almost all cases.
So, consider one of these simple solutions:
Create balanced training data: like 400 samples from each class.
or sample balanced batches for training (exactly the same number of labels on each training batch)
In addition to tracking accuracy and loss, look at precision-recall-f1 or even better try plotting area under curve, maybe different classes need different thresholds of activation. (If you are using Sigmoid on last layer maybe one class could perform better with 0.2 activations and another class with 0.7)
first try simple model. embedding 1 layer LSTM than classify
how to tokenize text , is vocab size enough ?
try dice loss

How to get the probability of a particular token(word) in a sentence given the context

I'm trying to calculate the probability or any type of score for words in a sentence using NLP. I've tried this approach with GPT2 model using Huggingface Transformers library, but, I couldn't get satisfactory results due to the model's unidirectional nature which for me didn't seem to predict within context. So I was wondering whether there is a way, to calculate the above said using BERT since it's Bidirectional.
I've found this post relatable, which I randomly saw the other day but didn't see any answer which would be useful for me as well.
Hope I will be able to receive ideas or a solution for this. Any help is appreciated. Thank you.
BERT is trained as a masked language model, i.e., it is trained to predict tokens that were replaced by a [MASK] token.
from transformers import AutoTokenizer, BertForMaskedLM
tok = AutoTokenizer.from_pretrained("bert-base-cased")
bert = BertForMaskedLM.from_pretrained("bert-base-cased")
input_idx = tok.encode(f"The {tok.mask_token} were the best rock band ever.")
logits = bert(torch.tensor([input_idx]))[0]
prediction = logits[0].argmax(dim=1)
print(tok.convert_ids_to_tokens(prediction[2].numpy().tolist()))
It prints token no. 11581 which is:
Beatles
To get a normalized probability distribution over BERT's vocabulary, you can normalize the logits using the softmax function, i.e., F.softmax(logits, dim=1), (assuming standart import torch.nn.fucntional as F).
The tricky thing is that words might be split into multiple subwords. You can simulate that by adding multiple [MASK] tokens, but then you have a problem with how to compare the scores of prediction so different lengths reliably. I would probably average the probabilities, but maybe there is a better way.

Keras. figuring out class label encoding

I am doing binary classification with one-output layer. I want to know which class is encoded as 0 and as 1 so that I can interpret probability scores when using model.predict() in Keras (which I think are scores for label1). Does it make sense to use predct_classes for training data to inspect the class label given that training loss is small? Is there any better way to this?
Yes, it makes sense to use predict(trainingData) to study the results, to manually compare the values between the predicted data and the true data.
But it's you who define what 0 and 1 are when you create the true values.
The answer is in your true data, what they usually call "Y". The model will learn what is in Y and that is the classification. Only you (who created the data) can know that.

Does GridSearchCV use predict or predict_proba, when using auc_score as score function?

Does GridSearchCV use predict or predict_proba, when using auc_score as score function?
The predict function generates predicted class labels, which will always result in a triangular ROC-curve. A more curved ROC-curve is obtained using the predicted class probabilities. The latter one is, as far as I know, more accurate. If so, the area under the 'curved' ROC-curve is probably best to measure classification performance within the grid search.
Therefore I am curious if either the class labels or class probabilities are used for the grid search, when using the area under the ROC-curve as performance measure. I tried to find the answer in the code, but could not figure it out. Does anyone here know the answer?
Thanks
To use auc_score for grid searching you really need to use predict_proba or decision_function as you pointed out. This is not possible in the 0.13 release. If you do score_func=auc_score it will use predict which doesn't make any sense.
[edit]Since 0.14[/edit] it is possible to do grid-search using auc_score, by setting the new scoring parameter to roc_auc: GridSearch(est, param_grid, scoring='roc_auc'). It will do the right thing and use predict_proba (or decision_function if predict_proba is not available).
See the whats new page of the current dev version.
You need to install the current master from github to get this functionality or wait until April (?) for 0.14.
After performing some experiments with Sklearn SVC (which has predict_proba available) comparing some results with predict_proba and decision_function, it seems that roc_auc in GridSearchCV uses decision_function to compute AUC scores. I found a similar discussion here: Reproducing Sklearn SVC within GridSearchCV's roc_auc scores manually

How can i know probability of class predicted by predict() function in Support Vector Machine?

How can i know sample's probability that it belongs to a class predicted by predict() function of Scikit-Learn in Support Vector Machine?
>>>print clf.predict([fv])
[5]
There is any function?
Definitely read this section of the docs as there's some subtleties involved. See also Scikit-learn predict_proba gives wrong answers
Basically, if you have a multi-class problem with plenty of data predict_proba as suggested earlier works well. Otherwise, you may have to make do with an ordering that doesn't yield probability scores from decision_function.
Here's a nice motif for using predict_proba to get a dictionary or list of class vs probability:
model = svm.SVC(probability=True)
model.fit(X, Y)
results = model.predict_proba(test_data)[0]
# gets a dictionary of {'class_name': probability}
prob_per_class_dictionary = dict(zip(model.classes_, results))
# gets a list of ['most_probable_class', 'second_most_probable_class', ..., 'least_class']
results_ordered_by_probability = map(lambda x: x[0], sorted(zip(model.classes_, results), key=lambda x: x[1], reverse=True))
Use clf.predict_proba([fv]) to obtain a list with predicted probabilities per class. However, this function is not available for all classifiers.
Regarding your comment, consider the following:
>> prob = [ 0.01357713, 0.00662571, 0.00782155, 0.3841413, 0.07487401, 0.09861277, 0.00644468, 0.40790285]
>> sum(prob)
1.0
The probabilities sum to 1.0, so multiply by 100 to get percentage.
When creating SVC class to compute the probability estimates by setting probability=True:
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
Then call fit as usual and then predict_proba([fv]).
For clearer answers, I post again the information from scikit-learn for svm.
Needless to say, the cross-validation involved in Platt scaling is an expensive operation for large datasets. In addition, the probability estimates may be inconsistent with the scores, in the sense that the “argmax” of the scores may not be the argmax of the probabilities. (E.g., in binary classification, a sample may be labeled by predict as belonging to a class that has probability <½ according to predict_proba.) Platt’s method is also known to have theoretical issues. If confidence scores are required, but these do not have to be probabilities, then it is advisable to set probability=False and use decision_function instead of predict_proba.
For other classifiers such as Random Forest, AdaBoost, Gradient Boosting, it should be okay to use predict function in scikit-learn.
This is one way of obtaining the Probabilities
svc = SVC(probability=True)
preds_svc = svc.fit(X_train, y_train).predict(X_test)
probs_svc = svc.decision_function(X_test)#The decision function tells us on which side of the hyperplane generated by the classifier we are (and how far we are away from it).
probs_svc = (probs_svc - probs_svc.min()) / (probs_svc.max() - probs_svc.min())

Resources