How to normalize output from BERT classifier - python-3.x

I've trained a BERT classifier using HuggingFace transformers.TFBertForSequenceClassification classifier. It's working fine, but when using the model.predict() method, it gives a tuple as output which are not normalized between [0, 1]. E.g. I trained the model to classify news articles into fraud and non-fraud category. Then I fed the following 4 test data to the model for prediction:
articles = ['He was involved in the insider trading scandal.',
'Johnny was a good boy. May his soul rest in peace',
'The fraudster stole money using debit card pin',
'Sun rises in the east']
The outputs are:
[[-2.8615277, 2.6811066],
[ 2.8651822, -2.564444 ],
[-2.8276567, 2.4451752],
[ 2.770451 , -2.3713884]]
For me label-0 is for non-fraud, and label-1 is for fraud, so that's working fine. But how do I prepare the scoring confidence from here? Does normalization using softmax make sense in this context? Also, if I want to look at those predictions where the model is kind of indecisive, how would I do that? In that case would both the values be very close to each other?

Yes. You can use softmax. To be more precise, use an argmax over softmax to get label predictions like 0 or 1.
y_pred = tf.nn.softmax(model.predict(test_dataset))
y_pred_argmax = tf.math.argmax(y_pred, axis=1)
This blog was helpful for me when I had the same query..
To answer your second question, I would ask you to focus on what test instances that your classification model had misclassified than trying to find where the model went indecisive.
Because, argmax is always going to return 0 or 1 and never 0.5. And, I would say that a label 0.5 will be the appropriate value for claiming your model to be indecisive..

Related

Why my LSTM for Multi-Label Text Classification underperforms?

I'm using Windows 10 machine.
Libraries: Keras with Tensorflow 2.0
Embeddings:Glove(100 dimensions)
I am trying to implement an LSTM architecture for multi-label text classification.
My problem is that no matter how much fine-tuning I do, the results are really bad.
I am not experienced in DL practical implementations that's why I ask for your advice.
Below I will state basic information about my dataset and my model so far.
I can't embed images since I am a new member so they appear as links.
Dataset form+Embedings form+train-test-split form
Dataset's labels distribution
My Implementation of LSTM
Model's Summary
Model's Accuracy plot
Model's Loss plot
As you can see my dataset is really small (~6.000 examples) and maybe that's one reason why I cannot achieve better results. Still, I chose it because it's unbiased.
I'd like to know if there is any fundamental mistake in my code regarding the dimensions, shape, activation functions, and loss functions for multi-label text classification?
What would you recommend to achieve better results on my model? Also any general advice regarding optimizing, methods,# of nodes, layers, dropouts, etc is very welcome.
Model's best val accuracy that I achieved so far is ~0.54 and even if I tried to raise it, it seems stuck there.
There are many ways to get this wrong but the most common mistake is to get your model overfit the training data.
I suspect that 0.54 accuracy means that your model selects the most common label (offensive) for almost all cases.
So, consider one of these simple solutions:
Create balanced training data: like 400 samples from each class.
or sample balanced batches for training (exactly the same number of labels on each training batch)
In addition to tracking accuracy and loss, look at precision-recall-f1 or even better try plotting area under curve, maybe different classes need different thresholds of activation. (If you are using Sigmoid on last layer maybe one class could perform better with 0.2 activations and another class with 0.7)
first try simple model. embedding 1 layer LSTM than classify
how to tokenize text , is vocab size enough ?
try dice loss

How to get the probability of a particular token(word) in a sentence given the context

I'm trying to calculate the probability or any type of score for words in a sentence using NLP. I've tried this approach with GPT2 model using Huggingface Transformers library, but, I couldn't get satisfactory results due to the model's unidirectional nature which for me didn't seem to predict within context. So I was wondering whether there is a way, to calculate the above said using BERT since it's Bidirectional.
I've found this post relatable, which I randomly saw the other day but didn't see any answer which would be useful for me as well.
Hope I will be able to receive ideas or a solution for this. Any help is appreciated. Thank you.
BERT is trained as a masked language model, i.e., it is trained to predict tokens that were replaced by a [MASK] token.
from transformers import AutoTokenizer, BertForMaskedLM
tok = AutoTokenizer.from_pretrained("bert-base-cased")
bert = BertForMaskedLM.from_pretrained("bert-base-cased")
input_idx = tok.encode(f"The {tok.mask_token} were the best rock band ever.")
logits = bert(torch.tensor([input_idx]))[0]
prediction = logits[0].argmax(dim=1)
print(tok.convert_ids_to_tokens(prediction[2].numpy().tolist()))
It prints token no. 11581 which is:
Beatles
To get a normalized probability distribution over BERT's vocabulary, you can normalize the logits using the softmax function, i.e., F.softmax(logits, dim=1), (assuming standart import torch.nn.fucntional as F).
The tricky thing is that words might be split into multiple subwords. You can simulate that by adding multiple [MASK] tokens, but then you have a problem with how to compare the scores of prediction so different lengths reliably. I would probably average the probabilities, but maybe there is a better way.

How to adopt multiple different loss functions in each steps of LSTM in Keras

I have a set of sentences and their scores, I would like to train a marking system that could predict the score for a given sentence, such one example is like this:
(X =Tomorrow is a good day, Y = 0.9)
I would like to use LSTM to build such a marking system, and also consider the sequential relationship between each word in the sentence, so the training example shown above is transformed as following:
(x1=Tomorrow, y1=is) (x2=is, y2=a) (x3=a, y3=good) (x4=day, y4=0.9)
When training this LSTM, I would like the first three time steps using a softmax classifier, and the final step using a MSE. It is obvious that the loss function used in this LSTM is composed of two different loss functions. In this case, it seems the Keras does not provide the way to address my problem directly. In addition, I am not sure whether my method to build the marking system is correct or not.
Keras support multiple loss functions as well:
model = Model(inputs=inputs,
outputs=[lang_model, sent_model])
model.compile(optimizer='sgd',
loss=['categorical_crossentropy', 'mse'],
metrics=['accuracy'], loss_weights=[1., 1.])
Based on your explanation, I think you need a model that first, predict a token based on previous tokens, in NLP domain it usually called Language model, and then compute a score which I assume it is a sentiment (it is applicable to other domain).
To do so, you can train your language model with LSTM and pick the last output of LSTM for your ranking task. To this end, you need to define two loss function: categorical_crossentropy for the language model and MSE for the ranking task.
This tutorial would be helpful: https://www.pyimagesearch.com/2018/06/04/keras-multiple-outputs-and-multiple-losses/

Interpreting results from lightFM

I built a recommendation model on a user-item transactional dataset where each transaction is represented by 1.
model = LightFM(learning_rate=0.05, loss='warp')
Here are the results
Train precision at k=3: 0.115301
Test precision at k=3: 0.0209936
Train auc score: 0.978294
Test auc score : 0.810757
Train recall at k=3: 0.238312330233
Test recall at k=3: 0.0621618086561
Can anyone help me interpret this result? How is it that I am getting such good auc score and such bad precision/recall? The precision/recall gets even worse for 'bpr' Bayesian personalized ranking.
Prediction task
users = [0]
items = np.array([13433, 13434, 13435, 13436, 13437, 13438, 13439, 13440])
model.predict(users, item)
Result
array([-1.45337546, -1.39952552, -1.44265926, -0.83335167, -0.52803332,
-1.06252205, -1.45194077, -0.68543684])
How do I interpret the prediction scores?
Thanks
When it comes to the difference between precision#K at AUC, you may want to have a look at my answer here: Evaluating the LightFM Recommendation Model.
The scores themselves do not have a defined scale and are not interpretable. They only make sense in the context of defining a ranking over items for a given user, with higher scores denoting a stronger predicted preference.

predict() returns image similarities with SVM in scikit learn

A silly question: after i train my SVM in scikit-learn i have to use predict function: predict(X) for predicting at which class belongs? (http://scikit-learn.org/dev/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.predict)
X parameter is the image feature vector?
In case i give an image not trained (not trained because SVM ask at least 3 samples for class), what returns?
First remark: "predict() returns image similarities with SVM in scikit learn" is not a question. Please put a question in the header of Stack Overflow entries.
Second remark: the predict method of the SVC class in sklearn does not return "image similarities" but a class assignment prediction. Read the http://scikit-learn.org documentation and tutorials to understand what we mean by classification and prediction in machine learning.
X parameter is the image feature vector?
No, X is not "the image" feature vector: it is a set of image feature vectors with shape (n_samples, n_features) as explained in the documentation you refer to. In your case a sample is an image hence the expected shape would be (n_images, n_features). The predict API was design to compute many predictions at once for efficiency reason. If you want to compute a single prediction, you will have to wrap your single feature vector in an array with shape (1, n_features).
For instance if you have a single feature vector (1D) called my_single_image_features with shape (n_features,) you can call predict with:
predictions = clf.predict([my_single_image_features])
my_single_prediction = predictions[0]
Please note the [] signs around the my_single_image_features variable to turn it into a 2D array.
my_single_prediction will be an integer whose meaning depends on the integer values provided by you when calling the clf.fit(X_train, y_train) method in the first place.
In case i give an image not trained (not trained because SVM ask at least 3 samples for class), what returns?
An image is not "trained". Only the model is trained. Of course you can pass samples / images that are not part of the training set to the predict method. This is the whole purpose of machine learning: making predictions on new unseen data based on what you learn from the statistical regularities seen in the past training data.

Resources