Bleu Score in Model Evaluation Metric - keras

In many seq2seq implementations, I saw that they use accuracy metric in compiling the model and Bleu score only in predictions.
Why they don't use Bleu score in training to be more efficient? if I understand correctly!

Bilingual Evaluation Understudy Score was meant to replace humans, hence the word understudy comes in it's name.
Now, When you are training your data, you already have the targeted value and you can directly compare your generated output with it, but when you predict on a dataset, you don't have a way to measure if the sentence you translated into is correct. That is why you use Bleu, because no human can check after each machine translation if what you predicted is correct or not, and Bleu provides a sanity check.
P.S. Understudy means someone learning from a mentor to replace him if need be, Bleu "learns" from humans and then is able to score the translation.
For further reference check out https://www.youtube.com/watch?v=9ZvTxChwg9A&list=PL1w8k37X_6L_s4ncq-swTBvKDWnRSrinI&index=28
If any queries, comment below.

Related

Find optimized threshold value

I have a dataset which has a fraud_label and some other sets of feature variable. How can I find the best rule which would help me identify fraud_label correctly with the best precision and recall values. Example of features are number_of_site_visits, external_fraud_score etc. I need to be able to come up with a rule which says that if number_of_site_visits is less than X and external_fraud_score is greater than Y then we will get the best precision and recall. I have to do this in Python and any help you can provide or direction would be very helpful.
I have tried Random Forest model but that gives me feature importances and not exact threshold values.
The best way to find the best rule for identifying fraud_label correctly with the best precision and recall values is to use a supervised machine learning algorithm such as logistic regression or support vector machines. These algorithms can be used to train a model on your dataset and then use the trained model to predict the fraud_label. The model can then be evaluated using metrics such as precision and recall.
You can also use grid search or cross-validation to find the optimal parameters for your model, which will help you identify the best thresholds for each feature variable. This will allow you to create a rule that will give you the best precision and recall values.
In Python, you can use scikit-learn library for implementing these algorithms.

How to put more weight on one class during training in Pytorch [duplicate]

I have a multilabel classification problem, which I am trying to solve with CNNs in Pytorch. I have 80,000 training examples and 7900 classes; every example can belong to multiple classes at the same time, mean number of classes per example is 130.
The problem is that my dataset is very imbalance. For some classes, I have only ~900 examples, which is around 1%. For “overrepresented” classes I have ~12000 examples (15%). When I train the model I use BCEWithLogitsLoss from pytorch with a positive weights parameter. I calculate the weights the same way as described in the documentation: the number of negative examples divided by the number of positives.
As a result, my model overestimates almost every class… Mor minor and major classes I get almost twice as many predictions as true labels. And my AUPRC is just 0.18. Even though it’s much better than no weighting at all, since in this case the model predicts everything as zero.
So my question is, how do I improve the performance? Is there anything else I can do? I tried different batch sampling techniques (to oversample minority class), but they don’t seem to work.
I would suggest either one of these strategies
Focal Loss
A very interesting approach for dealing with un-balanced training data through tweaking of the loss function was introduced in
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollar Focal Loss for Dense Object Detection (ICCV 2017).
They propose to modify the binary cross entropy loss in a way that decrease the loss and gradient of easily classified examples while "focusing the effort" on examples where the model makes gross errors.
Hard Negative Mining
Another popular approach is to do "hard negative mining"; that is, propagate gradients only for part of the training examples - the "hard" ones.
see, e.g.:
Abhinav Shrivastava, Abhinav Gupta and Ross Girshick Training Region-based Object Detectors with Online Hard Example Mining (CVPR 2016)
#Shai has provided two strategies developed in the deep learning era. I would like to provide you some additional traditional machine learning options: over-sampling and under-sampling.
The main idea of them is to produce a more balanced dataset by sampling before starting your training. Note that you probably will face some problems such as losing the data diversity (under-sampling) and overfitting the training data (over-sampling), but it might be a good start point.
See the wiki link for more information.

How to get sentiment score for a word in a given dataset

I have a sentiment analysis dataset that is labeled in three categories: positive, negative, and neutral. I also have a list of words (mostly nouns), for which I want to calculate the sentiment value, to understand "how" (positively or negatively) these entities were talked about in the dataset. I have read some online resources like blogs and thought about a couple of approaches for calculating the sentiment score for a particular word X.
Calculate how many data instances (sentences) which have the word X in those, have "positive" labels, have "negative" labels, and "neutral" labels. Then, calculate the weighted average sentiment for that word.
Take a generic untrained BERT architecture, and then train it using the dataset. Then, pass each word from the list to that trained model to get the sentiment scores for the word.
Does any of these approaches make sense? If so, can you suggest some related works that I can look at?
If these approaches don't make sense, could you please advise how I can calculate the sentiment score for a word, in a given dataset?
The first method will suffer from the same drawbacks as other bag-of-words models do. Consider that you have a dataset of movie reviews with their sentiment scores, and you want to find the sentiment for a particular actor called X. A label for a sample like "X's acting was the only good thing in an otherwise bad movie" will be negative, but the sentiment towards X is positive. A simple approach like the first one can't handle such cases.
The second approach also does not make much sense, as the BERT models may not perform well without context. You can try using weakly supervised learning which can help in creating token-level labels. Read section 3.3 for this paper to get an idea about this. Disclaimer: I'm one of the authors of this paper.

How to evaluate and explain the trained model in this machine learning?

I am new in machine learning. I did a test but do not know how to explain and evaluate.
Case 1:
I first divide randomly the data (data A, about 8000 words) into 10 groups (a1..a10). Within each group, I use 90% of data to build ngram model. This ngram model is then tested on the other 10% data of the same group. The result is below 10% accuracy. Other 9 groups are done same way (respectively build model and respectively tested on the remained 10% data of that group). All results are about 10% accuracy. (Is this 10 fold cross-validation?)
Case 2:
I first build a ngram model based on entire data set (data A) of about 8000 words. Then I divide this A into 10 groups(a1,a2,a3..a10), randomly of course. I then use this ngram to test respectively a1,a2..a10. I found the model is almost 96% accuracy on all groups.
How to explain such situations.
Thanks in advance.
Yes, 10-fold cross validation.
This testing method has the common flaw of testing on the training set. That is why the accuracy is inflated. It is unrealistic because, in real life, your test instances are novel and previously unseen by the system.
N-fold cross validation is a valid evaluation method used in many works.
You need to read up on the topic of overfitting.
The situation you describes gives the impression that your ngram model is heavily overfitted: it can "memorize" 96% of the training data. But when trained on a proper subset, it only achieves a prediction on the unknown data of 10%.
This is called 10 fold cross-validation

information criteria for confusion matrices

One can measure goodness of fit of a statistical model using Akaike Information Criterion (AIC), which accounts for goodness of fit and for the number of parameters that were used for model creation. AIC involves calculation of maximized value of likelihood function for that model (L).
How can one compute L, given prediction results of a classification model, represented as a confusion matrix?
It is not possible to calculate the AIC from a confusion matrix since it doesn't contain any information about the likelihood. Depending on the model you are using it may be possible to calculate the likelihood or quasi-likelihood and hence the AIC or QIC.
What is the classification problem that you are working on, and what is your model?
In a classification context often other measures are used to do GoF testing. I'd recommend reading through The Elements of Statistical Learning by Hastie, Tibshirani and Friedman to get a good overview of this kind of methodology.
Hope this helps.
Information-Based Evaluation Criterion for Classifier's Performance by Kononenko and Bratko is exactly what I was looking for:
Classification accuracy is usually used as a measure of classification performance. This measure is, however, known to have several defects. A fair evaluation criterion should exclude the influence of the class probabilities which may enable a completely uninformed classifier to trivially achieve high classification accuracy. In this paper a method for evaluating the information score of a classifier''s answers is proposed. It excludes the influence of prior probabilities, deals with various types of imperfect or probabilistic answers and can be used also for comparing the performance in different domains.

Resources