BLEU score value higher than 1 - nlp

I've been looking at how BLEU score works. What I understood from the online videos + the original research paper is that BLEU score value should be within the range 0-1.
Then, when I started to look at some research papers, I found that BLEU value (almost) always higher than 1!
For instance, have a look here:
https://www.aclweb.org/anthology/W19-8624.pdf
https://arxiv.org/pdf/2005.01107v1.pdf
Am I missing something?
Another small point: what does the headers in the table below mean? The BLEU score was calculated using unigrams, then unigrams & bigrams (averaged), etc.? or each ngrams size was calculated independently?

Related

Ratios of co-occurrence probabilities can encode meaning components

I'studying NLP these days through CS224N, which is NLP lecture in Stanford, and I had a question of co-occurrence probability.
I can understand that each first row and second row shows co-occurrence probability, but it's hard to understand last row. Hard to figure out what it is...
In lecture, said that meaning component is something like female to male, king to queen. I thought would be gender in this example.
In last row I thought it was condition, but hard to find correlation between Large and Small..
enter image description here

With a RNN, how can we predict a currency price to reach a specific price in a given time period?

I am working on a bitcoin price predictor, and I realize that it's non sense to predict an exact price at a given time.
What we want when predicting some currency price can be summarize with this question: "What is the probability for the price to reach value X in a specific time range ?"
I have hard time to integrate this thinking into a RNN /LSTM architecture. My first thought was to build a custom Loss function that compare the output of the RNN (typically, a predicted price) with the real lower and upper price of the next day, then if the lower_price < predicted_value < upper_price the RNN output should be "classified" as correct (loss = 0), otherwise the loss would be > 0. But I am sure there already exists a better solution for this kind of problem.
Any idea ?
Thank you
There are a number of different ways to do what you are asking. However I think what you are looking for is a quantile loss function.
def tilted_loss(q,y,f):
e = (y-f)
return K.mean(K.maximum(q*e, (q-1)*e), axis=-1)
Notebook with full source code. Or if you prefer PyTorch you can find an implementation here.

How to evaluate auto summary generated with gold summaries with Rouge metric?

I'm working on a auto summarization system and I want to evaluate my output summary with my gold summaries. I have multiple summaries with different length for each case. So I'm a little confused in here.
my question is that how should I evaluate my summary with these gold summaries. should I evaluate mine with each gold summary then average the results or assume union of gold summaries as gold summary then evaluate mine with that?
Thank you in advance
ROUGE measure compares your summary with all of the reference summaries.
For example, ROUGE-N is computed based on the sum of similar n-gram counts between your summary and each of the reference summaries divided by total number of n-grams occurred in all of the reference summaries.
This paper on ROUGE will help you.

Fleiss-kappa score for interannotator agreement

In my dataset I have a set of categories, where for every category I have a set of 150 examples. Each example has been annotated as true/false by 5 human raters. I am computing the inter-annotator agreement using the Fleiss-kappa score:
1) for the entire dataset
2) for each category in particular
However, the results I obtained show that the Fleiss-kappa score for the entire dataset does not equal the average of the Fleiss-kappa score for each category. In my computation I am using a standard built-in package to compute the scores. Could this be due to a bug in my matrix computation, or are the scores not supposed to be equal? Thanks!

binomial distribution z-score value too large

i try to solve this question
by
n =500 ,p=0.9/100 and q=1-0.9/100
but im geting z-score and mean very large .
Paycheck Errors The payroll department of a hospital has found that in one year, 0.9% of its paychecks are calcu- lated incorrectly. The hospital has 500 employees.
(a) What is the probability that in one month’s records no paycheck errors are made?
(b) What is the probability that in one month’s records at least one paycheck error is made?
Z transformation is a poor approximation to the binomial distribution for npq < 10. For your problem npq == 4.4595, so the Z approximation is a no-go.
You'd do better to calculate it exactly as a binomial using software, or approximate it as a Poisson with rate λ=np. Once you solve part (a), part (b) is just the complement.
I went ahead and calculated part (a) both ways. The Poisson approximation differs from the exact calculation by only 0.00022.
You should use binomial distribution formula rather than sampling distribution formula.

Resources