"When 1-gram precision is high, the reference tends to satisfy
adequacy.
When longer n-gram precision is high, the reference tends to account
for fluency."
What does this mean?
Adequacy: How much of the source information is preserved?
Fluency: How good is the generated target language quality?
For Machine Translation task,
English(S-V-O): A dog chased a cat
Reference translation in Hindi would look like:
Hindi(S-O-V): A dog a cat chased
When 1-gram precision is high, correct word-to-word translations will be high but order of those words might not be correct in translated sentence. But still, majority of the source information is preserved.
High 1-gram, low N-gram(2-gram) precision value example: chased dog cat
When N-gram precision is high, order of those words will be preserved to some extent and you would get a fluent sentence in Hindi.
High N-gram(2-gram) precision example: A dog cat chased
Related
So here is an example of a column in my data-set:
"industries": ["Gaming", "fitness and wellness"]
The industries column has hundreds of different tags, some of which can have the same meaning, for example, some rows have: "Gaming" and some have "video games" and others "Games & consoles".
I'd like to "lemmatize" these tags so I could query the data and not worry about minute differences in the presentation (if they are basically the same).
What is the standard solution in this case?
I don't know that there is a "standard" solution, but I can suggest a couple of approaches, ranked by increasing depth of knowledge, or going from the surface form to the meaning.
String matching
Lemmatisation/stemming
Word embedding vector distance
String matching is based on the calculating the difference between strings, as a measure of how many characters they share or how many editing steps it takes to transform one into the other. Levenshtein distance is one of the most common ones. However, depending on the size of your data, it might be a bit inefficient to use. This is a really cool approach to find most similar strings in a large data set.
However, it might not be the most suitable one for your particular data set, as your similarities seem more semantic and less bound to the surface form of the words.
Lemmatisation/stemming goes beyond the surface by analysing the words apart based on their morphology. In your example, gaming and games both have the same stem game, so you could base your similarity measure on matching stems. This can be better than pure string matching as you can see that *go" and went are related
Word embeddings go beyond the surface form by encoding meaning as the context in which words appear and as such, might find a semantic similarity between health and *fitness", that is not apparent from the surface at all! The similarity is measured as the cosine distance/similarity between two word vectors, which is basically the angle between the two vectors.
It seems to me that the third approach might be most suitable for your data.
This is a tough NLU question! Basically 'what are synonyms or near synonyms of each other, even if there's not exact string overlap?'.
1. Use GLoVE word embeddings to judge synonymous words
It might be interesting to use spaCy's pre-trained GLoVE model (en_vectors_web_lg) for word embeddings, to get the pairwise distances between tokens, and use that as a metric for judging 'closeness'.
nlp = spacy.load('en_vectors_web_lg')
doc1 = nlp('board games')
doc2 = nlp('Games & Recreation')
doc3 = nlp('video games')
for doc in [doc1, doc2, doc3]:
for comp in [doc1, doc2, doc3]:
if doc != comp:
print(f'{doc} | {comp} | similarity: {round(doc.similarity(comp), 4)}')
board games | Games & Recreation | similarity: 0.6958
board games | video games | similarity: 0.7732
Games & Recreation | board games | similarity: 0.6958
Games & Recreation | video games | similarity: 0.675
video games | board games | similarity: 0.7732
video games | Games & Recreation | similarity: 0.675
(GLoVE is cool - really nice mathematical intuition for word embeddings.)
PROS: GLoVE is robust, spaCy has it built in, vector space comparisons are easy in spaCy
CONS: It doesn't handle out of vocabulary words well, spaCy's just taking the average of all the token vectors here (so it's sensitive to document length)
2. Try using different distance metrics/fuzzy string matching
You might also look at different kinds of distance metrics -- cosine distance isn't the only one.
FuzzyWuzzy is a good implementation of Levenshtein distance for fuzzy string matching (no vectors required).
This library implements a whole slew of string-matching algorithms.
PROS: Using a preconfigured library saves you some coding, other distance metrics might help you find new correlations, don't need to train a vector model
CONS: More dependencies, some kinds of distance aren't appropriate and will miss synonymous words without literal string overlap
3. Use WordNet to get synonym sets
You could also get a sort of dictionary of synonym sets ('synsets') from WordNet, which was put together by linguists as a kind of semantic knowledge graph.
The nice thing about this is it gets you some textual entailment -- that is, given sentence A, would a reader think sentence B is most likely true?
Because it was handmade by linguists and grad students, WordNet isn't as dependent on string overlap and can give you nice semantic enrichment. It also provides things like hyponyms/meroynms and hypernyms/holonyms -- so you could, e.g., say 'video game' is a subtype of 'game', which is a subset of 'recreation' -- just based off of WordNet.
You can access WordNet in python through the textblob library.
from textblob import Word
from textblob.wordnet import NOUN
game = Word('game').get_synsets(pos=NOUN)
for synset in game:
print(synset.definition())
a contest with rules to determine a winner
a single play of a sport or other contest
an amusement or pastime
animal hunted for food or sport
(tennis) a division of play during which one player serves
(games) the score at a particular point or the score needed to win
the flesh of wild animals that is used for food
a secret scheme to do something (especially something underhand or illegal)
the game equipment needed in order to play a particular game
your occupation or line of work
frivolous or trifling behavior
print(game[0].hyponyms())
[Synset('athletic_game.n.01'),
Synset('bowling.n.01'),
Synset('card_game.n.01'),
Synset('child's_game.n.01'),
Synset('curling.n.01'),
Synset('game_of_chance.n.01'),
Synset('pall-mall.n.01'),
Synset('parlor_game.n.01'),
Synset('table_game.n.01'),
Synset('zero-sum_game.n.01')]
Even cooler, you can get the similarity based on these semantic features between any words you like.
print((Word('card_game').synsets[0]).shortest_path_distance(Word('video_game').synsets[0]))
5
PROS: Lets you use semantic information like textual entailment to get at your objective, which is hard to get in other ways
CONS: WordNet is limited to what is in WordNet, so again out-of-vocabulary words may be a problem for you.
I Suggest to use the word2vector approcah or the lemmatisation approach:
With the first one you can compute vectors starting from words, and so you have a projection into a vectorial space. With this projection you can compute the similarity between words (with cosine similarity as #Shnipp sad) and then put a threshold, above which you say that two words belong do different arguments.
Using lemmatisation you can compare bare words/lemma using SequenceMatcher. In this case you condition of equality could be based on the presence of very similar lemmas (similarity above 95%).
It's us to you to choose the best for your purpouse. If you want something solid and structured use word2vec. Othervise if you something simple and fast to implement use the lemmatisation approcah.
With the results of two different summary systems (sys1 and sys2) and the same reference summaries, I evaluated them with both BLEU and ROUGE. The problem is: All ROUGE scores of sys1 was higher than sys2 (ROUGE-1, ROUGE-2, ROUGE-3, ROUGE-4, ROUGE-L, ROUGE-SU4, ...) but the BLEU score of sys1 was less than the BLEU score of sys2 (quite much).
So my question is: Both ROUGE and BLEU are based on n-gram to measure the similar between the summaries of systems and the summaries of human. So why there are differences in results of evaluation like that? And what's the main different of ROUGE vs BLEU to explain this issue?
In general:
Bleu measures precision: how much the words (and/or n-grams) in the machine generated summaries appeared in the human reference summaries.
Rouge measures recall: how much the words (and/or n-grams) in the human reference summaries appeared in the machine generated summaries.
Naturally - these results are complementing, as is often the case in precision vs recall. If you have many words from the system results appearing in the human references you will have high Bleu, and if you have many words from the human references appearing in the system results you will have high Rouge.
In your case it would appear that sys1 has a higher Rouge than sys2 since the results in sys1 consistently had more words from the human references appear in them than the results from sys2. However, since your Bleu score showed that sys1 has lower recall than sys2, this would suggest that not so many words from your sys1 results appeared in the human references, in respect to sys2.
This could happen for example if your sys1 is outputting results which contain words from the references (upping the Rouge), but also many words which the references didn't include (lowering the Bleu). sys2, as it seems, is giving results for which most words outputted do appear in the human references (upping the Blue), but also missing many words from its results which do appear in the human references.
BTW, there's something called brevity penalty, which is quite important and has already been added to standard Bleu implementations. It penalizes system results which are shorter than the general length of a reference (read more about it here). This complements the n-gram metric behavior which in effect penalizes longer than reference results, since the denominator grows the longer the system result is.
You could also implement something similar for Rouge, but this time penalizing system results which are longer than the general reference length, which would otherwise enable them to obtain artificially higher Rouge scores (since the longer the result, the higher the chance you would hit some word appearing in the references). In Rouge we divide by the length of the human references, so we would need an additional penalty for longer system results which could artificially raise their Rouge score.
Finally, you could use the F1 measure to make the metrics work together:
F1 = 2 * (Bleu * Rouge) / (Bleu + Rouge)
Both ROUGE and BLEU are based on n-gram to measure the similar between the summaries of systems and the summaries of human. So why there are differences in results of evaluation like that? And what's the main different of ROUGE vs BLEU to explain this issue?
There exist both the ROUGE-n precision and the ROUGE-n precision recall. the original ROUGE implementation from the paper that introduced ROUGE {3} computes both, as well as the resulting F1-score.
From http://text-analytics101.rxnlp.com/2017/01/how-rouge-works-for-evaluation-of.html (mirror):
ROUGE recall:
ROUGE precision:
(The original ROUGE implementation from the paper that introduced ROUGE {1} may perform a few more things such as stemming.)
The ROUGE-n precision and recall are easy to interpret, unlike BLEU (see Interpreting ROUGE scores).
The difference between the ROUGE-n precision and BLEU is that BLEU introduces a brevity penalty term, and also compute the n-gram match for several size of n-grams (unlike the ROUGE-n, where there is only one chosen n-gram size).
Stack Overflow does not support LaTeX so I won't go into more formulas to compare against BLEU. {2} explains BLEU clearly.
References:
{1} Lin, Chin-Yew. "Rouge: A package for automatic evaluation of summaries." In Text summarization branches out: Proceedings of the ACL-04 workshop, vol. 8. 2004. https://scholar.google.com/scholar?cluster=2397172516759442154&hl=en&as_sdt=0,5 ; http://anthology.aclweb.org/W/W04/W04-1013.pdf
{2} Callison-Burch, Chris, Miles Osborne, and Philipp Koehn. "Re-evaluation the Role of Bleu in Machine Translation Research." In EACL, vol. 6, pp. 249-256. 2006. https://scholar.google.com/scholar?cluster=8900239586727494087&hl=en&as_sdt=0,5 ;
ROGUE and BLEU are both set of metrics applicable for the task of creating the text summary. Originally BLEU was needed for machine translation, but it is perfectly applicable for the text summary task.
It is best to understand the concepts using examples. First, we need to have summary candidate (machine learning created summary) like this:
the cat was found under the bed
And the gold standard summary (usually created by human):
the cat was under the bed
Let's find precision and recall for the unigram (each word) case. We use words as metrics.
Machine learning summary has 7 words (mlsw=7), gold standard summary has 6 words (gssw=6), and the number of overlapping words is again 6 (ow=6).
The recall for the machine learning would be: ow/gssw=6/6=1
The precision for the machine learning would be: ow/mlsw=6/7=0.86
Similarly we can compute precision and recall scores on grouped unigrams, bigrams, n-grams...
For the ROGUE we know it uses both recall and precision, and also the F1 score which is the harmonic mean of these.
For BLEU, well it also use precision twinned with recall but uses geometric mean and brevity penalty.
Subtle differences, but it is important to note they both use precision and recall.
At least 3 types of n-grams can be considered for representing text documents:
byte-level n-grams
character-level n-grams
word-level n-grams
It's unclear to me which one should be used for a given task (clustering, classification, etc). I read somewhere that character-level n-grams are preferred to word-level n-grams when the text contains typos, so that "Mary loves dogs" remains similar to "Mary lpves dogs".
Are there other criteria to consider for choosing the "right" representation?
Evaluate. The criterion for choosing the representation is whatever works.
Indeed, character level (!= bytes, unless you only care about english) probably is the most common representation, because it is robust to spelling differences (which do not need to be errors, if you look at history; spelling changes). So for spelling correction purposes, this works well.
On the other hand, Google Books n-gram viewer uses word level n-grams on their books corpus. Because they don't want to analyze spelling, but term usage over time; e.g. "child care", where the individual words aren't as interesting as their combination. This was shown to be very useful in machine translation, often referred to as "refrigerator magnet model".
If you are not processing international language, bytes may be meaningful, too.
I would outright discard byte-level n-grams for text-related tasks, because bytes are not a meaningful representation of anything.
Of the 2 remaining levels, the character-level n-grams will need much less storage space and will , subsequently, hold much less information. They are usually utilized in such tasks as language identification, writer identification (i.e. fingerprinting), anomaly detection.
As for word-level n-grams, they may serve the same purposes, and much more, but they need much more storage. For instance, you'll need up to several gigabytes to represent in memory a useful subset of English word 3-grams (for general-purpose tasks). Yet, if you have a limited set of texts you need to work with, word-level n-grams may not require so much storage.
As for the issue of errors, a sufficiently large word n-grams corpus will also include and represent them. Besides, there are various smoothing methods to deal with sparsity.
There other issue with n-grams is that they will almost never be able to capture the whole needed context, so will only approximate it.
You can read more about n-grams in the classic Foundations of Statistical Natural Language Processing.
I use character ngrams on small strings, and word ngrams for something like text classification of larger chunks of text. It is a matter of which method will preserve the context you need more or less...
In general for classification of text, word ngrams will help a bit with word-sense dissambiguation, where character ngrams would be easily confused and your features could be completely ambiguous. For unsupervised clustering, it will depend on how general you want your clusters, and on what basis you want docs to converge. I find stemming, stopword removal, and word bigrams work well in unsupervised clustering tasks on fairly large corpora.
Character ngrams are great for fuzzy string matching of small strings.
I like to think of a set of grams as a vector, and imagine comparing vectors with the grams you have, then ask yourself if what you are comparing maintains enough context to answer the question you are trying to answer.
HTH
I'm trying to identify important terms in a set of government documents. Generating the term frequencies is no problem.
For document frequency, I was hoping to use the handy Python scripts and accompanying data that Peter Norvig posted for his chapter in "Beautiful Data", which include the frequencies of unigrams in a huge corpus of data from the Web.
My understanding of tf-idf, however, is that "document frequency" refers to the number of documents containing a term, not the number of total words that are this term, which is what we get from the Norvig script. Can I still use this data for a crude tf-idf operation?
Here's some sample data:
word tf global frequency
china 1684 0.000121447
the 352385 0.022573582
economy 6602 0.0000451130774123
and 160794 0.012681757
iran 2779 0.0000231482902018
romney 1159 0.000000678497795593
Simply dividing tf by gf gives "the" a higher score than "economy," which can't be right. Is there some basic math I'm missing, perhaps?
As I understand, Global Frequency is equal "inverse total term frequency" mentioned here Robertson. From this Robertson's paper:
One possible way to get away from this problem would be to make a fairly radical re-
placement for IDF (that is, radical in principle, although it may be not so radical
in terms of its practical effects). ....
the probability from the event space of documents to the event space of term positions
in the concatenated text of all the documents in the collection.
Then we have a new measure, called here
inverse total term frequency:
...
On the whole, experiments with inverse total term frequency weights have tended to show
that they are not as effective as IDF weights
According to this text, you can use inverse global frequency as IDF term, albeit more crude than standard one.
Also you are missing stop words removal. Words such as the are used in almost all documents, therefore they do not give any information. Before tf-idf , you should remove such stop words.
In the part of speech tagger, the best probable tags for the given sentence is determined using HMM by
P(T*) = argmax P(Word/Tag)*P(Tag/TagPrev)
T
But when 'Word' did not appear in the training corpus, P(Word/Tag) produces ZERO for given all possible tags, this leaves no room for choosing the best.
I have tried few ways,
1) Assigning small amount of probability for all unknown words, P(UnknownWord/AnyTag)~Epsilon... means this completely ignores the P(Word/Tag) for unknowns word by assigning the constant probability.. So decision making on unknown word is by prior probabilities.. As expected it is not producing good result.
2) Laplace Smoothing
I confused with this. I don't know what is difference between (1) and this. My way of understanding Laplace Smoothing adds the constant probability(lambda) to all unknown & Known words.. So the All Unknown words will get constant probability(fraction of lambda) and Known words probabilities will be the same relatively since all word's prob increased by Lambda.
Is the Laplace Smoothing same as the previous one ?
*)Is there any better way of dealing with unknown words ?
Your two approaches are similar, but, if I understand correctly, they differ in one key way. In (1) you are assigning extra mass to counts of unknown words and in (2) you are assigning extra mass to all counts. You definitely want to do (2) and not (1).
One of the problems with Laplace smoothing is that it give too much of a boost to unknown words and drags down the probabilities of high-probability words too much (relatively speaking). Your version (1) would actually worsen this problem. Basically, it would over-smooth.
Laplace smoothing words ok for an HMM, but it's not great. Most people do add-one smoothing but you could experiment with things like add-one-half or whatever.
If you want to move beyond this naive approach to smoothing, check out "one-count smoothing", as described in the Appendix of Jason Eisner's HMM tutorial. The basic idea here is that for unknown words more probability mass should be given to tags that appear with a wider variety of low frequency words. For example, since the tag NOUN appears on a large number of different words and DETERMINER appears on a small number of different words, it is more likely that an unseen word will be a NOUN.
If you want to get even fancier, you could use a Chinese Restaurant Process model taken from non-parametric Bayesian statistics to put a prior distribution on unseen word/tag combinations. Kevin Knight's Bayesian inference tutorial has details.
I think the HMM-based TnT tagger provides a better approach to handle unknown words (see the approach in TnT tagger's paper).
The accuracy results (for known words and unknown words) of TnT and other two POS and morphological taggers on 13 languages including Bulgarian, Czech, Dutch, English, French, German, Hindi, Italian, Portuguese, Spanish, Swedish, Thai and Vietnamese, can be found in this article.