I am providing class label to the ELKI elki-bundle-0.7.1
It provides a lot of statistics such as below but i cant find info about what are they?
I know f1-measure, precision and recall however how can there be multiple measures? Aren't they supposed to be calculated according to the result of clustering?
Thank you
Pair counting measures?
Jaccard 0.3851744186046512
F1-Measure 0.5561385099685204
Precision 0.6463414634146342
Recall 0.4880294659300184
Rand 0.8368055555555556
ARI 0.458537539334965
FowlkesMallows 0.5616348272664993
Entropy based measures?
NMI Joint 0.5758289911830176
NMI Sqrt 0.7309481146561948
BCubed-based measures?
F1-Measure 0.7033781601851384
Recall 0.6901589423648247
Precision 0.7171136653895275
Set-Matching-based measures?
F1-Measure 0.7702702702702702
Purity 0.7916666666666667
Inverse Purity 0.7499999999999998
Editing-distance measures?
F1-Measure 0.6312576312576313
Precision 0.6527777777777778
Recall 0.6111111111111112
Gini measures?
Mean +-0.2958 0.703636303877176
Please see the ELKI documentation. We implemented many many evaluation measures. Here is an excerpt from the list on http://elki.dbs.ifi.lmu.de/wiki/RelatedPublications
Silhouette:
P. J. Rousseeuw
Silhouettes: A graphical aid to the interpretation and validation of cluster analysis
In: Journal of Computational and Applied Mathematics, Volume 20
Rand Index:
Rand, W. M.
Objective Criteria for the Evaluation of Clustering Methods
In: Journal of the American Statistical Association, Vol. 66 Issue 336
Fowlkes-Mallows:
Fowlkes, E.B. and Mallows, C.L.
A method for comparing two hierarchical clusterings
BCubed:
A. Bagga and B. Baldwin
Entity-based cross-document coreferencing using the Vector Space Model
In: Proc. COLING '98 Proceedings of the 17th international conference on Computational linguistics
Edit-Distance:
Pantel, P. and Lin, D.
Document clustering with committees
In: Proc. 25th ACM SIGIR conference on Research and development in information retrieval
Entropy-based measures:
Meilă, M.
Comparing clusterings by the variation of information
In: Learning theory and kernel machines
Nguyen, X. V. and Epps, J. and Bailey, J.
Information theoretic measures for clusterings comparison: is a correction for chance necessary?
In: Proc. ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Set-Matching purity:
Steinbach, M. and Karypis, G. and Kumar, V.
A comparison of document clustering techniques
In: KDD workshop on text mining, 2000
E. Amigó, J. Gonzalo, J. Artiles, and F. Verdejo
A comparison of extrinsic clustering evaluation metrics based on formal constraints
In: Inf. Retrieval, vol. 12, no. 5
Meilă, M
Comparing clusterings
In: University of Washington, Seattle, Technical Report 418, 2002
Zhao, Y. and Karypis, G.
Criterion functions for document clustering: Experiments and analysis
In: University of Minnesota, Department of Computer Science, Technical Report 01-40, 2001
C-Index:
L. J. Hubert and J. R. Levin
A general statistical framework for assessing categorical clustering in free recall.
In: Psychological Bulletin, Vol. 83(6)
Concordant pairs:
F. B. Baker, and L. J. Hubert
Measuring the Power of Hierarchical Cluster Analysis
In: Journal of the American Statistical Association, 70(349)
F. J. Rohlf
Methods of comparing classifications
In: Annual Review of Ecology and Systematics
Davies-Bouldin:
D. L. Davies and D. W. Bouldin
A Cluster Separation Measure
In: IEEE Transactions Pattern Analysis and Machine Intelligence PAMI-1(2)
PBM:
M. K. Pakhira, and S. Bandyopadhyay, and U. Maulik
Validity index for crisp and fuzzy clusters
In: Pattern recognition, 37(3)
Variance-Ratio Criteria:
R. B. Calinski and J. Harabasz
A dendrite method for cluster analysis
In: Communications in Statistics-theory and Methods, 3(1)
We also have DBCV, but the code is not reviewed and merged yet.
My personal recommendation is the use the Adjusted Rand Index, because of the nice adjustment for chance. An ARI less than 0 means the result is worse than random. With almost every other measure, even a random result will score positively.
Related
sorry for this very basic question. I've trawled through previous pages and cannot quite find a case that corresponds to our situation.
320 individuals rated two types of films. The rating was provided on a 1-11 scale.There are many films of each type. In short the DV is a continuous variable.
20 individuals have a particular disease that we now consider of interest. We would like to examine the effect of the disease on the rating.
We conducted a 2-way repeated measures ANOVA, using 'situation type' as a within-subject factor, and 'disease status' as a between-subject factor, using SPSS. The design is obviously unbalanced with more observations in the healthy group. The data appeared to be normally distributed. Levine test suggested equality of variance. Does that mean it is appropriate to use ANOVA for this analysis?
With the results of two different summary systems (sys1 and sys2) and the same reference summaries, I evaluated them with both BLEU and ROUGE. The problem is: All ROUGE scores of sys1 was higher than sys2 (ROUGE-1, ROUGE-2, ROUGE-3, ROUGE-4, ROUGE-L, ROUGE-SU4, ...) but the BLEU score of sys1 was less than the BLEU score of sys2 (quite much).
So my question is: Both ROUGE and BLEU are based on n-gram to measure the similar between the summaries of systems and the summaries of human. So why there are differences in results of evaluation like that? And what's the main different of ROUGE vs BLEU to explain this issue?
In general:
Bleu measures precision: how much the words (and/or n-grams) in the machine generated summaries appeared in the human reference summaries.
Rouge measures recall: how much the words (and/or n-grams) in the human reference summaries appeared in the machine generated summaries.
Naturally - these results are complementing, as is often the case in precision vs recall. If you have many words from the system results appearing in the human references you will have high Bleu, and if you have many words from the human references appearing in the system results you will have high Rouge.
In your case it would appear that sys1 has a higher Rouge than sys2 since the results in sys1 consistently had more words from the human references appear in them than the results from sys2. However, since your Bleu score showed that sys1 has lower recall than sys2, this would suggest that not so many words from your sys1 results appeared in the human references, in respect to sys2.
This could happen for example if your sys1 is outputting results which contain words from the references (upping the Rouge), but also many words which the references didn't include (lowering the Bleu). sys2, as it seems, is giving results for which most words outputted do appear in the human references (upping the Blue), but also missing many words from its results which do appear in the human references.
BTW, there's something called brevity penalty, which is quite important and has already been added to standard Bleu implementations. It penalizes system results which are shorter than the general length of a reference (read more about it here). This complements the n-gram metric behavior which in effect penalizes longer than reference results, since the denominator grows the longer the system result is.
You could also implement something similar for Rouge, but this time penalizing system results which are longer than the general reference length, which would otherwise enable them to obtain artificially higher Rouge scores (since the longer the result, the higher the chance you would hit some word appearing in the references). In Rouge we divide by the length of the human references, so we would need an additional penalty for longer system results which could artificially raise their Rouge score.
Finally, you could use the F1 measure to make the metrics work together:
F1 = 2 * (Bleu * Rouge) / (Bleu + Rouge)
Both ROUGE and BLEU are based on n-gram to measure the similar between the summaries of systems and the summaries of human. So why there are differences in results of evaluation like that? And what's the main different of ROUGE vs BLEU to explain this issue?
There exist both the ROUGE-n precision and the ROUGE-n precision recall. the original ROUGE implementation from the paper that introduced ROUGE {3} computes both, as well as the resulting F1-score.
From http://text-analytics101.rxnlp.com/2017/01/how-rouge-works-for-evaluation-of.html (mirror):
ROUGE recall:
ROUGE precision:
(The original ROUGE implementation from the paper that introduced ROUGE {1} may perform a few more things such as stemming.)
The ROUGE-n precision and recall are easy to interpret, unlike BLEU (see Interpreting ROUGE scores).
The difference between the ROUGE-n precision and BLEU is that BLEU introduces a brevity penalty term, and also compute the n-gram match for several size of n-grams (unlike the ROUGE-n, where there is only one chosen n-gram size).
Stack Overflow does not support LaTeX so I won't go into more formulas to compare against BLEU. {2} explains BLEU clearly.
References:
{1} Lin, Chin-Yew. "Rouge: A package for automatic evaluation of summaries." In Text summarization branches out: Proceedings of the ACL-04 workshop, vol. 8. 2004. https://scholar.google.com/scholar?cluster=2397172516759442154&hl=en&as_sdt=0,5 ; http://anthology.aclweb.org/W/W04/W04-1013.pdf
{2} Callison-Burch, Chris, Miles Osborne, and Philipp Koehn. "Re-evaluation the Role of Bleu in Machine Translation Research." In EACL, vol. 6, pp. 249-256. 2006. https://scholar.google.com/scholar?cluster=8900239586727494087&hl=en&as_sdt=0,5 ;
ROGUE and BLEU are both set of metrics applicable for the task of creating the text summary. Originally BLEU was needed for machine translation, but it is perfectly applicable for the text summary task.
It is best to understand the concepts using examples. First, we need to have summary candidate (machine learning created summary) like this:
the cat was found under the bed
And the gold standard summary (usually created by human):
the cat was under the bed
Let's find precision and recall for the unigram (each word) case. We use words as metrics.
Machine learning summary has 7 words (mlsw=7), gold standard summary has 6 words (gssw=6), and the number of overlapping words is again 6 (ow=6).
The recall for the machine learning would be: ow/gssw=6/6=1
The precision for the machine learning would be: ow/mlsw=6/7=0.86
Similarly we can compute precision and recall scores on grouped unigrams, bigrams, n-grams...
For the ROGUE we know it uses both recall and precision, and also the F1 score which is the harmonic mean of these.
For BLEU, well it also use precision twinned with recall but uses geometric mean and brevity penalty.
Subtle differences, but it is important to note they both use precision and recall.
I am studying lexical semantics. I have 65 pairs of synonyms with their sense relatedness. The dataset is derived from the paper:
Rubenstein, Herbert, and John B. Goodenough. "Contextual correlates of synonymy." Communications of the ACM 8.10 (1965): 627-633.
I extract sentences containing those synonyms, transfer the neighbouring words appearing in those sentences to vectors, calculate the cosine distance between different vectors, and finally get the Pearson correlation between the distances we calculate and the sense relatedness given by Rubenstein and Goodenough
I get the Pearson correlation for Method 1 is 0.79, and for Method 2 is 0.78, for example. How do I measure Method 1 is significantly better than Method 2 or not?
Well Strictly not a programming question, but since this question is unanswered in others stackexchange sites, i'll tell the approach i would take.
I would say there are other benchmarks to check your approaches on similar tasks. You can check how your method performs on those benchmarks and analyze the results. Some methods may capture similarity more while others relatedness and some both.
This is the link WordVec Demo which automatically scores your vectors and provides you the results.
I'm trying to identify important terms in a set of government documents. Generating the term frequencies is no problem.
For document frequency, I was hoping to use the handy Python scripts and accompanying data that Peter Norvig posted for his chapter in "Beautiful Data", which include the frequencies of unigrams in a huge corpus of data from the Web.
My understanding of tf-idf, however, is that "document frequency" refers to the number of documents containing a term, not the number of total words that are this term, which is what we get from the Norvig script. Can I still use this data for a crude tf-idf operation?
Here's some sample data:
word tf global frequency
china 1684 0.000121447
the 352385 0.022573582
economy 6602 0.0000451130774123
and 160794 0.012681757
iran 2779 0.0000231482902018
romney 1159 0.000000678497795593
Simply dividing tf by gf gives "the" a higher score than "economy," which can't be right. Is there some basic math I'm missing, perhaps?
As I understand, Global Frequency is equal "inverse total term frequency" mentioned here Robertson. From this Robertson's paper:
One possible way to get away from this problem would be to make a fairly radical re-
placement for IDF (that is, radical in principle, although it may be not so radical
in terms of its practical effects). ....
the probability from the event space of documents to the event space of term positions
in the concatenated text of all the documents in the collection.
Then we have a new measure, called here
inverse total term frequency:
...
On the whole, experiments with inverse total term frequency weights have tended to show
that they are not as effective as IDF weights
According to this text, you can use inverse global frequency as IDF term, albeit more crude than standard one.
Also you are missing stop words removal. Words such as the are used in almost all documents, therefore they do not give any information. Before tf-idf , you should remove such stop words.
I'm using a lexicon-based approach to text analysis. Basically I have a long list of words marked with whether they are positive/negative/angry/sad/happy etc. I match the words in the text I want to analyze to the words in the lexicon in order to help me determine if my text is positive/negative/angry/sad/happy etc.
But the length of the texts I want to analyze vary. Most of them are under 100 words, but consider the following example:
John is happy. (1 word in the category 'happy' giving a score of 33% for happy)
John told Mary yesterday that he was happy. (12.5% happy)
So comparing across different sentences, it seems that my first sentence is more 'happy' than my second sentence, simply because the sentence is shorter, and gives a disproportionate % to the word 'happy'.
Is there an algorithm or way of calculation you can think of that would allow me to make a fairer comparison, perhaps by taking into account the length of the sentence?
As many pointed out, you have to go down to syntactic tree, something similar to this work.
Also, consider this:
John told Mary yesterday that he was happy.
John told Mary yesterday that she was happy.
The second one tells nothing about John's happiness, but naive algorithm would be confused quickly. So in addition to syntax parsing, pronouns have to represent linking to the subjects. In particular, that means that the algorithm should know that John is he and Mary is she.
Ignoring the issue of negation raised by HappyTimeGopher, you can simply divide the number of happy words in the sentence by the length of the sentence. You get:
John is happy. (1 word in the category 'happy' / 3 words in sentence = score of 33% for happy)
John told Mary yesterday that he was happy. (1/8 = 12.5% happy)
Keep in mind word-list based approaches will only go so far. What should be the score for "I was happy with the food, but the waiter was horrible"? Consider using a more sophisticated system--- the papers below are a good place to start your research:
Choi, Y., & Cardie, C. (2008). Learning with compositional semantics as structural inference for subsentential sentiment analysis.
Moilanen, K., & Pulman, S. (2009). Multi-entity sentiment scoring.
Pang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis.
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using machine learning techniques.
Turney, P. D., & Littman, M. L. (2003). Measuring praise and criticism: Inference of semantic orientation from association.