Fleiss-kappa score for interannotator agreement - statistics

In my dataset I have a set of categories, where for every category I have a set of 150 examples. Each example has been annotated as true/false by 5 human raters. I am computing the inter-annotator agreement using the Fleiss-kappa score:
1) for the entire dataset
2) for each category in particular
However, the results I obtained show that the Fleiss-kappa score for the entire dataset does not equal the average of the Fleiss-kappa score for each category. In my computation I am using a standard built-in package to compute the scores. Could this be due to a bug in my matrix computation, or are the scores not supposed to be equal? Thanks!

Related

BLEU score value higher than 1

I've been looking at how BLEU score works. What I understood from the online videos + the original research paper is that BLEU score value should be within the range 0-1.
Then, when I started to look at some research papers, I found that BLEU value (almost) always higher than 1!
For instance, have a look here:
https://www.aclweb.org/anthology/W19-8624.pdf
https://arxiv.org/pdf/2005.01107v1.pdf
Am I missing something?
Another small point: what does the headers in the table below mean? The BLEU score was calculated using unigrams, then unigrams & bigrams (averaged), etc.? or each ngrams size was calculated independently?

How to measure how distinct a document is based on predefined linguistic categories?

I have 3 categories of words that correspond to different types of psychological drives (need-for-power, need-for-achievement, and need-for-affiliation). Currently, for every document in my sample (n=100,000), I am using a tool to count the number of words in each category, and calculating a proportion score for each category by converting the raw word counts into a percentage based on total words used in the text.
n-power n-achieve n-affiliation
Document1 0.010 0.025 0.100
Document2 0.045 0.010 0.050
: : : :
: : : :
Document100000 0.100 0.020 0.010
For each document, I would like to get a measure of distinctiveness that indicates the degree to which the content of a document on the three psychological categories differs from the average content of all documents (i.e., the prototypical document in my sample). Is there a way to do this?
Essentially what you have is a clustering problem. Currently you made a representation of each of your documents with 3 numbers, lets call them a vector (essentially you cooked up some embeddings). To do what you want you can
1) Calculate an average vector for the whole set. Basically add up all numbers in each column and divide by the number of documents.
2) Pick a metric you like which will reflect an alignment of your document vectors with an average. You can just use (Euclidian)
sklearn.metrics.pairwise.euclidean_distances
or cosine
sklearn.metrics.pairwise.cosine_distances
X will be you list of document vectors and Y will be a single average vector in the list. This is a good place to start.
If I would do it I would ignore average vector approach as you are in fact dealing with clustering problem. So I would use KMeans
see more here guide
Hope this helps!

How to choose the right value of k in K Nearest Neighbor

I have a dataset with 9448 data points (rows)
Whenever I choose values of K ranging BETWEEN 1 to 10, the accuracy comes out to be 100 percent ( which is an ideal case ofcourse! ) and wierd.
If I choose my K value to be be 100 or above the accuracy decreases gradually (95% to 90%).
How does one choose the value of K? We want a decent accuracy and not hypothetical as 100 percent
Well, a simple approach to select k is sqrt(no. of datapoints). In this case, it will be sqrt(9448) = 97.2 ~ 97. And please keep in mind that It is inappropriate to say which k value suits best without looking at the data. If training samples of similar classes form clusters, then using k value from 1 to 10 will achieve good accuracy. If data is randomly distributed then one cannot say which k value will give the best results. In such cases, you need to find it by performing an empirical analysis.

Is there a ranking metric based on percentages that favors larger magnitudes?

I have two groups, "in" and "out," and item categories that can be split up among the groups. For example, I can have item category A that is 99% "in" and 1% "out," and item B that is 98% "in" and 2% "out."
For each of these items, I actually have the counts that are in/out. For example, A could have 99 items in and 1 item out, and B could have 196 items that are in and 4 that are out.
I would like to rank these items based on the percentage that are "in," but I would also like to give some priority to items that have larger overall populations. This is because I would like to focus on items that are very relevant to the "in" group, but still have a large number of items in the "out" group that I could pursue.
Is there some kind of score that could do this?
I'd be tempted to use a probabilistic rank — the probability that an item category is from the group given the actual numbers for that category. This requires making some assumptions about the data set, including why a category may have any out-of-group items. You might take a look at the binomial test or the Mann-Whitney U test for a start. You might also look at some other kinds of nonparametric statistics.
I ultimately ended up using bayesian averaging, which was recommended in this post. The technique is briefly described in this wikipedia article and more thoroughly described in this post by Evan miller and this post by Paul Masurel.
In bayesian averaging, "prior values" are used to influence the numerator and denominator towards the expected values. Essentially, the expected numerator and expected denominator are added to the actual numerator and denominator. In the case where the numerator and denominator are small, the prior values have a larger impact because they represent a larger proportion of the new numerator/denominator. As the numerators and denominators grow in magnitude, the bayesian average starts to approach the actual average due to increased confidence.
In my case, the prior value for the average was fairly low, which biased averages with small denominators downward.

Compute statistical significance with Excel

I have 2 columns and multiple rows of data in excel. Each column represents an algorithm and the values in rows are the results of these algorithms with different parameters. I want to make statistical significance test of these two algorithms with excel. Can anyone suggest a function?
As a result, it will be nice to state something like "Algorithm A performs 8% better than Algorithm B with .9 probability (or 95% confidence interval)"
The wikipedia article explains accurately what I need:
http://en.wikipedia.org/wiki/Statistical_significance
It seems like a very easy task but I failed to find a scientific measurement function.
Any advice over a built-in function of excel or function snippets are appreciated.
Thanks..
Edit:
After tharkun's comments, I realized I should clarify some points:
The results are merely real numbers between 1-100 (they are percentage values). As each row represents a different parameter, values in a row represents an algorithm's result for this parameter. The results do not depend on each other.
When I take average of all values for Algorithm A and Algorithm B, I see that the mean of all results that Algorithm A produced are 10% higher than Algorithm B's. But I don't know if this is statistically significant or not. In other words, maybe for one parameter Algorithm A scored 100 percent higher than Algorithm B and for the rest Algorithm B has higher scores but just because of this one result, the difference in average is 10%.
And I want to do this calculation using just excel.
Thanks for the clarification. In that case you want to do an independent sample T-Test. Meaning you want to compare the means of two independent data sets.
Excel has a function TTEST, that's what you need.
For your example you should probably use two tails and type 2.
The formula will output a probability value known as probability of alpha error. This is the error which you would make if you assumed the two datasets are different but they aren't. The lower the alpha error probability the higher the chance your sets are different.
You should only accept the difference of the two datasets if the value is lower than 0.01 (1%) or for critical outcomes even 0.001 or lower. You should also know that in the t-test needs at least around 30 values per dataset to be reliable enough and that the type 2 test assumes equal variances of the two datasets. If equal variances are not given, you should use the type 3 test.
http://depts.alverno.edu/nsmt/stats.htm

Resources