How to measure how distinct a document is based on predefined linguistic categories? - nlp

I have 3 categories of words that correspond to different types of psychological drives (need-for-power, need-for-achievement, and need-for-affiliation). Currently, for every document in my sample (n=100,000), I am using a tool to count the number of words in each category, and calculating a proportion score for each category by converting the raw word counts into a percentage based on total words used in the text.
n-power n-achieve n-affiliation
Document1 0.010 0.025 0.100
Document2 0.045 0.010 0.050
: : : :
: : : :
Document100000 0.100 0.020 0.010
For each document, I would like to get a measure of distinctiveness that indicates the degree to which the content of a document on the three psychological categories differs from the average content of all documents (i.e., the prototypical document in my sample). Is there a way to do this?

Essentially what you have is a clustering problem. Currently you made a representation of each of your documents with 3 numbers, lets call them a vector (essentially you cooked up some embeddings). To do what you want you can
1) Calculate an average vector for the whole set. Basically add up all numbers in each column and divide by the number of documents.
2) Pick a metric you like which will reflect an alignment of your document vectors with an average. You can just use (Euclidian)
sklearn.metrics.pairwise.euclidean_distances
or cosine
sklearn.metrics.pairwise.cosine_distances
X will be you list of document vectors and Y will be a single average vector in the list. This is a good place to start.
If I would do it I would ignore average vector approach as you are in fact dealing with clustering problem. So I would use KMeans
see more here guide
Hope this helps!

Related

Calculating confidence interval and sample size for data conversions

Say we are converting data from one bookstore application to another. How would one going about calculating the sample size of books to review after the data conversion to be sure 90+-5% of all books converted correctly?
Say our existing book list contains 30 books. How many books would we have to review in the new application after the data conversion to be 85-95% certain that all books converted correctly?
Ok, let's assume the variable X=Proportion of books converted correctly, distributed normally, with values between 0 and 1
Sample size = this is what we want to determine
Population size = 30
Existing book list contains 30 books
Estimated value = 0.90
That is, the value of X that you think is real.
90+-5% of all books converted correctly
If you have no idea of what's the actual value, use 0.5 instead
Error margin = 0.05
The difference between the real value and the estimated value. As you ascertained above, this would be +-5%
Confidence level = 0.95
This is NOT the same as error margin. You are making a prediction, how sure do you want to be of your prediction? This is the confidence level. You gave two values above:
to be 85-95% certain that all books converted correctly
So we're going with 95%, just to be sure.
The recommended sample size is 25
You can use this calculator to arrive to the same results
https://select-statistics.co.uk/calculators/sample-size-calculator-population-proportion/
And it also has a magnificent explanation of all the input values above.
Hope it works for you. Cheers!

numerical entity extraction from unstructured texts using python

I want to extract numerical entities like temperature and duration mentioned in unstructured formats of texts using neural models like CRF using python. I would like to know how to proceed for numerical extraction as most of the examples available on the internet are for specific words or strings extraction.
Input: 'For 5 minutes there, I felt like baking in an oven at 350 degrees F'
Output: temperature: 350
duration: 5 minutes
So far my research shows that you can treat numbers as words.
This raises an issue : learning 5 will be ok, but 19684 will be to rare to be learned.
One proposal is to convert into words. "nineteen thousands six hundred eighty four" and embedding each word. The inconvenient is that you are now learning a (minimum) 6 dimensional vector (one dimension per word)
Based on your usage, you can also embed 0 to 3000 with distinct ids, and say 3001 to 10000 will map id 3001 in your dictionary, and then add one id in your dictionary for each 10x.

Fleiss-kappa score for interannotator agreement

In my dataset I have a set of categories, where for every category I have a set of 150 examples. Each example has been annotated as true/false by 5 human raters. I am computing the inter-annotator agreement using the Fleiss-kappa score:
1) for the entire dataset
2) for each category in particular
However, the results I obtained show that the Fleiss-kappa score for the entire dataset does not equal the average of the Fleiss-kappa score for each category. In my computation I am using a standard built-in package to compute the scores. Could this be due to a bug in my matrix computation, or are the scores not supposed to be equal? Thanks!

How to generate random numbers within a normal distribution using Excel

I want to use the RAND() function in Excel to generate a random number between 0 and 1.
However, I would like 80% of the values to fall between 0 and 0.2, 90% of the values to fall between 0 and 0.3, 95% of the values to fall between 0 and 0.5, etc.
This reminds me that I took an applied statistics course once upon a time, but not of what was actually in the course...
How is the best way to go about achieving this result using an Excel formula. Alternatively, what is this kind of statistical calculation called / any other pointers that I can Google around for.
=================
Use case:
I have a single column of meter readings, which I would like to duplicate 7 times (each column for a new month). each column has 55 000 rows. While the meter readings need to vary for each month, when taken as a time series, each meter number should have 7 realistic readings.
The aim is to produce realistic data to turn into heat maps (i.e. flag outlying meter readings)
I don't think that there is a formula which would fit exactly to your requirements. I would use a very straightforward solution:
Generate 80% of data using =RANDBETWEEN(0,20)/100
Generate 10% of data using =RANDBETWEEN(20,30)/100
Generate 5% of data using =RANDBETWEEN(30,50)/100
and so on
You can easily change the precision of generated data by modifying the parameters, for example: =RANDBETWEEN(0,2000)/10000 will generate data with up to 4 digits after decimal point.
UPDATE
Use a normal distribution for the use case, for example:
=NORMINV(RAND(), 20, 5)
where 20 is a mean value and 5 is a standard deviation.

Compute statistical significance with Excel

I have 2 columns and multiple rows of data in excel. Each column represents an algorithm and the values in rows are the results of these algorithms with different parameters. I want to make statistical significance test of these two algorithms with excel. Can anyone suggest a function?
As a result, it will be nice to state something like "Algorithm A performs 8% better than Algorithm B with .9 probability (or 95% confidence interval)"
The wikipedia article explains accurately what I need:
http://en.wikipedia.org/wiki/Statistical_significance
It seems like a very easy task but I failed to find a scientific measurement function.
Any advice over a built-in function of excel or function snippets are appreciated.
Thanks..
Edit:
After tharkun's comments, I realized I should clarify some points:
The results are merely real numbers between 1-100 (they are percentage values). As each row represents a different parameter, values in a row represents an algorithm's result for this parameter. The results do not depend on each other.
When I take average of all values for Algorithm A and Algorithm B, I see that the mean of all results that Algorithm A produced are 10% higher than Algorithm B's. But I don't know if this is statistically significant or not. In other words, maybe for one parameter Algorithm A scored 100 percent higher than Algorithm B and for the rest Algorithm B has higher scores but just because of this one result, the difference in average is 10%.
And I want to do this calculation using just excel.
Thanks for the clarification. In that case you want to do an independent sample T-Test. Meaning you want to compare the means of two independent data sets.
Excel has a function TTEST, that's what you need.
For your example you should probably use two tails and type 2.
The formula will output a probability value known as probability of alpha error. This is the error which you would make if you assumed the two datasets are different but they aren't. The lower the alpha error probability the higher the chance your sets are different.
You should only accept the difference of the two datasets if the value is lower than 0.01 (1%) or for critical outcomes even 0.001 or lower. You should also know that in the t-test needs at least around 30 values per dataset to be reliable enough and that the type 2 test assumes equal variances of the two datasets. If equal variances are not given, you should use the type 3 test.
http://depts.alverno.edu/nsmt/stats.htm

Resources