Descriptive statistics in Stata - Word frequencies - statistics

I have a large data set containing as variables fileid, year and about 1000 words (each word is a separate variable). All line entries come from company reports indicating the year, an unique fileid and the respective absolute frequency for each word in that report. Now I want some descriptive statistics: Number of words not used at all, Mean of words, variance of words, top percentile of words. How can I program that in Stata?

Caveat: You are probably better off using a text processing package in R or another program. But since no one else has answered, I'll give it a Stata-only shot. There may be an ado file already built that is much better suited, but I'm not aware of one.
I'm assuming that
each word is a separate variable
means that there is a variable word_profit that takes a value k from 0 to K where word_profit[i] is the number of times profit is written in the i-th report, fileid[i].
Mean of words
collapse (mean) word_* will give you the average number of times the words are used. Adding a by(year) option will give you those means by year. To make this more manageable than a very wide one observation dataset, you'll want to run the following after the collapse:
gen temp = 1
reshape long word_, i(temp) j(str) string
rename word_ count
drop temp
Variance of words
collapse (std) word_* will give you the standard deviation. To get variances, just square the standard deviation.
Number of words not used at all
Without a bit more clarity, I don't have a good idea of what you want here. You could count zeros for each word with:
foreach var of varlist word_* {
gen zero_`var' = (`var' == 0)
}
collapse (sum) zero_*

Related

Quanteda: Removing documents with low occurrence of word x

When reading on methods of textual analysis, some eliminate documents with "10% lowest density score", that is, documents that are relatively long compared to the occurrence of a certain keyword. How can I achieve a similar result in quanteda?
I've created a corpus using a query of the words "refugee" and "asylum seeker". Now I would like to remove all documents where the count frequency of refugee|asylum_seeker is below 3. However, I imagine it is also possible to use the relative frequency if document length is to be taken into account.
Could someone help me? The solution in my head looks like this, however I don't know how to implement it.
For count frequency: Add counts of occurrences of refugee|asylum_seeker per document and remove documents with an added count below 3.
For relative frequency: Inspect the overall average relative frequency of both words refugee and asylum_seeker, to then calculate the per row relative frequencies of the features and apply a function to remove all documents with a relative frequency of both features below X.
Create a dfm from your tokenised corpus, using dfmat <- dfm(your tokens).
Remove the documents features this way:
dfm_remove(dfmat,
as.logical(dfmat[, c("refugee")] < 3 |
dfmat[, c("asylum_seeker")] < 3)
)

counting most commonly used words across thousands of text files using Octave

I've accumulated a list of over 10,000 text files in Octave. I've got a function which cleans up the contents of each files, normalizing things in various ways (lowercase, reducing repeated whitespace, etc). I'd like to distill from all these files a list of words that appear in at least 100 files. I'm not at all familiar with Octave data structs or cell arrays or the Octave sorting functions, and was hoping someone could help me understand how to:
initialize an appropriate data structure (word_appearances) to count how many emails contain a particular word
loop thru the unique words that appear in an email string and increment for each of those words the count I'm tracking in word_appearances -- ideally we'd ignore words less than two chars in length and also exclude a short list of stop_words.
reduce word_appearances to only contain words that appear some number of times, e.g, min_appearances=100 times.
sort the words in word_appearances alphabetically and either export this as a .MAT file or as a CSV file like so:
1 aardvark
2 albatross
etc.
I currently have this code to loop through my files one by one:
for i = 1:total_files
filename = char(files{i}(1));
printf("processing %s\n", filename);
file_contents = jPreProcessFile(readFile(filename))
endfor
Note that the file_contents that comes back is pretty clean -- usually just a bunch of words, some repeated, separated by single spaces like so:
email market if done right is both highli effect and amazingli cost effect ok it can be downright profit if done right imagin send your sale letter to on million two million ten million or more prospect dure the night then wake up to sale order ring phone and an inbox full of freshli qualifi lead we ve been in the email busi for over seven year now our list ar larg current and deliver we have more server and bandwidth than we current need that s why we re send you thi note we d like to help you make your email market program more robust and profit pleas give us permiss to call you with a propos custom tailor to your busi just fill out thi form to get start name email address phone number url or web address i think you ll be delight with the bottom line result we ll deliv to your compani click here emailaddr thank you remov request click here emailaddr licatdcknjhyynfrwgwyyeuwbnrqcbh
Obviously, I need to create the word_appearances data structure such that each element in it specifies a word and how many files have contained that word so far. My primary point of confusion is what sort of data structure word_appearances should be, how I would search this data structure to see if some new word is already in it, and if found, increment its count, otherwise add a new element to word_appearances with count=1.
Octave has containers.Map to hold key-value pairs. This is the simple usage:
% initialize map
m = containers.Map('KeyType', 'char', 'ValueType', 'int32');
% check if it has a word
word = 'hello';
if !m.isKey(word)
m(word) = 1;
endif
% increment existing values
m(word) += 1;
This is one way to extract most frequent words from a map like the one above:
counts = m.values;
[sorted_counts, indices] = sort(cell2mat(counts));
top10_indices = indices(end:-1:end-9);
top10_words = m.keys(top10_indices);
I must warn you though, Octave may be pretty slow at this task, considering that you have thousands of files. Use it only if running time isn't that important for you.

How to detect repeating "sequences of words" across too many texts?

The problem is to detect repeating sequences of words across big number of text pieces. It is an approximation and efficiency problem, since the data I want to work with is huge. I want the assign numbers to texts while indexing texts, if they have matching parts with the texts which are already indexed.
For example, if a TextB which I am indexing now has a matching part with 2 other texts in the database. I want to assign a number to it ,p1.
If that matching part would be longer then I want it to assign p2 (p2>p1).
If TextB has matching part with only 1 other text then it should give p3 (p3 < p1).
These two parameters(length of the sequence, size of the matching group) would have maximum values, meaning after these max values have been surpassed, the number being assigned stops increasing.
I can think of a way to do this in brute force, but I need efficieny. My boss directed me to learn about NLP and search solutions there and I am planing to follow through this stanford video lectures.
But I am having doubts about if that is the right way to approach so I wanted to ask your opinion.
Example:
Text 1:"I want to become an artist and travel the world."
Text 2:"I want to become a musician."
Text 3:"travel the world."
Text 4:"She wants to travel the world."
Having these texts I want to have a data looks like this:
-"I want to become" , 2 instances , [1,2]
-"travel the world" , 3 instances , [1,3,4]
After having this data, finally, I wanna do this procedure(after having the previous data, this may be trivial):
(A matrix called A has some values at necessary indexes. I will determine these after some trials.)
Match groups have numeric values, which they retrieve from matrix A.
Group 1 = A(4,2) % 4 words, 2 instances
Group 2 = A(3,3) % 3 words , 3 instances
Then I will assign each text to have a number, which is the sum of numbers of the groups they are inside of.
My problem is forming this dataset in an efficient manner.

Looping through several string variables. How to account for replicates?

As mentioned in a prior question (kindly answered with perfectly working syntax) I have a very large dataset of multiple diagnoses (25) per patient represented by ICD 10 codes in SPSS. For brevity sake I have posted a snapshot of what I am attempting to replicate simply using a test dataset of 3 string variables labeled DIAG1 to DIAG3 and random codes:
Assume each row represents a patient. The outcome presented in column "O74Updated" is what I am attempting to replicate. Essentially a presence/absence variable with a number representing the number of times a patient had an "O74" diagnoses across any one of the "DIAG" columns. The current working syntax that generates the outcome in column "O74" is:
compute O74 = 0.
do repeat x = DIAG1 to DIAG3.
if O74=0 O74 = (char.index(UPPER(x),'O74')>0).
end repeat.
As mentioned, the syntax provided by above runs wonderfully. However, I have come across a few hundred patients whom have multiple diagnoses of a "O74" which the above code does not accurately capture. I want to ensure all incidence of O74 are accounted for by providing a total count for each patient. Is it possible to ensure patients with multiple diagnoses are accounted for in the syntax provided above?
Again, I greatly appreciate any input/guidance into what is likely a very elementary syntax question in SPSS.
The syntax in your post yields a 1 in case any of the diagnoses contains 'O74' in it. A small change in the syntax will make it count the number of occurrences:
compute O74 = 0.
do repeat x = DIAG1 to DIAG3.
if char.index(UPPER(x),'O74')>0 O74 = O74 + 1.
end repeat.

Count no. of words in O(n)

I am on an interview ride here. One more interview question I had difficulties with.
“A rose is a rose is a rose” Write an
algorithm that prints the number of
times a character/word occurs. E.g.
A – 3 Rose – 3 Is – 2
Also ensure that when you are printing
the results, they are in order of
what was present in the original
sentence. All this in order n.
I did get solution to count number of occurrences of each word in sentence in the order as present in the original sentence. I used Dictionary<string,int> to do it. However I did not understand what is meant by order of n. That is something I need an explanation from you guys.
There are 26 characters, So you can use counting sort to sort them, in your counting sort you can have an index which determines when specific character visited first time to save order of occurrence. [They can be sorted by their count and their occurrence with sort like radix sort].
Edit: by words first thing every one can think about it, is using Hash table and insert words in hash, and in this way count them, and They can be sorted in O(n), because all numbers are within 1..n steel you can sort them by counting sort in O(n), also for their occurrence you can traverse string and change position of same values.
Order of n means you traverse the string only once or some lesser multiple of n ,where n is number of characters in the string.
So your solution to store the String and number of its occurences is O(n) , order of n, as you loop through the complete string only once.
However it uses extra space in form of the list you created.
Order N refers to the Big O computational complexity analysis where you get a good upper bound on algorithms. It is a theory we cover early in a Data Structures class, so we can torment, I mean help the student gain facility with it as we traverse in a balanced way, heaps of different trees of knowledge, all different. In your case they want your algorithm to grow in compute time proportional to the size of the text as it grows.
It's a reference to Big O notation. Basically the interviewer means that you have to complete the task with an O(N) algorithm.
"Order n" is referring to Big O notation. Big O is a way for mathematicians and computer scientists to describe the behavior of a function. When someone specifies searching a string "in order n", that means that the time it takes for the function to execute grows linearly as the length of that string increases. In other words, if you plotted time of execution vs length of input, you would see a straight line.
Saying that your function must be of Order n does not mean that your function must equal O(n), a function with a Big O less than O(n) would also be considered acceptable. In your problems case, this would not be possible (because in order to count a letter, you must "touch" that letter, thus there must be some operation dependent on the input size).
One possible method is to traverse the string linearly. Then create a hash and list. The idea is to use the word as the hash key and increment the value for each occurance. If the value is non-existent in the hash, add the word to the end of the list. After traversing the string, go through the list in order using the hash values as the count.
The order of the algorithm is O(n). The hash lookup and list add operations are O(1) (or very close to it).

Resources