counting most commonly used words across thousands of text files using Octave - struct

I've accumulated a list of over 10,000 text files in Octave. I've got a function which cleans up the contents of each files, normalizing things in various ways (lowercase, reducing repeated whitespace, etc). I'd like to distill from all these files a list of words that appear in at least 100 files. I'm not at all familiar with Octave data structs or cell arrays or the Octave sorting functions, and was hoping someone could help me understand how to:
initialize an appropriate data structure (word_appearances) to count how many emails contain a particular word
loop thru the unique words that appear in an email string and increment for each of those words the count I'm tracking in word_appearances -- ideally we'd ignore words less than two chars in length and also exclude a short list of stop_words.
reduce word_appearances to only contain words that appear some number of times, e.g, min_appearances=100 times.
sort the words in word_appearances alphabetically and either export this as a .MAT file or as a CSV file like so:
1 aardvark
2 albatross
etc.
I currently have this code to loop through my files one by one:
for i = 1:total_files
filename = char(files{i}(1));
printf("processing %s\n", filename);
file_contents = jPreProcessFile(readFile(filename))
endfor
Note that the file_contents that comes back is pretty clean -- usually just a bunch of words, some repeated, separated by single spaces like so:
email market if done right is both highli effect and amazingli cost effect ok it can be downright profit if done right imagin send your sale letter to on million two million ten million or more prospect dure the night then wake up to sale order ring phone and an inbox full of freshli qualifi lead we ve been in the email busi for over seven year now our list ar larg current and deliver we have more server and bandwidth than we current need that s why we re send you thi note we d like to help you make your email market program more robust and profit pleas give us permiss to call you with a propos custom tailor to your busi just fill out thi form to get start name email address phone number url or web address i think you ll be delight with the bottom line result we ll deliv to your compani click here emailaddr thank you remov request click here emailaddr licatdcknjhyynfrwgwyyeuwbnrqcbh
Obviously, I need to create the word_appearances data structure such that each element in it specifies a word and how many files have contained that word so far. My primary point of confusion is what sort of data structure word_appearances should be, how I would search this data structure to see if some new word is already in it, and if found, increment its count, otherwise add a new element to word_appearances with count=1.

Octave has containers.Map to hold key-value pairs. This is the simple usage:
% initialize map
m = containers.Map('KeyType', 'char', 'ValueType', 'int32');
% check if it has a word
word = 'hello';
if !m.isKey(word)
m(word) = 1;
endif
% increment existing values
m(word) += 1;
This is one way to extract most frequent words from a map like the one above:
counts = m.values;
[sorted_counts, indices] = sort(cell2mat(counts));
top10_indices = indices(end:-1:end-9);
top10_words = m.keys(top10_indices);
I must warn you though, Octave may be pretty slow at this task, considering that you have thousands of files. Use it only if running time isn't that important for you.

Related

How can I separate consecutive strings without any delimiters?

My input data is a VCF (Variant Call Format) file. Each line that I am interested in looks like this:
chrI 22232 DEL00BED N <DEL> . PASS SUPP=1159;SUPP_VEC=11111111111111111111111011111111111
I want to count the presence (1) of a specific deletion in a specific position (22232) supported by n samples. For this reason, I looked at SUPP_VEC= values, however, I don't know how to split each value as 1) it is a string, and 2) doesn't have delimiters. How could I add a space between every character? or How could I split/ count the values from SUPP_VEC= for Python3?
I was also curious to know what SUPP means. I found oneSUPP=2and I looked on Excel if the presence(1)\abscence(0) in the SUPP_VEC counted the value of SUPP, nevertheless, I could only count 1 instead of 2, probably does somebody know what SUPP means.
The reason for my procedure is to have a frequency table for a specific deletion type.
I hope I made myself clear.
Thank you in advance.

How to detect repeating "sequences of words" across too many texts?

The problem is to detect repeating sequences of words across big number of text pieces. It is an approximation and efficiency problem, since the data I want to work with is huge. I want the assign numbers to texts while indexing texts, if they have matching parts with the texts which are already indexed.
For example, if a TextB which I am indexing now has a matching part with 2 other texts in the database. I want to assign a number to it ,p1.
If that matching part would be longer then I want it to assign p2 (p2>p1).
If TextB has matching part with only 1 other text then it should give p3 (p3 < p1).
These two parameters(length of the sequence, size of the matching group) would have maximum values, meaning after these max values have been surpassed, the number being assigned stops increasing.
I can think of a way to do this in brute force, but I need efficieny. My boss directed me to learn about NLP and search solutions there and I am planing to follow through this stanford video lectures.
But I am having doubts about if that is the right way to approach so I wanted to ask your opinion.
Example:
Text 1:"I want to become an artist and travel the world."
Text 2:"I want to become a musician."
Text 3:"travel the world."
Text 4:"She wants to travel the world."
Having these texts I want to have a data looks like this:
-"I want to become" , 2 instances , [1,2]
-"travel the world" , 3 instances , [1,3,4]
After having this data, finally, I wanna do this procedure(after having the previous data, this may be trivial):
(A matrix called A has some values at necessary indexes. I will determine these after some trials.)
Match groups have numeric values, which they retrieve from matrix A.
Group 1 = A(4,2) % 4 words, 2 instances
Group 2 = A(3,3) % 3 words , 3 instances
Then I will assign each text to have a number, which is the sum of numbers of the groups they are inside of.
My problem is forming this dataset in an efficient manner.

Assigning and reading multidimensional arrays in Python

I'm stumped.
for a in range(0,500): #500 is a highly variable number but using it for example purposes
b = findall(r'<(.*?)>', d) # d will return a highly number variable number of matches could be anywhere from 45-10000
c.append([b])
print(c[0][1])
This returns the error because everything from 'b' goes into c[0][0]. I can understand this. The question is how do I split 'b' apart so I can put it into c so I can
print(c[0][234])
and get it give me back the 235, err element 234 of the 1, err 0, line?
This is a situation like I said above where the number of times going through 'b' will be variable, at least for right now until I get the entire file prepped I can only that 'b' in the end will be way north of 10,000 and probably closer to 100,000 by the time I have all the data collection finished. The number of elements that are stored can and will be highly variable depending on the file that they come from. They are all coming from a csv file but I'm hoping to not to deal with adding in any 'complexity' by going out and having to deal with the csv module...since I've never used it before and that will probably just lead to more questions.
I have tried something similiar to...different variables naturally so things would be appropriately matched up
d = list(zip(*(e.split(',') for e in b)))
all this has did is split on each and every letter versus on the comma.
Your error is coming from the square brackets you have in c.append([b]). The brackets create an extra list that contains the list b. So rather than a two dimensional data structure, you're ending up with three dimensions. Your indexing fails because c[0][1] is trying to get a second value from the middle list (which only ever has one item in it).
You might get what you want with c[0][0][1] instead. But you probably don't actually want that extra level in your data structure. You can avoid creating it by using: c.append(b)

Descriptive statistics in Stata - Word frequencies

I have a large data set containing as variables fileid, year and about 1000 words (each word is a separate variable). All line entries come from company reports indicating the year, an unique fileid and the respective absolute frequency for each word in that report. Now I want some descriptive statistics: Number of words not used at all, Mean of words, variance of words, top percentile of words. How can I program that in Stata?
Caveat: You are probably better off using a text processing package in R or another program. But since no one else has answered, I'll give it a Stata-only shot. There may be an ado file already built that is much better suited, but I'm not aware of one.
I'm assuming that
each word is a separate variable
means that there is a variable word_profit that takes a value k from 0 to K where word_profit[i] is the number of times profit is written in the i-th report, fileid[i].
Mean of words
collapse (mean) word_* will give you the average number of times the words are used. Adding a by(year) option will give you those means by year. To make this more manageable than a very wide one observation dataset, you'll want to run the following after the collapse:
gen temp = 1
reshape long word_, i(temp) j(str) string
rename word_ count
drop temp
Variance of words
collapse (std) word_* will give you the standard deviation. To get variances, just square the standard deviation.
Number of words not used at all
Without a bit more clarity, I don't have a good idea of what you want here. You could count zeros for each word with:
foreach var of varlist word_* {
gen zero_`var' = (`var' == 0)
}
collapse (sum) zero_*

Generating unique combinations of text

I'm building a string of text from different parts. Group A + GROUP B + GROUP C + GROUP D.
The text is put together in this exact order. Each sentence is unique.
I randomly take one sentence from each group and put them together so the total combination of unique text would be A*B*C*D where A,B,C,D are the number of sentences in their respective group.
My problem is that how do i track that i don't generate duplicates in this way and when to know that i have used up all possible combinations?
Storing all possible combinations somewhere seems rather inefficient way to do this. So what options do i have?
As random strings of text are pulled from each group, simply store the starting position of the sentence within the group along with the length into a container like a dictionary or perhaps HashSet. This would act as the key to the container. If the number of sentences in each group is small enough, you might be able to pack the data into a single integer or long value, otherwise define a structure or class for it. The code should look in the container to see if the random combination generated has already been used. If it has been used, then loop until a unique one has been found. If the total number of combinations is small enough such that the user might go through them all, then pre-calculate the total-count and check to see if the container reaches that count, in which case some sort of exit processing should be performed.

Resources