How to detect repeating "sequences of words" across too many texts? - text

The problem is to detect repeating sequences of words across big number of text pieces. It is an approximation and efficiency problem, since the data I want to work with is huge. I want the assign numbers to texts while indexing texts, if they have matching parts with the texts which are already indexed.
For example, if a TextB which I am indexing now has a matching part with 2 other texts in the database. I want to assign a number to it ,p1.
If that matching part would be longer then I want it to assign p2 (p2>p1).
If TextB has matching part with only 1 other text then it should give p3 (p3 < p1).
These two parameters(length of the sequence, size of the matching group) would have maximum values, meaning after these max values have been surpassed, the number being assigned stops increasing.
I can think of a way to do this in brute force, but I need efficieny. My boss directed me to learn about NLP and search solutions there and I am planing to follow through this stanford video lectures.
But I am having doubts about if that is the right way to approach so I wanted to ask your opinion.
Example:
Text 1:"I want to become an artist and travel the world."
Text 2:"I want to become a musician."
Text 3:"travel the world."
Text 4:"She wants to travel the world."
Having these texts I want to have a data looks like this:
-"I want to become" , 2 instances , [1,2]
-"travel the world" , 3 instances , [1,3,4]
After having this data, finally, I wanna do this procedure(after having the previous data, this may be trivial):
(A matrix called A has some values at necessary indexes. I will determine these after some trials.)
Match groups have numeric values, which they retrieve from matrix A.
Group 1 = A(4,2) % 4 words, 2 instances
Group 2 = A(3,3) % 3 words , 3 instances
Then I will assign each text to have a number, which is the sum of numbers of the groups they are inside of.
My problem is forming this dataset in an efficient manner.

Related

counting most commonly used words across thousands of text files using Octave

I've accumulated a list of over 10,000 text files in Octave. I've got a function which cleans up the contents of each files, normalizing things in various ways (lowercase, reducing repeated whitespace, etc). I'd like to distill from all these files a list of words that appear in at least 100 files. I'm not at all familiar with Octave data structs or cell arrays or the Octave sorting functions, and was hoping someone could help me understand how to:
initialize an appropriate data structure (word_appearances) to count how many emails contain a particular word
loop thru the unique words that appear in an email string and increment for each of those words the count I'm tracking in word_appearances -- ideally we'd ignore words less than two chars in length and also exclude a short list of stop_words.
reduce word_appearances to only contain words that appear some number of times, e.g, min_appearances=100 times.
sort the words in word_appearances alphabetically and either export this as a .MAT file or as a CSV file like so:
1 aardvark
2 albatross
etc.
I currently have this code to loop through my files one by one:
for i = 1:total_files
filename = char(files{i}(1));
printf("processing %s\n", filename);
file_contents = jPreProcessFile(readFile(filename))
endfor
Note that the file_contents that comes back is pretty clean -- usually just a bunch of words, some repeated, separated by single spaces like so:
email market if done right is both highli effect and amazingli cost effect ok it can be downright profit if done right imagin send your sale letter to on million two million ten million or more prospect dure the night then wake up to sale order ring phone and an inbox full of freshli qualifi lead we ve been in the email busi for over seven year now our list ar larg current and deliver we have more server and bandwidth than we current need that s why we re send you thi note we d like to help you make your email market program more robust and profit pleas give us permiss to call you with a propos custom tailor to your busi just fill out thi form to get start name email address phone number url or web address i think you ll be delight with the bottom line result we ll deliv to your compani click here emailaddr thank you remov request click here emailaddr licatdcknjhyynfrwgwyyeuwbnrqcbh
Obviously, I need to create the word_appearances data structure such that each element in it specifies a word and how many files have contained that word so far. My primary point of confusion is what sort of data structure word_appearances should be, how I would search this data structure to see if some new word is already in it, and if found, increment its count, otherwise add a new element to word_appearances with count=1.
Octave has containers.Map to hold key-value pairs. This is the simple usage:
% initialize map
m = containers.Map('KeyType', 'char', 'ValueType', 'int32');
% check if it has a word
word = 'hello';
if !m.isKey(word)
m(word) = 1;
endif
% increment existing values
m(word) += 1;
This is one way to extract most frequent words from a map like the one above:
counts = m.values;
[sorted_counts, indices] = sort(cell2mat(counts));
top10_indices = indices(end:-1:end-9);
top10_words = m.keys(top10_indices);
I must warn you though, Octave may be pretty slow at this task, considering that you have thousands of files. Use it only if running time isn't that important for you.

LibreOffice or Excel: Randomization of items across colums without repetition

I have 100 people and I want them to judge words as either positive or negative (e.g. 'insurance' and 'car accident'). I have a total of 100 of such words. I also want each person to do three words as I am interested in some statistical properties (i.e. seeing how well people agree).
I want assign words to people by creating three columns with the same words in each column. However, I want words to randomized in a way so that there is no repetition in any row. Randomization is obviously important as I want to avoid any bias, but it would be silly to ask the same person the same two (or worse, three) words.
So, here is the data structure that I try to achieve:
person1, word1, word65, word33;
person2, word55, word56, word44;
person3, word23, word23, word3; <--- This should not happen
Is there a simple formula or other way to do this form of column-spanning randomization without repetition in LibreOffice Calc or Excel?
Thanks in advance!
What you need is a random permutation of the words that you type in difference cells. You can do this task using the Libreoffice extension Permutate! (download here: https://sourceforge.net/projects/permutate/). Since I am the developer of this simple extension, please do not hesitate to ask for any clarifications.

Filtering letter combinatons

Hi – I’m looking for help for the following problem.
I have a utility operating that gives me all the combinations for a set of letters (or values). This is in the form of 8 choose n, ie there are 8 letters and I can produce all the combinations for sequences where I want no more than 4 letters. So n can be 2, 3, or 4
Now here it gets a bit more complex: the 8 letters are made up of three lists or groups. Hence, A,B,C,D;E1,E2;F1,F2
As I say, I can get all the 2, 3 and 4-sequences without a problem. But I need to filter them so that I get combinations (or rather can filter the result) where I only want letters in the result that ensures I get (in the n=2 condition) at least one from A,B,C,D and one from either the E set or the F set.
So, as a few examples, where n=2
AE1 or DF2… is ok but AB or E1E2 or E1F1… is not ok
Where n=3 the rules alter slightly but it’s the same principle
ABE1, ABF1, BDF2 or BE2F1… is ok but ABC, ABD, AE1E2, DF1F2 or E1E2F1… is not ok.
Similarly, where n=4
ABE1F1, ABE1F2… is ok but ABCD, ABE1E2, CDF1F2 or E1E2F1F2… is not ok.
I’ve tried a few things using different formulas such as with Match and Countif but can’t quite figure it out. So would be very grateful for any help.
Jon
I've been trying to find an approach to this problem that takes some of the messiness out of it. There are two factors that make this a bit awkward to deal with
(a) Combination of single letters and bigrams (digrams?)
(b) Possibility of several different letters / bigrams at each position in the string.
It's possible to deal with both of these issues by classifying the letters or bigrams into three groups or classes
(1) Letters A-D - let's call this group L
(2) First pair of bigrams E1 & E2 - let's call this group M
(3) Second pair of bigrams F1 & F2 - let's call this group N.
Then we can make a list of the allowed combinations of groups which as far as I can work out is something like this
For N=2
LM
LN
For N=3
LLM
LLN
LMN
For N=4
LLMN
(I don't know if LLLM etc. is allowed but these can be added)
I'm going to make a big assumption that the utility mentioned in OP doesn't generate strings like AAAA or E1E1E1E1 otherwise it would be pretty useless and you would be better off starting from scratch.
So you just need a substitute that looks like this
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2,"A","L"),"B","L"),"C","L"),"D","L"),"E1","M"),"E2","M"),"F1","N"),"F2","N")
And a lookup in the list of allowed patterns
=ISNUMBER(MATCH(B2,$D$2:$D$10,0))
and filter on the lookup value being TRUE.

Descriptive statistics in Stata - Word frequencies

I have a large data set containing as variables fileid, year and about 1000 words (each word is a separate variable). All line entries come from company reports indicating the year, an unique fileid and the respective absolute frequency for each word in that report. Now I want some descriptive statistics: Number of words not used at all, Mean of words, variance of words, top percentile of words. How can I program that in Stata?
Caveat: You are probably better off using a text processing package in R or another program. But since no one else has answered, I'll give it a Stata-only shot. There may be an ado file already built that is much better suited, but I'm not aware of one.
I'm assuming that
each word is a separate variable
means that there is a variable word_profit that takes a value k from 0 to K where word_profit[i] is the number of times profit is written in the i-th report, fileid[i].
Mean of words
collapse (mean) word_* will give you the average number of times the words are used. Adding a by(year) option will give you those means by year. To make this more manageable than a very wide one observation dataset, you'll want to run the following after the collapse:
gen temp = 1
reshape long word_, i(temp) j(str) string
rename word_ count
drop temp
Variance of words
collapse (std) word_* will give you the standard deviation. To get variances, just square the standard deviation.
Number of words not used at all
Without a bit more clarity, I don't have a good idea of what you want here. You could count zeros for each word with:
foreach var of varlist word_* {
gen zero_`var' = (`var' == 0)
}
collapse (sum) zero_*

Generating unique combinations of text

I'm building a string of text from different parts. Group A + GROUP B + GROUP C + GROUP D.
The text is put together in this exact order. Each sentence is unique.
I randomly take one sentence from each group and put them together so the total combination of unique text would be A*B*C*D where A,B,C,D are the number of sentences in their respective group.
My problem is that how do i track that i don't generate duplicates in this way and when to know that i have used up all possible combinations?
Storing all possible combinations somewhere seems rather inefficient way to do this. So what options do i have?
As random strings of text are pulled from each group, simply store the starting position of the sentence within the group along with the length into a container like a dictionary or perhaps HashSet. This would act as the key to the container. If the number of sentences in each group is small enough, you might be able to pack the data into a single integer or long value, otherwise define a structure or class for it. The code should look in the container to see if the random combination generated has already been used. If it has been used, then loop until a unique one has been found. If the total number of combinations is small enough such that the user might go through them all, then pre-calculate the total-count and check to see if the container reaches that count, in which case some sort of exit processing should be performed.

Resources