Batch running spaCy nlp() pipeline for large documents - python-3.x

I am trying to run the nlp() pipeline on a series of transcripts amounting to 20,211,676 characters. I am running on a machine with 8gb RAM. I'm very new at both Python and spaCy, but the corpus comparison tools and sentence chunking features are perfect for the paper I'm working on now.
What I've tried
I've begun by importing the English pipeline and removing 'ner' for faster speeds
nlp = spacy.load('en_core_web_lg', disable = ['ner'])
Then I break up the corpus into pieces of 800,000 characters since spaCy recommends 100,000 characters per gb
split_text = [text[i:i+800000] for i in range(0, len(text), 800000)]
Loop the pieces through the pipeline and create a list of nlp objects
nlp_text = []
for piece in split_text:
piece = nlp(piece)
nlp_text.append(piece)
Which works after a long wait period. note: I have tried upping the threshold via 'nlp.max_length' but anything above 1,200,000 breaks my python session.
Now that I have everything piped through I need to concatenate everything back since I will eventually need to compare the whole document to another (of roughly equal size). Also I would be interested in finding the most frequent noun-phrases in the document as a whole, not just in artificial 800,000 character pieces.
nlp_text = ''.join(nlp_text)
However I get the error message:
TypeError: sequence item 0: expected str instance,
spacy.tokens.doc.Doc found
I realize that I could turn to string and concatenate, but that would defeat the purpose of having "token" objects to works with.
What I need
Is there anything I can do (apart from using AWS expensive CPU time) to split my documents, run the nlp() pipeline, then join the tokens to reconstruct my complete document as an object of study? Am I running the pipeline wrong for a big document? Am I doomed to getting 64gb RAM somewhere?
Edit 1: Response to Ongenz
(1) Here is the error message I receive
ValueError: [E088] Text of length 1071747 exceeds maximum of 1000000.
The v2.x parser and NER models require roughly 1GB of temporary memory
per 100,000 characters in the input. This means long texts may cause
memory allocation errors. If you're not using the parser or NER, it's
probably safe to increase the nlp.max_length limit. The limit is in
number of characters, so you can check whether your inputs are too
long by checking len(text).
I could not find a part of the documentation that refers to this directly.
(2) My goal is to do a series of measures including (but not limited to if need arises): word frequency, tfidf count, sentence count, count most frequent noun-chunks, comparing two corpus using w2v or d2v strategies.
My understanding is that I need every part of the spaCy pipeline apart from Ner for this.
(3) You are completely right about cutting the document, in a perfect world I would cut on a line break instead. But as you mentioned I cannot use join to regroup my broken-apart corpus, so it might not be relevant anyways.

You need to join the resulting Docs using the Doc.from_docs method:
docs = []
for piece in split_text:
doc = nlp(piece)
docs.append(doc)
merged = Doc.from_docs(docs)
See the documentation here fore more details.

Related

My training data contains line breaks; how can I work with Gensim's LineSentence format for the corpus_file parameter?

Per Gensim's documentation, changelog, and previous StackOverflow answers, I know that passing training data in the LineSentence format to the corpus_data parameter can dramatically speed up Any2Vec training.
Documentation on the LineSentence format reads as follows:
Iterate over a file that contains sentences: one line = one sentence. Words must be already preprocessed and separated by whitespace.
My training data is comprised of tens of millions (and potentially 1xx million) of sentences extracted from plaintext files using spaCy. A sample sentence quite often contains one or more line break characters (\n).
How can I make these samples compatible with the LineSentence format? As far as I understand, these samples should be "understood" in the context of their linebreaks, as these breaks are present in the target text (data not trained upon). That means I can't just remove them from the training data.
Do I escape the newline characters with \\n? Is there a way to pass a custom delimiter?
I appreciate any guidance. Thanks in advance.
LineSentence is only an appropriate iterable for classes like Word2Vec, that expect a corpus to be a Python sequence, where each item is a list of tokens.
The exact placement of linebreaks is unlikely to make much difference in usual word-vector training. If reading a file line-by-line, all words on the same line will still appear in each other's contexts. (Extra linebreaks will just prevent words from 'bleeding' over slightly into the contexts of preceding/subsequent texts – which in a large training corpus probably makes no net difference for end results.)
So mainly: don't worry about it.
If you think it might be a problem you could try either...
Removing the newlines between texts that you conjecture "snould" be a single text, creating longer lines. (Note, though, that you dn't want any of your texts to be over 10000 tokens, or else an internal implementation limit in Gensim will mean tokens past the 1st 10000 will be ignored.)
Replacing the newlines, in texts that you conjecture "should" be a single text, with some synthetic token, like say <nl> (or whatever).
...then evaluate whether the results have improved over simply not doing that. (I doubt they will improve, for basic Word2Vec/FastText training.)
For Doc2Vec, you might have to pay more attention to ensuring that all words of a 'document' are handled as a single text. In that case, you should make sure that whatever iterable sequence you have that produces TaggedDocument-like objects assigns the desired, same tag to all raw text that should be considered part of the same document.

How to measure similarity between sentences inside a cluster after clustering?

I'm conducting topic modeling analysis on messages from public Telegram groups, super new to this area so just learning.
I've been following this example here (https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6), and tried swapping out the HDBSCAN clustering algorithm with the one in BERT's documentation util.community_detection (https://www.sbert.net/docs/package_reference/util.html).
When I output the results of the clusters in this example (4899 Telegram messages), I get something that looks like this.
Topic: just a cluster label
Doc: all the messages in that cluster combined together
0: top keywords found via tf-idf
The problem I'm concerned with is that, there are clearly a ton of messages that are basically identical to each other, I've marked them in yellow. A few examples,
Cluster 3: this is just a bunch of "hellos" and variations thereof
Cluster 5: this is just a bunch of "Ok"s, people saying yes / ok
Cluster 7: people just saying thanks and variations on that
Cluster 9: some variations and misspellings of the word "gas"
Cluster 19: just "siap" which I think means "sorry if I already posted"
To a human reader I feel like this type of text should just be excluded from the analysis altogether, the question is how do I detect it.
Since they're already grouped together by the clustering algorithm, the algorithm must have ways to measure the "similarity" between these messages within a cluster. But I don't seem to be able to find these values exposed anywhere or what it's called. Like for example the HDBSCAN algorithm (https://hdbscan.readthedocs.io/en/latest/basic_hdbscan.html#), I skimmed through the doc a few times and didn't find any such property or measure exposed, am I missing something here?
My hypothesis is that for the cases where it's just a word or a short phrase repeated over and over again, this similarity value must be super super high, and I'd just say "clusters whose internal similarity is higher than this threshold are getting thrown out".
Any help & advice would be greatly appreciated, thanks!
Index the corpus of your interest (for e.g. FAISS) just for an idea, example code is below:
def build_index(self):
""":returns an inverted index for the search documents"""
vectors = [self.encode(document) for document in self.documents]
index = faiss.IndexIDMap(faiss.IndexFlatIP(768)) # dimensionality of vector space
# Add document vectors into index after transforming into numpy arrays. IDs should match len(documents)
index.add_with_ids(np.array([vec.numpy() for vec in vectors]), np.array(range(0, len(self.documents))))
return index
Then perform any similarity metric like L2 Euclidean distance or cosine similarity with dot products. Essentially, concept is that once we transform vectors in an n-dimensional space, vectors with similar semantics are grouped together. Therefore, computing similarity is just computing the angle between them and applying a cosine on it. Similar vectors have less angle, therefore higher cosine value & vice-versa.
Check the following topics for your problem.
Cosine Similarity
FAISS
Sentence Vectors (similar to word vectors, but are good for long documents)
Check this repository for a better understanding of sentence vectorization and computing similarity to retrieve top n sentences.
In short,
Create an index file using FAISS for your data of interest.
Compute similarity by calling one of its methods.
Get top n most similar results.
Removing stop words:
Essentially your problem can be attributed to a list of finite stop words. If you can identify ones to some finite value (e.g. some 25) such different key words at max, then the task becomes stop word removal. Please use NLTK / Spacy libraries for easy stop word removal. You can also specify them in a list of strings, write a condition where if a token matches with one of those strings, they’re deleted from downstream processing. Stop words are omitted & is a necessary pre-processing task in NLP. Your task of telegram
data is also similar to Twitter analysis. Check this & this.

faster way of reading word2vec txt in python

I have a standard word2vec output which is a .txt file formatted as follows:
[number of words] [dimension (300)]
word1 [300 float numbers separated by spaces]
word2 ...
Now I want to read at most M word representations out of this file. A simple way is to loop the first M+1 lines in the file, and store the M vectors into a numpy array. But this is super slow, is there a faster way?
What do you mean, "is super slow"? Compared to what?
Because it's a given text format, there's no way around reading the file line-by-line, parsing the floats, and assigning them into a usable structure. But you might be doing things very inefficiently – without seeing your code, it's hard to tell.
The gensim library in Python includes classes for working with word-vectors in this format. And, its routines include an optional limit argument for reading just a certain number of vectors from the front of a file. For example, this will read the 1st 1000 from a file named vectors.txt:
word_vecs = KeyedVectors.load_word2vec_format('word-vectors.txt',
binary=False,
limit=1000)
I've never noticed it as being a particularly slow operation, even when loading something like the 3GB+ set of word-vectors Google released. (If it does seem super-slow, it could be you have insufficient RAM, and the attempted load is relying on virtual memory paging – which you never want to happe with a random-access data structure like this.)
If you then save the vectors in gensim's native format, via .save(), and if the constituent numpy arrays are large enough to be saved as separate files, then you'd have the option of using gensim's native .load() with the optional mmap='r' argument. This would entirely skip any parsing of the raw on-disk numpy arrays, just memory-mapping them into addressable space – making .load() complete very quickly. Then, as ranges of the array are accessed, they'd be paged into RAM. You'd still be paying the cost of reading-from-disk all the data – but incrementally, as needed, rather than in a big batch up front.
For example...
word_vecs.save('word-vectors.gensim')
...then later...
word_vecs2 = KeyedVectors.load('word_vectors.gensim', mmap='r')
(There's no 'limit' option for the native .load().)

Find most repeated phrase on huge text

I have huge text data. My entire database is text format in UTF-8
I need to have list of most repeated phrase on my whole text data.
For example my desire output something like this:
{
'a': 423412341,
'this': 423412341,
'is': 322472341,
'this is': 222472341,
'this is a': 122472341,
'this is a my': 5235634
}
Process and store each phrase take huge size of database.
For example store in MySQL or MongoDB.
Question is is there any more efficient database or alghorithm for find this result ?
Solr, Elasticsearch or etc ...
I think i have max 10 words in each phrase can be good for me.
I'd suggest combining ideas from two fields, here: Streaming Algorithms, and the Apriori Algorithm From Market-Basket Analysis.
Let's start with the problem of finding the k most frequent single words without loading the entire corpus into memory. A very simple algorithm, Sampling (see Finding Frequent Items in Data Streams]), can do so very easily. Moreover, it is very amenable to parallel implementation (described below). There is a plethora of work on top-k queries, including some on distributed versions (see, e.g., Efficient Top-K Query Calculation in Distributed Networks).
Now to the problem of k most frequent phrases (of possibly multiple phrases). Clearly, the most frequent phrases of length l + 1 must contain the most frequent phrases of length l as a prefix, as appending a word to a phrase cannot increase its popularity. Hence, once you have the k most frequent single words, you can scan the corpus for only them (which is faster) to build the most frequent phrases of length 2. Using this, you can build the most frequent phrases of length 3, and so on. The stopping condition is when a phrase of length l + 1 does not evict any phrase of length l.
A Short Description of The Sampling Algorithm
This is a very simple algorithm which will, with high probability, find the top k items out of those having frequency at least f. It operates in two stages: the first finds candidate elements, and the second counts them.
In the first stage, randomly select ~ log(n) / f words from the corpus (note that this is much less than n). With high probability, all your desired words appear in the set of these words.
In the second stage, maintain a dictionary of the counts of these candidate elements; scan the corpus, and count the occurrences.
Output the top k of the items resulting from the second stage.
Note that the second stage is very amenable to parallel implementation. If you partition the text into different segments, and count the occurrences in each segment, you can easily combine the dictionaries at the end.
If you can store the data in Apache Solr, then the Luke Request Handler could be used to find the most common phrases. Example query:
http://127.0.0.1:8983/solr/admin/luke?fl=fulltext&numTerms=100
Additionally, the Terms Component may help find the most common individual words. Here is an article about Self Updating Solr Stopwords which uses the Terms Component to find the 100 most common indexed words and add them to the Stopwords file. Example query:
http://127.0.0.1:8983/solr/terms?terms.fl=fulltext&terms.limit=100
Have you considered using MapReduce?
Assuming you have access to a proper infrastructure, this seems to be a clear fit for it. You will need a tokenizer that splits lines into multi-word tokens up to 10 words. I don't think that's a big deal. The outcome from the MR job will be token -> frequency pairs, which you can pass to another job to sort them on the frequencies (one option). I would suggest to read up on Hadoop/MapReduce before considering other solutions. You may also use HBase to store any intermediary outputs.
Original paper on MapReduce by Google.
tokenize it by 1 to 10 words and insert into 10 SQL tables by token lengths. Make sure to use hash index on the column with string tokens. Then just call SELECT token,COUNT(*) FROM tablename GROUP BY token on each table and dump results somewhere and wait.
EDIT: that would be infeasible for large datasets, just for each N-gram update the count by +1 or insert new row into table (in MYSQL would be useful query INSERT...ON DUPLICATE KEY UPDATE). You should definitely still use hash indexes, though.
After that just sort by number of occurences and merge data from these 10 tables (you could do that in single step, but that would put more strain on memory).
Be wary of heuristic methods like suggested by Ami Tavory, if you select wrong parameters, you can get wrong results (flaw of sampling algorithm can be seen on some classic terms or phrases - e.g. "habeas corpus" - neither habeas nor corpus will be selected as frequent by itself, but as a 2 word phrase it may very well rank higher than some phrases you get by appending/prepending to common word). There is surely no need to use them for tokens of lesser length, you could use them only when classic methods fail (take too much time or memory).
The top answer by Amy Tavori states:
Clearly, the most frequent phrases of length l + 1 must contain the most frequent phrases of length l as a prefix, as appending a word to a phrase cannot increase its popularity.
While it is true that appending a word to a phrase cannot increase its popularity, there is no reason to assume that the frequency of 2-grams are bounded by the frequency of 1-grams. To illustrate, consider the following corpus (constructed specifically to illustrate this point):
Here, a tricksy corpus will exist; a very strange, a sometimes cryptic corpus will dumbfound you maybe, perhaps a bit; in particular since my tricksy corpus will not match the pattern you expect from it; nor will it look like a fish, a boat, a sunflower, or a very handsome kitten. The tricksy corpus will surprise a user named Ami Tavory; this tricksy corpus will be fun to follow a year or a month or a minute from now.
Looking at the most frequent single words, we get:
1-Gram Frequency
------ ---------
a 12
will 6
corpus 5
tricksy 4
or 3
from 2
it 2
the 2
very 2
you 2
The method suggested by Ami Tavori would identify the top 1-gram, 'a', and narrow the search to 2-grams with the prefix 'a'. But looking at the corpus from before, the top 2-grams are:
2-Gram Frequency
------ ---------
corpus will 5
tricksy corpus 4
or a 3
a very 2
And moving on to 3-grams, there is only a single repeated 3-gram in the entire corpus, namely:
3-Gram Frequency
------ ---------
tricksy corpus will 4
To generalize: you can't use the top m-grams to extrapolate directly to top (m+1)-grams. What you can do is throw away the bottom m-grams, specifically the ones which do not repeat at all, and look at all the ones that do. That narrows the field a bit.
This can be simplified greatly. You don't need a database at all. Just store the full text in a file. Then write a PHP script to open and read the file contents. Use the PHP regex function to extract matches. Keep the total in a global variable. Write the results to another file. That's it.

How to efficiently compute similarity between documents in a stream of documents

I gather Text documents (in Node.js) where one document i is represented as a list of words.
What is an efficient way to compute the similarity between these documents, taking into account that new documents are coming as a sort of stream of documents?
I currently use cos-similarity on the Normalized Frequency of the words within each document. I don't use the TF-IDF (Term frequency, Inverse document frequency) because of the scalability issue since I get more and more documents.
Initially
My first version was to start with the currently available documents, compute a big Term-Document matrix A, and then compute S = A^T x A so that S(i, j) is (after normalization by both norm(doc(i)) and norm(doc(j))) the cos-similarity between documents i and j whose word frequencies are respectively doc(i) and doc(j).
For new documents
What do I do when I get a new document doc(k)? Well, I have to compute the similarity of this document with all the previous ones, which doesn't require to build a whole matrix. I can just take the inner-product of doc(k) dot doc(j) for all previous j, and that result in S(k, j), which is great.
The troubles
Computing S in Node.js is really long. Way too long in fact! So I decided to create a C++ module which would do the whole thing much faster. And it does! But I cannot wait for it, I should be able to use intermediate results. And what I mean by "not wait for it" is both
a. wait for the computation to be done, but also
b. wait for the matrix A to be built (it's a big one).
Computing new S(k, j) can take advantage of the fact that documents have way less words than the set of all the given words (which I use to build the whole matrix A). Thus, it looks faster to do it in Node.js, avoiding a lot of extra-resource to be taken to access the data.
But is there any better way to do that?
Note : the reason I started computing S is that I can easily build A in Node.js where I have access to all the data, and then do the matrix multiplication in C++ and get it back in Node.js, which speeds the whole thing a lot. But now that computing S gets impracticable, it looks useless.
Note 2 : yep, I don't have to compute the whole S, I can just compute the upper-right elements (or the lower-left ones), but that's not the issue. The time computation issue is not of that order.
If one has to solve it today, just use pre-trained word vectors from fasttext or word2vec

Resources