tf-idf on a somewhat large (65k) amount of text files

tf-idf on a somewhat large (65k) amount of text files - nlp

I want to try tfidf with scikit-learn (or nltk or am open to other suggestions). The data I have is a relatively large amount of discussion forum posts (~65k) we have scraped and stored in a mongoDB. Each post has a Post title, Date and Time of post, Text of the post message (or a re: if a reply to an existing post), User name, message ID and whether it is a child or parent post (in a thread, where you have the original post, and then replies to this op, or nested replies, the tree).
I figure each post, would be a separate document, and similar to the 20newsgroups, each document would have the fields I mentioned at the top, and the text of the message post at the bottom which I would extract out of mongo and write into the required format for each text file.
For loading the data into scikit, I know of:
http://scikit-learn.org/dev/modules/generated/sklearn.datasets.load_files.html (but my data is not categorized)
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html - For the input, I know I would be using filenames, but because I would have a large amount of files (each post), is there a way to either have filenames read from a text file? Or is there some example implementation someone could point me towards?
Also, any advice on structuring the filenames for each these discussion forum posts, for later identifying when I get the tfidf vectors and cosine similarity array
Thanks

You can pass a python generator or a generator expression of either filenames or string objects instead of a list and thus do the lazy loading of data from the drive as you go. Here is a toy example of a CountVectorizer taking a generator expression as argument:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> CountVectorizer().fit_transform(('a' * i for i in xrange(100)))
<100x98 sparse matrix of type '<type 'numpy.int64'>'
with 98 stored elements in Compressed Sparse Column format>
Note that generator support can make it possible to vectorize the data directly from a MongoDB query result iterator rather than going though filenames.
Also a list of 65k filenames of 10 chars each is just 650kB in memory (+ the overhead of the python list) so it should not be a problem to load all the filenames ahead of time anyway.
any advice on structuring the filenames for each these discussion forum posts, for later identifying when I get the tfidf vectors and cosine similarity array
Just use a deterministic ordering to be able to sort the list of filenames before feeding them to the vectorizer.

I was able to get these tasks.. in case it is helpful, below is the code for specifying a set of text files you want to use, and then how to set the flags and pass the filenames
path = "/wherever/yourfolder/oftextfiles/are"
filenames = os.listdir(path)
filenames.sort()
try:
filenames.remove('.DS_Store') #Because I am on a MAC
except ValueError:
pass # or scream: thing not in some_list!
except AttributeError:
pass # call security, some_list not quacking like a list!
vectorizer = CountVectorizer(input='filename', analyzer='word', strip_accents='unicode', stop_words='english')
X=vectorizer.fit_transform(filenames)
The mongo db part is basic but for what its worth (find all entries of type boardid 10 and sort by the messageid in ascending order):
cursor=coll.find({'boardid': 10 }).sort('messageid', 1)

Related

My training data contains line breaks; how can I work with Gensim's LineSentence format for the corpus_file parameter?

Per Gensim's documentation, changelog, and previous StackOverflow answers, I know that passing training data in the LineSentence format to the corpus_data parameter can dramatically speed up Any2Vec training.
Documentation on the LineSentence format reads as follows:
Iterate over a file that contains sentences: one line = one sentence. Words must be already preprocessed and separated by whitespace.
My training data is comprised of tens of millions (and potentially 1xx million) of sentences extracted from plaintext files using spaCy. A sample sentence quite often contains one or more line break characters (\n).
How can I make these samples compatible with the LineSentence format? As far as I understand, these samples should be "understood" in the context of their linebreaks, as these breaks are present in the target text (data not trained upon). That means I can't just remove them from the training data.
Do I escape the newline characters with \\n? Is there a way to pass a custom delimiter?
I appreciate any guidance. Thanks in advance.

LineSentence is only an appropriate iterable for classes like Word2Vec, that expect a corpus to be a Python sequence, where each item is a list of tokens.
The exact placement of linebreaks is unlikely to make much difference in usual word-vector training. If reading a file line-by-line, all words on the same line will still appear in each other's contexts. (Extra linebreaks will just prevent words from 'bleeding' over slightly into the contexts of preceding/subsequent texts – which in a large training corpus probably makes no net difference for end results.)
So mainly: don't worry about it.
If you think it might be a problem you could try either...
Removing the newlines between texts that you conjecture "snould" be a single text, creating longer lines. (Note, though, that you dn't want any of your texts to be over 10000 tokens, or else an internal implementation limit in Gensim will mean tokens past the 1st 10000 will be ignored.)
Replacing the newlines, in texts that you conjecture "should" be a single text, with some synthetic token, like say <nl> (or whatever).
...then evaluate whether the results have improved over simply not doing that. (I doubt they will improve, for basic Word2Vec/FastText training.)
For Doc2Vec, you might have to pay more attention to ensuring that all words of a 'document' are handled as a single text. In that case, you should make sure that whatever iterable sequence you have that produces TaggedDocument-like objects assigns the desired, same tag to all raw text that should be considered part of the same document.

faster way of reading word2vec txt in python

I have a standard word2vec output which is a .txt file formatted as follows:
[number of words] [dimension (300)]
word1 [300 float numbers separated by spaces]
word2 ...
Now I want to read at most M word representations out of this file. A simple way is to loop the first M+1 lines in the file, and store the M vectors into a numpy array. But this is super slow, is there a faster way?

What do you mean, "is super slow"? Compared to what?
Because it's a given text format, there's no way around reading the file line-by-line, parsing the floats, and assigning them into a usable structure. But you might be doing things very inefficiently – without seeing your code, it's hard to tell.
The gensim library in Python includes classes for working with word-vectors in this format. And, its routines include an optional limit argument for reading just a certain number of vectors from the front of a file. For example, this will read the 1st 1000 from a file named vectors.txt:
word_vecs = KeyedVectors.load_word2vec_format('word-vectors.txt',
binary=False,
limit=1000)
I've never noticed it as being a particularly slow operation, even when loading something like the 3GB+ set of word-vectors Google released. (If it does seem super-slow, it could be you have insufficient RAM, and the attempted load is relying on virtual memory paging – which you never want to happe with a random-access data structure like this.)
If you then save the vectors in gensim's native format, via .save(), and if the constituent numpy arrays are large enough to be saved as separate files, then you'd have the option of using gensim's native .load() with the optional mmap='r' argument. This would entirely skip any parsing of the raw on-disk numpy arrays, just memory-mapping them into addressable space – making .load() complete very quickly. Then, as ranges of the array are accessed, they'd be paged into RAM. You'd still be paying the cost of reading-from-disk all the data – but incrementally, as needed, rather than in a big batch up front.
For example...
word_vecs.save('word-vectors.gensim')
...then later...
word_vecs2 = KeyedVectors.load('word_vectors.gensim', mmap='r')
(There's no 'limit' option for the native .load().)

quanteda how much scale can textstat_simil handle

I have been using quanteda for the past couple of months and really enjoy using the package. One question I have is how many rows of a dfm can the textstat_simil function handle before the time to create the similarity matrix becomes too long.
I have a search corpus containing 15 million documents. Each document is a short sentence containing anywhere from 5 to 10 words (the documents sometimes include some 3-4 digit numbers too). I have tokenized this search corpus using character bigrams and created a dfm from it.
I also have another corpus that I call the match corpus. It has a couple hundred documents of similar length, has had the same tokenization, and a dfm created for it also. The aim is to find the closest matching document from the search corpus for each of the match corpus documents.
A combined dfm is made by rbinding the match dfm with the search dfm. The number of unique tokens for the combined dfm is about 1580. I then run textstat_simil on this combined dfm using "cosine" method, "documents" as the margin, and the selection being just one of the match corpus documents for now to test. However, when I run textstat_simil it takes over 5 minutes to run.
Is this sort of volume too much for this type of approach using quanteda?
Cheers,
Sof

In quanteda v1.3.13, we reprogrammed the function for computing cosine similarities so that is more efficient for memory and for storage. However it sounds like you are still trying to get a document-by-document distance matrix (excluding the diagonal) that will be (15000000^2)/2 - 150000000 = 1.124998e+14 cells in size. If you are able to get this to run at all, I'm very impressed with your machine!
For your 1,850 target document set, however, you can narrow this down by using the selection argument.
Also, look for the experimental textstat_proxy() function in v1.3.13, which we created for this sort of problem. You can specify a minimum distance below which a distance will not be recorded, and it returns a distance matrix using a sparse matrix object. This is still experimental because the sparse values are not zeroes, but will be treated as zeroes by any operations on the sparse matrix. (This violates some distance properties - see the discussion here.)

Batch running spaCy nlp() pipeline for large documents

I am trying to run the nlp() pipeline on a series of transcripts amounting to 20,211,676 characters. I am running on a machine with 8gb RAM. I'm very new at both Python and spaCy, but the corpus comparison tools and sentence chunking features are perfect for the paper I'm working on now.
What I've tried
I've begun by importing the English pipeline and removing 'ner' for faster speeds
nlp = spacy.load('en_core_web_lg', disable = ['ner'])
Then I break up the corpus into pieces of 800,000 characters since spaCy recommends 100,000 characters per gb
split_text = [text[i:i+800000] for i in range(0, len(text), 800000)]
Loop the pieces through the pipeline and create a list of nlp objects
nlp_text = []
for piece in split_text:
piece = nlp(piece)
nlp_text.append(piece)
Which works after a long wait period. note: I have tried upping the threshold via 'nlp.max_length' but anything above 1,200,000 breaks my python session.
Now that I have everything piped through I need to concatenate everything back since I will eventually need to compare the whole document to another (of roughly equal size). Also I would be interested in finding the most frequent noun-phrases in the document as a whole, not just in artificial 800,000 character pieces.
nlp_text = ''.join(nlp_text)
However I get the error message:
TypeError: sequence item 0: expected str instance,
spacy.tokens.doc.Doc found
I realize that I could turn to string and concatenate, but that would defeat the purpose of having "token" objects to works with.
What I need
Is there anything I can do (apart from using AWS expensive CPU time) to split my documents, run the nlp() pipeline, then join the tokens to reconstruct my complete document as an object of study? Am I running the pipeline wrong for a big document? Am I doomed to getting 64gb RAM somewhere?
Edit 1: Response to Ongenz
(1) Here is the error message I receive
ValueError: [E088] Text of length 1071747 exceeds maximum of 1000000.
The v2.x parser and NER models require roughly 1GB of temporary memory
per 100,000 characters in the input. This means long texts may cause
memory allocation errors. If you're not using the parser or NER, it's
probably safe to increase the nlp.max_length limit. The limit is in
number of characters, so you can check whether your inputs are too
long by checking len(text).
I could not find a part of the documentation that refers to this directly.
(2) My goal is to do a series of measures including (but not limited to if need arises): word frequency, tfidf count, sentence count, count most frequent noun-chunks, comparing two corpus using w2v or d2v strategies.
My understanding is that I need every part of the spaCy pipeline apart from Ner for this.
(3) You are completely right about cutting the document, in a perfect world I would cut on a line break instead. But as you mentioned I cannot use join to regroup my broken-apart corpus, so it might not be relevant anyways.

You need to join the resulting Docs using the Doc.from_docs method:
docs = []
for piece in split_text:
doc = nlp(piece)
docs.append(doc)
merged = Doc.from_docs(docs)
See the documentation here fore more details.

Scikit-Learn - No True Positives - Best Way to Normalize Data

Thanks for taking the time to read my question!
So I am running an experiment to see if I can predict whether an individual has been diagnosed with depression (or at least says they have been) based on the words (or tokens)they use in their tweets. I found 139 users that at some point tweeted "I have been diagnosed with depression" or some variant of this phrase in an earnest context (.e. not joking or sarcastic. Human beings that were native speakers in the language of the tweet were used to discern whether the tweet being made was genuine or not).
I then collected the entire public timeline of tweets of all of these users' tweets, giving me a "depressed user tweet corpus" of about 17000 tweets.
Next I created a database of about 4000 random "control" users, and with their timelines created a "control tweet corpus" of about 800,000 tweets.
Then I combined them both into a big dataframe,which looks like this:
,class,tweet
0,depressed,tweet text .. *
1,depressed,tweet text.
2,depressed,# tweet text
3,depressed,저 tweet text
4,depressed,# tweet text😚
5,depressed,# tweet text😍
6,depressed,# tweet text ?
7,depressed,# tweet text ?
8,depressed,tweet text *
9,depressed,# tweet text ?
10,depressed,# tweet text
11,depressed,tweet text *
12,depressed,#tweet text
13,depressed,
14,depressed,tweet text !
15,depressed,tweet text
16,depressed,tweet text. .
17,depressed,tweet text
...
50595,control,#tweet text?
150596,control,"# tweet text."
150597,control,# tweet text.
150598,control,"# tweet text. *"
150599,control,"#tweet text?"t
150600,control,"# tweet text?"
150601,control,# tweet text?
150602,control,# tweet text.
150603,control,#tweet text~
150604,control,# tweet text.
Then I trained a multinomial naive bayes classifier using an object from the CountVectorizer() class imported from the sklearn library:
count_vectorizer = CountVectorizer()
counts = count_vectorizer.fit_transform(tweet_corpus['tweet'].values)
classifier = MultinomialNB()
targets = tweet_corpus['class'].values
classifier.fit(counts, targets)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior= True)
Unfortunately, after running a 6-fold cross validation test, the results suck and I am trying to figure out why.
Total tweets classified: 613952
Score: 0.0
Confusion matrix:
[[596070 743]
[ 17139 0]]
So, I didn't properly predict a single depressed person's tweet! My initial thought is that I have not properly normalized the counts of the control group, and therefore even tokens which appear more frequently among the depressed user corpus are over represented in the control tweet corpus due to its much larger size. I was under the impression that .fit() did this already, so maybe I am on the wrong track here? If not, any suggestions on the most efficient way to normalize the data between two groups of disparate size?

You should use a re-sampling techniques to deal with unbalanced classes. There are many ways to do that "by hand" in Python, but I recommend unbalanced learn which compiles re-sampling techniques commonly used in datasets showing strong between-class imbalance.
If you are using Anaconda, you can use:
conda install -c glemaitre imbalanced-learn.
or simply:
pip install -U imbalanced-learn
This library is compteible with sci-kit learn. Your dataset looks very interesting, is it public? Hope this helps.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string