faster way of reading word2vec txt in python - text

I have a standard word2vec output which is a .txt file formatted as follows:
[number of words] [dimension (300)]
word1 [300 float numbers separated by spaces]
word2 ...
Now I want to read at most M word representations out of this file. A simple way is to loop the first M+1 lines in the file, and store the M vectors into a numpy array. But this is super slow, is there a faster way?

What do you mean, "is super slow"? Compared to what?
Because it's a given text format, there's no way around reading the file line-by-line, parsing the floats, and assigning them into a usable structure. But you might be doing things very inefficiently – without seeing your code, it's hard to tell.
The gensim library in Python includes classes for working with word-vectors in this format. And, its routines include an optional limit argument for reading just a certain number of vectors from the front of a file. For example, this will read the 1st 1000 from a file named vectors.txt:
word_vecs = KeyedVectors.load_word2vec_format('word-vectors.txt',
binary=False,
limit=1000)
I've never noticed it as being a particularly slow operation, even when loading something like the 3GB+ set of word-vectors Google released. (If it does seem super-slow, it could be you have insufficient RAM, and the attempted load is relying on virtual memory paging – which you never want to happe with a random-access data structure like this.)
If you then save the vectors in gensim's native format, via .save(), and if the constituent numpy arrays are large enough to be saved as separate files, then you'd have the option of using gensim's native .load() with the optional mmap='r' argument. This would entirely skip any parsing of the raw on-disk numpy arrays, just memory-mapping them into addressable space – making .load() complete very quickly. Then, as ranges of the array are accessed, they'd be paged into RAM. You'd still be paying the cost of reading-from-disk all the data – but incrementally, as needed, rather than in a big batch up front.
For example...
word_vecs.save('word-vectors.gensim')
...then later...
word_vecs2 = KeyedVectors.load('word_vectors.gensim', mmap='r')
(There's no 'limit' option for the native .load().)

Related

My training data contains line breaks; how can I work with Gensim's LineSentence format for the corpus_file parameter?

Per Gensim's documentation, changelog, and previous StackOverflow answers, I know that passing training data in the LineSentence format to the corpus_data parameter can dramatically speed up Any2Vec training.
Documentation on the LineSentence format reads as follows:
Iterate over a file that contains sentences: one line = one sentence. Words must be already preprocessed and separated by whitespace.
My training data is comprised of tens of millions (and potentially 1xx million) of sentences extracted from plaintext files using spaCy. A sample sentence quite often contains one or more line break characters (\n).
How can I make these samples compatible with the LineSentence format? As far as I understand, these samples should be "understood" in the context of their linebreaks, as these breaks are present in the target text (data not trained upon). That means I can't just remove them from the training data.
Do I escape the newline characters with \\n? Is there a way to pass a custom delimiter?
I appreciate any guidance. Thanks in advance.
LineSentence is only an appropriate iterable for classes like Word2Vec, that expect a corpus to be a Python sequence, where each item is a list of tokens.
The exact placement of linebreaks is unlikely to make much difference in usual word-vector training. If reading a file line-by-line, all words on the same line will still appear in each other's contexts. (Extra linebreaks will just prevent words from 'bleeding' over slightly into the contexts of preceding/subsequent texts – which in a large training corpus probably makes no net difference for end results.)
So mainly: don't worry about it.
If you think it might be a problem you could try either...
Removing the newlines between texts that you conjecture "snould" be a single text, creating longer lines. (Note, though, that you dn't want any of your texts to be over 10000 tokens, or else an internal implementation limit in Gensim will mean tokens past the 1st 10000 will be ignored.)
Replacing the newlines, in texts that you conjecture "should" be a single text, with some synthetic token, like say <nl> (or whatever).
...then evaluate whether the results have improved over simply not doing that. (I doubt they will improve, for basic Word2Vec/FastText training.)
For Doc2Vec, you might have to pay more attention to ensuring that all words of a 'document' are handled as a single text. In that case, you should make sure that whatever iterable sequence you have that produces TaggedDocument-like objects assigns the desired, same tag to all raw text that should be considered part of the same document.

Batch running spaCy nlp() pipeline for large documents

I am trying to run the nlp() pipeline on a series of transcripts amounting to 20,211,676 characters. I am running on a machine with 8gb RAM. I'm very new at both Python and spaCy, but the corpus comparison tools and sentence chunking features are perfect for the paper I'm working on now.
What I've tried
I've begun by importing the English pipeline and removing 'ner' for faster speeds
nlp = spacy.load('en_core_web_lg', disable = ['ner'])
Then I break up the corpus into pieces of 800,000 characters since spaCy recommends 100,000 characters per gb
split_text = [text[i:i+800000] for i in range(0, len(text), 800000)]
Loop the pieces through the pipeline and create a list of nlp objects
nlp_text = []
for piece in split_text:
piece = nlp(piece)
nlp_text.append(piece)
Which works after a long wait period. note: I have tried upping the threshold via 'nlp.max_length' but anything above 1,200,000 breaks my python session.
Now that I have everything piped through I need to concatenate everything back since I will eventually need to compare the whole document to another (of roughly equal size). Also I would be interested in finding the most frequent noun-phrases in the document as a whole, not just in artificial 800,000 character pieces.
nlp_text = ''.join(nlp_text)
However I get the error message:
TypeError: sequence item 0: expected str instance,
spacy.tokens.doc.Doc found
I realize that I could turn to string and concatenate, but that would defeat the purpose of having "token" objects to works with.
What I need
Is there anything I can do (apart from using AWS expensive CPU time) to split my documents, run the nlp() pipeline, then join the tokens to reconstruct my complete document as an object of study? Am I running the pipeline wrong for a big document? Am I doomed to getting 64gb RAM somewhere?
Edit 1: Response to Ongenz
(1) Here is the error message I receive
ValueError: [E088] Text of length 1071747 exceeds maximum of 1000000.
The v2.x parser and NER models require roughly 1GB of temporary memory
per 100,000 characters in the input. This means long texts may cause
memory allocation errors. If you're not using the parser or NER, it's
probably safe to increase the nlp.max_length limit. The limit is in
number of characters, so you can check whether your inputs are too
long by checking len(text).
I could not find a part of the documentation that refers to this directly.
(2) My goal is to do a series of measures including (but not limited to if need arises): word frequency, tfidf count, sentence count, count most frequent noun-chunks, comparing two corpus using w2v or d2v strategies.
My understanding is that I need every part of the spaCy pipeline apart from Ner for this.
(3) You are completely right about cutting the document, in a perfect world I would cut on a line break instead. But as you mentioned I cannot use join to regroup my broken-apart corpus, so it might not be relevant anyways.
You need to join the resulting Docs using the Doc.from_docs method:
docs = []
for piece in split_text:
doc = nlp(piece)
docs.append(doc)
merged = Doc.from_docs(docs)
See the documentation here fore more details.

How do I limit word length in FastText?

I am using FastText to compute skipgrams on a corpus containing a long sequence of characters with no spaces. After an hour or so, FastText produces a model containing vectors (of length 100) corresponding to "words" of length 50 characters from the corpus.
I tried setting -minn and -maxn parameters, but that does not help (I kind of knew it won't, but tried anyway), and -wordNgrams parameter only applies if there are spaces, I guess (?!). This is just a long stream of characters representing state, without spaces.
The documentation doesn't seem to have any information on this (or perhaps I'm missing something?)
The tool just takes whatever space-delimited tokens you feed it.
If you want to truncate, or discard, tokens that are longer than 50 characters (or any other threshold), you'd need to preprocess the data yourself.
(If your question is actually something else, add more details to the question showing example lines from your corpus, how you're invoking fasttext on it, how you are reviewing unsatisfactory results, and how you would expect satisfactory results to look instead.

SystemML binary format

SystemML comes packaged with a range of scripts that generate random input data files for use by the various algorithms. Each script accepts an option 'format' which determines whether the data files should be written in CSV or binary format.
I've taken a look at the binary files but they're not in any format I recognize. There doesn't appear to be documentation anywhere online. What is the binary format? What fields are in the header? For dense matrices, are the data contiguously packed at the end of the file (IEEE-754 32-bit float), or are there metadata fields spaced throughout the file?
Essentially, our binary format for matrices and frames are hadoop sequence files (single file or directory of part files) of type <MatrixIndexes,MatrixBlock> (with MatrixIndexes being a long-long pair for row/column block indexes) and <LongWritable,FrameBlock>, respectively. So anybody with hadoop io libraries and SystemML in the classpath can consume these files.
In detail, this binary blocked format is our internal tiled matrix representation (with default blocksize of 1K x 1K entries, and hence fixed logical but potentially variable physical size). Any external format provided to SystemML, such as csv or matrix market, is automatically converted into binary block format and all operations work over these binary intermediates. Depending on the backend, there are different representations, though:
For singlenode, in-memory operations and storage, the entire matrix is represented as a single block in deserialized form (where we use linearized double arrays for dense and MCSR, CSR, or COO for sparse).
For spark operations and storage, a matrix is represented as JavaPairRDD<MatrixIndexes, MatrixBlock> and we use MEMORY_AND_DISK (deserialized) as default storage level in aggregated memory.
For mapreduce operations and storage, matrices are actually persisted to sequence files (similar to inputs/outputs).
Furthermore, in serialized form (as written to sequence files or during shuffle), matrix blocks are encoded in one of the following: (1) empty (header: int rows, int cols, byte type), (2) dense (header plus serialized double values), (3) sparse (header plus for each row: nnz per row, followed by column index, value pairs), (4) ultra-sparse (header plus triples of row/column indexes and values, or pairs of row indexes and values for vectors). Note that we also redirect java serialization via writeExternal(ObjectOutput os) and readExternal(ObjectInput is) to the same serialization code path.
There are more details, especially with regard to the recently added compressed matrix blocks and frame blocks - so please ask if you're interested in anything specific here.

tf-idf on a somewhat large (65k) amount of text files

I want to try tfidf with scikit-learn (or nltk or am open to other suggestions). The data I have is a relatively large amount of discussion forum posts (~65k) we have scraped and stored in a mongoDB. Each post has a Post title, Date and Time of post, Text of the post message (or a re: if a reply to an existing post), User name, message ID and whether it is a child or parent post (in a thread, where you have the original post, and then replies to this op, or nested replies, the tree).
I figure each post, would be a separate document, and similar to the 20newsgroups, each document would have the fields I mentioned at the top, and the text of the message post at the bottom which I would extract out of mongo and write into the required format for each text file.
For loading the data into scikit, I know of:
http://scikit-learn.org/dev/modules/generated/sklearn.datasets.load_files.html (but my data is not categorized)
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html - For the input, I know I would be using filenames, but because I would have a large amount of files (each post), is there a way to either have filenames read from a text file? Or is there some example implementation someone could point me towards?
Also, any advice on structuring the filenames for each these discussion forum posts, for later identifying when I get the tfidf vectors and cosine similarity array
Thanks
You can pass a python generator or a generator expression of either filenames or string objects instead of a list and thus do the lazy loading of data from the drive as you go. Here is a toy example of a CountVectorizer taking a generator expression as argument:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> CountVectorizer().fit_transform(('a' * i for i in xrange(100)))
<100x98 sparse matrix of type '<type 'numpy.int64'>'
with 98 stored elements in Compressed Sparse Column format>
Note that generator support can make it possible to vectorize the data directly from a MongoDB query result iterator rather than going though filenames.
Also a list of 65k filenames of 10 chars each is just 650kB in memory (+ the overhead of the python list) so it should not be a problem to load all the filenames ahead of time anyway.
any advice on structuring the filenames for each these discussion forum posts, for later identifying when I get the tfidf vectors and cosine similarity array
Just use a deterministic ordering to be able to sort the list of filenames before feeding them to the vectorizer.
I was able to get these tasks.. in case it is helpful, below is the code for specifying a set of text files you want to use, and then how to set the flags and pass the filenames
path = "/wherever/yourfolder/oftextfiles/are"
filenames = os.listdir(path)
filenames.sort()
try:
filenames.remove('.DS_Store') #Because I am on a MAC
except ValueError:
pass # or scream: thing not in some_list!
except AttributeError:
pass # call security, some_list not quacking like a list!
vectorizer = CountVectorizer(input='filename', analyzer='word', strip_accents='unicode', stop_words='english')
X=vectorizer.fit_transform(filenames)
The mongo db part is basic but for what its worth (find all entries of type boardid 10 and sort by the messageid in ascending order):
cursor=coll.find({'boardid': 10 }).sort('messageid', 1)

Resources