Paragraph Vector or Doc2vec model size - nlp

I am using deeplearning4j java library to build paragraph vector model (doc2vec) of dimension 100. I am using a text file. It has around 17 million lines, and size of the file is 330 MB.
I can train the model and calculate paragraph vector which gives reasonably good results.
The problem is that when I try to save the model (by writing to disk) with WordVectorSerializer.writeParagraphVectors (dl4j method) it takes around 20 GB of space. And around 30GB when I use native java serializer.
I'm thinking may be the model is size is too big for that much data. Is the model size 20GB reasonable for the text data of 300 MB?
Comments are also welcome from people who have used doc2vec/paragraph vector in other library/language.
Thank you!

I'm not familiar with the dl4j implementation, but model size is dominated by the number of unique word-vectors/doc-vectors, and the chosen vector size.
(330MB / 17 million) means each of your documents averages only 20 bytes – very small for Doc2Vec!
But if for example you're training up a 300-dimensional doc-vector for each doc, and each dimension is (as typical) a 4-byte float, then (17 million * 300 dims * 4 bytes/dim) = 20.4GB. And then there'd be more space for word-vectors and model inner-weights/vocabulary/etc, so the storage sizes you've reported aren't implausible.
With the sizes you've described, there's also a big risk of overfitting - if using 300-dimensions, you'd be modeling docs of <20 bytes source material as (300*4=) 1200-byte doc-vectors.
To some extent, that makes the model tend towards a giant, memorized-inputs lookup table, and thus less-likely to capture generalizable patterns that help understand training docs, or new docs. Effective learning usually instead looks somewhat like compression: modeling the source materials as something smaller but more salient.

Related

What are the negative & sample parameters?

I am new to NLP and doc2Vec. I want to understand the parameters of doc2Vec. Thank you
Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, sample = 0, seed=0)
vector_size:I believe this is to control over-fitting. A larger feature vector will learn more details so it tends to over-fit. Is there a method to determine a appropriate vector size based on the number of document or total words in all doc?
negative: how many “noise words” should be drawn. What is noise word?
sample: the threshold for configuring which higher-frequency words are randomly down sampled. So what does sample=0 mean?
As a beginner, only vector_size will be of initial interest.
Typical values are 100-1000, but larger dimensionalities require far more training data & more memory. There's no hard & fast rules – try different values, & see what works for your purposes.
Very vaguely, you'll want your count of unique vocabulary words to be much larger than the vector_size, at least the square of the vector_size: the gist of the algorithm is to force many words into a smaller-number of dimensions. (If for some reason you're running experiments on tiny amounts of data with a tiny vocabulary – for which word2vec isn't really good anyway – you'll have to shrink the vector_size very low.)
The negative value controls a detail of how the internal neural network is adjusted: how many random 'noise' words the network is tuned away from predicting for each target positive word it's tuned towards predicting. The default of 5 is good unless/until you have a repeatable way to rigorously score other values against it.
Similarly, sample controls how much (if at all) more-frquent words are sometimes randomly skipped (down-sampled). (So many redundant usage examples are overkill, wasting training time/effort that could better be spent on rarer words.) Again, you'd only want to tinker with this if you've got a way to compare the results of alternate values. Smaller values make the downsampling more aggressive (dropping more words). sample=0 would turn off such down-sampling completely, leaving all training text words used.
Though you didn't ask:
dm=0 turns off the default PV-DM mode in favor of the PV-DBOW mode. That will train doc-vectors a bit faster, and often works very well on short texts, but won't train word-vectors at all (unless you turn on an extra dbow_words=1 mode to add-back interleaved ski-gram word-vector training).
hs is an alternate mode to train the neural-network that uses multi-node encodings of words, rather than one node per (positive or negative) word. If enabled via hs=1, you should disable the negative-sampling with negative=0. But negative-sampling mode is the default for a reason, & tends to get relatively better with larger amounts of training data - so it's rare to use this mode.

Size of input allowed in AllenNLP - Predict when using a Predictor

does anyone has an idea what is the input text size limit that can be passed to the
predict(passage, question) method of the AllenNLP Predictors.
I have tried with passage of 30-40 sentences, which is working fine. But eventually it is not working for me when I am passing some significant amount of text around 5K statement.
Which model are you using? Some models truncate the input, others try to handle arbitrary length input using a sliding window approach. With the latter, the limit will depend on the memory available on your system.

How to fit a huge distance matrix into a memory?

I have a huge distance matrix of size aroud 590000 * 590000 (data type of each element is float16). Will it fit in Memory for clustering algorithm ?? If not could anyone give an idea of using it in clustering DBSCAN algorithm??
590000 * 590000 * 2 bytes (float16 size) = 696.2 GB of RAM
It won't fit in memory with a standard computer. Moreover, float16 are converted to float32 in order to perform computations (see Python numpy float16 datatype operations, and float8?), so it might use a lot more than 700GB of RAM.
Why do you have a square matrix ? Can't you use a condensed matrix ? It will use half the memory needed with a square matrix.
Clustering (creating chunks) to decrease the problem size for DBSCAN can e.g. be done by having areas with overlapping regions.
The size of the overlapping regions has to fit your problem.
Find a reasonable size for the chunks of your problem and the overlapping region.
Afterwards stitch the results manually by iterating and comparing the clusters found in the overlapping regions.
You have to check if the elements in one cluster are also present in other chunks.
You might have to apply some stitching parameters, e.g. if some number of elements are in clusters in two different chunks they are the same cluster.
I just saw this:
The problem apparently is a non-standard DBSCAN implementation in
scikit-learn. DBSCAN does not need a distance matrix.
But this has probably been fixed years ago.
Which implementation are you using?
DBSCAN only needs the neighbors of each point.
So if you would know the appropriate parameters (which I doubt), you could read the huge matrix one row at a time, and build a list of neighbors within your distance threshold. Assuming that less than 1% are neighbors (on such huge data, you'll likely want to go even lower) that would reduce the memory need 100x.
But usually you want to avoid computing such a matrix at all!

How can Stanford CoreNLP Named Entity Recognition capture measurements like 5 inches, 5", 5 in., 5 in

I'm looking to capture measurements using Stanford CoreNLP. (If you can suggest a different extractor, that is fine too.)
For example, I want to find 15kg, 15 kg, 15.0 kg, 15 kilogram, 15 lbs, 15 pounds, etc. But among CoreNLPs extraction rules, I don't see one for measurements.
Of course, I can do this with pure regexes, but toolkits can run more quickly, and they offer the opportunity to chunk at a higher level, e.g. to treat gb and gigabytes together, and RAM and memory as building blocks--even without full syntactic parsing--as they build bigger units like 128 gb RAM and 8 gigabytes memory.
I want an extractor for this that is rule-based, not machine-learning-based), but don't see one as part of RegexNer or elsewhere. How do I go about this?
IBM Named Entity Extraction can do this. The regexes are run in an efficient way rather than passing the text through each one. And the regexes are bundled to express meaningful entities, as for example one that unites all the measurement units into a single concept.
I don't think a rule-based system exists for this particular task. However, it shouldn't be hard to make with TokensregexNER. For example, a mapping like:
[{ner:NUMBER}]+ /(k|m|g|t)b/ memory? MEMORY
[{ner:NUMBER}]+ /"|''|in(ches)?/ LENGTH
...
You could try using vanilla TokensRegex as well, and then just extract out the relevant value with a capture group:
(?$group_name [{ner:NUMBER}]+) /(k|m|g|t)b/ memory?
You can build your own training data and label the required measurements accordingly.
For example if you have a sentence like Jack weighs about 50 kgs
So the model will classify your input as:
Jack, PERSON
weighs, O
about, O
50, MES
kgs, MES
Where MES stands for measurements.
I have recently made training data for the Stanford NER tagger for my customized problem and have built a model for it.
I think for Stanford CoreNLP NER also you can do the same thing
This may be a machine learning-based approach rather than a rule-based approach

How to resolve mkcls taking up lots of memory and time for word alignment using GIZA++?

I am using the GIZA++ for aligning word from the bitexts from the Europarl corpus.
Before i train the alignment model using GIZA++, i need to use the mkcls script to making classes that is necessary for Hidden Markov Model algorithm as such:
mkcls -n10 -pcorp.tok.low.src -Vcorp.tok.low.src.vcb.classes
I have tried it with a small size 1000 lines corpus and it works properly and completed in a few minutes. Now i'm trying it on corpus with 1,500,000 lines and it's taking up 100% of one of the my CPU (Six-Core AMD Opteron(tm) Processor 2431 × 12)
Before making the classes, i have taken the necessary step to tokenize, lower all upper cases and filter out lines with more than 40 words.
Does anyone have similar experience on the mkcls for GIZA++? How is it solved? If anyone had done the same on the Europarl corpus, how long did it take you to run the mkcls?
Because the mkcls script for MOSES and GIZA++ isn't parallelized, and the number of sentences and words in the 1.5 million words in Europarl corpus, it takes around 1-2 hours to make the vocabulary classes.
the other pre-GIZA++ processing steps (viz. plain2snt, snt2cooc) takes much much lesser time and processing power.
try mgiza (http://www.kyloo.net/software/doku.php/mgiza:overview ) which support multi-threading. It should significantly decrease amount of time needed to accomplish your task.

Resources