Dataset for Doc2vec - nlp

I have a question is there already any free dataset available to test doc2vec and if in case I wanted to create my own dataset what could be an appropriate way to do it.

Assuming you mean the 'Paragraph Vectors' algorithm, which is often called Doc2Vec, any textual dataset is a potential test/demo dataset.
The original papers by the creators of Doc2Vec showed results from applying it to:
movie reviews
search engine summary snippets
Wikipedia articles
scientific articles from Arxiv
People have also used it on…
titles of articles/books
abstracts of larger articles
full news articles or scientific papers
tweets
blogposts or social media posts
resumes
When learning, it's best to pick very simple, common datasets when you're 1st starting, and then larger datasets that you somewhat understand or are related to your areas of interest – if you don't already have a sufficient project-related dataset.
Note that the algorithm, like others in the [something]2vec family of algorithms, works best with lots of varied training data – many tens of thousands of unique words each with many contrasting usage examples, over many tens of thousands (or many more) of documents.
If you crank the vector_size way down, & the training-epochs way up, you can eke some hints of its real performance out of smaller datasets of a few hundred contrasting documents. For example, in the Python Gensim library's Doc2Vec intro-tutorial & test-cases, a tiny set of 300 news-summaries (from about 20 years ago called the 'Lee Corpus') are used, and each text is only a few hundreds words long.
But the vector_size is reduced to 50 – much smaller than the hundreds-of-dimensions typical with larger training data, and perhaps still too many dimensions for such a small amount of data. And, the training epochs is increased to 40, much larger than the default of 5 or typical Doc2Vec choices in published papers of 10-20 epochs. And even with those changes, with such little data & textual variety, the effect of moving similar documents to similar vector coordinates will be appear weaker to human review, & be less consistent between runs, than a better dataset will usually show (albeit using many more minutes/hours of training time).

Related

Evaluation of gensim Doc2Vec model for Recommendations

I have developed a pipeline to extract text from documents, preprocess the text, and train a gensim Doc2vec model on given documents. Given a document in my corpus, I would like to recommend other documents in the corpus.
I want to know how I can evaluate my model without having a pre-defined list of "good" recommendations. Any ideas?
One simple self-check that can be used to catch some big problems with a Doc2Vec model training pipeline – like gross misparameterizations, or insufficient data/epochs – is to re-infer vectors for the training texts (using .infer_vector()), and check that generally:
the bulk-trained vector for the same text is "close to" the re-inferred vector - such as its nearest-neighbor, or one of the top neighbors, in a .most_similar() operation on the re-inferred text
the overall list of nearest-neighbors (from .most_similar()) for the bulk-trained vector, & the re-inferred vector, are very similar.
They won't necessarily be identical, for reasons explained in Q11 & Q12 of the Gensim Project FAQ, but if they're wildly-different, then something foundational has gone wrong, like:
insufficient (in quantity or quality/form) training data
misparameterizations, like too few epochs or too-large (overfitting-prone) vectors for the quantity of data
Ultimately, though, the variety of data sources & intended uses & possible dimensions of "recommendation-worthy" mean that you need cusomt inputs, based on your project's needs, usually from the intended audience (or your own ability to simulate/represent it).
In the original paper introducing the "Paragraph Vector" algorithm (what's inside the Doc2Vec class), and a followup evaluating it on Wikipedia & arXiv articles, several of the evaluations used triplets of documents, where 2 of the triplet were conjectured to be "necessarily similar" based on some preexisting system's groupings, and the 3rd randomly-chosen.
The algorithm's performance, and relative performance under different parameter choices, was scored based on how often it placed the 2 presumptively-related documents closer-together than the 3rd randomly-chosen document.
For example, one of the original paper's evaluations use brief search-engine-result snippets as documents, and considered any 2 documents that appeared as sibling top-10 results for the same query as presumptively-related. Two of the followup paper's evaluation used the human-curated categories of Wikipedia or arXiv as signalling that articles of the same category should be presumptively-related.
It's imperfect, but allowed the creation of large evaluation sets from already-existing systems/data, which generally pointed results in the same direction as human senses-of-relatedness.
Perhaps you can find a similar preexisting guide for your data. Or, as you perform ad-hoc checking, be sure to capture every judgement you make, so that it becomes, over time, a growing dataset of desirable pairings that are either (a) better than some other result that was co-presented; or (b) just "presumably good enough" that they usually should rank higher than other random 3rd documents. A large amount of imprecision in such desirability-data is tolerable, as it can even out as the set of probe-pairings grows, and the power of being able to automate bulk quantitative evaluations (reusing old assessments against new parameters/models) drives far more overall improvement than any small glitches in the evaluations cost.

Word2Vec clustering: embed with low dimensionality or with high dimensionality and then reduce?

I am using K-means for topic modelling using Word2Vec and would like to understand the implications of vectorizing up to, let's say, 10 dimensions, against embedding it with 200 dimensions and then using PCA to get down to 10. Does the second approach make sense at all?
Which one worked better for your specific purposes, & your specific data, after trying both & comparing the end-results against each other, either in some ad-hoc ("eyeballing") or rigorous way?
There's no reason to prematurely reject any approach, given how many details about your data & ultimate end-goals are unstated.
It would be atypical to train a word2vec model to have only 10 dimensions. Published work most often shows the use of 100 to 1000 dimensions, often 300 or 400, assuming you've got enough bulk training data to make the algorithm worthwhile.
(Word2vec needs a lot of varied training text, with many contrasting usage examples for every word of interest, to generate good results. You may occasionally see toy-sized demos, on smaller amounts of data, just to quickly show steps, or some major qualities of the results. But good results, in the aspects for which word2vec is most appreciated, depend on plentiful training data.)
Also, whether or not your aims would be helped by the extra step of PCA to reduce the dimensionality of a larger word2vec model seems another separable question, to be determined experimentally by comparing results with and without that step, on your actual data/problem, rather than guessed at from intuitions from other projects that might not be comparable.

Which method dm or dbow works well for document similarity using Doc2Vec?

I'm trying to find out the similarity between 2 documents. I'm using Doc2vec Gensim to train around 10k documents. There are around 10 string type of tags. Each tag consists of a unique word and contains some sort of documents. Model is trained using distributed memory method.
Doc2Vec(alpha=0.025, min_alpha=0.0001, min_count=2, window=10, dm=1, dm_mean=1, epochs=50, seed=25, vector_size=100, workers=1)
I've tried both dm and dbow as well. dm gives better result(similarity score) as compared to dbow. I understood the concepts of dm vs dbow. But don't know which method is good for similarity measures between two documents.
First question: Which method is the best to perform well on similarities?
model.wv.n_similarity(<words_1>, <words_2>) gives similarity score using word vectors.
model.docvecs.similarity_unseen_docs(model, doc1, doc2) gives similarity score using doc vectors where doc1 and doc2 are not tags/ or indexes of doctags. Each doc1 and doc2 contains 10-20 words kind of sentences.
Both wv.n_similarity and docvecs.similarity_unseen_docs provide different similarity scores on same types of documents.
docvecs.similarity_unseen_docs gives little bit good results as compared to wv.n_similarity but wv.n_similarity sometimes also gives good results.
Question: What is the difference between docvecs.similarity_unseen_docs and wv.n_similarity? Can I use docvecs.similarity_unseen_docs to find the similarity score between unseen data (It might be a silly question)?
Why I asked because docvecs.similarity_unseen_docs provides similarity score on tags, not on actual words belonging to their tags. I'm not sure, please correct me here, if I'm wrong.
How can I convert cosine similarity score to probability?
Thanks.
model = Doc2Vec(alpha=0.025, min_alpha=0.0001, min_count=2, window=10, dm=1, dm_mean=1, epochs=50, seed=25, vector_size=100, workers=4)
# Training of the model
tagged_data = [TaggedDocument(words=_d, tags=[str(i)]) for i, _d in enumerate(<list_of_list_of_tokens>)]
model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)
# Finding similarity score
model.wv.n_similarity(<doc_words1>, <doc_words2>)
model.random.seed(25)
model.docvecs.similarity_unseen_docs(model, <doc_words1>, <doc_words2>)
Both PV-DM mode (dm=1, the default) and PV-DBOW mode (dm=0) can work well. Which is better will depend on your data and goals. Once you have a robust way to quantitatively score the quality of a model's results, for your project goals – which you'll want to be able to tune all of the model's meta-parameters, including DM/DBOW mode – you can and should try both.
PV-DBOW trains fast, and often works very well on shortish-docs (a few dozens of words). Note, though, that this mode doesn't train usable word-vectors unless you also add the dbow_words=1 option, which will slow training.
Using model.wv.n_similarity() relies on word-vectors only. It averages each set f word-vectors, then reports the cosine-similarity between those two averages. (So, it will only be sensible in PV-DM mode, or PV-DBOW with dbow_words=1 activated.
Using model. docvecs.similarity_unseen_docs() uses infer_vector() to treat each of the supplied docs as new texts, for which a true Doc2Vec doc-vector (not a mere average-of-word-vectors) is calculated. (This method operates on lists-of-words, not lists-of-tags.)
Which is better is something you should test for your goals. The average-of-word-vectors is a simpler, faster technique for making a text-vector – but still works ok for a lot of purposes. The inferred doc-vectors take longer to calculate, but with a good model, may be better for some tasks.
Other notes on your setup:
often, setting min_count as low as 2 is a bad idea: those rare words don't have enough examples to mean much, and actually interfere with the quality of surrounding words
10k documents is on the smallish side for a training corpus, compared to published Doc2Vec results (which usually use tens-of-thousands to millions of documents).
published results often use 10-20 training epochs (though more, like your choice of 50, might be helpful especially for smaller corpuses)
on typical multi-core machines workers=1 will be much slower than the default (workers=3); on a machine with 8 or more cores, up to workers=8 is often a good idea. (Though, unless using the newer corpus_file input option, more workers up to the full count of 16, 32, etc cores doesn't help.)
classic Doc2Vec usage doesn't assign docs just known labels (as in your "10 string type of tags"), but unique IDs for each document. In some cases using, or adding, known labels as tags may help, but beware that if you're only supplying 10 tags, you've essentially turned your 10,000 documents into 10 documents (from the perspective of the model's view, which sees all texts with the same tag as if they were segments of one larger document with that tag). In plain PV-DBOW, training only 10 doc-vectors, of 100-dimensions each, from just 10 distinct examples wouldn't make much sense: it'd be prone to severe overfitting. (In PV-DM or PV-DBOW with dbow_words, the fact that the model is training both 10 doc-vectors and many hundreds/thousands of other vocabulary-word word-vectors would help offset the risk of overfitting.)

What is an appropriate training set size for sentiment analysis?

I'm looking to use some tweets about measles/ the mmr vaccine to see how sentiment about vaccination changes over time. I plan on creating the training set from the corpus of data I currently have (unless someone has a recommendation on where I can get similar data).
I would like to classify a tweet as either: Pro-vaccine, Anti-Vaccine, or Neither (these would be factual tweets about outbreaks).
So the question is: How big is big enough? I want to avoid problems of overfitting (so I'll do a test train split) but as I include more and more tweets, the number of features needing to be learned increases dramatically.
I was thinking 1000 tweets (333 of each). Any input is appreciated here, and if you could recommend some resources, that would be great too.
More is always better. 1000 tweets on a 3-way split seems quite ambitious, I would even consider 1000 per class for a 3-way split on tweets quite low. Label as many as you can within a feasible amount of time.
Also, it might be worth taking a cascaded approach (esp. with so little data), i.e. label a set vaccine vs non-vaccine, and within the vaccine subset you'd have a pro vs anti set.
In my experience trying to model a catch-all "neutral" class, that contains everything that is not explicitly "pro" or "anti" is quite difficult because there is so much noise. Especially with simpler models such as Naive Bayes, I have found the cascaded approach to be working quite well.

NLP software for classification of large datasets

Background
For years I've been using my own Bayesian-like methods to categorize new items from external sources based on a large and continually updated training dataset.
There are three types of categorization done for each item:
30 categories, where each item must belong to one category, and at most two categories.
10 other categories, where each item is only associated with a category if there is a strong match, and each item can belong to as many categories as match.
4 other categories, where each item must belong to only one category, and if there isn't a strong match the item is assigned to a default category.
Each item consists of English text of around 2,000 characters. In my training dataset there are about 265,000 items, which contain a rough estimate of 10,000,000 features (unique three word phrases).
My homebrew methods have been fairly successful, but definitely have room for improvement. I've read the NLTK book's chapter "Learning to Classify Text", which was great and gave me a good overview of NLP classification techniques. I'd like to be able to experiment with different methods and parameters until I get the best classification results possible for my data.
The Question
What off-the-shelf NLP tools are available that can efficiently classify such a large dataset?
Those I've tried so far:
NLTK
TIMBL
I tried to train them with a dataset that consisted of less than 1% of the available training data: 1,700 items, 375,000 features. For NLTK I used a sparse binary format, and a similarly compact format for TIMBL.
Both seemed to rely on doing everything in memory, and quickly consumed all system memory. I can get them to work with tiny datasets, but nothing large. I suspect that if I tried incrementally adding the training data the same problem would occur either then or when doing the actual classification.
I've looked at Google's Prediction API, which seem to do much of what I'm looking for but not everything. I'd also like to avoid relying on an external service if possible.
About the choice of features: in testing with my homebrew methods over the years, three word phrases produced by far the best results. Although I could reduce the number of features by using words or two word phrases, that would most likely produce inferior results and would still be a large number of features.
After this post and based on the personal experience, I would recommend Vowpal Wabbit. It is said to have one of the fastest text classification algorithms.
MALLET has a number of classifiers (NB, MaxEnt, CRF, etc). It's written Andrew McCallum's group. SVMLib is another good option, but SVM models typically require a bit more tuning than MaxEnt. Alternatively some sort of online clustering like K-means might not be bad in this case.
SVMLib and MALLET are quite fast (C and Java) once you have your model trained. Model training can take a while though! Unfortunately it's not always easy to find example code. I have some examples of how to use MALLET programmatically (along with the Stanford Parser, which is slow and probably overkill for your purposes). NLTK is a great learning tool and is simple enough that is you can prototype what you are doing there, that's ideal.
NLP is more about features and data quality than which machine learning method you use. 3-grams might be good, but how about character n-grams across those? Ie, all the character ngrams in a 3-gram to account for spelling variations/stemming/etc? Named entities might also be useful, or some sort of lexicon.
I would recommend Mahout as it is intended for handling very large scale data sets.
The ML algorithms are built over Apache Hadoop(map/reduce), so scaling is inherent.
Take a look at classification section below and see if it helps.
https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
Have you tried MALLET?
I can't be sure that it will handle your particular dataset but I've found it to be quite robust in previous tests of mine.
However, I my focus was on topic modeling rather than classification per se.
Also, beware that with many NLP solutions you needn't input the "features" yourself (as the N-grams, i.e. the three-words-phrases and two-word-phrases mentioned in the question) but instead rely on the various NLP functions to produce their own statistical model.

Resources