doc2vec infer words from vectors - doc2vec

I am clustering comments.
After preprocessing and a vectorization of a text, I have inferred vectors from my doc2vec model and applied kmeans.
After that I want to convert cluster centroid vectors to words to kinda look at the semantic cores of the clusters. Is it possible?
Edit: I use python/gensim.

There are a bunch of potential approaches you could try, and see which might offer what you want.
First & foremost, some of the Gensim Doc2Vec modes co-train word-vectors into the same coordinate system as the doc-vectors – allowing direct comparisons betwee words & docs, sometimes even to the level of compositional 'vector-arithmetic' (like in the famous word2vec analogy-solving examples).
You can see this potential discussed in the paper "Document Embedding with Paragraph Vectors".
The default PV-DM mode (parameter dm=1) automatically co-trains words and docs in the same space. You can also add interleaved word-vector skip-gram training into the other PV-DBOW dm=0 mode by adding the optional parameter dbow_words=1.
While it is still the case that d2v_model.dv.most_similar(docvec_or_doctag) will only return doc-vector results, and d2v_model.wv.most_similar(wordvec_or_word_token) will only return word-vector results, you can absolutely provide a raw vector of a document to the set of word-vectors, or a word-vector to the set of doc-vectors, to get the nearest-neighbors of the other type.
So in one of these modes, with doc-vector, you can use...
d2v_model.wv.most_simlar(positive=[doc_vector])
...to get a list-of-words that are closest to that doc-vector. Whether they're sufficiently representative will vary based on lots of factors. (If they seem totally random, there may be other problems with your data-sufficiency or process, or you may be using the dm=0, dbow_words=0 mode that leaves words random & untrained.)
You could use this on the centroid of your clusters – but note, a centroid might hide lots of the variety of a larger grouping, which might include docs not all in a tight 'ball' around the centroid. So you could also use this on all docs in a cluster, to get the top-N closest words to each – and then summarize the cluster as the words most often appearing in those many top-N lists, or most uniquely appearing in those top-N lists (versus the top-N lists of other clusters). That might describe more of the full cluster.
Separately, there's a method from Gensim's Word2Vec, predict_output_word(), which vaguely simulates the word2vec training-predictions to give a ranked list of predictions of a word from its surrounding words. The same code could be generalized to predict document-words from a doc-vector – there's an open pending issue to do so, and it'd be a simple bit of coding, though no-one's tackled it yet. (It'd be a welcome, and pretty easy, 1ast contribution to the Gensim project.)
Also: after having established your clusters, you could even put the Doc2Vec model aside, and use more traditional direct counting/frequency methods to pick out the most-salient words in each cluster. For example, turn each cluser into a single synthetic pseudodocument. Rank the words inside by TF-IDF, compared to the other cluster pseudodocs. (Or, get the top TF-IDF terms for every one of the individual original documents; describe each cluster by the most-often-relevant words tallied across all cluster docs.)

Though gojomo's answer makes perfect sense, I've decided to go the other way around with classification instead of clusterization. The article about the library I have found useful:
https://towardsdatascience.com/unsupervised-text-classification-with-lbl2vec-6c5e040354de]

Related

Truncate LDA topics

I am training an LDA model. While I obtain decently interpretable topics (based on the top words), particular documents tend to load heavily on very "generic" topics rather than specialized ones -- even though the most frequent words in the document are specialized.
For example, I have a real estate report as a document. Top words by frequency are "rent", "reit", "growth". Now, I have a "specialized" topic with top words being exactly those three. However, the loading of the specialized topic is 9%, and 32% goes to a topic which is very diffuse and the top words are rather common.
How can I increase the weight of "specialized" topics? Is it possible to truncate topics such that I only include the top 10 words and assign zero probability to anything else? Is it desirable to do so?
I am using the gensim package. Thank you!
It seems that you want a very precise control over the topics which looks much more like clustering with a set of centroids chosen ahead of time than LDA which is generally not very deterministic and hence controllable.
One of the ways you can strive to achieve your goal with LDA is to filter more words out of the documents (same as you do with stopwords). Then the "rather common" words that go into one of the topics stop obscuring the LDA model creation process and you get more crisply delineated topics (hopefully).
Removing the most common words is quite a common practice for preprocessing in topic modeling. Because topics are usually generated from the most frequent words, but usually these words are not very informative. You can also remove the most common words as a post-processing step (See Pulling Out the Stops: Rethinking Stopword Removal for Topic Models)
About having sparser word-topic distributions, you can use Non-negative Matrix Factorization (NMF) instead of LDA. If you adjust the sparsity parameters, you can get more spiked proportions of the topics. You can use scikit-learn NMF's implementation.

how can I simplify BoWs?

I'm trying to apply some binary text classification but I don't feel that having millions of >1k length vectors is a good idea. So, which alternatives are there for the basic BOW model?
I think there are quite a few different approaches, based on what exactly you are aiming for in your prediction task (processing speed over accuracy, variance in your text data distribution, etc.).
Without any further information on your current implementation, I think the following avenues offer ways for improvement in your approach:
Using sparse data representations. This might be a very obvious point, but choosing the right data structure to represent your input vectors can already save you a great deal of pain. Sklearn offers a variety of options, and detail them in their great user guide. Specifically, I would point out that you could either use scipy.sparse matrices, or alternatively represent something with sklearn's DictVectorizer.
Limit your vocabulary. There might be some words that you can easily ignore when building your BoW representation. I'm again assuming that you're working with some implementation similar to sklearn's CountVectorizer, which already offers a great number of possibilities. The most obvious option are stopwords, which can simply be dropped from your vocabulary entirely, but of course you can also limit it further by using pre-processing steps such as lemmatization/stemming, lowercasing, etc. CountVectorizer specifically also allows you to control the minimum and maximum document frequency (don't confuse this with corpus frequency), which again should limit the size of your vocabulary.

Cluster similar words using word2vec

I have various restaurant labels with me and i have some words that are unrelated to restaurants as well. like below:
vegan
vegetarian
pizza
burger
transportation
coffee
Bookstores
Oil and Lube
I have such mix of around 500 labels. I want to know is there a way pick the similar labels that are related to food choices and leave out words like oil and lube, transportation.
I tried using word2vec but, some of them have more than one word and could not figure out a right way.
Brute-force approach is to tag them manually. But, i want to know is there a way using NLP or Word2Vec to cluster all related labels together.
Word2Vec could help with this, but key factors to consider are:
How are your word-vectors trained? Using off-the-shelf vectors (like say the popular GoogleNews vectors trained on a large corpus of news stories) are unlikely to closely match the senses of these words in your domain, or include multi-word tokens like 'oil_and_lube'. But, if you have a good training corpus from your own domain, with multi-word tokens from a controlled vocabulary (like oil_and_lube) that are used in context, you might get quite good vectors for exactly the tokens you need.
The similarity of word-vectors isn't strictly 'synonymity' but often other forms of close-relation including oppositeness and other ways words can be interchangeable or be used in similar contexts. So whether or not the word-vector similarity-values provide a good threshold cutoff for your particular desired "related to food" test is something you'd have to try out & tinker around. (For example: whether words that are drop-in replacements for each other are closest to each other, or words that are common-in-the-same-topics are closest to each other, can be influenced by whether the window parameter is smaller or larger. So you could find tuning Word2Vec training parameters improve the resulting vectors for your specific needs.)
Making more recommendations for how to proceed would require more details on the training data you have available – where do these labels come from? what's the format they're in? how much do you have? – and your ultimate goals – why is it important to distinguish between restaurant- and non-restaurant- labels?
OK, thank you for the details.
In order to train on word2vec you should take into account the following facts :
You need a huge and variate text dataset. Review your training set and make sure it contains the useful data you need in order to obtain what you want.
Set one sentence/phrase per line.
For preprocessing, you need to delete punctuation and set all strings to lower case.
Do NOT lemmatize or stemmatize, because the text will be less complex!
Try different settings:
5.1 Algorithm: I used word2vec and I can say BagOfWords (BOW) provided better results, on different training sets, than SkipGram.
5.2 Number of layers: 200 layers provide good result
5.3 Vector size: Vector length = 300 is OK.
Now run the training algorithm. The, use the obtained model in order to perform different tasks. For example, in your case, for synonymy, you can compare two words (i.e. vectors) with cosine (or similarity). From my experience, cosine provides a satisfactory result: the distance between two words is given by a double between 0 and 1. Synonyms have high cosine values, you must find the limit between words which are synonyms and others that are not.

Can we compare word vectors from different models using transfer learning?

I want to train two word2vec/GLoVe models on different corpora and then compare the vectors of a single word. I know that it makes no sense to do so as different models start at different random states, but what if we use pre-trained word vectors as the starting point. Can we assume that the two models will continue to build upon the pre-trained vectors by incorporating the respective domain-specific knowledge, and not go into completely different states?
Tried to find some research papers which discuss this problem, but couldn't find any.
Simply starting your models with pre-trained bectors would eliminate some of the randomness, but with each training epoch on your new corpora:
there's still randomness introduced by negative-sampling (if using that default mode), by frequent-word downsampling (if using default values of the sample parameter in word2vec), and by the interplay of different threads
each epoch with your new corpora will be pulling the word-vectors for present words to new, better positions for that corpora, but leaving original words unmoved. The net movements over many epochs could move words arbitrarily far from where they started, in response to the whole-corpus-effects on all words.
So, doing so wouldn't necessarily achieve your goal in a reliable (or theoretically-defensible) way, though it might kinda-work – at least better than starting from purely random initialization – especially if your corpora are small and you do few training epochs. (That's usually a bad idea – you want big varied training data and enough passes for extra passes to make little incremental difference. But doing those things "wrong" could make your results look "better" in this scenario, where you don't want your training to change the original coordinate-space "too much". I wouldn't rely on such an approach.)
Especially if the words you need to compare are a small subset of the total vocabulary, a couple things you could consider:
combine the corpora into one training corpus, shuffled together, but for those words you need to compare, replace them with corpora-specific tokens. For example, replace 'sugar' with 'sugar_c1' and 'sugar_c2' – leaving the vast majority of surrounding words to be the same tokens (and thus learn a single vector across the whole corpus). Then, the two variant tokens for the "same word" will learn different vectors, based on their differing contexts that still share many of the same tokens.
using some "anchor set" of words that you know (or confidently conjecture) either do mean the same across both contexts, or should mean the same, train two models but learn a transformation between the two space based on those guide words. Then, when you apply that transformation to other words, that weren't used to learn the transformation, they'll land in contrasting positions in each others' spaces, maybe achieving the comparison you need. This is a technique that's been used for language-to-language translation, and there's a helper class and example notebook included with the Python gensim library.
There may be other better approaches, these are just two quick ideas that might work without much change to existing libraries. A project like 'HistWords', which used word-vector training to try to track evolving changes in word-meaning over time, might also have ideas for usable techniques.

human-interpretable, meaningful clusters using doc2vec

I am clustering a set of education documents using doc2vec.
As a human, I think of these as in categories such as:
computer-related
language related
collaboration
arts
etc.
I wonder if there is a way to 'guide' the doc2vec clustering into a set of clusters that are human-interpretable.
One strategy I have been trying is to filter out all 'nonsense' words, and only train doc2vec on the words that seem meaningful. But of course, this seems to perhaps ruin the training.
Something just occurred to me that might work:
Train on entire documents (don't filter out words) to create doc2vec space
Filter nonsense words ('help', 'student', etc. are words that have very little meaning in this space) out of each document
Project filtered documents into doc2vec space
then process using k-means etc
I would appreciate any constructive suggestions or next steps.
best
Your plan is fine; you should try it to evaluate the results. The clusters may not map tightly to your preconceived groupings, but by looking at the example docs per cluster, you'll probably be able to form your own rough idea of what the cluster "is" in human-crafted descriptive terms.
Don't try too much guesswork preprocessing (like eliminating words) at first. Try those kinds of variations after you have the simplest possible approach working, as a baseline – so you can evaluate (even if only by ad hoc eyeballing) whether they're helping as expected. (For example, if a word like 'student' truly appears across all documents equally, it won't have much influence either way on Doc2Vec final doc coordinates... so you don't have to make that judgement call yourself, it'll just be deemphasized automatically.)
I'm assuming that by Doc2Vec you mean the 'Paragraph Vector' algorithm, as implemented by the Doc2Vec class in Python gensim. Some PV-Doc2Vec modes, including the default PV-DM (dm=1) and also the simpler PV-DBOW if you also enable concurrent word-training (dm=0, dbow_words=1), train word-vectors into the same space as doc-vectors. So the word-vectors that are closest to the doc-vectors in a cluster, or the cluster's centroid, might be useful as interpretable descriptions of the cluster.
(In the word-vector space, there's also research that tries to make the individual dimensions of word-vectors more-interpretable by constraining training in some way, such as requiring vectors to be spares with only non-negative dimensions. See for example this NNSE work and other papers like it. Presumably that might also be applicable to doc-vectors, but I don't know offhand any papers or libraries to do that.)
You could also apply other topic-modeling algorithms, like LDA, that calculate discrete 'topics' that are usually fairly interpretable, and report the strongest topics in each document. (You can cluster on the full doc-topics weights, or perhaps just naively assign each document to its one strongest topic as a simple kind of clustering.)

Resources