Can I interpret doc2vec components? - nlp

I am solving a binary text classification problem with corporate filings. Using Doc2Vec embeddings of length 100 with LightGBM is producing great results. However, for this project it would be very valuable to approximate a thematic meaning for at least one of the components. Ideally, this would be a feature ranked with high importance by LightGBM explained anecdotally with a few examples.
Has anyone attempted this, or should interpretation be off the table for a high-dimensional model with this level of complexity?

The individual dimensions of a Doc2Vec representation should not be considered independent, interpretable features. They're only useful in concert with each other, and the exact directions aligned with individual coordinate-axes may not be strongly meaningful in any human-describable sense.
However, neighborhoods of the space may loosely fit describable themes, and certain directions (not specifically parallel with coordinate-axes) may loosely fit semantic themes.
But to characterize those, you might try to find the centroid points of groups-of-related-documents, or discovered clusters, and compare the relative distances/directions between those centroids.

Related

doc2vec infer words from vectors

I am clustering comments.
After preprocessing and a vectorization of a text, I have inferred vectors from my doc2vec model and applied kmeans.
After that I want to convert cluster centroid vectors to words to kinda look at the semantic cores of the clusters. Is it possible?
Edit: I use python/gensim.
There are a bunch of potential approaches you could try, and see which might offer what you want.
First & foremost, some of the Gensim Doc2Vec modes co-train word-vectors into the same coordinate system as the doc-vectors – allowing direct comparisons betwee words & docs, sometimes even to the level of compositional 'vector-arithmetic' (like in the famous word2vec analogy-solving examples).
You can see this potential discussed in the paper "Document Embedding with Paragraph Vectors".
The default PV-DM mode (parameter dm=1) automatically co-trains words and docs in the same space. You can also add interleaved word-vector skip-gram training into the other PV-DBOW dm=0 mode by adding the optional parameter dbow_words=1.
While it is still the case that d2v_model.dv.most_similar(docvec_or_doctag) will only return doc-vector results, and d2v_model.wv.most_similar(wordvec_or_word_token) will only return word-vector results, you can absolutely provide a raw vector of a document to the set of word-vectors, or a word-vector to the set of doc-vectors, to get the nearest-neighbors of the other type.
So in one of these modes, with doc-vector, you can use...
d2v_model.wv.most_simlar(positive=[doc_vector])
...to get a list-of-words that are closest to that doc-vector. Whether they're sufficiently representative will vary based on lots of factors. (If they seem totally random, there may be other problems with your data-sufficiency or process, or you may be using the dm=0, dbow_words=0 mode that leaves words random & untrained.)
You could use this on the centroid of your clusters – but note, a centroid might hide lots of the variety of a larger grouping, which might include docs not all in a tight 'ball' around the centroid. So you could also use this on all docs in a cluster, to get the top-N closest words to each – and then summarize the cluster as the words most often appearing in those many top-N lists, or most uniquely appearing in those top-N lists (versus the top-N lists of other clusters). That might describe more of the full cluster.
Separately, there's a method from Gensim's Word2Vec, predict_output_word(), which vaguely simulates the word2vec training-predictions to give a ranked list of predictions of a word from its surrounding words. The same code could be generalized to predict document-words from a doc-vector – there's an open pending issue to do so, and it'd be a simple bit of coding, though no-one's tackled it yet. (It'd be a welcome, and pretty easy, 1ast contribution to the Gensim project.)
Also: after having established your clusters, you could even put the Doc2Vec model aside, and use more traditional direct counting/frequency methods to pick out the most-salient words in each cluster. For example, turn each cluser into a single synthetic pseudodocument. Rank the words inside by TF-IDF, compared to the other cluster pseudodocs. (Or, get the top TF-IDF terms for every one of the individual original documents; describe each cluster by the most-often-relevant words tallied across all cluster docs.)
Though gojomo's answer makes perfect sense, I've decided to go the other way around with classification instead of clusterization. The article about the library I have found useful:
https://towardsdatascience.com/unsupervised-text-classification-with-lbl2vec-6c5e040354de]

how can I simplify BoWs?

I'm trying to apply some binary text classification but I don't feel that having millions of >1k length vectors is a good idea. So, which alternatives are there for the basic BOW model?
I think there are quite a few different approaches, based on what exactly you are aiming for in your prediction task (processing speed over accuracy, variance in your text data distribution, etc.).
Without any further information on your current implementation, I think the following avenues offer ways for improvement in your approach:
Using sparse data representations. This might be a very obvious point, but choosing the right data structure to represent your input vectors can already save you a great deal of pain. Sklearn offers a variety of options, and detail them in their great user guide. Specifically, I would point out that you could either use scipy.sparse matrices, or alternatively represent something with sklearn's DictVectorizer.
Limit your vocabulary. There might be some words that you can easily ignore when building your BoW representation. I'm again assuming that you're working with some implementation similar to sklearn's CountVectorizer, which already offers a great number of possibilities. The most obvious option are stopwords, which can simply be dropped from your vocabulary entirely, but of course you can also limit it further by using pre-processing steps such as lemmatization/stemming, lowercasing, etc. CountVectorizer specifically also allows you to control the minimum and maximum document frequency (don't confuse this with corpus frequency), which again should limit the size of your vocabulary.

Bert fine-tuned for semantic similarity

I would like to apply fine-tuning Bert to calculate semantic similarity between sentences.
I search a lot websites, but I almost not found downstream about this.
I just found STS benchmark.
I wonder if I can use STS benchmark dataset to train a fine-tuning bert model, and apply it to my task.
Is it reasonable?
As I know, there are a lot method to calculate similarity including cosine similarity, pearson correlation, manhattan distance, etc.
How choose for semantic similarity?
In addition, if you're after a binary verdict (yes/no for 'semantically similar'), BERT was actually benchmarked on this task, using the MRPC (Microsoft Research Paraphrase Corpus).
The google github repo https://github.com/google-research/bert includes some example calls for this, see --task_name=MRPC in section Sentence (and sentence-pair) classification tasks.
As a general remark ahead, I want to stress that this kind of question might not be considered on-topic on Stackoverflow, see How to ask. There are, however, related sites that might be better for these kinds of questions (no code, theoretical PoV), namely AI Stackexchange, or Cross Validated.
If you look at a rather popular paper in the field by Mueller and Thyagarajan, which is concerned with learning sentence similarity on LSTMs, they use a closely related dataset (the SICK dataset), which is also hosted by the SemEval competition, and ran alongside the STS benchmark in 2014.
Either one of those should be a reasonable set to fine-tune on, but STS has run over multiple years, so the amount of available training data might be larger.
As a great primer on the topic, I can also highly recommend the Medium article by Adrien Sieg (see here, which comes with an accompanied GitHub reference.
For semantic similarity, I would estimate that you are better of with fine-tuning (or training) a neural network, as most classical similarity measures you mentioned have a more prominent focus on the token similarity (and thus, syntactic similarity, although not even that necessarily). Semantic meaning, on the other hand, can sometimes differ wildly on a single word (maybe a negation, or the swapped sentence position of two words), which is difficult to interpret or evaluate with static methods.

human-interpretable, meaningful clusters using doc2vec

I am clustering a set of education documents using doc2vec.
As a human, I think of these as in categories such as:
computer-related
language related
collaboration
arts
etc.
I wonder if there is a way to 'guide' the doc2vec clustering into a set of clusters that are human-interpretable.
One strategy I have been trying is to filter out all 'nonsense' words, and only train doc2vec on the words that seem meaningful. But of course, this seems to perhaps ruin the training.
Something just occurred to me that might work:
Train on entire documents (don't filter out words) to create doc2vec space
Filter nonsense words ('help', 'student', etc. are words that have very little meaning in this space) out of each document
Project filtered documents into doc2vec space
then process using k-means etc
I would appreciate any constructive suggestions or next steps.
best
Your plan is fine; you should try it to evaluate the results. The clusters may not map tightly to your preconceived groupings, but by looking at the example docs per cluster, you'll probably be able to form your own rough idea of what the cluster "is" in human-crafted descriptive terms.
Don't try too much guesswork preprocessing (like eliminating words) at first. Try those kinds of variations after you have the simplest possible approach working, as a baseline – so you can evaluate (even if only by ad hoc eyeballing) whether they're helping as expected. (For example, if a word like 'student' truly appears across all documents equally, it won't have much influence either way on Doc2Vec final doc coordinates... so you don't have to make that judgement call yourself, it'll just be deemphasized automatically.)
I'm assuming that by Doc2Vec you mean the 'Paragraph Vector' algorithm, as implemented by the Doc2Vec class in Python gensim. Some PV-Doc2Vec modes, including the default PV-DM (dm=1) and also the simpler PV-DBOW if you also enable concurrent word-training (dm=0, dbow_words=1), train word-vectors into the same space as doc-vectors. So the word-vectors that are closest to the doc-vectors in a cluster, or the cluster's centroid, might be useful as interpretable descriptions of the cluster.
(In the word-vector space, there's also research that tries to make the individual dimensions of word-vectors more-interpretable by constraining training in some way, such as requiring vectors to be spares with only non-negative dimensions. See for example this NNSE work and other papers like it. Presumably that might also be applicable to doc-vectors, but I don't know offhand any papers or libraries to do that.)
You could also apply other topic-modeling algorithms, like LDA, that calculate discrete 'topics' that are usually fairly interpretable, and report the strongest topics in each document. (You can cluster on the full doc-topics weights, or perhaps just naively assign each document to its one strongest topic as a simple kind of clustering.)

Need help applying scikit-learn to this unbalanced text categorization task

I have a multi-class text classification/categorization problem. I have a set of ground truth data with K different mutually exclusive classes. This is an unbalanced problem in two respects. First, some classes are a lot more frequent than others. Second, some classes are of more interest to us than others (those generally positively correlate with their relative frequency, although there are some classes of interest that are fairly rare).
My goal is to develop a single classifier or a collection of them to be able to classify the k << K classes of interest with high precision (at least 80%) while maintaining reasonable recall (what's "reasonable" is a bit vague).
Features that I use are mostly typical unigram-/bigram-based ones plus some binary features coming from metadata of the incoming documents that are being classified (e.g. whether them were submitted via email or though a webform).
Because of the unbalanced data, I am leaning toward developing binary classifiers for each of the important classes, instead of a single one like a multi-class SVM.
What ML learning algorithms (binary or not) implemented in scikit-learn allow for training tuned to precision (versus for example recall or F1) and what options do I need to set for that?
What data analysis tools in scikit-learn can be used for feature selection to narrow down the features that might be the most relevant to the precision-oriented classification of a particular class?
This is not really a "big data" problem: K is about 100, k is about 15, the total number of samples available to me for training and testing is about 100,000.
Thx
Given that k is small, I would just do this manually. For each desired class, train your individual (one vs the rest) classifier, take look at the precision-recall curve, and then choose the threshold that gives the desired precision.

Resources