What are the applications of length normalization? - nlp

I found some info about Length Normalization. I found it mentioned only in the context of search engines. Have people used it for different textual purposes? (please forgive my ignorance. I've truly searched for other uses of it but google keeps confusing the term "normalization" with "scaling"...).

The link you provide in the question already mentions one reason for using length-normalization: to avoid having high term-frequency counts in document vectors. This affects document ranking considerably. A direct application of this is, of course, query-based document retrieval.
There are other algorithm-specific applications as well. For example, if you want to cluster documents using cosine similarity between the vectors: simple clustering algorithms such as k-means may not converge unless the vectors are all on a sphere, i.e. all vectors have the same length.

Related

doc2vec infer words from vectors

I am clustering comments.
After preprocessing and a vectorization of a text, I have inferred vectors from my doc2vec model and applied kmeans.
After that I want to convert cluster centroid vectors to words to kinda look at the semantic cores of the clusters. Is it possible?
Edit: I use python/gensim.
There are a bunch of potential approaches you could try, and see which might offer what you want.
First & foremost, some of the Gensim Doc2Vec modes co-train word-vectors into the same coordinate system as the doc-vectors – allowing direct comparisons betwee words & docs, sometimes even to the level of compositional 'vector-arithmetic' (like in the famous word2vec analogy-solving examples).
You can see this potential discussed in the paper "Document Embedding with Paragraph Vectors".
The default PV-DM mode (parameter dm=1) automatically co-trains words and docs in the same space. You can also add interleaved word-vector skip-gram training into the other PV-DBOW dm=0 mode by adding the optional parameter dbow_words=1.
While it is still the case that d2v_model.dv.most_similar(docvec_or_doctag) will only return doc-vector results, and d2v_model.wv.most_similar(wordvec_or_word_token) will only return word-vector results, you can absolutely provide a raw vector of a document to the set of word-vectors, or a word-vector to the set of doc-vectors, to get the nearest-neighbors of the other type.
So in one of these modes, with doc-vector, you can use...
d2v_model.wv.most_simlar(positive=[doc_vector])
...to get a list-of-words that are closest to that doc-vector. Whether they're sufficiently representative will vary based on lots of factors. (If they seem totally random, there may be other problems with your data-sufficiency or process, or you may be using the dm=0, dbow_words=0 mode that leaves words random & untrained.)
You could use this on the centroid of your clusters – but note, a centroid might hide lots of the variety of a larger grouping, which might include docs not all in a tight 'ball' around the centroid. So you could also use this on all docs in a cluster, to get the top-N closest words to each – and then summarize the cluster as the words most often appearing in those many top-N lists, or most uniquely appearing in those top-N lists (versus the top-N lists of other clusters). That might describe more of the full cluster.
Separately, there's a method from Gensim's Word2Vec, predict_output_word(), which vaguely simulates the word2vec training-predictions to give a ranked list of predictions of a word from its surrounding words. The same code could be generalized to predict document-words from a doc-vector – there's an open pending issue to do so, and it'd be a simple bit of coding, though no-one's tackled it yet. (It'd be a welcome, and pretty easy, 1ast contribution to the Gensim project.)
Also: after having established your clusters, you could even put the Doc2Vec model aside, and use more traditional direct counting/frequency methods to pick out the most-salient words in each cluster. For example, turn each cluser into a single synthetic pseudodocument. Rank the words inside by TF-IDF, compared to the other cluster pseudodocs. (Or, get the top TF-IDF terms for every one of the individual original documents; describe each cluster by the most-often-relevant words tallied across all cluster docs.)
Though gojomo's answer makes perfect sense, I've decided to go the other way around with classification instead of clusterization. The article about the library I have found useful:
https://towardsdatascience.com/unsupervised-text-classification-with-lbl2vec-6c5e040354de]

Domain-specific word similarity

Does anyone know how of an accurate tool or method that can be used to compute word embeddings or find similarity among domain-specific words? I'm working on an NLP project that involves computing cosine similarity between technical terms, such as "address" and "socket", but pre-trained models like word2vec aren't giving useful embeddings or accurate cosine similarities because they aren't specific to technical terms. Since the more general-nontechnical meanings of "address" and "socket" aren't similar to one another, these pretrained models aren't giving them sufficiently high similarity scores for the purposes of my project. Would appreciate any advice people would be able to offer. Thank you!
With sufficient data from your specific domain, you can train your own word2vec model - whose resulting word-vectors, being only influenced by your domain data, will be far more reflective of the in-domain meanings.
Similarly, if you have a mixture of data where you have hints that some word uses are for different senses of a polysemous word, you could try preprocessing your text, using those hints, replacing the ambiguous tokens (like say 'address') with a larger number of distinct tokens (like 'address*networking', 'address*delivery', etc). Even with a lot of error in such a process, its results might be sufficient for a specific purpose.
For example, maybe you'd assume all docs of a certain type – like articles from a particular publication – always mean 'address*networking' when they write 'address'. That crude replacement, on just some subset of docs sufficient to collect enough varied examples of 'address*networking' usage, might leave you with a good-enough word-vector for 'address*networking'.
(More generally, deciding which word sense of multiple candidates is meant by a particular word is called "word sense disambiguation", and it might be possible to use other preexisting code for performing that to help preprocess texts - replacing ambiguous tokens with more-speciific stand-ins – before performing word2vec training.)
Even without such assistive pre-processing, there've been a number of research attempts to extend word2vec to better model words with multiple contrasting meanings. Googling for [word2vec polysemy] or [polysemous embeddings] should turn up a bunch of examples.
But I don't know any of those techniques that have become widely-used, or that are explicitly supported by major word2vec libraries, so I can't specifically recommend or show working code for any. I don't know a standard best-practice or off-the-shelf solution – you'd have to treat adopting those ideas from research papers as an R&D project, performing a lot of your own implementation/evaluation to see if any help with your goals.

how can I simplify BoWs?

I'm trying to apply some binary text classification but I don't feel that having millions of >1k length vectors is a good idea. So, which alternatives are there for the basic BOW model?
I think there are quite a few different approaches, based on what exactly you are aiming for in your prediction task (processing speed over accuracy, variance in your text data distribution, etc.).
Without any further information on your current implementation, I think the following avenues offer ways for improvement in your approach:
Using sparse data representations. This might be a very obvious point, but choosing the right data structure to represent your input vectors can already save you a great deal of pain. Sklearn offers a variety of options, and detail them in their great user guide. Specifically, I would point out that you could either use scipy.sparse matrices, or alternatively represent something with sklearn's DictVectorizer.
Limit your vocabulary. There might be some words that you can easily ignore when building your BoW representation. I'm again assuming that you're working with some implementation similar to sklearn's CountVectorizer, which already offers a great number of possibilities. The most obvious option are stopwords, which can simply be dropped from your vocabulary entirely, but of course you can also limit it further by using pre-processing steps such as lemmatization/stemming, lowercasing, etc. CountVectorizer specifically also allows you to control the minimum and maximum document frequency (don't confuse this with corpus frequency), which again should limit the size of your vocabulary.

human-interpretable, meaningful clusters using doc2vec

I am clustering a set of education documents using doc2vec.
As a human, I think of these as in categories such as:
computer-related
language related
collaboration
arts
etc.
I wonder if there is a way to 'guide' the doc2vec clustering into a set of clusters that are human-interpretable.
One strategy I have been trying is to filter out all 'nonsense' words, and only train doc2vec on the words that seem meaningful. But of course, this seems to perhaps ruin the training.
Something just occurred to me that might work:
Train on entire documents (don't filter out words) to create doc2vec space
Filter nonsense words ('help', 'student', etc. are words that have very little meaning in this space) out of each document
Project filtered documents into doc2vec space
then process using k-means etc
I would appreciate any constructive suggestions or next steps.
best
Your plan is fine; you should try it to evaluate the results. The clusters may not map tightly to your preconceived groupings, but by looking at the example docs per cluster, you'll probably be able to form your own rough idea of what the cluster "is" in human-crafted descriptive terms.
Don't try too much guesswork preprocessing (like eliminating words) at first. Try those kinds of variations after you have the simplest possible approach working, as a baseline – so you can evaluate (even if only by ad hoc eyeballing) whether they're helping as expected. (For example, if a word like 'student' truly appears across all documents equally, it won't have much influence either way on Doc2Vec final doc coordinates... so you don't have to make that judgement call yourself, it'll just be deemphasized automatically.)
I'm assuming that by Doc2Vec you mean the 'Paragraph Vector' algorithm, as implemented by the Doc2Vec class in Python gensim. Some PV-Doc2Vec modes, including the default PV-DM (dm=1) and also the simpler PV-DBOW if you also enable concurrent word-training (dm=0, dbow_words=1), train word-vectors into the same space as doc-vectors. So the word-vectors that are closest to the doc-vectors in a cluster, or the cluster's centroid, might be useful as interpretable descriptions of the cluster.
(In the word-vector space, there's also research that tries to make the individual dimensions of word-vectors more-interpretable by constraining training in some way, such as requiring vectors to be spares with only non-negative dimensions. See for example this NNSE work and other papers like it. Presumably that might also be applicable to doc-vectors, but I don't know offhand any papers or libraries to do that.)
You could also apply other topic-modeling algorithms, like LDA, that calculate discrete 'topics' that are usually fairly interpretable, and report the strongest topics in each document. (You can cluster on the full doc-topics weights, or perhaps just naively assign each document to its one strongest topic as a simple kind of clustering.)

Best for resume, document matching

I have used three different ways to calculate the matching between the resume and the job description. Can anyone tell me that what method is the best and why?
I used NLTK for keyword extraction and then RAKE for
keywords/keyphrase scoring, then I applied cosine similarity.
Scikit for keywords extraction, tf-idf and cosine similarity
calculation.
Gensim library with LSA/LSI model to extract keywords and calculate
cosine similarity between documents and query.
Nobody here can give you the answer. The only way to decide which method works better is to have one or more humans independently match lots and lots of resumes and job descriptions, and compare what they do to what your algorithms do. Ideally you'd have a dataset of already matched resumes and job descriptions (companies must do this kind of thing when people apply), because it takes a lot of work to create a sufficiently large dataset.
Next time you take on this kind of project, start by considering how you are going to evaluate the performance of the solution you'll put together.
As already mentioned in answers, try ti use Doc2Vec.
Seems using Doc2Vec from Gensim on both corpora (CVs and job descriptions) separately and then using cosine similarity between the two vectors is the easiest flow to work. It works better than others on documents which are not similar in form and words content but similar in context and sematics, so merely keywords would not help much here.
Then you can try to train CNN on the corpus of pairs of matched CV&JD with labels like yes/no if available and use it to qulaify CVs/resumees against job descriptions.
Basically I'm going to try these aproaches in my pretty much the same task, pls see https://datascience.stackexchange.com/questions/22421/is-there-an-algorithm-or-nn-to-match-two-documents-basically-not-closely-simila
Since its highly likely that job description and resume content can be different, you should think from semantics point of view. One thing possible you can do is use some domain knowledge. But its pretty difficult to gain domain knowledge for a variety of job types. Researchers sometimes use dictionary to augment the similarity matching between documents.
Researchers are using deep neural networks to capture both syntactic and semantic structure of documents. You can use doc2Vec to compare two documents. Gensim can produce doc2Vec representation for you. I believe that will give better results compared to keyword extraction and similarity computation. You can build your own neural network model to train on job descriptions and resumes. I guess neural networks will be effective for your work.

Resources