Domain-specific word similarity - nlp

Does anyone know how of an accurate tool or method that can be used to compute word embeddings or find similarity among domain-specific words? I'm working on an NLP project that involves computing cosine similarity between technical terms, such as "address" and "socket", but pre-trained models like word2vec aren't giving useful embeddings or accurate cosine similarities because they aren't specific to technical terms. Since the more general-nontechnical meanings of "address" and "socket" aren't similar to one another, these pretrained models aren't giving them sufficiently high similarity scores for the purposes of my project. Would appreciate any advice people would be able to offer. Thank you!

With sufficient data from your specific domain, you can train your own word2vec model - whose resulting word-vectors, being only influenced by your domain data, will be far more reflective of the in-domain meanings.
Similarly, if you have a mixture of data where you have hints that some word uses are for different senses of a polysemous word, you could try preprocessing your text, using those hints, replacing the ambiguous tokens (like say 'address') with a larger number of distinct tokens (like 'address*networking', 'address*delivery', etc). Even with a lot of error in such a process, its results might be sufficient for a specific purpose.
For example, maybe you'd assume all docs of a certain type – like articles from a particular publication – always mean 'address*networking' when they write 'address'. That crude replacement, on just some subset of docs sufficient to collect enough varied examples of 'address*networking' usage, might leave you with a good-enough word-vector for 'address*networking'.
(More generally, deciding which word sense of multiple candidates is meant by a particular word is called "word sense disambiguation", and it might be possible to use other preexisting code for performing that to help preprocess texts - replacing ambiguous tokens with more-speciific stand-ins – before performing word2vec training.)
Even without such assistive pre-processing, there've been a number of research attempts to extend word2vec to better model words with multiple contrasting meanings. Googling for [word2vec polysemy] or [polysemous embeddings] should turn up a bunch of examples.
But I don't know any of those techniques that have become widely-used, or that are explicitly supported by major word2vec libraries, so I can't specifically recommend or show working code for any. I don't know a standard best-practice or off-the-shelf solution – you'd have to treat adopting those ideas from research papers as an R&D project, performing a lot of your own implementation/evaluation to see if any help with your goals.

Related

Find list of Out Of Vocabulary (OOV) words from my domain spectific pdf while using FastText model

How to find list of Out Of Vocabulary (OOV) words from my domain spectific pdf while using FastText model? I need to fine tune FastText with my domain specific words.
A FastText model will already be able to generate vectors for OOV words.
So there's not necessarily any need to either list the specifically OOV words in your PDF, nor 'fine tune' as FastText model.
You just ask it for vectors, it gives them back. The vectors for full in-vocabulary words, that were trained from relevant training material, will likely be best, while vectors synthesized for OOV words from word-fragments (character n-grams) shared with training material will just be rough guesses - better than nothing, but not great.
(To train a good word-vector requires many varied examples of a word's use, interleaved with similarly good examples of its many 'peer' words – and traditionally, in one unified, balanced training session.)
If you think you need to do more, you should expand your questin with more details about why you think that's necessary, and what existing precedents (in docs/tutorials/papers) you're trying to match.
I've not seen a well-documented way to casually fine-tune, or incrementally expand the known-vocabulary of, an existing FastText model. There would be a lot of expert tradeoffs required, and in many cases simply training a new model with sufficient data is likely to be a safer approach.
Anyone seeking such fine-tuning should have a clear idea of:
what their incremental data might be able to add to an existing model
what process/code will they be using, and why that process/code might be expected to give meaningful results with their specific starting model & new data
how the results of any such process can be evaluated to ensure the extra fine-tuning steps are beneficial compared to alternatives

Using Bert and cosine similarity fo identify similar documents

we have a news website where we have to match news to a particular user.
We have to use for the matching only the user textual information, like for example the interests of the user or a brief description about them.
I was thinking to threat both the user textual information and the news text as document and find document similarity.
In this way, I hope, that if in my profile I wrote sentences like: I loved the speach of the president in Chicago last year, and a news talks about: Trump is going to speak in Illinois I can have a match (the example is purely casual).
I tried, first, to embed my documents using TF-IDF and then I tried a kmeans to see if there was something that makes sense, but I don't like to much the results.
I think the problem derives from the poor embedding that TF-IDF gives me.
Thus I was thinking of using BERT embedding to retrieve the embedding of my documents and then use cosine similarity to check similarity of two document (a document about the user profile and a news).
Is this an approach that could make sense? Bert can be used to retrieve the embedding of sentences, but there is a way to embed an entire document?
What would you advice me?
Thank you
BERT is trained on pairs of sentences, therefore it is unlikely to generalize for much longer texts. Also, BERT requires quadratic memory with the length of the text, using too long texts might result in memory issues. In most implementations, it does not accept sequences longer than 512 subwords.
Making pre-trained Transformers work efficiently for long texts is an active research area, you can have a look at a paper called DocBERT to have an idea what people are trying. But it will take some time until there is a nicely packaged working solution.
There are also other methods for document embedding, for instance Gensim implements doc2vec. However, I would still stick with TF-IDF.
TF-IDF is typically very sensitive to data pre-processing. You certainly need to remove stopwords, in many languages it also pays off to do lemmatization. Given the specific domain of your texts, you can also try expanding the standard list of stop words by words that appear frequently in news stories. You can get further improvements by detecting and keeping together named entities.

Bert fine-tuned for semantic similarity

I would like to apply fine-tuning Bert to calculate semantic similarity between sentences.
I search a lot websites, but I almost not found downstream about this.
I just found STS benchmark.
I wonder if I can use STS benchmark dataset to train a fine-tuning bert model, and apply it to my task.
Is it reasonable?
As I know, there are a lot method to calculate similarity including cosine similarity, pearson correlation, manhattan distance, etc.
How choose for semantic similarity?
In addition, if you're after a binary verdict (yes/no for 'semantically similar'), BERT was actually benchmarked on this task, using the MRPC (Microsoft Research Paraphrase Corpus).
The google github repo https://github.com/google-research/bert includes some example calls for this, see --task_name=MRPC in section Sentence (and sentence-pair) classification tasks.
As a general remark ahead, I want to stress that this kind of question might not be considered on-topic on Stackoverflow, see How to ask. There are, however, related sites that might be better for these kinds of questions (no code, theoretical PoV), namely AI Stackexchange, or Cross Validated.
If you look at a rather popular paper in the field by Mueller and Thyagarajan, which is concerned with learning sentence similarity on LSTMs, they use a closely related dataset (the SICK dataset), which is also hosted by the SemEval competition, and ran alongside the STS benchmark in 2014.
Either one of those should be a reasonable set to fine-tune on, but STS has run over multiple years, so the amount of available training data might be larger.
As a great primer on the topic, I can also highly recommend the Medium article by Adrien Sieg (see here, which comes with an accompanied GitHub reference.
For semantic similarity, I would estimate that you are better of with fine-tuning (or training) a neural network, as most classical similarity measures you mentioned have a more prominent focus on the token similarity (and thus, syntactic similarity, although not even that necessarily). Semantic meaning, on the other hand, can sometimes differ wildly on a single word (maybe a negation, or the swapped sentence position of two words), which is difficult to interpret or evaluate with static methods.

human-interpretable, meaningful clusters using doc2vec

I am clustering a set of education documents using doc2vec.
As a human, I think of these as in categories such as:
computer-related
language related
collaboration
arts
etc.
I wonder if there is a way to 'guide' the doc2vec clustering into a set of clusters that are human-interpretable.
One strategy I have been trying is to filter out all 'nonsense' words, and only train doc2vec on the words that seem meaningful. But of course, this seems to perhaps ruin the training.
Something just occurred to me that might work:
Train on entire documents (don't filter out words) to create doc2vec space
Filter nonsense words ('help', 'student', etc. are words that have very little meaning in this space) out of each document
Project filtered documents into doc2vec space
then process using k-means etc
I would appreciate any constructive suggestions or next steps.
best
Your plan is fine; you should try it to evaluate the results. The clusters may not map tightly to your preconceived groupings, but by looking at the example docs per cluster, you'll probably be able to form your own rough idea of what the cluster "is" in human-crafted descriptive terms.
Don't try too much guesswork preprocessing (like eliminating words) at first. Try those kinds of variations after you have the simplest possible approach working, as a baseline – so you can evaluate (even if only by ad hoc eyeballing) whether they're helping as expected. (For example, if a word like 'student' truly appears across all documents equally, it won't have much influence either way on Doc2Vec final doc coordinates... so you don't have to make that judgement call yourself, it'll just be deemphasized automatically.)
I'm assuming that by Doc2Vec you mean the 'Paragraph Vector' algorithm, as implemented by the Doc2Vec class in Python gensim. Some PV-Doc2Vec modes, including the default PV-DM (dm=1) and also the simpler PV-DBOW if you also enable concurrent word-training (dm=0, dbow_words=1), train word-vectors into the same space as doc-vectors. So the word-vectors that are closest to the doc-vectors in a cluster, or the cluster's centroid, might be useful as interpretable descriptions of the cluster.
(In the word-vector space, there's also research that tries to make the individual dimensions of word-vectors more-interpretable by constraining training in some way, such as requiring vectors to be spares with only non-negative dimensions. See for example this NNSE work and other papers like it. Presumably that might also be applicable to doc-vectors, but I don't know offhand any papers or libraries to do that.)
You could also apply other topic-modeling algorithms, like LDA, that calculate discrete 'topics' that are usually fairly interpretable, and report the strongest topics in each document. (You can cluster on the full doc-topics weights, or perhaps just naively assign each document to its one strongest topic as a simple kind of clustering.)

Best for resume, document matching

I have used three different ways to calculate the matching between the resume and the job description. Can anyone tell me that what method is the best and why?
I used NLTK for keyword extraction and then RAKE for
keywords/keyphrase scoring, then I applied cosine similarity.
Scikit for keywords extraction, tf-idf and cosine similarity
calculation.
Gensim library with LSA/LSI model to extract keywords and calculate
cosine similarity between documents and query.
Nobody here can give you the answer. The only way to decide which method works better is to have one or more humans independently match lots and lots of resumes and job descriptions, and compare what they do to what your algorithms do. Ideally you'd have a dataset of already matched resumes and job descriptions (companies must do this kind of thing when people apply), because it takes a lot of work to create a sufficiently large dataset.
Next time you take on this kind of project, start by considering how you are going to evaluate the performance of the solution you'll put together.
As already mentioned in answers, try ti use Doc2Vec.
Seems using Doc2Vec from Gensim on both corpora (CVs and job descriptions) separately and then using cosine similarity between the two vectors is the easiest flow to work. It works better than others on documents which are not similar in form and words content but similar in context and sematics, so merely keywords would not help much here.
Then you can try to train CNN on the corpus of pairs of matched CV&JD with labels like yes/no if available and use it to qulaify CVs/resumees against job descriptions.
Basically I'm going to try these aproaches in my pretty much the same task, pls see https://datascience.stackexchange.com/questions/22421/is-there-an-algorithm-or-nn-to-match-two-documents-basically-not-closely-simila
Since its highly likely that job description and resume content can be different, you should think from semantics point of view. One thing possible you can do is use some domain knowledge. But its pretty difficult to gain domain knowledge for a variety of job types. Researchers sometimes use dictionary to augment the similarity matching between documents.
Researchers are using deep neural networks to capture both syntactic and semantic structure of documents. You can use doc2Vec to compare two documents. Gensim can produce doc2Vec representation for you. I believe that will give better results compared to keyword extraction and similarity computation. You can build your own neural network model to train on job descriptions and resumes. I guess neural networks will be effective for your work.

Resources