How to identify the similar words using the word2vec - nlp

input: I have a set of words(N) & input sentence
problem statement:
the sentence is dynamic, the user can give any sentence related to one business domain. we have to map the input sentence tokens to the set of words based on the closeness.
for example, we can use different words to ask the same meaning questions, and hard to maintain all the synonyms hence we have a mechanism to find similar words, we can map easily.
1) A meeting scheduled by john
2) A meeting organized by john
user can frame a sentence in different ways, like the above example.
scheduled & organized are very close.
N set has the word, scheduled. if a user gives a sentence like (2), I have to map the organized with scheduled.

Take a look at "Word Mover's Distance", a way to calculate differences between texts that's essentially based on "bags of word-vectors". It can be expensive to calculate, especially on longer texts, but generally identifies "similar" ranges-of-text better than a simple baseline like "average of all word-vectors".
Beyond that, some of the deeper-neural-network methods of vectorizing text – BERT, ELMo, etc – may do an even-more effective job of placing such "similar intent by different words" texts into close positions in a shared coordinate space.

Related

How do I use NLP to find which group of words a sentence is closes to?

I am trying to use NLP to see how well survey responses fit into predetermined categories. I can't use normal text-classification methods since a given response usually contains multiple categories.
Instead, I've pulled out the 10-20 words most commonly used in each category, and I want to build a script that inputs a survey response and computes how much it aligns with each list of words. Ideally I'd like it to recognize similar words to the ones in each list as well. The final result should a vector describing how much the response aligns with each group of words.
My only idea so far is to use a for loop that loops over every word in a response, while each group has a counter that goes up if a word matches. However, this wouldn't be useful in dealing with synonyms or similar words. Is there any way to work this out?
I do not have 50 reputation so I can't comment. But I think if you implement a sin function you could represent more precisely the word meaning. That is, creaste a sin or cos function which contains every word and then, to compute its relation, just calculate sin or cos similarity.
The problem here is which features do you need to use to create this function. Well, your question is a bit open so here we cannot help you. There are several ways to do this, one if them is the following:

word2vec or alternative for finding synonymous phrases based on position

I work for a hospital. One of the tasks I'm working on now is finding codes, from a controlled vocabulary (RxNorm), that correspond to the string representation of medicines from our electronic health records.
For example, "500 mg tylenol tablet" would be mapped to RxNorm 209459, "Acetaminophen 500 MG Oral Tablet [Tylenol]" with a score of 0.8, using the RxNav API
There's lots of ways to do this nowadays. I would like to optimize our success by finding abbreviations and other tokens that are common in our mediation strings but not in any of the medication labels in RxNorm.
For example, "500 mg tylenol po tab" also maps to RxNorm 209459, but only with a score of 0.67, because RxNorm doesn't seem to know that "po" is common medical jargon for "by mouth" or "oral", and tab is a lexical variant of "tablet". It seems to work very well, but only with perfect word matches.
Can word2vec, or something else, detect the similarity between "po tab" and "oral tablet" since the EHR frequently contains strings like
"blah blah po tab"
And RxNorm has
"blah blah oral tablet"
with the same "blahs"?
I tried by following the word2vec demo scripts, but got almost all noise. Obviously my strings are themselves short phrases, not snippets from narratives. The training set is small too...
so far I've been training on a well-characterized corpus of 11 026 087 (non-unique) words spread over 2 148 750 lines.
I have been using the 2013 fork of word2vec that compiles under MacOS clang without any fiddling.
Though these small phrases aren't quite like the varied natural-language text usually used with word2vec & related algorithms, with enough data, it might be helpful. It will tend to learn which words are "highly related", even if not exact synonyms.
The best data would have many examples of each tokens' use, in varied contexts, including mixes of different lingo. For example, if you only have training data that includes...
blah blah oral tablet
blah blah po tab
...it'll be harder for it to discover the similarities between 'oral' & 'po', and 'tablet' & 'tab', than if you also had training examples which included:
blah blah oral tab
blah blah po tablet
(That is: data that's a little more chaotic/gradual in its lingo mixes may be better than something that keeps alternate conventions totally separate.)
When you say you're getting "all noise", are the lists of most-similar words sensible for your purposes? (For example, are 'oral' and 'po' very close, after training?) If so, at least a little, you may be on the right path and be able to tune further to get better results. If not, your data or training parameters may be insufficient or have some other problems.
In training, with smaller or less-varied data, it can be helpful to reduce the vector-dimensionality, or up the number of training-epochs, to squeeze meaningful final vector-positions out. If your data has some natural sort-order that groups all related items together – such that certain words only appear all early, or all late – an initial shuffle of examples may help a little.
The window parameter can be especially influential in affecting whether the resulting model emphasizes exact 'syntactic' (drop-in-replacement word) similarity, or general domain/topic similarity. Smaller windows – say just 1-3 words – emphasize drop-in replacement words (both synonyms & antonyms), while larger windows find more general associations.
(See this answer for a bit more context & a link to a paper which observed this window-size effect.)
You might want to try a later word2vec implementation, like that in the Python gensim library, if any part of your pipeline is in Python or you want to try a few options that weren't in the initial Google word2vec.c (like using non-default ns_exponent values, which one paper suggested was especially useful in recommendation-applications where related-item-basket token frequencies are somewhat different from natural-language).
If many of your 'unknown' words are in fact abbreviations or typos of known-words, using the Facebook 'FastText' refinement of word2vec may be a help. It also learns vectors for subwords, so pulls 'tab' and 'tablet' closer to each other, and when confronted with a never-before-seen word can assemble a candidate-vector from word fragments that's usually better than a random guess, same as people intuit a word's general gist from word-roots. (Python gensim also contains a FastText implementation.)
If you do achieve a word-model whose lists of most-similar words seem sensible to you, you might then try:
when you have a text with words you know aren't in `RxNorm', try replacing the unknown words with their nearest-neighbor that is in 'RxNorm'
using "Word Mover's Distance" to compare your phrases with known phrases - it's often good at quantifying the shift between short phrases, using word-vectors as an input. It's expensive on larger texts, but on 4-6 word fragments, might work really well. (It's available in the gensim word-vectors classes as a .wmdistance() method.)
Finally, to the extent there's a limited number of 'tab'->'tablet' type exact replacements, progressively replacing any fuzzy discoveries from word2vec analysis with expert-confirmed synonyms seems a good idea, to replace statistical guesses with sure-things.
Going back to the example above, if you already knew 'tab'->'tablet', but not yet 'po'->'oral', it might even make sense to take all texts that have 'tab' or 'tablet' & create new additional synthetic examples with that word reversed. That could give subsequent word2vec training an extra hint/shove in the direction of being able to realize that 'po'/'oral' fills the same relative-role with both 'tab'/'tablet'.

Token sequence labeling

I have a task which is half a matcher and half an entity extraction. I want to label some words that in some contexts refer to some label. Named entity extraction would be the way to go, but these words do not necessarily share structure (they can be verbs, nouns... etc). I could simply use a dictionary, but I would like to use context to label them. I am having trouble finding a solution to this problem. Can NER be used for this or is this a completely different task?
To give some examples, consider that I am interested in the following category "customer acceptance". Now, these can be possible sentences: "this is a fair amount of data!" and "this condition is not fair". I want my word extractor to find only the second 'fair'.
In other words, it is like a dictionary that takes context into account.

How can I quantify the difference of meaning of two terms? For example "bird" and "Chair"

Edited:
I have some terms/topics and I want to quantify how different these terms/topics are in meaning or domain from each other. Following is the use case in which I want to apply it:
Right now I have dataset from twitter about a particular cricket match (tweets with hashtag of this match). I want to see how many other topics, unrelated to cricket match, make their way in such tweets. For example if someone starts taking about "Syrian Refuges" in such tweet that will not be very related to the topic of game Cricket.
My basic approach is to extract topics from these tweets and then identify which topics are closely related to domain of cricket and which ones are not.
Statistically, you can look at word2vec, fasttext, and similar models. Here "difference" can be the distance (euclidean or cosine similarity) between two points in the vector space. In short, you load your corpus in an engine which creates an n-dimensional space, placing words (and sometimes documents or char n-grams) as points in the space in such way that words appearing in similar contexts have close representations (vectors).
One drawback of most such representations is that antonyms often appear close to each other: For instance in "I love you" and "I hate you", love and hate have very similar contexts.
From a semantic point of view, as you added the tag ontology, you can use a structured knowledge base or ontology. One option is to define "distance" in the taxonomy between the two terms. You can check if they appear on the same level as siblings, one is parent of the other or other relations. I believe the most straight-forward way is to manually define weights for each relation, but maybe statistical approaches for graphs traversing and clustering are also appropriate.
For classes you can use number of instances you have and any relations between those instances. For instance, you can calculate distance between "bird" and "chair" by the number of instances of birds and chairs for which you have relation "sits on". Hopefully "person" and "chair" will be much closer as most of your person objects will have a designated "chair" object.
To have a quick look you can use bird-noun-1 and chair-noun-1 and wordnet at:
http://labs.fc.ul.pt/dishin/
it gives you:
Resnik 0.315625756544
Lin 0.0574161071905
Jiang&Conrath 0.0964964414156
The python code: https://github.com/lasigeBioTM/DiShIn

Finding how relevant a text is, given a whitelist and blacklist of words/phrases

This is a case of me wanting to search for something online but not knowing what it's called.
I have a collection of job descriptions in text files, some only a sentence or two long, most a paragraph or two. I want to write a script that, given a set of rules, will notify me when it finds a job description I would want.
For example, lets say I am looking for a job in PHP programming, but not a full-time position and not a designing position. So my "rule book" could be:
want: PHP
want: web programming
want: telecommuting
do not want: designing
do not want: full-time position
What is a method I could use to sort these files into a "pass" (descriptions that match what I'm looking for) and a "fail" (descriptions are not relevant)? Some ideas I was considering:
Count the occurrences of the phrases in the text file that are also in my "rule book", and reject those that contain words that I do not want. This doesn't always work, though, because what if a description says "web designing not required"? Then my algorithm would say "That contains the word designing so it is not relevant" when it really was!
When searching the text for phrases that I do and do not want, count phrases within a certain Levenshtein distance as the same phrase. For example, designing and design should be treated the same way, as well as misspellings of words, such as programing.
I have a large collection of descriptions that I have looked through manually. Is there a way I could "teach" the program "these are examples of good descriptions, these are examples of bad ones"?
Does anyone know what this "filtering process" is called, and/or have any advice or methods on how I can accomplish this?
You basically have a text classification or document classification problem. This is a specific case of binary classification, which is itself a specific case of supervised learning. It's well studied problem, there are many tools to do it. Basically you give a set of good documents and bad documents to a learning or training process, which finds words that correlate strongly with positive and negative documents and it outputs a function capable of classifying unseen documents as positive or not. Naive Bayes is the simplest learning algorithm for this kind of task, and it will do a decent job. There are fancier algorithms like Logistic Regression and Support Vector Machines which will probably do a somewhat better, but they are more complicated.
To determine which variants words are actually equivalent to each other, you want to do some kind of stemming. The Porter stemmer is a common choice here.

Resources