Words that are similar to or means 'yes' and 'no' - nlp

I'm wondering if there's a corpus for words that basically means "yes" or "no"? If not, what are the possible algorithms/techniques to collect such information?
I just started to learn NLP, so please bear with me if this is an obvious question. Thank you!

One way to solve this problem is to find the similar words in a certain corpus of words.
In order to measure words similarity you can use a model called Word2Vec that was introduced by Mikolov et al.
If you train this model on a corpus of words it maps each word to it's representation in a vector space. Basically each word will be represented by a vector (where the nome Word2Vec). Word2vec assigns high similarity to words that tend to appear in the same context.
After that you can measure words similarity by calculating the cosine similarity of these two word vectors.
Here are the results that I get when training Word2vec in a corpus of product reviews:
First 4 most similar words to yes:
'yeah', 'oh', 'hey', 'sure'
First 4 most similar words to no:
'whatsoever', discernible', 'denying', zero
Some Word2vec references :
https://radimrehurek.com/gensim/models/word2vec.html
http://rare-technologies.com/word2vec-tutorial/
EDIT:
You can find related words to No and Yes in the general inquirer website as well
http://www.wjh.harvard.edu/~inquirer/No.html
http://www.wjh.harvard.edu/~inquirer/Yes.html
Hope this helps.

Related

How to find similar sentence from a corpus on word2vec?

I have implemented word2vec on my corpus using the TensorFlow tutorial: https://www.tensorflow.org/tutorials/text/word2vec#next_steps
Now I'm want to give a sentence as input and want to find a similar sentence in the corpus.
Any leads on how I can perform this?
A simple word2vec model is not capable of such task, as it only relates word semantics to each other, not the semantics of whole sentences. Inherently, such a model has no generative function, it only serves as a look-up table.
Word2vec models map word strings to vectors in the embedding space. To find similar words for a given sample word, one can simply go through all vectors in the vocabulary and find the ones that are closest (in terms of the 2-norm) from the sample word vector. For further information you could go here or here.
However, this does not work for sentences as it would require a whole vocabulary of sentences of which to pick similar ones - which is not feasible.
Edit: This seems to be a duplicate of this question.

How does gensim word2vec word embedding extract training word pair for 1 word sentence?

Refer to below image (the process of how word2vec skipgram extract training datasets-the word pair from the input sentences).
E.G. "I love you." ==> [(I,love), (I, you)]
May I ask what is the word pair when the sentence contains only one word?
Is it "Happy!" ==> [(happy,happy)] ?
I tested the word2vec algorithm in genism, when there is just one word in the training set sentences, (and this word is not included in other sentences), the word2vec algorithm can still construct an embedding vector for this specific word. I am not sure how the algorithm is able to do so.
===============UPDATE===============================
As the answer posted below, I think the word embedding vector created for the word in the 1-word-sentence is just the random initialization of neural network weights.
No word2vec training is possible from a 1-word sentence, because there's no neighbor words to use as input to predict a center/target word. Essentially, that sentence is skipped.
If that was the only appearance of the word in the corpus, and you're seeing a vector for that word, it's just the starting random-initialization of the word, with no further training. (And, you should probably use a higher min_count, as keeping such rare words is usually a mistake in word2vec: they won't get good vectors, and other nearby words' vectors will improve if the 'noise' from all such insufficiently model-able rare words is removed.)
If that 1-word sentence actually appeared next-to other real sentences in your corpus, it could make sense to combine it with surrounding texts. There's nothing magic about actual sentences for this kind word-from-surroundings modeling - the algorithm is just working on 'neighbors', and it's common to use multi-sentence chunks as the texts for training, and sometimes even punctuation (like sentence-ending periods) is also retained as 'words'. Then words from an actually-separate sentence – but still related by having appeared in the same document – will appear in each other's contexts.

Find a sentence is related to a medical term or not

Input: user enters a sentence
if the word is related to any medical term , or if he needs any medical attention,
Output=True
else
Output=False
I am reading https://www.nltk.org/. I scraped 'https://www.merriam-webster.com/browse/medical/a' this website to get the medical related words but I am unable to figure out how to detect the sentence which are related to medical term . I haven't done any code because the algorithm is not clear to me.
I want to know what should I use , where to start, I need a tutorial link to implement this thing. Any guidance will be highly appreciated
I will list down the various ways you can do this with naive to intelligent ways -
Get a large vocabulary of medical terms, iterate over the sentence and return yes or no incase you find anything
Get a large vocabulary of medical terms, iterate over the sentence and do a fuzzy match with each word, so that words that are variations of the same work syntactically (alphabetically) are still detected and caught. [Check fuzzywuzzy library in python]
Get a large vocabulary of medical terms with definitions for each. Use pre-trained word embeddings (word2vec, Glove etc) for each word in the descriptions of those terms. Take a weighted sum of each word embeddings with weights set to the TFIDF of each word, to represent each medical term (its description to be precise) as a vector. Repeat the process for the sentence as well. Then take a cosine similary between them to calculate how contextually similar is the text to the description of the medical term. If the similarity is above a certain threshold that you fix, then return True. [This approach doesnt need the exact term, even if the person is talking about the condition, it should be able to detect]
Label a large number of sentences with their respective medical terms in them (annotate using something like the API.AI entity annotation tool or RASA entity annotation tool). Create a neural network with input embedding layer (which you can initialise with word2vec embeddings if you like), bi-LSTM layers and output with the list of medical terms / conditions with softmax. This will get you probability of each condition or term being associated with the sentence.
Create a neural network with encoder decoder architecture with attention layer between them. Create encoder embeddings from the input sentence. Create decoder with output as a string of medical terms. Train an encoder-decoder attention layer with pre-annotated data.
Create a pointer network which as input takes a sentence with the respective medical terms and return pointers, which point back to the inputs and marks them as medical term or non-medical term. (not easy to build fyi...)
OK so, I don't understand which part do you not understand? Because, the idea is rather simple and one google search gives you great and easy results. Unless the issue is that you don't know python. In that case it will be very hard for you to implement this.
The idea itself is simple - tokenize sentence (have each word for itself in a list) and search the list of medical terms. If the current word is in the list, the term is medical so the sentence is related to that medical term as well. If you imagine that you have a list of medical terms in a medical_terms list then in python it would look something like this:
>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthurs' abdomen was hurting."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthurs', 'abdomen', "was", 'hurting', '.']
>>> def is_medical(tokens):
... for i in tokens:
... if i in medical_terms:
... return True
... else:
... return False
>>> is_medical(tokens)
True
You just tokenize the input sentence with NLTK and then search the list if any of the words in the sentence are medical terms. You can adapt this function to work with n-grams as well. This has a lot of other approaches and different special cases that have to be handled by this is the good start.

Calculating the similarity between two vectors

I did LDA over a corpus of documents with topic_number=5. As a result, I have five vectors of words, each word associates with a weight or degree of importance, like this:
Topic_A = {(word_A1,weight_A1), (word_A2, weight_A2), ... ,(word_Ak, weight_Ak)}
Topic_B = {(word_B1,weight_B1), (word_B2, weight_B2), ... ,(word_Bk, weight_Bk)}
.
.
Topic_E = {(word_E1,weight_E1), (word_E2, weight_E2), ... ,(word_Ek, weight_Ek)}
Some of the words are common between documents. Now, I want to know, how I can calculate the similarity between these vectors. I can calculate cosine similarity (and other similarity measures) by programming from scratch, but I was thinking, there might be an easier way to do it. Any help would be appreciated. Thank you in advance for spending time on this.
I am programming with Python 3.6 and gensim library (but I am open to any other library)
I know someone else has asked similar question (Cosine Similarity and LDA topics) but becasue he didn't get the answer, I ask it again
After LDA you have topics characterized as distributions on words. If you plan to compare these probability vectors (weight vectors if you prefer), you can simply use any cosine similarity implemented for Python, sklearn for instance.
However, this approach will only tell you which topics have in general similar probabilities put in the same words.
If you want to measure similarities based on semantic information instead of word occurrences, you may want to use word vectors (as those learned by Word2Vec, GloVe or FastText).
They learned vectors for representing the words as low dimensional vectors, encoding certain semantic information. They're easy to use in Gensim, and the typical approach is loading a pre-trained model, learned in Wikipedia articles or News.
If you have topics defined by words, you can represent these words as vectors and obtain an average of the cosine similarities between the words in two topics (we did it for a workshop). There are some sources using these Word Vectors (also called Word Embeddings) to represent somehow topics or documents. For instance, this one.
There are some recent publications combining Topic Models and Word Embeddings, you can look for them if you're interested.

Classify words with the same meaning

I have 50.000 subject lines from emails and i want to classify the words in them based on synonyms or words that can be used instead of others.
For example:
Top sales!
Best sales
I want them to be in the same group.
I build the following function with nltk's wordnet but it doesn't work well.
def synonyms(w,group,guide):
try:
# Check if the words is similar
w1 = wordnet.synset(w +'.'+guide+'.01')
w2 = wordnet.synset(group +'.'+guide+'.01')
if w1.wup_similarity(w2)>=0.7:
return True
elif w1.wup_similarity(w2)<0.7:
return False
except:
return False
Any ideas or tools to accomplish this?
The easiest way to accomplish this would be to compare the similarity of the respective word embeddings (the most common implementation of this is Word2Vec).
Word2Vec is a way of representing the semantic meaning of a token in a vector space, which enables the meanings of words to be compared without requiring a large dictionary/thesaurus like WordNet.
One problem with regular implementations of Word2Vec is that it does differentiate between different senses of the same word. For example, the word bank would have the same Word2Vec representation in all of these sentences:
The river bank was dry.
The bank loaned money to me.
The plane may bank to the left.
Bank has the same vector in each of these cases, but you may want them to be sorted into different groups.
One way to solve this is to use a Sense2Vec implementation. Sense2Vec models take into account the context and part of speech (and potentially other features) of the token, allowing you to differentiate between the meanings of different senses of the word.
A great library for this in Python is Spacy. It is like NLTK, but much faster as it is written in Cython (20x faster for tokenization and 400x faster for tagging). It also has Sense2Vec embeddings inbuilt, so you can accomplish your similarity task without needing other libraries.
It's as simple as:
import spacy
nlp = spacy.load('en')
apples, and_, oranges = nlp(u'apples and oranges')
apples.similarity(oranges)
It's free and has a liberal license!
An idea is to solve this with embeddings and word2vec , the outcome will be a mapping from words to vectors which are "near" when they have similar meanings, for example "car" and "vehicle" will be near and "car" and "food" will not, you can then measure the vector distance between 2 words and define a threshold to select if they are so near that they mean the same, as i said its just an idea of word2vec
The computation behind what Nick said is to calculate the distance (cosine distance) between two phrases vectors.
Top sales!
Best sales
Here is one way to do so: How to calculate phrase similarity between phrases

Resources