Find a sentence is related to a medical term or not - python-3.x

Input: user enters a sentence
if the word is related to any medical term , or if he needs any medical attention,
Output=True
else
Output=False
I am reading https://www.nltk.org/. I scraped 'https://www.merriam-webster.com/browse/medical/a' this website to get the medical related words but I am unable to figure out how to detect the sentence which are related to medical term . I haven't done any code because the algorithm is not clear to me.
I want to know what should I use , where to start, I need a tutorial link to implement this thing. Any guidance will be highly appreciated

I will list down the various ways you can do this with naive to intelligent ways -
Get a large vocabulary of medical terms, iterate over the sentence and return yes or no incase you find anything
Get a large vocabulary of medical terms, iterate over the sentence and do a fuzzy match with each word, so that words that are variations of the same work syntactically (alphabetically) are still detected and caught. [Check fuzzywuzzy library in python]
Get a large vocabulary of medical terms with definitions for each. Use pre-trained word embeddings (word2vec, Glove etc) for each word in the descriptions of those terms. Take a weighted sum of each word embeddings with weights set to the TFIDF of each word, to represent each medical term (its description to be precise) as a vector. Repeat the process for the sentence as well. Then take a cosine similary between them to calculate how contextually similar is the text to the description of the medical term. If the similarity is above a certain threshold that you fix, then return True. [This approach doesnt need the exact term, even if the person is talking about the condition, it should be able to detect]
Label a large number of sentences with their respective medical terms in them (annotate using something like the API.AI entity annotation tool or RASA entity annotation tool). Create a neural network with input embedding layer (which you can initialise with word2vec embeddings if you like), bi-LSTM layers and output with the list of medical terms / conditions with softmax. This will get you probability of each condition or term being associated with the sentence.
Create a neural network with encoder decoder architecture with attention layer between them. Create encoder embeddings from the input sentence. Create decoder with output as a string of medical terms. Train an encoder-decoder attention layer with pre-annotated data.
Create a pointer network which as input takes a sentence with the respective medical terms and return pointers, which point back to the inputs and marks them as medical term or non-medical term. (not easy to build fyi...)

OK so, I don't understand which part do you not understand? Because, the idea is rather simple and one google search gives you great and easy results. Unless the issue is that you don't know python. In that case it will be very hard for you to implement this.
The idea itself is simple - tokenize sentence (have each word for itself in a list) and search the list of medical terms. If the current word is in the list, the term is medical so the sentence is related to that medical term as well. If you imagine that you have a list of medical terms in a medical_terms list then in python it would look something like this:
>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthurs' abdomen was hurting."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthurs', 'abdomen', "was", 'hurting', '.']
>>> def is_medical(tokens):
... for i in tokens:
... if i in medical_terms:
... return True
... else:
... return False
>>> is_medical(tokens)
True
You just tokenize the input sentence with NLTK and then search the list if any of the words in the sentence are medical terms. You can adapt this function to work with n-grams as well. This has a lot of other approaches and different special cases that have to be handled by this is the good start.

Related

How does gensim word2vec word embedding extract training word pair for 1 word sentence?

Refer to below image (the process of how word2vec skipgram extract training datasets-the word pair from the input sentences).
E.G. "I love you." ==> [(I,love), (I, you)]
May I ask what is the word pair when the sentence contains only one word?
Is it "Happy!" ==> [(happy,happy)] ?
I tested the word2vec algorithm in genism, when there is just one word in the training set sentences, (and this word is not included in other sentences), the word2vec algorithm can still construct an embedding vector for this specific word. I am not sure how the algorithm is able to do so.
===============UPDATE===============================
As the answer posted below, I think the word embedding vector created for the word in the 1-word-sentence is just the random initialization of neural network weights.
No word2vec training is possible from a 1-word sentence, because there's no neighbor words to use as input to predict a center/target word. Essentially, that sentence is skipped.
If that was the only appearance of the word in the corpus, and you're seeing a vector for that word, it's just the starting random-initialization of the word, with no further training. (And, you should probably use a higher min_count, as keeping such rare words is usually a mistake in word2vec: they won't get good vectors, and other nearby words' vectors will improve if the 'noise' from all such insufficiently model-able rare words is removed.)
If that 1-word sentence actually appeared next-to other real sentences in your corpus, it could make sense to combine it with surrounding texts. There's nothing magic about actual sentences for this kind word-from-surroundings modeling - the algorithm is just working on 'neighbors', and it's common to use multi-sentence chunks as the texts for training, and sometimes even punctuation (like sentence-ending periods) is also retained as 'words'. Then words from an actually-separate sentence – but still related by having appeared in the same document – will appear in each other's contexts.

What is the difference between keras.tokenize.text_to_sequences and word embeddings

Difference between tokenize.fit_on_text, tokenize.text_to_sequence and word embeddings?
Tried to search on various platforms but didn't get a suitable answer.
Word embeddings is a way of representing words such that words with the same/similar meaning have a similar representation. Two commonly used algorithms that learn word embedding are Word2Vec and GloVe.
Note that word embeddings can also be learnt from scratch while training your neural network for text processing, on your specific NLP problem. You can also use transfer learning; in this case, it would mean to transfer the learned representation of the words from huge datasets on your problem.
As for the tokenizer(I assume it's Keras that we're speaking of), taking from the documentation:
tokenize.fit_on_text() --> Creates the vocabulary index based on word frequency. For example, if you had the phrase "My dog is different from your dog, my dog is prettier", word_index["dog"] = 0, word_index["is"] = 1 (dog appears 3 times, is appears 2 times)
tokenize.text_to_sequence() --> Transforms each text into a sequence of integers. Basically if you had a sentence, it would assign an integer to each word from your sentence. You can access tokenizer.word_index() (returns a dictionary) to verify the assigned integer to your word.

Classify words with the same meaning

I have 50.000 subject lines from emails and i want to classify the words in them based on synonyms or words that can be used instead of others.
For example:
Top sales!
Best sales
I want them to be in the same group.
I build the following function with nltk's wordnet but it doesn't work well.
def synonyms(w,group,guide):
try:
# Check if the words is similar
w1 = wordnet.synset(w +'.'+guide+'.01')
w2 = wordnet.synset(group +'.'+guide+'.01')
if w1.wup_similarity(w2)>=0.7:
return True
elif w1.wup_similarity(w2)<0.7:
return False
except:
return False
Any ideas or tools to accomplish this?
The easiest way to accomplish this would be to compare the similarity of the respective word embeddings (the most common implementation of this is Word2Vec).
Word2Vec is a way of representing the semantic meaning of a token in a vector space, which enables the meanings of words to be compared without requiring a large dictionary/thesaurus like WordNet.
One problem with regular implementations of Word2Vec is that it does differentiate between different senses of the same word. For example, the word bank would have the same Word2Vec representation in all of these sentences:
The river bank was dry.
The bank loaned money to me.
The plane may bank to the left.
Bank has the same vector in each of these cases, but you may want them to be sorted into different groups.
One way to solve this is to use a Sense2Vec implementation. Sense2Vec models take into account the context and part of speech (and potentially other features) of the token, allowing you to differentiate between the meanings of different senses of the word.
A great library for this in Python is Spacy. It is like NLTK, but much faster as it is written in Cython (20x faster for tokenization and 400x faster for tagging). It also has Sense2Vec embeddings inbuilt, so you can accomplish your similarity task without needing other libraries.
It's as simple as:
import spacy
nlp = spacy.load('en')
apples, and_, oranges = nlp(u'apples and oranges')
apples.similarity(oranges)
It's free and has a liberal license!
An idea is to solve this with embeddings and word2vec , the outcome will be a mapping from words to vectors which are "near" when they have similar meanings, for example "car" and "vehicle" will be near and "car" and "food" will not, you can then measure the vector distance between 2 words and define a threshold to select if they are so near that they mean the same, as i said its just an idea of word2vec
The computation behind what Nick said is to calculate the distance (cosine distance) between two phrases vectors.
Top sales!
Best sales
Here is one way to do so: How to calculate phrase similarity between phrases

Words that are similar to or means 'yes' and 'no'

I'm wondering if there's a corpus for words that basically means "yes" or "no"? If not, what are the possible algorithms/techniques to collect such information?
I just started to learn NLP, so please bear with me if this is an obvious question. Thank you!
One way to solve this problem is to find the similar words in a certain corpus of words.
In order to measure words similarity you can use a model called Word2Vec that was introduced by Mikolov et al.
If you train this model on a corpus of words it maps each word to it's representation in a vector space. Basically each word will be represented by a vector (where the nome Word2Vec). Word2vec assigns high similarity to words that tend to appear in the same context.
After that you can measure words similarity by calculating the cosine similarity of these two word vectors.
Here are the results that I get when training Word2vec in a corpus of product reviews:
First 4 most similar words to yes:
'yeah', 'oh', 'hey', 'sure'
First 4 most similar words to no:
'whatsoever', discernible', 'denying', zero
Some Word2vec references :
https://radimrehurek.com/gensim/models/word2vec.html
http://rare-technologies.com/word2vec-tutorial/
EDIT:
You can find related words to No and Yes in the general inquirer website as well
http://www.wjh.harvard.edu/~inquirer/No.html
http://www.wjh.harvard.edu/~inquirer/Yes.html
Hope this helps.

Identifying the entity in sentiment analysis using Lingpipe

I have implemented sentiment analysis using the sentiment analysis module of Lingpipe. I know that they use a Dynamic LR model for this. It just tells me if the test string is a positive sentiment or negative sentiment. What ideas could I use to determine the object for which the sentiment has been expressed?
If the text is categorized as positive sentiment, I would like to get the object for which the sentiment has been expressed - this could be a movie name, product name or others.
Although this question is really old but I would like to answer it for others' benefit.
What you want here is concept level sentiment analysis. For a very basic version, I would recommend following these steps:
Apply sentence splitter. You can either use Lingpipe's Sentence Splitter or the OpenNLP Sentence Detector.
Apply part-of-spech tagging. Again you can either use Lingpipe's POS tagger or OpenNLP POS Tagger.
You then need to identify tokens(s) identified as 'Nouns' by the POS tagger. These token(s) have the potential of being the targeted entity in the sentence.
Then you need to find sentiment words in the sentence. The easiest way to do this is by using a dictionary of sentiment bearing words. You can find many such dictionaries online.
The next step will be find out dependency relations in sentences. This can be achieved by using the Stanford Dependency Parser. For example, if you try out the sentence - "This phone is good." in their online demo, you can see the following 'Typed Dependencies':
det(phone-2, This-1),
nsubj(good-4, phone-2),
cop(good-4, is-3),
root(ROOT-0, good-4)
The dependency nsubj(good-4, phone-2) here indicates that phone is the nominal subject of the token good, implying that the word good is expressed for phone. I am sure that your sentiment dictionary will contain the word good and phone would have been identified as a noun by the POS tagger. Thus, you can conclude that the sentiment good was expressed for the entity phone.
This was a very basic example. You can go a step further and create rules around the dependency relations to extract more complex sentiment-entity pairs. You can also assign scores to your sentiment terms and come up with a total score for the sentence depending upon the number of occurrences of sentiment words in that sentence.
Usually sentiment sentence means that the main entity of such sentence is the object of that sentiment. So basic heuristic is to NER and get first object. Otherwise you should use deep parsing NLP toolkits and write some rules to link sentiment to object.

Resources